Alternative Splicing Analysis - Algorithm and Workflow

1. Biological Background

1.1 What is Alternative Splicing

In eukaryotes, a single gene can produce multiple mRNA isoforms through alternative splicing (AS) -- the process by which different combinations of exons are joined during pre-mRNA splicing. This greatly expands proteomic diversity without increasing gene count.

A gene with N exons can theoretically produce up to 2^(N-1) splice variants. In practice, most genes produce 2-10 isoforms, but some (e.g., Drosophila Dscam) can produce tens of thousands.

1.2 Why Computational Analysis

High-throughput transcriptome sequencing (RNA-seq, full-length cDNA sequencing) produces thousands of transcript alignments against a reference genome. Manually inspecting each gene for splicing variation is infeasible. A computational pipeline is needed to:

Cluster transcripts that belong to the same gene locus
Identify which transcripts are genuine splice variants vs. artifacts
Classify the type of each splicing variation
Quantify and summarize AS events across the genome

1.3 The AS Code Concept

The AS Code system (Sammeth et al., 2008) provides a compact, unambiguous notation for describing any alternative splicing event. The key insight is:

For any pair of overlapping transcripts, the splice sites that differ between them define the AS event. By numbering these differential sites and marking each as donor (^) or acceptor (-), we obtain a code that uniquely describes the event structure.

This code enables systematic classification of AS events into biologically meaningful categories.

2. Input Data and Preprocessing

2.1 Required Input Files

+-------------------+------------------+-----------------------------------+
| File              | Format           | Content                           |
+-------------------+------------------+-----------------------------------+
| Genome            | FASTA            | Reference genome sequences        |
| Gene annotation   | GTF              | Gene models with exon coordinates |
| cDNA alignment    | GFF3 (cDNA_match)| Full-length cDNA alignments       |
+-------------------+------------------+-----------------------------------+

The gene annotation provides the reference transcript structures. The cDNA alignment provides experimentally observed transcript structures that may reveal novel splice variants not in the annotation.

2.2 Parsing Gene/CDNA Annotations

Both GTF (gene) and GFF3 (cDNA) files are parsed to extract exon coordinates for each transcript:

GTF line:  chr1  ensembl  exon  1000  1200  .  +  .  gene_id "G1"; transcript_id "T1";
GFF3 line: chr1  est     cDNA_match  1000  1200  .  +  .  ID=T2;

Extracted:
  transcript T1 on chr1: exon at [1000, 1200], strand +
  transcript T2 on chr1: match at [1000, 1200], strand +

Each transcript is stored as an ordered list of exon coordinates:

Transcript T1:  [[1000, 1200, '+'], [3000, 3500, '+'], [5000, 5500, '+']]
                  exon1                exon2               exon3

2.3 Deriving Introns from Exons

Introns are the gaps between consecutive exons on the same transcript. For a transcript with exons sorted by start position:

Exons:   [1000, 1200]    [3000, 3500]    [5000, 5500]
              exon1           exon2           exon3
              |               |               |
Introns:      |  [1201, 2999] |  [3501, 4999] |
              |    intron1    |    intron2    |

Rule: intron = [exon_i.end + 1, exon_(i+1).start - 1, strand]

Short introns (default: < 9 bp) are filtered out, and the flanking exons are merged into a single exon. This handles alignment artifacts:

Before merge:  [1000, 1200]  [1205, 1300]  [3000, 3500]
                   exon1    short intron     exon2

After merge:   [1000, 1300]                [3000, 3500]
                  merged exon1                  exon2

2.4 Extracting Splice Site Sequences

For each intron, the flanking 6 bp sequences are extracted from the genome:

Genome:  ...ATCGAT|GT  .....intron.....  AG|GCTAGC...
                donor                   acceptor
                (6bp)                  (6bp)

Donor site:    genome[intron.start-1 : intron.start-1+6]  = "GT...."
Acceptor site: genome[intron.end-6   : intron.end]        = "....AG"

On the minus strand, positions are the same but sequences are reverse- complemented, and donor/acceptor labels are swapped:

Minus strand:  ...ATCGAT|GT  .....intron.....  AG|GCTAGC...
               (genomic 5'->3')

Biology reads: 3'<-AG....|.....intron.....|CT....AC<-5'
                   acceptor              donor

Canonical splice sites are GT/AG (major) and GC/AG (minor).

2.5 Computing Gene Ranges

For each gene, the genomic span is the range from the leftmost exon start to the rightmost exon end across all its transcripts:

Gene G1 transcripts:
  T1: [1000, 1200] [3000, 3500] [5000, 5500]
  T2: [1000, 1200] [3000, 3500]
  T3:       [1100, 1200] [3000, 3500] [5000, 5800]

Gene range: [1000, 5800, '+']
            ^first start   ^last end

3. Core Workflow

3.1 Overview

                                              +-----------------+
                                              |  Genome (FASTA) |
                                              +--------+--------+
                                                       |
                                                       v
+-----------------+    +-----------------+    +-----------------+
| Gene GTF        |    | cDNA GFF3       |    | Parse FASTA     |
| (reference)     |    | (evidence)      |    | -> {chr: seq}   |
+--------+--------+    +--------+--------+    +--------+--------+
         |                       |                     |
         v                       v                     |
    Parse GTF              Parse GFF3                  |
    Extract exons          Extract matches             |
    Derive introns         Derive introns              |
    Get gene ranges        Get gene ranges             |
    Get splice seqs        Get splice seqs <-----------+
         |                       | 
         +-----------+-----------+ 
                     |
                     v
          +---------------------+          +---------------------+
          | STEP 1: Clustering  |          | STEP 2: Same-       |
          +  Gene-cDNA          +--------->+ transcript detect.  |
          |  overlap            |          | (graph clustering)  |
          +----------+----------+          +----------+----------+
                                                      |
                                                      v
          +---------------------+          +---------------------+
          | STEP 4:             |          | STEP 3:             |
          +  AS Type            +<---------+  AS Code            +
          |  classification     |          |  generation         |
          +----------+----------+          +----------+----------+
                     |
                     v
          +------------------------------------------------------+
          |          Output results + SVG visualization          |
          +------------------------------------------------------+

3.2 Step 1 -- Transcript Clustering

Goal: Assign each cDNA transcript to the gene it likely originates from.

Method: Check whether the cDNA's genomic range overlaps with any gene's range.

Gene G1 range: [1000, 5800]
  cDNA T4 range: [1050, 5500]  --> OVERLAP --> T4 joins cluster G1
  cDNA T5 range: [7000, 9000]  --> NO OVERLAP --> T5 is a novel gene

Gene G2 range: [6000, 8000]
  cDNA T5 range: [7000, 9000]  --> OVERLAP --> T5 joins cluster G2

For each gene cluster, we collect:

All reference gene transcripts belonging to this gene
All cDNA transcripts overlapping this gene's range

Additional checks:

If a cDNA's strand conflicts with the gene's strand, it is flagged as an orientation error
Genes with no overlapping cDNAs are flagged as "unproved"
cDNAs overlapping no gene are flagged as "novel genes"

3.3 Step 2 -- Same-Transcript Detection

Goal: Within each gene cluster, identify which transcripts are actually the same splice variant (and should be collapsed), leaving only genuine alternative splicing differences.

This is the most algorithmically complex step. It uses graph-based clustering.

3.3.1 The Problem

A gene cluster may contain many transcripts from both gene annotation and cDNA evidence. Some of these represent the same splice isoform (e.g., two cDNA sequences that happen to align to the same exon-intron structure), while others represent genuinely different isoforms.

Gene G1 cluster contains: T1(gene), T2(gene), T3(cDNA), T4(cDNA), T5(cDNA)

Possible scenario:
  T1 and T3 are the same isoform (just from different sources)
  T2 and T5 are the same isoform
  T4 is a unique isoform

We need to detect this automatically.

3.3.2 Pairwise Comparison Algorithm

Two transcripts are considered "the same isoform" if their intron structures are sufficiently similar. The comparison proceeds as follows:

Step A: Check strand consistency If the two transcripts are on opposite strands, they cannot be the same.

Step B: Check splice site sequence availability The script checks whether splice site sequence information is available for both transcripts (both have spliceSeq records, or both do not). It does NOT compare canonical vs non-canonical status -- only presence of the records matters.

Step C: Intron overlap check

The script uses nested loops to find a matching intron pair: for each intron in A, scan all introns in B to find one that overlaps with sufficient coverage. Once a match is found, the remaining introns are compared sequentially via extension check.

Transcript A introns:  I1a  I2a  I3a
Transcript B introns:  I1b  I2b

For each Iia in A, scan all Ijb in B:
  - If Iia and Ijb overlap with coverage >= threshold (default 0.9) on BOTH sides
    --> match found, proceed to extension check on remaining introns
  - If they overlap with coverage > (1 - threshold) on one side
    --> too different to be the same, return FALSE
  - If Iia ends before Ijb starts and Iia is contained within B's gene range
    --> A has an extra intron that B doesn't, return FALSE

Step D: Extension check

When one transcript runs out of introns before the other:

Case 1: A has more introns than B (B exhausted first)
  Remaining introns of A must NOT be contained within B's exon range
  (otherwise A has extra splicing that B lacks --> different isoforms)

Case 2: B has more introns than A (A exhausted first)
  Remaining introns of B must NOT be contained within A's exon range
  (otherwise B has extra splicing that A lacks --> different isoforms)

Case 3: Equal number of introns, all matched
  --> Same isoform

3.3.3 Graph Construction and Clustering

Build an undirected graph where:

Each node = one transcript
An edge connects two transcripts if they are "the same isoform" (per above)

Connected components of this graph represent families of identical isoforms:

Graph for Gene G1:

  T1 ---- T3          T2 ---- T5          T4
  (connected)         (connected)         (isolated)

Connected components:
  Family 1: {T1, T3}  -- T1 is gene transcript, T3 is cDNA (proved)
  Family 2: {T2, T5}  -- T2 is gene transcript, T5 is cDNA (proved)
  Family 3: {T4}      -- T4 is cDNA only (unproved)

Within each family, the longest transcript (by total exon length) is selected as the representative. All others are considered redundant and removed from the alternative splicing analysis.

The remaining transcripts (after collapsing) constitute the set of genuine alternative splice variants for this gene.

3.4 Step 3 -- AS Code Generation

Goal: For each pair of alternative splice transcripts, compute a compact code that describes exactly how their splicing differs.

3.4.1 Algorithm

Given two transcripts A and B (both with introns, on the same strand):

Step 1: Collect all intron coordinates from both transcripts

Transcript A introns:  [1201, 2999, +]  [3501, 4999, +]
Transcript B introns:  [1201, 2499, +]  [3501, 4999, +]

Step 2: Cluster introns by genomic overlap

Introns are sorted by start position, then grouped into clusters by continuously extending the current cluster's range as long as each new intron overlaps it. Each cluster represents a region where splicing variation may occur.

Cluster 1: {[1201,2999], [1201,2499]}  -- overlapping introns
Cluster 2: {[3501,4999], [3501,4999]}  -- identical introns (no variation)

Step 3: Filter clusters

Keep only clusters where the cluster range has a containment relationship with BOTH transcripts' ranges (overlap type 3 or 4): either the transcript range contains the cluster (type 3), or the cluster contains the transcript range (type 4). This ensures we only analyze variation in regions relevant to both transcripts.

Step 4: Extract differential splice sites

Within each cluster, collect all intron boundary positions (starts and ends). A position is "alternative" if it appears in only one transcript's introns (not shared by both).

Cluster 1 boundaries:
  A: start=1201, end=2999
  B: start=1201, end=2499

  Shared:  1201 (both start here)
  Diff:    2999 (only A), 2499 (only B)

Step 5: Filter微小差异

If two differential positions are within a small distance (default: 3 bp), they are considered sequencing/alignment noise and removed.

Step 6: Assign AS Code

Sort the remaining differential positions by genomic coordinate (reverse for minus strand). Number them 1, 2, 3, ... and mark each as donor (^) or acceptor (-):

Sorted differential positions: 2499, 2999

Position 2499: This is the END of B's intron --> acceptor --> 1-
Position 2999: This is the END of A's intron --> acceptor --> 2-

For transcript A: position 2999 is present --> code "2-"
For transcript B: position 2499 is present --> code "1-"

AS Code: 2-,1-

The donor/acceptor determination depends on whether the position is an intron start (donor, ^) or intron end (acceptor, -), and on the strand:

Plus strand:
  Intron start position --> donor (^)
  Intron end position   --> acceptor (-)

Minus strand:
  Intron start position --> acceptor (-)  [because 5' end is the biological 3' end]
  Intron end position   --> donor (^)

Implementation note: on the minus strand, the script achieves this by reversing the sorted intron coordinate lists before code generation, rather than explicitly checking strand in the code-assignment loop. The loop always uses "even index = donor, odd index = acceptor", and the list reversal makes the biological 5'->3' order correct for minus strand.

3.4.2 AS Code Format

The full AS Code for a pair is written as:

code_A,code_B

Where each code is a concatenation of: position_number + type_symbol
  position_number: 1, 2, 3, ... (relative order in the event)
  type_symbol: ^ (donor) or - (acceptor)

If a transcript has no alternative sites in this event, its code is "0".

Examples:

1^,2^    -- A uses donor site 1, B uses donor site 2
1-,0     -- A uses acceptor site 1, B has no alternative sites (intron retention)
1^2-,3^4- -- A uses sites 1(donor) and 2(acceptor), B uses 3(donor) and 4(acceptor)

3.5 Step 4 -- AS Type Classification

The AS Code is mapped to one of six biological categories:

+----------+-----------+--------------------------------------------------+
| Type     | Code      | Biological Meaning                               |
+----------+-----------+--------------------------------------------------+
| ExonS    | 0,X^      | Exon Skipping: one transcript skips an exon      |
|          | X^,0      | that the other includes                          |
+----------+-----------+--------------------------------------------------+
| IntronR  | 0,X-      | Intron Retention: one transcript retains an      |
|          | X-,0      | intron that the other splices out                |
+----------+-----------+--------------------------------------------------+
| AltD     | 1^,2^     | Alternative Donor: two transcripts use different |
|          | 2^,1^     | 5' splice sites (donor sites)                    |
+----------+-----------+--------------------------------------------------+
| AltA     | 1-,2-     | Alternative Acceptor: two transcripts use        |
|          | 2-,1-     | different 3' splice sites (acceptor sites)       |
+----------+-----------+--------------------------------------------------+
| AltP     | X^Y-,     | Alternative Position: introns overlap but differ |
|          | W^Z-      | at both 5' and 3' ends                          |
+----------+-----------+--------------------------------------------------+
| Other    | (rest)    | Complex events not fitting above categories      |
+----------+-----------+--------------------------------------------------+

Classification rules in detail:

if one code is "0":
    if the other ends with "^":  --> ExonS  (the ^ marks a donor, meaning
                                         an exon-intron boundary that
                                         creates an extra exon)
    if the other ends with "-":  --> IntronR (the - marks an acceptor,
                                          meaning an intron-exon boundary
                                          that retains an intron)

elif codes are "1^,2^" or "2^,1^":  --> AltD

elif codes are "1-,2-" or "2-,1-":  --> AltA

elif both codes match pattern [1-4][\^-][1-4][\^-]:  --> AltP

else:  --> Other

4. AS Event Types -- Structural Visualization

4.1 Notation Guide

======   Exon (solid block)
------   Intron (line connecting exons)
  ^      Donor splice site (5' end of intron, exon->intron boundary)
  |      Acceptor splice site (3' end of intron, intron->exon boundary)
 -->     Strand direction

4.2 Exon Skipping (ExonS)

One transcript includes a cassette exon that the other skips entirely.

The skipped exon is flanked by two introns in the skipping transcript, but is spliced into the mRNA in the including transcript.

  Gene:   [===E1===]-----------[===E2===]-----------[===E3===]
                     ^donor              ^acceptor

  Transcript A (skipping):    Transcript B (including):
  [===E1===]---------------------------[===E3===]
               |      skipped E2       |
               |                       |

  [===E1===]-------[===E2===]--------[===E3===]

  AS Code: 0, 1-2^

  Explanation:
    Transcript A: no alternative sites --> code "0"
    Transcript B: acceptor at E1-E2 boundary (-), donor at E2-E3 boundary (^)
                  --> positions 1 and 2 --> code "1-2^"
    The code ends with "^" --> ExonS

4.3 Intron Retention (IntronR)

One transcript retains an intron that is spliced out in the other.

  Gene:   [===E1===]---[===E2===]---[===E3===]

  Transcript A (retained):       Transcript B (spliced):
  [===E1===]---[intr]---[===E3===]   [===E1===]-------[===E3===]
               |     on                     |
               |  intron                    |  intron spliced
               |  retained                  |  out
               |                            |

  Detailed view:

  Transcript A:
  [===E1===]---[==intron==]---[===E3===]
              intron retained as part of mRNA

  Transcript B:
  [===E1===]---------------------------[===E3===]
              intron spliced out

  AS Code: 0, 1^2-

  Explanation:
    Transcript A: no alternative sites (intron retained) --> code "0"
    Transcript B: donor at intron start (^), acceptor at intron end (-)
                  --> positions 1 and 2 --> code "1^2-"
    The code ends with "-" --> IntronR

  Key distinction from ExonS:
    IntronR: the "0" transcript LACKS the intron boundaries (intron retained)
             the non-zero transcript HAS intron boundaries (donor then acceptor: ^...-)
    ExonS:   the "0" transcript LACKS the exon boundaries (exon skipped)
             the non-zero transcript HAS exon boundaries (acceptor then donor: -...^)

4.4 Alternative Donor Site (AltD)

Two transcripts share the same acceptor site but use different donor sites, resulting in different 5' splice site positions.

  Gene:   [===E1===]---[===E2===]---[===E3===]

  Transcript A (proximal donor):  Transcript B (distal donor):

  [===E1===]---[==E2==]---[===E3===]    [===E1===]------[E2]---[===E3===]
              ^D1                    ^D2              ^D2

  Detailed view (zoom into variable region):

  Transcript A:                Transcript B:
  [===E1===]                   [===E1===]
       |  D1                        |         D2
       |  v                         |         v
       +--.....intron.....---+      +--..........intron..........---+
       |  ^donor_A           |      |  ^donor_B                    |
       |                     |      |                              |
  [==E2==]              [===E3===]  [E2]                     [===E3===]
       ^same acceptor                 ^same acceptor

  AS Code: 1^, 2^

  Explanation:
    Two different donor sites, numbered 1 and 2 by position
    Transcript A uses donor 1 (proximal) --> code "1^"
    Transcript B uses donor 2 (distal) --> code "2^"
    Both codes are single donors --> AltD

4.5 Alternative Acceptor Site (AltA)

Two transcripts share the same donor site but use different acceptor sites, resulting in different 3' splice site positions.

  Gene:   [===E1===]---[===E2===]---[===E3===]

  Transcript A (proximal acceptor):  Transcript B (distal acceptor):

  [===E1===]---[==E2==]---[===E3===]    [===E1===]---[E2]------[===E3===]
              ^same donor               ^same donor        ^
                                         A1                A2

  Detailed view (zoom into variable region):

  Transcript A:                Transcript B:
                        [===E3===]                    [===E3===]
                        ^A1 acceptor_A                ^A2 acceptor_B
                        |                             |
       +--.....intron.....---+      +--..........intron..........---+
       |  ^same donor        |      |  ^same donor                 |
       |                     |      |                              |
  [===E1===]            [==E2==]   [===E1===]                [E2]

  AS Code: 1-, 2-

  Explanation:
    Two different acceptor sites, numbered 1 and 2 by position
    Transcript A uses acceptor 1 (proximal) --> code "1-"
    Transcript B uses acceptor 2 (distal) --> code "2-"
    Both codes are single acceptors --> AltA

4.6 Alternative Position (AltP)

Two introns overlap but differ at BOTH the donor and acceptor sites.

  Gene:   [===E1===]---[===E2===]---[===E3===]---[===E4===]

  Transcript A:                        Transcript B:
  [===E1===]----[==E2==]---[===E3===]  [===E1===]---[E2]------[E3]---[===E4===]
              ^D_A       ^A_A                     ^D_B            ^A_B

  Detailed view (zoom into variable region):

  Transcript A:                Transcript B:
       ^D_A                        ^D_B
       |                           |
       +--.....intron_A.....---+   +--........intron_B..........---+
       |                     |   |                                |
       ^A_A                  ^A_B
       (same region, different boundaries at both ends)

  Overlap diagram:

  Transcript A intron:  |<------D_A..............A_A------>|
  Transcript B intron:     |<---D_B...............A_B------------>|
                              ^                           ^
                              different donor          different acceptor

  AS Code: 1^2-, 3^4-

  Explanation:
    4 alternative sites total, ordered by position:
      1: donor of A (^)       2: acceptor of A (-)
      3: donor of B (^)       4: acceptor of B (-)
    Transcript A: sites 1 and 2 --> code "1^2-"
    Transcript B: sites 3 and 4 --> code "3^4-"
    Both codes have 2 sites with mixed ^ and - --> AltP

4.7 Summary of All AS Types

  Type     Structure                          AS Code    Key Feature
  ------   --------------------------------   --------   -------------------
  ExonS    [E1]----------[E3]                 0, 1-2^    One side = 0,
            [E1]---[E2]--[E3]                            ends with ^

  IntronR  [E1]----------[E3]                 0, 1^2-    One side = 0,
            [E1]--[intr]--[E3]                            ends with -

  AltD     [E1]-D1--[E2]---[E3]              1^, 2^     Two donors
            [E1]-D2----[E2]-[E3]

  AltA     [E1]---[E2]-A1-[E3]               1-, 2-     Two acceptors
            [E1]---[E2]---A2-[E3]

  AltP     [E1]-D1..A1-[E2]-[E3]             1^2-,      Both ends differ
            [E1]-D2....A2--[E3]               3^4-

  Other    Complex multi-site variations      various    >4 sites or
                                                           mixed patterns

5. Coordinate Comparison Engine

The coordinate comparison is a fundamental operation used throughout the pipeline. Given two intervals [s1, e1] and [s2, e2], it determines their overlap relationship.

5.1 Six Overlap Types

  Type 1: No overlap (B before A)     Type 6: No overlap (A before B)

  s2----e2     s1----e1              s1----e1     s2----e2
       |  |    |      |                   |  |    |      |

  Type 2: B overlaps A's left         Type 5: B overlaps A's right

       s2----s1----e2----e1           s1----s2----e1----e2
       |     |    |     |             |     |    |     |
       |  overlap     |               |  overlap     |

  Type 3: B contains A               Type 4: A contains B

  s2----s1----------e1----e2         s1----s2----------e2----e1
  |     |            |     |         |     |            |     |
  |     |  A inside B  |             |     |  B inside A  |

  Overlap length:
    Type 1,6: 0
    Type 2:   e2 - s1 + 1
    Type 3:   e1 - s1 + 1  (= length of A)
    Type 4:   e2 - s2 + 1  (= length of B)
    Type 5:   e1 - s2 + 1

5.2 Usage in the Pipeline

Clustering: Check if cDNA range overlaps gene range (any type != 1,6)
Same-transcript detection: Check intron-by-intron overlap with coverage threshold (type 2-5, compute overlap / intron_length)
AS Code generation: Check if a cluster range is contained within both transcript ranges (type 3 or 4 means the cluster is internal to a transcript)
Gene range computation: Find the outermost coordinates across all exons

6. Output Specification

6.1 File List

Output directory structure:

  alternative.splice.result/
  |
  |-- transcript.cluster.list.txt      Gene -> cDNA cluster mapping
  |-- alternative.splice.list.txt      Gene -> AS transcript list
  |-- proved.transcript.list.txt       Gene transcript -> matched cDNAs
  |-- unproved.transcript.list.txt     cDNAs with no gene transcript match
  |-- unproved.gene.list.txt           Genes with no cDNA evidence
  |-- novel.gene.list.txt              cDNAs matching no known gene
  |-- error.orient.list.txt            Strand-conflict transcripts
  |
  |-- acceptor.list.txt                Acceptor site sequences (optional)
  |-- donor.list.txt                   Donor site sequences (optional)
  |
  |-- splice.ascode.list.txt           AS Code details (GTF-like format)
  |-- splice.ascode.stat.txt           AS type statistics (optional)
  |
  |-- gene.cluster.picture/            SVG visualization (optional)
      |-- GENEID.svg

6.2 AS Code Output Format

GTF-like format with custom attributes:

chr1  Undefined  as_event  1201  2999  .  +  .  transcript_id T1,T2; gene_id G1; \
    structure 2-,1-; splice_chain 2999-,2499-

Fields:

col 1: chromosome/scaffold
col 2: "Undefined" (placeholder)
col 3: "as_event"
col 4-5: AS event genomic range (start, end)
col 6: "."
col 7: strand
col 8: "."
col 9: attributes
- transcript_id: the two compared transcripts
- gene_id: parent gene
- structure: AS Code (code_A,code_B)
- splice_chain: coordinate-level code (position_A,position_B)

6.3 AS Type Statistics Format

AS_Number    ExonS      1523
AS_Number    IntronR     847
AS_Number    AltD       2105
AS_Number    AltA       1893
AS_Number    AltP        672
AS_Number    Other       234

###########################

Gene_Number  ExonS       892
Gene_Number  IntronR     501
Gene_Number  AltD       1203
Gene_Number  AltA       1067
Gene_Number  AltP        398
Gene_Number  Other       156

###########################

Code_Number  0,1-2^      892
Code_Number  0,1^2-      501
Code_Number  1^,2^      1203
Code_Number  1-,2-      1067
Code_Number  1^2-,3^4-   398
...

AS_Number: total count of each AS event type
Gene_Number: number of unique genes involved in each type
Code_Number: count of each distinct AS Code pattern

7. Complete Algorithm Pseudocode

7.1 Main Pipeline

FUNCTION main(genome_fasta, gene_gtf, cdna_gff3, params):

    genome = parse_fasta(genome_fasta)              // {chr: sequence}

    gene = parse_gtf(gene_gtf, feature='exon')      // extract exon coords
    gene.introns = derive_introns(gene.exons, min_length=params.min_intron)
    gene.ranges = compute_gene_ranges(gene.exons)
    gene.splice_seqs = extract_splice_sites(gene.introns, genome)

    cdna = parse_gff3(cdna_gff3, feature='cDNA_match')
    cdna.introns = derive_introns(cdna.exons, min_length=params.min_intron)
    cdna.ranges = compute_gene_ranges(cdna.exons)
    cdna.splice_seqs = extract_splice_sites(cdna.introns, genome)

    // Step 1: Cluster cDNAs to genes
    clusters = {}
    FOR each geneID, gene_range IN gene.ranges:
        FOR each cdnaID, cdna_range IN cdna.ranges:
            IF overlap(gene_range, cdna_range) > 0:
                clusters[geneID].add(cdnaID)
                clusters[geneID].add(all_transcripts_of(geneID))

    // Step 2: Detect same transcripts
    FOR each geneID IN clusters:
        G = empty_graph()
        FOR each pair (ti, tj) IN clusters[geneID]:
            IF is_same_transcript(ti, tj, params.coverage):
                G.add_edge(ti, tj)
        families = connected_components(G)
        FOR each family IN families:
            representative = longest_transcript(family)
            alternative_splice[geneID].add(representative)

    // Step 3: Generate AS codes
    FOR each geneID IN alternative_splice:
        FOR each pair (ti, tj) IN alternative_splice[geneID]:
            as_code = compute_as_code(ti, tj, params)
            IF as_code is not empty:
                store as_code for (ti, tj)

    // Step 4: Classify AS types
    FOR each as_code IN all_as_codes:
        as_type = classify_as_code(as_code)
        count[as_type] += 1
        genes[as_type].add(geneID)

    OUTPUT all results

7.2 Same-Transcript Detection

FUNCTION is_same_transcript(transcript_a, transcript_b, coverage):
    IF strand(a) != strand(b): RETURN FALSE
    // splice site check: both have records or both do not (presence only,
    // NOT canonical/non-canonical comparison)
    IF splice_seq_presence_mismatch(a, b): RETURN FALSE

    introns_a = sorted(a.introns)
    introns_b = sorted(b.introns)
    range_a = (a.first_exon.start, a.last_exon.end)
    range_b = (b.first_exon.start, b.last_exon.end)

    // Nested loop: for each intron in A, scan all introns in B for a match
    FOR i IN range(len(introns_a)):
        FOR j IN range(len(introns_b)):
            overlap = compute_overlap(introns_a[i], introns_b[j])
            IF overlap / length(introns_a[i]) > coverage AND
               overlap / length(introns_b[j]) > coverage:
                // Match found, extend-check remaining introns
                RETURN extend_overlap(introns_a[i+1:], introns_b[j+1:],
                                      range_a, range_b)
            IF overlap / length(introns_a[i]) > 1-coverage OR
               overlap / length(introns_b[j]) > 1-coverage:
                RETURN FALSE  // too different
            IF introns_a[i] ends before introns_b[j] AND
               introns_a[i] contained in range_b: RETURN FALSE
            IF introns_b[j] ends before introns_a[i] AND
               introns_b[j] contained in range_a: RETURN FALSE
    RETURN FALSE

FUNCTION extend_overlap(remaining_a, remaining_b, range_a, range_b):
    i = 0
    WHILE i < min(len(remaining_a), len(remaining_b)):
        overlap = compute_overlap(remaining_a[i], remaining_b[i])
        IF overlap / length(remaining_a[i]) < coverage: RETURN FALSE
        IF overlap / length(remaining_b[i]) < coverage: RETURN FALSE
        i += 1

    // Check remaining introns (containment in other's range => different)
    IF len(remaining_a) > len(remaining_b):
        IF intron_contained_in_range(remaining_a[i], range_b): RETURN FALSE
    ELSE IF len(remaining_b) > len(remaining_a):
        IF intron_contained_in_range(remaining_b[i], range_a): RETURN FALSE

    RETURN TRUE

7.3 AS Code Generation

FUNCTION compute_as_code(transcript_a, transcript_b, params):
    all_introns = introns_a + introns_b
    clusters = cluster_overlapping_introns(all_introns)

    code_a = ""
    code_b = ""

    FOR each cluster IN clusters:
        IF cluster NOT internal_to_both(cluster, range_a, range_b): CONTINUE

        boundaries = collect_all_boundaries(cluster)
        diff_sites = boundaries appearing in only one transcript
        diff_sites = filter_small_differences(diff_sites, params.asvaryedge)

        IF diff_sites is empty: CONTINUE

        SORT diff_sites by position (reverse if minus strand)

        site_number = 1
        FOR each site IN diff_sites:
            symbol = "^" if site is donor else "-"
            IF site belongs to transcript_a:
                code_a += str(site_number) + symbol
            ELSE:
                code_b += str(site_number) + symbol
            site_number += 1

    IF code_a is empty: code_a = "0"
    IF code_b is empty: code_b = "0"

    RETURN (code_a, code_b)

7.4 AS Type Classification

FUNCTION classify_as_code(code_a, code_b):
    IF code_a == "0" OR code_b == "0":
        non_zero = code_a if code_b == "0" else code_b
        IF non_zero ends with "^": RETURN "ExonS"
        IF non_zero ends with "-": RETURN "IntronR"

    IF (code_a == "1^" AND code_b == "2^") OR
       (code_a == "2^" AND code_b == "1^"):
        RETURN "AltD"

    IF (code_a == "1-" AND code_b == "2-") OR
       (code_a == "2-" AND code_b == "1-"):
        RETURN "AltA"

    IF both codes match pattern /^[1-4][\^-][1-4][\^-]$/:
        RETURN "AltP"

    RETURN "Other"

8. Key Parameters and Their Effects

+-------------------+---------+-------------------------------------------+
| Parameter         | Default | Effect                                    |
+-------------------+---------+-------------------------------------------+
| coverage          | 0.9     | Minimum overlap fraction for two introns  |
|                   |         | to be considered "matching". Lower values |
|                   |         | are more permissive (more transcripts     |
|                   |         | collapsed as "same").                     |
+-------------------+---------+-------------------------------------------+
| minintronlength   | 9       | Introns shorter than this are removed and |
|                   |         | flanking exons merged. Prevents spurious  |
|                   |         | introns from alignment gaps.              |
+-------------------+---------+-------------------------------------------+
| asvaryedge        | 3       | Alternative splice sites within this many |
|                   |         | bp of each other are filtered out.        |
|                   |         | Reduces noise from imprecise alignments.  |
+-------------------+---------+-------------------------------------------+
| canonical         | FALSE   | If TRUE, only consider introns with       |
|                   |         | canonical splice sites (GT/AG or GC/AG).  |
|                   |         | Filters out non-canonical events.         |
+-------------------+---------+-------------------------------------------+
| type              | exon    | GTF feature type to parse. "exon" for     |
|                   |         | full transcript structure, "CDS" for      |
|                   |         | coding-region-only analysis.              |
+-------------------+---------+-------------------------------------------+

9. Worked Example

9.1 Input

Gene G1 on chr1, plus strand, with three transcripts:

T1 (gene):  [100, 200] [500, 700] [900, 1100]
              exon1       exon2       exon3

T2 (gene):  [100, 200] [500, 700]
              exon1       exon2

T3 (cDNA):  [100, 200] [600, 700] [900, 1100]
              exon1       exon2       exon3

9.2 Step 1: Derive Introns

T1 introns:  [201, 499, +]  [701, 899, +]
T2 introns:  [201, 499, +]
T3 introns:  [201, 599, +]  [701, 899, +]

9.3 Step 2: Cluster

All three overlap with gene G1 range [100, 1100], so cluster = {T1, T2, T3}.

9.4 Step 3: Same-Transcript Detection

Compare T1 vs T2:
  T1 has 2 introns, T2 has 1 intron
  Intron 1 matches (both [201, 499])
  T1's second intron [701, 899] vs T2's range [100, 700]:
    701 > 700, so the intron is NOT within T2's range (overlap type 1)
    --> T1's extra intron is outside the shared region --> SAME isoform

Compare T1 vs T3:
  Intron 1: T1=[201,499], T3=[201,599]
    overlap = 499-201+1 = 299
    coverage_A = 299/299 = 1.0 >= 0.9
    coverage_B = 299/399 = 0.75 < 0.9 --> NOT same

Compare T2 vs T3:
  T2 has 1 intron, T3 has 2 introns
  First intron: T2=[201,499], T3=[201,599]
    coverage_T3 = 299/399 = 0.75, which is > (1-0.9)=0.1 but < 0.9
    --> too different to be same --> NOT same

Graph: T1 ---- T2 (same), T3 (isolated)
Families: {T1, T2}, {T3}
Representative of {T1, T2}: T1 (longer, 3 exons)
Alternative splice transcripts: T1, T3

9.5 Step 4: AS Code for T1 vs T3

All introns: [201,499,+], [701,899,+], [201,599,+], [701,899,+]

Cluster by overlap:
  Cluster [201, 599]: {[201,499], [201,599]}  -- T1 intron1 and T3 intron1 overlap
  Cluster [701, 899]: {[701,899], [701,899]}  -- identical, no variation

Process Cluster [201, 599]:
  Boundaries: 201(T1,T3), 499(T1), 599(T3)
  Shared: 201
  Differential: 499 (T1 only), 599 (T3 only)
  Filter: |599-499| = 100 > 3, keep both

  Sort by position: 499, 599
  Position 499: end of T1's intron --> acceptor --> 1-
  Position 599: end of T3's intron --> acceptor --> 2-

  T1 code: "1-"
  T3 code: "2-"

AS Code: 1-,2- --> AltA (Alternative Acceptor)

This makes biological sense: T1 and T3 share the same donor site at position 201, but T1's intron ends at 499 while T3's intron ends at 599, meaning T3 uses a more distal acceptor site.

  T1:  [===E1===]---[===E2===]---[===E3===]
               ^D              ^A1

  T3:  [===E1===]------[==E2==]---[===E3===]
               ^D                 ^A2

  Same donor (^D), different acceptors (A1 vs A2) --> AltA