1. Biological Background
1.1 What is Alternative Splicing
In eukaryotes, a single gene can produce multiple mRNA isoforms through alternative splicing (AS) -- the process by which different combinations of exons are joined during pre-mRNA splicing. This greatly expands proteomic diversity without increasing gene count.
A gene with N exons can theoretically produce up to 2^(N-1) splice variants. In practice, most genes produce 2-10 isoforms, but some (e.g., Drosophila Dscam) can produce tens of thousands.
1.2 Why Computational Analysis
High-throughput transcriptome sequencing (RNA-seq, full-length cDNA sequencing) produces thousands of transcript alignments against a reference genome. Manually inspecting each gene for splicing variation is infeasible. A computational pipeline is needed to:
- Cluster transcripts that belong to the same gene locus
- Identify which transcripts are genuine splice variants vs. artifacts
- Classify the type of each splicing variation
- Quantify and summarize AS events across the genome
1.3 The AS Code Concept
The AS Code system (Sammeth et al., 2008) provides a compact, unambiguous notation for describing any alternative splicing event. The key insight is:
For any pair of overlapping transcripts, the splice sites that differ between them define the AS event. By numbering these differential sites and marking each as donor (^) or acceptor (-), we obtain a code that uniquely describes the event structure.
This code enables systematic classification of AS events into biologically meaningful categories.
2. Input Data and Preprocessing
2.1 Required Input Files
+-------------------+------------------+-----------------------------------+
| File | Format | Content |
+-------------------+------------------+-----------------------------------+
| Genome | FASTA | Reference genome sequences |
| Gene annotation | GTF | Gene models with exon coordinates |
| cDNA alignment | GFF3 (cDNA_match)| Full-length cDNA alignments |
+-------------------+------------------+-----------------------------------+
The gene annotation provides the reference transcript structures. The cDNA alignment provides experimentally observed transcript structures that may reveal novel splice variants not in the annotation.
2.2 Parsing Gene/CDNA Annotations
Both GTF (gene) and GFF3 (cDNA) files are parsed to extract exon coordinates for each transcript:
GTF line: chr1 ensembl exon 1000 1200 . + . gene_id "G1"; transcript_id "T1";
GFF3 line: chr1 est cDNA_match 1000 1200 . + . ID=T2;
Extracted:
transcript T1 on chr1: exon at [1000, 1200], strand +
transcript T2 on chr1: match at [1000, 1200], strand +
Each transcript is stored as an ordered list of exon coordinates:
Transcript T1: [[1000, 1200, '+'], [3000, 3500, '+'], [5000, 5500, '+']]
exon1 exon2 exon3
2.3 Deriving Introns from Exons
Introns are the gaps between consecutive exons on the same transcript. For a transcript with exons sorted by start position:
Exons: [1000, 1200] [3000, 3500] [5000, 5500]
exon1 exon2 exon3
| | |
Introns: | [1201, 2999] | [3501, 4999] |
| intron1 | intron2 |
Rule: intron = [exon_i.end + 1, exon_(i+1).start - 1, strand]
Short introns (default: < 9 bp) are filtered out, and the flanking exons are merged into a single exon. This handles alignment artifacts:
Before merge: [1000, 1200] [1205, 1300] [3000, 3500]
exon1 short intron exon2
After merge: [1000, 1300] [3000, 3500]
merged exon1 exon2
2.4 Extracting Splice Site Sequences
For each intron, the flanking 6 bp sequences are extracted from the genome:
Genome: ...ATCGAT|GT .....intron..... AG|GCTAGC...
donor acceptor
(6bp) (6bp)
Donor site: genome[intron.start-1 : intron.start-1+6] = "GT...."
Acceptor site: genome[intron.end-6 : intron.end] = "....AG"
On the minus strand, positions are the same but sequences are reverse- complemented, and donor/acceptor labels are swapped:
Minus strand: ...ATCGAT|GT .....intron..... AG|GCTAGC...
(genomic 5'->3')
Biology reads: 3'<-AG....|.....intron.....|CT....AC<-5'
acceptor donor
Canonical splice sites are GT/AG (major) and GC/AG (minor).
2.5 Computing Gene Ranges
For each gene, the genomic span is the range from the leftmost exon start to the rightmost exon end across all its transcripts:
Gene G1 transcripts:
T1: [1000, 1200] [3000, 3500] [5000, 5500]
T2: [1000, 1200] [3000, 3500]
T3: [1100, 1200] [3000, 3500] [5000, 5800]
Gene range: [1000, 5800, '+']
^first start ^last end
3. Core Workflow
3.1 Overview
+-----------------+
| Genome (FASTA) |
+--------+--------+
|
v
+-----------------+ +-----------------+ +-----------------+
| Gene GTF | | cDNA GFF3 | | Parse FASTA |
| (reference) | | (evidence) | | -> {chr: seq} |
+--------+--------+ +--------+--------+ +--------+--------+
| | |
v v |
Parse GTF Parse GFF3 |
Extract exons Extract matches |
Derive introns Derive introns |
Get gene ranges Get gene ranges |
Get splice seqs Get splice seqs <-----------+
| |
+-----------+-----------+
|
v
+---------------------+ +---------------------+
| STEP 1: Clustering | | STEP 2: Same- |
+ Gene-cDNA +--------->+ transcript detect. |
| overlap | | (graph clustering) |
+----------+----------+ +----------+----------+
|
v
+---------------------+ +---------------------+
| STEP 4: | | STEP 3: |
+ AS Type +<---------+ AS Code +
| classification | | generation |
+----------+----------+ +----------+----------+
|
v
+------------------------------------------------------+
| Output results + SVG visualization |
+------------------------------------------------------+
3.2 Step 1 -- Transcript Clustering
Goal: Assign each cDNA transcript to the gene it likely originates from.
Method: Check whether the cDNA's genomic range overlaps with any gene's range.
Gene G1 range: [1000, 5800]
cDNA T4 range: [1050, 5500] --> OVERLAP --> T4 joins cluster G1
cDNA T5 range: [7000, 9000] --> NO OVERLAP --> T5 is a novel gene
Gene G2 range: [6000, 8000]
cDNA T5 range: [7000, 9000] --> OVERLAP --> T5 joins cluster G2
For each gene cluster, we collect:
- All reference gene transcripts belonging to this gene
- All cDNA transcripts overlapping this gene's range
Additional checks:
- If a cDNA's strand conflicts with the gene's strand, it is flagged as an orientation error
- Genes with no overlapping cDNAs are flagged as "unproved"
- cDNAs overlapping no gene are flagged as "novel genes"
3.3 Step 2 -- Same-Transcript Detection
Goal: Within each gene cluster, identify which transcripts are actually the same splice variant (and should be collapsed), leaving only genuine alternative splicing differences.
This is the most algorithmically complex step. It uses graph-based clustering.
3.3.1 The Problem
A gene cluster may contain many transcripts from both gene annotation and cDNA evidence. Some of these represent the same splice isoform (e.g., two cDNA sequences that happen to align to the same exon-intron structure), while others represent genuinely different isoforms.
Gene G1 cluster contains: T1(gene), T2(gene), T3(cDNA), T4(cDNA), T5(cDNA)
Possible scenario:
T1 and T3 are the same isoform (just from different sources)
T2 and T5 are the same isoform
T4 is a unique isoform
We need to detect this automatically.
3.3.2 Pairwise Comparison Algorithm
Two transcripts are considered "the same isoform" if their intron structures are sufficiently similar. The comparison proceeds as follows:
Step A: Check strand consistency If the two transcripts are on opposite strands, they cannot be the same.
Step B: Check splice site sequence availability The script checks whether splice site sequence information is available for both transcripts (both have spliceSeq records, or both do not). It does NOT compare canonical vs non-canonical status -- only presence of the records matters.
Step C: Intron overlap check
The script uses nested loops to find a matching intron pair: for each intron in A, scan all introns in B to find one that overlaps with sufficient coverage. Once a match is found, the remaining introns are compared sequentially via extension check.
Transcript A introns: I1a I2a I3a
Transcript B introns: I1b I2b
For each Iia in A, scan all Ijb in B:
- If Iia and Ijb overlap with coverage >= threshold (default 0.9) on BOTH sides
--> match found, proceed to extension check on remaining introns
- If they overlap with coverage > (1 - threshold) on one side
--> too different to be the same, return FALSE
- If Iia ends before Ijb starts and Iia is contained within B's gene range
--> A has an extra intron that B doesn't, return FALSE
Step D: Extension check
When one transcript runs out of introns before the other:
Case 1: A has more introns than B (B exhausted first)
Remaining introns of A must NOT be contained within B's exon range
(otherwise A has extra splicing that B lacks --> different isoforms)
Case 2: B has more introns than A (A exhausted first)
Remaining introns of B must NOT be contained within A's exon range
(otherwise B has extra splicing that A lacks --> different isoforms)
Case 3: Equal number of introns, all matched
--> Same isoform
3.3.3 Graph Construction and Clustering
Build an undirected graph where:
- Each node = one transcript
- An edge connects two transcripts if they are "the same isoform" (per above)
Connected components of this graph represent families of identical isoforms:
Graph for Gene G1:
T1 ---- T3 T2 ---- T5 T4
(connected) (connected) (isolated)
Connected components:
Family 1: {T1, T3} -- T1 is gene transcript, T3 is cDNA (proved)
Family 2: {T2, T5} -- T2 is gene transcript, T5 is cDNA (proved)
Family 3: {T4} -- T4 is cDNA only (unproved)
Within each family, the longest transcript (by total exon length) is selected as the representative. All others are considered redundant and removed from the alternative splicing analysis.
The remaining transcripts (after collapsing) constitute the set of genuine alternative splice variants for this gene.
3.4 Step 3 -- AS Code Generation
Goal: For each pair of alternative splice transcripts, compute a compact code that describes exactly how their splicing differs.
3.4.1 Algorithm
Given two transcripts A and B (both with introns, on the same strand):
Step 1: Collect all intron coordinates from both transcripts
Transcript A introns: [1201, 2999, +] [3501, 4999, +]
Transcript B introns: [1201, 2499, +] [3501, 4999, +]
Step 2: Cluster introns by genomic overlap
Introns are sorted by start position, then grouped into clusters by continuously extending the current cluster's range as long as each new intron overlaps it. Each cluster represents a region where splicing variation may occur.
Cluster 1: {[1201,2999], [1201,2499]} -- overlapping introns
Cluster 2: {[3501,4999], [3501,4999]} -- identical introns (no variation)
Step 3: Filter clusters
Keep only clusters where the cluster range has a containment relationship with BOTH transcripts' ranges (overlap type 3 or 4): either the transcript range contains the cluster (type 3), or the cluster contains the transcript range (type 4). This ensures we only analyze variation in regions relevant to both transcripts.
Step 4: Extract differential splice sites
Within each cluster, collect all intron boundary positions (starts and ends). A position is "alternative" if it appears in only one transcript's introns (not shared by both).
Cluster 1 boundaries:
A: start=1201, end=2999
B: start=1201, end=2499
Shared: 1201 (both start here)
Diff: 2999 (only A), 2499 (only B)
Step 5: Filter微小差异
If two differential positions are within a small distance (default: 3 bp), they are considered sequencing/alignment noise and removed.
Step 6: Assign AS Code
Sort the remaining differential positions by genomic coordinate (reverse for minus strand). Number them 1, 2, 3, ... and mark each as donor (^) or acceptor (-):
Sorted differential positions: 2499, 2999
Position 2499: This is the END of B's intron --> acceptor --> 1-
Position 2999: This is the END of A's intron --> acceptor --> 2-
For transcript A: position 2999 is present --> code "2-"
For transcript B: position 2499 is present --> code "1-"
AS Code: 2-,1-
The donor/acceptor determination depends on whether the position is an intron start (donor, ^) or intron end (acceptor, -), and on the strand:
Plus strand:
Intron start position --> donor (^)
Intron end position --> acceptor (-)
Minus strand:
Intron start position --> acceptor (-) [because 5' end is the biological 3' end]
Intron end position --> donor (^)
Implementation note: on the minus strand, the script achieves this by reversing the sorted intron coordinate lists before code generation, rather than explicitly checking strand in the code-assignment loop. The loop always uses "even index = donor, odd index = acceptor", and the list reversal makes the biological 5'->3' order correct for minus strand.
3.4.2 AS Code Format
The full AS Code for a pair is written as:
code_A,code_B
Where each code is a concatenation of: position_number + type_symbol
position_number: 1, 2, 3, ... (relative order in the event)
type_symbol: ^ (donor) or - (acceptor)
If a transcript has no alternative sites in this event, its code is "0".
Examples:
1^,2^ -- A uses donor site 1, B uses donor site 2
1-,0 -- A uses acceptor site 1, B has no alternative sites (intron retention)
1^2-,3^4- -- A uses sites 1(donor) and 2(acceptor), B uses 3(donor) and 4(acceptor)
3.5 Step 4 -- AS Type Classification
The AS Code is mapped to one of six biological categories:
+----------+-----------+--------------------------------------------------+
| Type | Code | Biological Meaning |
+----------+-----------+--------------------------------------------------+
| ExonS | 0,X^ | Exon Skipping: one transcript skips an exon |
| | X^,0 | that the other includes |
+----------+-----------+--------------------------------------------------+
| IntronR | 0,X- | Intron Retention: one transcript retains an |
| | X-,0 | intron that the other splices out |
+----------+-----------+--------------------------------------------------+
| AltD | 1^,2^ | Alternative Donor: two transcripts use different |
| | 2^,1^ | 5' splice sites (donor sites) |
+----------+-----------+--------------------------------------------------+
| AltA | 1-,2- | Alternative Acceptor: two transcripts use |
| | 2-,1- | different 3' splice sites (acceptor sites) |
+----------+-----------+--------------------------------------------------+
| AltP | X^Y-, | Alternative Position: introns overlap but differ |
| | W^Z- | at both 5' and 3' ends |
+----------+-----------+--------------------------------------------------+
| Other | (rest) | Complex events not fitting above categories |
+----------+-----------+--------------------------------------------------+
Classification rules in detail:
if one code is "0":
if the other ends with "^": --> ExonS (the ^ marks a donor, meaning
an exon-intron boundary that
creates an extra exon)
if the other ends with "-": --> IntronR (the - marks an acceptor,
meaning an intron-exon boundary
that retains an intron)
elif codes are "1^,2^" or "2^,1^": --> AltD
elif codes are "1-,2-" or "2-,1-": --> AltA
elif both codes match pattern [1-4][\^-][1-4][\^-]: --> AltP
else: --> Other
4. AS Event Types -- Structural Visualization
4.1 Notation Guide
====== Exon (solid block)
------ Intron (line connecting exons)
^ Donor splice site (5' end of intron, exon->intron boundary)
| Acceptor splice site (3' end of intron, intron->exon boundary)
--> Strand direction
4.2 Exon Skipping (ExonS)
One transcript includes a cassette exon that the other skips entirely.
The skipped exon is flanked by two introns in the skipping transcript, but is spliced into the mRNA in the including transcript.
Gene: [===E1===]-----------[===E2===]-----------[===E3===]
^donor ^acceptor
Transcript A (skipping): Transcript B (including):
[===E1===]---------------------------[===E3===]
| skipped E2 |
| |
[===E1===]-------[===E2===]--------[===E3===]
AS Code: 0, 1-2^
Explanation:
Transcript A: no alternative sites --> code "0"
Transcript B: acceptor at E1-E2 boundary (-), donor at E2-E3 boundary (^)
--> positions 1 and 2 --> code "1-2^"
The code ends with "^" --> ExonS
4.3 Intron Retention (IntronR)
One transcript retains an intron that is spliced out in the other.
Gene: [===E1===]---[===E2===]---[===E3===]
Transcript A (retained): Transcript B (spliced):
[===E1===]---[intr]---[===E3===] [===E1===]-------[===E3===]
| on |
| intron | intron spliced
| retained | out
| |
Detailed view:
Transcript A:
[===E1===]---[==intron==]---[===E3===]
intron retained as part of mRNA
Transcript B:
[===E1===]---------------------------[===E3===]
intron spliced out
AS Code: 0, 1^2-
Explanation:
Transcript A: no alternative sites (intron retained) --> code "0"
Transcript B: donor at intron start (^), acceptor at intron end (-)
--> positions 1 and 2 --> code "1^2-"
The code ends with "-" --> IntronR
Key distinction from ExonS:
IntronR: the "0" transcript LACKS the intron boundaries (intron retained)
the non-zero transcript HAS intron boundaries (donor then acceptor: ^...-)
ExonS: the "0" transcript LACKS the exon boundaries (exon skipped)
the non-zero transcript HAS exon boundaries (acceptor then donor: -...^)
4.4 Alternative Donor Site (AltD)
Two transcripts share the same acceptor site but use different donor sites, resulting in different 5' splice site positions.
Gene: [===E1===]---[===E2===]---[===E3===]
Transcript A (proximal donor): Transcript B (distal donor):
[===E1===]---[==E2==]---[===E3===] [===E1===]------[E2]---[===E3===]
^D1 ^D2 ^D2
Detailed view (zoom into variable region):
Transcript A: Transcript B:
[===E1===] [===E1===]
| D1 | D2
| v | v
+--.....intron.....---+ +--..........intron..........---+
| ^donor_A | | ^donor_B |
| | | |
[==E2==] [===E3===] [E2] [===E3===]
^same acceptor ^same acceptor
AS Code: 1^, 2^
Explanation:
Two different donor sites, numbered 1 and 2 by position
Transcript A uses donor 1 (proximal) --> code "1^"
Transcript B uses donor 2 (distal) --> code "2^"
Both codes are single donors --> AltD
4.5 Alternative Acceptor Site (AltA)
Two transcripts share the same donor site but use different acceptor sites, resulting in different 3' splice site positions.
Gene: [===E1===]---[===E2===]---[===E3===]
Transcript A (proximal acceptor): Transcript B (distal acceptor):
[===E1===]---[==E2==]---[===E3===] [===E1===]---[E2]------[===E3===]
^same donor ^same donor ^
A1 A2
Detailed view (zoom into variable region):
Transcript A: Transcript B:
[===E3===] [===E3===]
^A1 acceptor_A ^A2 acceptor_B
| |
+--.....intron.....---+ +--..........intron..........---+
| ^same donor | | ^same donor |
| | | |
[===E1===] [==E2==] [===E1===] [E2]
AS Code: 1-, 2-
Explanation:
Two different acceptor sites, numbered 1 and 2 by position
Transcript A uses acceptor 1 (proximal) --> code "1-"
Transcript B uses acceptor 2 (distal) --> code "2-"
Both codes are single acceptors --> AltA
4.6 Alternative Position (AltP)
Two introns overlap but differ at BOTH the donor and acceptor sites.
Gene: [===E1===]---[===E2===]---[===E3===]---[===E4===]
Transcript A: Transcript B:
[===E1===]----[==E2==]---[===E3===] [===E1===]---[E2]------[E3]---[===E4===]
^D_A ^A_A ^D_B ^A_B
Detailed view (zoom into variable region):
Transcript A: Transcript B:
^D_A ^D_B
| |
+--.....intron_A.....---+ +--........intron_B..........---+
| | | |
^A_A ^A_B
(same region, different boundaries at both ends)
Overlap diagram:
Transcript A intron: |<------D_A..............A_A------>|
Transcript B intron: |<---D_B...............A_B------------>|
^ ^
different donor different acceptor
AS Code: 1^2-, 3^4-
Explanation:
4 alternative sites total, ordered by position:
1: donor of A (^) 2: acceptor of A (-)
3: donor of B (^) 4: acceptor of B (-)
Transcript A: sites 1 and 2 --> code "1^2-"
Transcript B: sites 3 and 4 --> code "3^4-"
Both codes have 2 sites with mixed ^ and - --> AltP
4.7 Summary of All AS Types
Type Structure AS Code Key Feature
------ -------------------------------- -------- -------------------
ExonS [E1]----------[E3] 0, 1-2^ One side = 0,
[E1]---[E2]--[E3] ends with ^
IntronR [E1]----------[E3] 0, 1^2- One side = 0,
[E1]--[intr]--[E3] ends with -
AltD [E1]-D1--[E2]---[E3] 1^, 2^ Two donors
[E1]-D2----[E2]-[E3]
AltA [E1]---[E2]-A1-[E3] 1-, 2- Two acceptors
[E1]---[E2]---A2-[E3]
AltP [E1]-D1..A1-[E2]-[E3] 1^2-, Both ends differ
[E1]-D2....A2--[E3] 3^4-
Other Complex multi-site variations various >4 sites or
mixed patterns
5. Coordinate Comparison Engine
The coordinate comparison is a fundamental operation used throughout the pipeline. Given two intervals [s1, e1] and [s2, e2], it determines their overlap relationship.
5.1 Six Overlap Types
Type 1: No overlap (B before A) Type 6: No overlap (A before B)
s2----e2 s1----e1 s1----e1 s2----e2
| | | | | | | |
Type 2: B overlaps A's left Type 5: B overlaps A's right
s2----s1----e2----e1 s1----s2----e1----e2
| | | | | | | |
| overlap | | overlap |
Type 3: B contains A Type 4: A contains B
s2----s1----------e1----e2 s1----s2----------e2----e1
| | | | | | | |
| | A inside B | | | B inside A |
Overlap length:
Type 1,6: 0
Type 2: e2 - s1 + 1
Type 3: e1 - s1 + 1 (= length of A)
Type 4: e2 - s2 + 1 (= length of B)
Type 5: e1 - s2 + 1
5.2 Usage in the Pipeline
- Clustering: Check if cDNA range overlaps gene range (any type != 1,6)
- Same-transcript detection: Check intron-by-intron overlap with coverage threshold (type 2-5, compute overlap / intron_length)
- AS Code generation: Check if a cluster range is contained within both transcript ranges (type 3 or 4 means the cluster is internal to a transcript)
- Gene range computation: Find the outermost coordinates across all exons
6. Output Specification
6.1 File List
Output directory structure:
alternative.splice.result/
|
|-- transcript.cluster.list.txt Gene -> cDNA cluster mapping
|-- alternative.splice.list.txt Gene -> AS transcript list
|-- proved.transcript.list.txt Gene transcript -> matched cDNAs
|-- unproved.transcript.list.txt cDNAs with no gene transcript match
|-- unproved.gene.list.txt Genes with no cDNA evidence
|-- novel.gene.list.txt cDNAs matching no known gene
|-- error.orient.list.txt Strand-conflict transcripts
|
|-- acceptor.list.txt Acceptor site sequences (optional)
|-- donor.list.txt Donor site sequences (optional)
|
|-- splice.ascode.list.txt AS Code details (GTF-like format)
|-- splice.ascode.stat.txt AS type statistics (optional)
|
|-- gene.cluster.picture/ SVG visualization (optional)
|-- GENEID.svg
6.2 AS Code Output Format
GTF-like format with custom attributes:
chr1 Undefined as_event 1201 2999 . + . transcript_id T1,T2; gene_id G1; \
structure 2-,1-; splice_chain 2999-,2499-
Fields:
- col 1: chromosome/scaffold
- col 2: "Undefined" (placeholder)
- col 3: "as_event"
- col 4-5: AS event genomic range (start, end)
- col 6: "."
- col 7: strand
- col 8: "."
- col 9: attributes
- transcript_id: the two compared transcripts
- gene_id: parent gene
- structure: AS Code (code_A,code_B)
- splice_chain: coordinate-level code (position_A,position_B)
6.3 AS Type Statistics Format
AS_Number ExonS 1523
AS_Number IntronR 847
AS_Number AltD 2105
AS_Number AltA 1893
AS_Number AltP 672
AS_Number Other 234
###########################
Gene_Number ExonS 892
Gene_Number IntronR 501
Gene_Number AltD 1203
Gene_Number AltA 1067
Gene_Number AltP 398
Gene_Number Other 156
###########################
Code_Number 0,1-2^ 892
Code_Number 0,1^2- 501
Code_Number 1^,2^ 1203
Code_Number 1-,2- 1067
Code_Number 1^2-,3^4- 398
...
- AS_Number: total count of each AS event type
- Gene_Number: number of unique genes involved in each type
- Code_Number: count of each distinct AS Code pattern
7. Complete Algorithm Pseudocode
7.1 Main Pipeline
FUNCTION main(genome_fasta, gene_gtf, cdna_gff3, params):
genome = parse_fasta(genome_fasta) // {chr: sequence}
gene = parse_gtf(gene_gtf, feature='exon') // extract exon coords
gene.introns = derive_introns(gene.exons, min_length=params.min_intron)
gene.ranges = compute_gene_ranges(gene.exons)
gene.splice_seqs = extract_splice_sites(gene.introns, genome)
cdna = parse_gff3(cdna_gff3, feature='cDNA_match')
cdna.introns = derive_introns(cdna.exons, min_length=params.min_intron)
cdna.ranges = compute_gene_ranges(cdna.exons)
cdna.splice_seqs = extract_splice_sites(cdna.introns, genome)
// Step 1: Cluster cDNAs to genes
clusters = {}
FOR each geneID, gene_range IN gene.ranges:
FOR each cdnaID, cdna_range IN cdna.ranges:
IF overlap(gene_range, cdna_range) > 0:
clusters[geneID].add(cdnaID)
clusters[geneID].add(all_transcripts_of(geneID))
// Step 2: Detect same transcripts
FOR each geneID IN clusters:
G = empty_graph()
FOR each pair (ti, tj) IN clusters[geneID]:
IF is_same_transcript(ti, tj, params.coverage):
G.add_edge(ti, tj)
families = connected_components(G)
FOR each family IN families:
representative = longest_transcript(family)
alternative_splice[geneID].add(representative)
// Step 3: Generate AS codes
FOR each geneID IN alternative_splice:
FOR each pair (ti, tj) IN alternative_splice[geneID]:
as_code = compute_as_code(ti, tj, params)
IF as_code is not empty:
store as_code for (ti, tj)
// Step 4: Classify AS types
FOR each as_code IN all_as_codes:
as_type = classify_as_code(as_code)
count[as_type] += 1
genes[as_type].add(geneID)
OUTPUT all results
7.2 Same-Transcript Detection
FUNCTION is_same_transcript(transcript_a, transcript_b, coverage):
IF strand(a) != strand(b): RETURN FALSE
// splice site check: both have records or both do not (presence only,
// NOT canonical/non-canonical comparison)
IF splice_seq_presence_mismatch(a, b): RETURN FALSE
introns_a = sorted(a.introns)
introns_b = sorted(b.introns)
range_a = (a.first_exon.start, a.last_exon.end)
range_b = (b.first_exon.start, b.last_exon.end)
// Nested loop: for each intron in A, scan all introns in B for a match
FOR i IN range(len(introns_a)):
FOR j IN range(len(introns_b)):
overlap = compute_overlap(introns_a[i], introns_b[j])
IF overlap / length(introns_a[i]) > coverage AND
overlap / length(introns_b[j]) > coverage:
// Match found, extend-check remaining introns
RETURN extend_overlap(introns_a[i+1:], introns_b[j+1:],
range_a, range_b)
IF overlap / length(introns_a[i]) > 1-coverage OR
overlap / length(introns_b[j]) > 1-coverage:
RETURN FALSE // too different
IF introns_a[i] ends before introns_b[j] AND
introns_a[i] contained in range_b: RETURN FALSE
IF introns_b[j] ends before introns_a[i] AND
introns_b[j] contained in range_a: RETURN FALSE
RETURN FALSE
FUNCTION extend_overlap(remaining_a, remaining_b, range_a, range_b):
i = 0
WHILE i < min(len(remaining_a), len(remaining_b)):
overlap = compute_overlap(remaining_a[i], remaining_b[i])
IF overlap / length(remaining_a[i]) < coverage: RETURN FALSE
IF overlap / length(remaining_b[i]) < coverage: RETURN FALSE
i += 1
// Check remaining introns (containment in other's range => different)
IF len(remaining_a) > len(remaining_b):
IF intron_contained_in_range(remaining_a[i], range_b): RETURN FALSE
ELSE IF len(remaining_b) > len(remaining_a):
IF intron_contained_in_range(remaining_b[i], range_a): RETURN FALSE
RETURN TRUE
7.3 AS Code Generation
FUNCTION compute_as_code(transcript_a, transcript_b, params):
all_introns = introns_a + introns_b
clusters = cluster_overlapping_introns(all_introns)
code_a = ""
code_b = ""
FOR each cluster IN clusters:
IF cluster NOT internal_to_both(cluster, range_a, range_b): CONTINUE
boundaries = collect_all_boundaries(cluster)
diff_sites = boundaries appearing in only one transcript
diff_sites = filter_small_differences(diff_sites, params.asvaryedge)
IF diff_sites is empty: CONTINUE
SORT diff_sites by position (reverse if minus strand)
site_number = 1
FOR each site IN diff_sites:
symbol = "^" if site is donor else "-"
IF site belongs to transcript_a:
code_a += str(site_number) + symbol
ELSE:
code_b += str(site_number) + symbol
site_number += 1
IF code_a is empty: code_a = "0"
IF code_b is empty: code_b = "0"
RETURN (code_a, code_b)
7.4 AS Type Classification
FUNCTION classify_as_code(code_a, code_b):
IF code_a == "0" OR code_b == "0":
non_zero = code_a if code_b == "0" else code_b
IF non_zero ends with "^": RETURN "ExonS"
IF non_zero ends with "-": RETURN "IntronR"
IF (code_a == "1^" AND code_b == "2^") OR
(code_a == "2^" AND code_b == "1^"):
RETURN "AltD"
IF (code_a == "1-" AND code_b == "2-") OR
(code_a == "2-" AND code_b == "1-"):
RETURN "AltA"
IF both codes match pattern /^[1-4][\^-][1-4][\^-]$/:
RETURN "AltP"
RETURN "Other"
8. Key Parameters and Their Effects
+-------------------+---------+-------------------------------------------+
| Parameter | Default | Effect |
+-------------------+---------+-------------------------------------------+
| coverage | 0.9 | Minimum overlap fraction for two introns |
| | | to be considered "matching". Lower values |
| | | are more permissive (more transcripts |
| | | collapsed as "same"). |
+-------------------+---------+-------------------------------------------+
| minintronlength | 9 | Introns shorter than this are removed and |
| | | flanking exons merged. Prevents spurious |
| | | introns from alignment gaps. |
+-------------------+---------+-------------------------------------------+
| asvaryedge | 3 | Alternative splice sites within this many |
| | | bp of each other are filtered out. |
| | | Reduces noise from imprecise alignments. |
+-------------------+---------+-------------------------------------------+
| canonical | FALSE | If TRUE, only consider introns with |
| | | canonical splice sites (GT/AG or GC/AG). |
| | | Filters out non-canonical events. |
+-------------------+---------+-------------------------------------------+
| type | exon | GTF feature type to parse. "exon" for |
| | | full transcript structure, "CDS" for |
| | | coding-region-only analysis. |
+-------------------+---------+-------------------------------------------+
9. Worked Example
9.1 Input
Gene G1 on chr1, plus strand, with three transcripts:
T1 (gene): [100, 200] [500, 700] [900, 1100]
exon1 exon2 exon3
T2 (gene): [100, 200] [500, 700]
exon1 exon2
T3 (cDNA): [100, 200] [600, 700] [900, 1100]
exon1 exon2 exon3
9.2 Step 1: Derive Introns
T1 introns: [201, 499, +] [701, 899, +]
T2 introns: [201, 499, +]
T3 introns: [201, 599, +] [701, 899, +]
9.3 Step 2: Cluster
All three overlap with gene G1 range [100, 1100], so cluster = {T1, T2, T3}.
9.4 Step 3: Same-Transcript Detection
Compare T1 vs T2:
T1 has 2 introns, T2 has 1 intron
Intron 1 matches (both [201, 499])
T1's second intron [701, 899] vs T2's range [100, 700]:
701 > 700, so the intron is NOT within T2's range (overlap type 1)
--> T1's extra intron is outside the shared region --> SAME isoform
Compare T1 vs T3:
Intron 1: T1=[201,499], T3=[201,599]
overlap = 499-201+1 = 299
coverage_A = 299/299 = 1.0 >= 0.9
coverage_B = 299/399 = 0.75 < 0.9 --> NOT same
Compare T2 vs T3:
T2 has 1 intron, T3 has 2 introns
First intron: T2=[201,499], T3=[201,599]
coverage_T3 = 299/399 = 0.75, which is > (1-0.9)=0.1 but < 0.9
--> too different to be same --> NOT same
Graph: T1 ---- T2 (same), T3 (isolated)
Families: {T1, T2}, {T3}
Representative of {T1, T2}: T1 (longer, 3 exons)
Alternative splice transcripts: T1, T3
9.5 Step 4: AS Code for T1 vs T3
All introns: [201,499,+], [701,899,+], [201,599,+], [701,899,+]
Cluster by overlap:
Cluster [201, 599]: {[201,499], [201,599]} -- T1 intron1 and T3 intron1 overlap
Cluster [701, 899]: {[701,899], [701,899]} -- identical, no variation
Process Cluster [201, 599]:
Boundaries: 201(T1,T3), 499(T1), 599(T3)
Shared: 201
Differential: 499 (T1 only), 599 (T3 only)
Filter: |599-499| = 100 > 3, keep both
Sort by position: 499, 599
Position 499: end of T1's intron --> acceptor --> 1-
Position 599: end of T3's intron --> acceptor --> 2-
T1 code: "1-"
T3 code: "2-"
AS Code: 1-,2- --> AltA (Alternative Acceptor)
This makes biological sense: T1 and T3 share the same donor site at position 201, but T1's intron ends at 499 while T3's intron ends at 599, meaning T3 uses a more distal acceptor site.
T1: [===E1===]---[===E2===]---[===E3===]
^D ^A1
T3: [===E1===]------[==E2==]---[===E3===]
^D ^A2
Same donor (^D), different acceptors (A1 vs A2) --> AltA