Mapping¶
hgvs.assemblymapper
¶
-
class
hgvs.assemblymapper.
AssemblyMapper
(hdp, assembly_name=u'GRCh38', alt_aln_method=u'splign', normalize=True, prevalidation_level=u'EXTRINSIC', in_par_assume=u'X', replace_reference=True, *args, **kwargs)¶ Bases:
hgvs.variantmapper.VariantMapper
Provides simplified variant mapping for a single assembly and transcript-reference alignment method.
AssemblyMapper is instantiated with an assembly name and alt_aln_method. These enable the following conveniences over VariantMapper:
- The assembly and alignment method are used to automatically select an appropriate chromosomal reference sequence when mapping from a transcript to a genome (i.e., c_to_g(…) and n_to_g(…)).
- A new method, relevant_trancripts(g_variant), returns a list of transcript accessions available for the specified variant. These accessions are candidates mapping from genomic to trancript coordinates (i.e., g_to_c(…) and g_to_n(…)).
Note: AssemblyMapper supports only chromosomal references (e.g., NC_000006.11). It does not support contigs or other genomic sequences (e.g., NT_167249.1).
Parameters: - hdp (object) – instance of hgvs.dataprovider subclass
- replace_reference (bool) – replace reference (entails additional network access)
- assembly_name (str) – name of assembly (“GRCh38.p5”)
- alt_aln_method (str) – genome-transcript alignment method (“splign”, “blat”, “genewise”)
- normalize (bool) – normalize variants
- prevalidation_level (str) – None or Intrinsic or Extrinsic validation before mapping
- in_par_assume (str) – during x_to_g, assume this chromosome name if alignment is ambiguous
Raises: HGVSError subclasses – for a variety of mapping and data lookup failures
-
c_to_g
(var_c)¶
-
c_to_n
(var_c)¶
-
c_to_p
(var_c)¶
-
g_to_c
(var_g, tx_ac)¶
-
g_to_n
(var_g, tx_ac)¶
-
g_to_t
(var_g, tx_ac)¶
-
n_to_c
(var_n)¶
-
n_to_g
(var_n)¶
-
relevant_transcripts
(var_g)¶ return list of transcripts accessions (strings) for given variant, selected by genomic overlap
-
t_to_g
(var_t)¶
hgvs.variantmapper
¶
Provides VariantMapper and AssemblyMapper to project variants between sequences using TranscriptMapper.
-
class
hgvs.variantmapper.
VariantMapper
(hdp, replace_reference=True, prevalidation_level=u'EXTRINSIC')¶ Bases:
object
Maps SequenceVariant objects between g., n., r., c., and p. representations.
g⟷{c,n,r} projections are similar in that c, n, and r variants may use intronic coordinates. There are two essential differences that distinguish the three types:
- Sequence start: In n and r variants, position 1 is the sequence start; in c variants, 1 is the transcription start site.
- Alphabet: In n and c variants, sequences are DNA; in r. variants, sequences are RNA.
This differences are summarized in this diagram:
g ----acgtatgcac--gtctagacgt---- ----acgtatgcac--gtctagacgt---- ----acgtatgcac--gtctagacgt---- \ \/ / \ \/ / \ \/ / c acgtATGCACGTCTAGacgt n acgtatgcacgtctagacgt r acguaugcacgucuagacgu 1 1 1 p MetHisValTer
The g excerpt and exon structures are identical. The g⟷n transformation, which is the most basic, accounts for the offset of the aligned sequences (shown with “1”) and the exon structure. The g⟷c transformation is akin to g⟷n transformation, but requires an addition offset to account for the translation start site (c.1). The CDS in uppercase. The g⟷c transformation is akin to g⟷n transformation with a change of alphabet.
Therefore, this this code uses g⟷n as the core transformation between genomic and c, n, and r variants: All c⟷g and r⟷g transformations use n⟷g after accounting for the above differences. For example, c_to_g accounts for the transcription start site offset, then calls n_to_g.
All methods require and return objects of type
hgvs.sequencevariant.SequenceVariant
.Parameters: -
c_to_g
(var_c, alt_ac, alt_aln_method=u'splign')¶ Given a parsed c. variant, return a g. variant on the specified transcript using the specified alignment method (default is “splign” from NCBI).
Parameters: - var_c (hgvs.sequencevariant.SequenceVariant) – a variant object
- alt_ac (str) – a reference sequence accession (e.g., NC_000001.11)
- alt_aln_method (str) – the alignment method; valid values depend on data source
Returns: variant object (
hgvs.sequencevariant.SequenceVariant
)Raises: HGVSInvalidVariantError – if var_c is not of type “c”
-
c_to_n
(var_c)¶ Given a parsed c. variant, return a n. variant on the specified transcript using the specified alignment method (default is “transcript” indicating a self alignment).
Parameters: var_c (hgvs.sequencevariant.SequenceVariant) – a variant object Returns: variant object ( hgvs.sequencevariant.SequenceVariant
)Raises: HGVSInvalidVariantError – if var_c is not of type “c”
-
c_to_p
(var_c, pro_ac=None)¶ Converts a c. SequenceVariant to a p. SequenceVariant on the specified protein accession Author: Rudy Rico
Parameters: - var_c (SequenceVariant) – hgvsc tag
- pro_ac (str) – protein accession
Return type:
-
g_to_c
(var_g, tx_ac, alt_aln_method=u'splign')¶ Given a parsed g. variant, return a c. variant on the specified transcript using the specified alignment method (default is “splign” from NCBI).
Parameters: - var_g (hgvs.sequencevariant.SequenceVariant) – a variant object
- tx_ac (str) – a transcript accession (e.g., NM_012345.6 or ENST012345678)
- alt_aln_method (str) – the alignment method; valid values depend on data source
Returns: variant object (
hgvs.sequencevariant.SequenceVariant
) using CDS coordinatesRaises: HGVSInvalidVariantError – if var_g is not of type “g”
-
g_to_n
(var_g, tx_ac, alt_aln_method=u'splign')¶ Given a parsed g. variant, return a n. variant on the specified transcript using the specified alignment method (default is “splign” from NCBI).
Parameters: - var_g (hgvs.sequencevariant.SequenceVariant) – a variant object
- tx_ac (str) – a transcript accession (e.g., NM_012345.6 or ENST012345678)
- alt_aln_method (str) – the alignment method; valid values depend on data source
Returns: variant object (
hgvs.sequencevariant.SequenceVariant
) using transcript (n.) coordinatesRaises: HGVSInvalidVariantError – if var_g is not of type “g”
-
g_to_t
(var_g, tx_ac, alt_aln_method=u'splign')¶
-
n_to_c
(var_n)¶ Given a parsed n. variant, return a c. variant on the specified transcript using the specified alignment method (default is “transcript” indicating a self alignment).
Parameters: var_n (hgvs.sequencevariant.SequenceVariant) – a variant object Returns: variant object ( hgvs.sequencevariant.SequenceVariant
)Raises: HGVSInvalidVariantError – if var_n is not of type “n”
-
n_to_g
(var_n, alt_ac, alt_aln_method=u'splign')¶ Given a parsed n. variant, return a g. variant on the specified transcript using the specified alignment method (default is “splign” from NCBI).
Parameters: - var_n (hgvs.sequencevariant.SequenceVariant) – a variant object
- alt_ac (str) – a reference sequence accession (e.g., NC_000001.11)
- alt_aln_method (str) – the alignment method; valid values depend on data source
Returns: variant object (
hgvs.sequencevariant.SequenceVariant
)Raises: HGVSInvalidVariantError – if var_n is not of type “n”
-
t_to_g
(var_t, alt_ac, alt_aln_method=u'splign')¶
hgvs.intervalmapper
¶
Mapping intervals between pairs of congruent segments
The IntervalMapper class is at the heart of mapping between aligned sequences. An instance
of uta.tools.intervalmapper.IntervalMapper
is constructed with an ordered list of
uta.tools.intervalmapper.IntervalPair
instances, each of which consists of two
uta.tools.intervalmapper.Interval
instances. The IntervalMapper class is unaware
of strand/orientation; that issue is handled by the
uta.tools.transcriptmapper.TranscriptMapper
class.
NOTE: Mapping at the boundaries around indels requires a choice. If seq B has an insertion relative to seq A, then mapping coordinate at the boundaries can either be minimal or maximal for both the start and end. Consider this alignment:
0 15 20 35 50
|==========|====|==========|==========|
| | __/ __/| |
| |/ / | |
|==========|==========|====|==========|
0 15 30 35 50
15M 5D 15M 5I 15M
segment 1: [ 0,15] ~ [ 0,15]
segment 2: [15,20] ~ [15,15]
segment 3: [20,35] ~ [15,30]
segment 4: [35,35] ~ [30,35]
segment 5: [35,50] ~ [35,50]
and these intervals around reference position 35:
interval 1: 34,36 -> 29,36 (no ambiguity)
interval 2: 35,35 -> 30,35 (reasonable)
interval 3: 34,35 -> 29,30 (minimal) or 29,35 (maximal)
interval 4: 35,36 -> 35,36 (minimal) or 30,36 (maximal)
So, for interval 3, end_i=35 matches segment 3 and segment 4. Analagously for interval 4, start_i=35 matches segment 4 and segment 5.
Currently, this code matches an interval <start_i,end_i> using the maximal start_i and minimal end_i.
-
class
hgvs.intervalmapper.
CIGARElement
(len, op)¶ Bases:
object
represents elements of a CIGAR string and provides methods for determining the number of ref and tgt bases consumed by the operation
-
len
¶
-
op
¶
-
ref_len
¶ returns number of nt/aa consumed in reference sequence for this edit
-
tgt_len
¶ returns number of nt/aa consumed in target sequence for this edit
-
-
class
hgvs.intervalmapper.
Interval
(start_i, end_i)¶ Bases:
object
Represents a segment of a sequence in interbase coordinates (0-based, right-open).
-
end_i
¶
-
len
¶
-
start_i
¶
-
-
class
hgvs.intervalmapper.
IntervalMapper
(interval_pairs)¶ Bases:
object
Provides mapping between sequence coordinates according to an ordered set of IntervalPairs.
Parameters: interval_pairs (list (of IntervalPair instances)) – an ordered list of IntervalPair instances Returns: an IntervalMapper instance -
static
from_cigar
(cigar)¶ Parameters: cigar (str.) – a Compact Idiosyncratic Gapped Alignment Report string Returns: an IntervalMapper instance from the CIGAR string
-
interval_pairs
¶
-
map_ref_to_tgt
(start_i, end_i, max_extent=False)¶
-
map_tgt_to_ref
(start_i, end_i, max_extent=False)¶
-
ref_intervals
¶
-
ref_len
¶
-
tgt_intervals
¶
-
tgt_len
¶
-
static
-
class
hgvs.intervalmapper.
IntervalPair
(ref, tgt)¶ Bases:
object
Represents a match, insertion, or deletion segment of an alignment. If a match, the lengths must be equal; if an insertion or deletion, the length of the ref or tgt must be zero respectively.
-
ref
¶
-
tgt
¶
-
-
hgvs.intervalmapper.
cigar_to_intervalpairs
(cigar)¶ For a given CIGAR string, return a list of (Interval,Interval) pairs. The length of the returned list will be equal to the number of CIGAR operations
hgvs.projector
¶
Utility class that projects variants from one transcript to another via a common reference sequence.
-
class
hgvs.projector.
Projector
(hdp, alt_ac, src_ac, dst_ac, src_alt_aln_method=u'splign', dst_alt_aln_method=u'splign')¶ Bases:
object
The Projector class implements liftover between two transcripts via a common reference sequence.
Parameters: - hdp – HGVS Data Provider Interface-compliant instance (see
hgvs.dataproviders.interface.Interface
) - ref – string representing the common reference assembly (e.g., GRCh37.p10)
- src_ac – string representing the source transcript accession (e.g., NM_000551.2)
- dst_ac – string representing the destination transcript accession (e.g., NM_000551.3)
- src_alt_aln_method – string representing the source transcript alignment method
- dst_alt_aln_method – string representing the destination transcript alignment method
This class assumes (and verifies) that the transcripts are on the same strand. This assumption obviates some work in flipping sequence variants twice unnecessarily.
-
project_interval_backward
(c_interval)¶ project c_interval on the destination transcript to the source transcript
Parameters: c_interval – an hgvs.interval.Interval
object on the destination transcriptReturns: c_interval: an hgvs.interval.Interval
object on the source transcript
-
project_interval_forward
(c_interval)¶ project c_interval on the source transcript to the destination transcript
Parameters: c_interval – an hgvs.interval.Interval
object on the source transcriptReturns: c_interval: an hgvs.interval.Interval
object on the destination transcript
-
project_variant_backward
(c_variant)¶ project c_variant on the source transcript onto the destination transcript
Parameters: c_variant – an hgvs.sequencevariant.SequenceVariant
object on the source transcriptReturns: c_variant: an hgvs.sequencevariant.SequenceVariant
object on the destination transcript
-
project_variant_forward
(c_variant)¶ project c_variant on the source transcript onto the destination transcript
Parameters: c_variant – an hgvs.sequencevariant.SequenceVariant
object on the source transcriptReturns: c_variant: an hgvs.sequencevariant.SequenceVariant
object on the destination transcript
- hdp – HGVS Data Provider Interface-compliant instance (see
hgvs.transcriptmapper
¶
Provides coordinate (not variant) mapping operations between genomic (g), non-coding (n), cds (c), and protein (p) coordinates.
-
class
hgvs.transcriptmapper.
TranscriptMapper
(hdp, tx_ac, alt_ac, alt_aln_method)¶ Bases:
object
Provides coordinate (not variant) mapping operations between genomic (g), non-coding (n), cds (c), and protein (p) coordinates. All coordinates are 1-based inclusive, per the HGVS recommendations. All methods take
hgvs.location.Interval
objects.Parameters: - hdp – HGVS Data Provider Interface-compliant instance (see
hgvs.dataproviders.interface.Interface
) - tx_ac (str) – string representing transcript accession (e.g., NM_000551.2)
- alt_ac (str) – string representing the reference sequence accession (e.g., NM_000551.3)
- alt_aln_method (str) – string representing the alignment method; valid values depend on data source
-
c_to_g
(c_interval)¶ convert a transcript CDS (c.) interval to a genomic (g.) interval
-
c_to_n
(c_interval)¶ convert a transcript CDS (c.) interval to a transcript cDNA (n.) interval
-
g_to_c
(g_interval)¶ convert a genomic (g.) interval to a transcript CDS (c.) interval
-
g_to_n
(g_interval)¶ convert a genomic (g.) interval to a transcript cDNA (n.) interval
-
is_coding_transcript
¶
-
n_to_c
(n_interval)¶ convert a transcript cDNA (n.) interval to a transcript CDS (c.) interval
-
n_to_g
(n_interval)¶ convert a transcript cDNA (n.) interval to a genomic (g.) interval
- hdp – HGVS Data Provider Interface-compliant instance (see