Mapping

hgvs.assemblymapper

class hgvs.assemblymapper.AssemblyMapper(hdp, assembly_name=u'GRCh38', alt_aln_method=u'splign', normalize=True, in_par_assume=u'X', replace_reference=True, *args, **kwargs)

Bases: hgvs.variantmapper.VariantMapper

Provides simplified variant mapping for a single assembly and transcript-reference alignment method.

AssemblyMapper is instantiated with an assembly name and alt_aln_method. These enable the following conveniences over VariantMapper:

  • The assembly and alignment method are used to automatically select an appropriate chromosomal reference sequence when mapping from a transcript to a genome (i.e., c_to_g(...) and n_to_g(...)).
  • A new method, relevant_trancripts(g_variant), returns a list of transcript accessions available for the specified variant. These accessions are candidates mapping from genomic to trancript coordinates (i.e., g_to_c(...) and g_to_n(...)).

Note: AssemblyMapper supports only chromosomal references (e.g., NC_000006.11). It does not support contigs or other genomic sequences (e.g., NT_167249.1).

Parameters:
  • hdp (object) – instance of hgvs.dataprovider subclass
  • replace_reference (bool) – replace reference (entails additional network access)
  • assembly_name (str) – name of assembly (“GRCh38.p5”)
  • alt_aln_method (str) – genome-transcript alignment method (“splign”, “blat”, “genewise”)
  • normalize (bool) – normalize variants
  • in_par_assume (str) – during x_to_g, assume this chromosome name if alignment is ambiguous
Raises:

HGVSError subclasses – for a variety of mapping and data lookup failures

c_to_g(var_c)
c_to_n(var_c)
c_to_p(var_c)
g_to_c(var_g, tx_ac)
g_to_n(var_g, tx_ac)
g_to_t(var_g, tx_ac)
n_to_c(var_n)
n_to_g(var_n)
relevant_transcripts(var_g)

return list of transcripts accessions (strings) for given variant, selected by genomic overlap

t_to_g(var_t)

hgvs.variantmapper

Provides VariantMapper and AssemblyMapper to project variants between sequences using TranscriptMapper.

class hgvs.variantmapper.VariantMapper(hdp, replace_reference=True)

Bases: object

Maps SequenceVariant objects between g., n., r., c., and p. representations.

g⟷{c,n,r} projections are similar in that c, n, and r variants may use intronic coordinates. There are two essential differences that distinguish the three types:

  • Sequence start: In n and r variants, position 1 is the sequence start; in c variants, 1 is the transcription start site.
  • Alphabet: In n and c variants, sequences are DNA; in r. variants, sequences are RNA.

This differences are summarized in this diagram:

g ----acgtatgcac--gtctagacgt----      ----acgtatgcac--gtctagacgt----      ----acgtatgcac--gtctagacgt----
      \         \/         /              \         \/         /              \         \/         /
c      acgtATGCACGTCTAGacgt         n      acgtatgcacgtctagacgt         r      acguaugcacgucuagacgu   
           1                               1                                   1
p          MetHisValTer

The g excerpt and exon structures are identical. The g⟷n transformation, which is the most basic, accounts for the offset of the aligned sequences (shown with “1”) and the exon structure. The g⟷c transformation is akin to g⟷n transformation, but requires an addition offset to account for the translation start site (c.1). The CDS in uppercase. The g⟷c transformation is akin to g⟷n transformation with a change of alphabet.

Therefore, this this code uses g⟷n as the core transformation between genomic and c, n, and r variants: All c⟷g and r⟷g transformations use n⟷g after accounting for the above differences. For example, c_to_g accounts for the transcription start site offset, then calls n_to_g.

All methods require and return objects of type hgvs.sequencevariant.SequenceVariant.

Parameters:replace_reference (bool) – replace reference (entails additional network access)
c_to_g(var_c, alt_ac, alt_aln_method=u'splign')

Given a parsed c. variant, return a g. variant on the specified transcript using the specified alignment method (default is “splign” from NCBI).

Parameters:
  • var_c (hgvs.sequencevariant.SequenceVariant) – a variant object
  • alt_ac (str) – a reference sequence accession (e.g., NC_000001.11)
  • alt_aln_method (str) – the alignment method; valid values depend on data source
Returns:

variant object (hgvs.sequencevariant.SequenceVariant)

Raises:

HGVSInvalidVariantError – if var_c is not of type “c”

c_to_n(var_c)

Given a parsed c. variant, return a n. variant on the specified transcript using the specified alignment method (default is “transcript” indicating a self alignment).

Parameters:var_c (hgvs.sequencevariant.SequenceVariant) – a variant object
Returns:variant object (hgvs.sequencevariant.SequenceVariant)
Raises:HGVSInvalidVariantError – if var_c is not of type “c”
c_to_p(var_c, pro_ac=None)

Converts a c. SequenceVariant to a p. SequenceVariant on the specified protein accession Author: Rudy Rico

Parameters:
Return type:

hgvs.sequencevariant.SequenceVariant

g_to_c(var_g, tx_ac, alt_aln_method=u'splign')

Given a parsed g. variant, return a c. variant on the specified transcript using the specified alignment method (default is “splign” from NCBI).

Parameters:
  • var_g (hgvs.sequencevariant.SequenceVariant) – a variant object
  • tx_ac (str) – a transcript accession (e.g., NM_012345.6 or ENST012345678)
  • alt_aln_method (str) – the alignment method; valid values depend on data source
Returns:

variant object (hgvs.sequencevariant.SequenceVariant) using CDS coordinates

Raises:

HGVSInvalidVariantError – if var_g is not of type “g”

g_to_n(var_g, tx_ac, alt_aln_method=u'splign')

Given a parsed g. variant, return a n. variant on the specified transcript using the specified alignment method (default is “splign” from NCBI).

Parameters:
  • var_g (hgvs.sequencevariant.SequenceVariant) – a variant object
  • tx_ac (str) – a transcript accession (e.g., NM_012345.6 or ENST012345678)
  • alt_aln_method (str) – the alignment method; valid values depend on data source
Returns:

variant object (hgvs.sequencevariant.SequenceVariant) using transcript (n.) coordinates

Raises:

HGVSInvalidVariantError – if var_g is not of type “g”

g_to_t(var_g, tx_ac, alt_aln_method=u'splign')
n_to_c(var_n)

Given a parsed n. variant, return a c. variant on the specified transcript using the specified alignment method (default is “transcript” indicating a self alignment).

Parameters:var_n (hgvs.sequencevariant.SequenceVariant) – a variant object
Returns:variant object (hgvs.sequencevariant.SequenceVariant)
Raises:HGVSInvalidVariantError – if var_n is not of type “n”
n_to_g(var_n, alt_ac, alt_aln_method=u'splign')

Given a parsed n. variant, return a g. variant on the specified transcript using the specified alignment method (default is “splign” from NCBI).

Parameters:
  • var_n (hgvs.sequencevariant.SequenceVariant) – a variant object
  • alt_ac (str) – a reference sequence accession (e.g., NC_000001.11)
  • alt_aln_method (str) – the alignment method; valid values depend on data source
Returns:

variant object (hgvs.sequencevariant.SequenceVariant)

Raises:

HGVSInvalidVariantError – if var_n is not of type “n”

t_to_g(var_t, alt_ac, alt_aln_method=u'splign')

hgvs.intervalmapper

Mapping intervals between pairs of congruent segments

The IntervalMapper class is at the heart of mapping between aligned sequences. An instance of uta.tools.intervalmapper.IntervalMapper is constructed with an ordered list of uta.tools.intervalmapper.IntervalPair instances, each of which consists of two uta.tools.intervalmapper.Interval instances. The IntervalMapper class is unaware of strand/orientation; that issue is handled by the uta.tools.transcriptmapper.TranscriptMapper class.

NOTE: Mapping at the boundaries around indels requires a choice. If seq B has an insertion relative to seq A, then mapping coordinate at the boundaries can either be minimal or maximal for both the start and end. Consider this alignment:

      0         15   20         35         50
      |==========|====|==========|==========|
      |          | __/        __/|          |
      |          |/          /   |          |
      |==========|==========|====|==========|
      0         15         30   35         50
          15M   5D   15M      5I      15M  

segment 1: [ 0,15] ~ [ 0,15]
segment 2: [15,20] ~ [15,15]
segment 3: [20,35] ~ [15,30]
segment 4: [35,35] ~ [30,35]
segment 5: [35,50] ~ [35,50]

and these intervals around reference position 35:

interval 1: 34,36   -> 29,36 (no ambiguity)
interval 2: 35,35   -> 30,35 (reasonable)
interval 3: 34,35   -> 29,30 (minimal) or 29,35 (maximal)
interval 4: 35,36   -> 35,36 (minimal) or 30,36 (maximal)

So, for interval 3, end_i=35 matches segment 3 and segment 4. Analagously for interval 4, start_i=35 matches segment 4 and segment 5.

Currently, this code matches an interval <start_i,end_i> using the maximal start_i and minimal end_i.

class hgvs.intervalmapper.CIGARElement(len, op)

Bases: object

represents elements of a CIGAR string and provides methods for determining the number of ref and tgt bases consumed by the operation

len
op
ref_len

returns number of nt/aa consumed in reference sequence for this edit

tgt_len

returns number of nt/aa consumed in target sequence for this edit

class hgvs.intervalmapper.Interval(start_i, end_i)

Bases: object

Represents a segment of a sequence in interbase coordinates (0-based, right-open).

end_i
len
start_i
class hgvs.intervalmapper.IntervalMapper(interval_pairs)

Bases: object

Provides mapping between sequence coordinates according to an ordered set of IntervalPairs.

Parameters:interval_pairs (list (of IntervalPair instances)) – an ordered list of IntervalPair instances
Returns:an IntervalMapper instance
static from_cigar(cigar)
Parameters:cigar (str.) – a Compact Idiosyncratic Gapped Alignment Report string
Returns:an IntervalMapper instance from the CIGAR string
interval_pairs
map_ref_to_tgt(start_i, end_i, max_extent=False)
map_tgt_to_ref(start_i, end_i, max_extent=False)
ref_intervals
ref_len
tgt_intervals
tgt_len
class hgvs.intervalmapper.IntervalPair(ref, tgt)

Bases: object

Represents a match, insertion, or deletion segment of an alignment. If a match, the lengths must be equal; if an insertion or deletion, the length of the ref or tgt must be zero respectively.

ref
tgt
hgvs.intervalmapper.cigar_to_intervalpairs(cigar)

For a given CIGAR string, return a list of (Interval,Interval) pairs. The length of the returned list will be equal to the number of CIGAR operations

hgvs.projector

Utility class that projects variants from one transcript to another via a common reference sequence.

class hgvs.projector.Projector(hdp, alt_ac, src_ac, dst_ac, src_alt_aln_method=u'splign', dst_alt_aln_method=u'splign')

Bases: object

The Projector class implements liftover between two transcripts via a common reference sequence.

Parameters:
  • hdp – HGVS Data Provider Interface-compliant instance (see hgvs.dataproviders.interface.Interface)
  • ref – string representing the common reference assembly (e.g., GRCh37.p10)
  • src_ac – string representing the source transcript accession (e.g., NM_000551.2)
  • dst_ac – string representing the destination transcript accession (e.g., NM_000551.3)
  • src_alt_aln_method – string representing the source transcript alignment method
  • dst_alt_aln_method – string representing the destination transcript alignment method

This class assumes (and verifies) that the transcripts are on the same strand. This assumption obviates some work in flipping sequence variants twice unnecessarily.

project_interval_backward(c_interval)

project c_interval on the destination transcript to the source transcript

Parameters:c_interval – an hgvs.interval.Interval object on the destination transcript
Returns:c_interval: an hgvs.interval.Interval object on the source transcript
project_interval_forward(c_interval)

project c_interval on the source transcript to the destination transcript

Parameters:c_interval – an hgvs.interval.Interval object on the source transcript
Returns:c_interval: an hgvs.interval.Interval object on the destination transcript
project_variant_backward(c_variant)

project c_variant on the source transcript onto the destination transcript

Parameters:c_variant – an hgvs.sequencevariant.SequenceVariant object on the source transcript
Returns:c_variant: an hgvs.sequencevariant.SequenceVariant object on the destination transcript
project_variant_forward(c_variant)

project c_variant on the source transcript onto the destination transcript

Parameters:c_variant – an hgvs.sequencevariant.SequenceVariant object on the source transcript
Returns:c_variant: an hgvs.sequencevariant.SequenceVariant object on the destination transcript

hgvs.transcriptmapper

Provides coordinate (not variant) mapping operations between genomic (g), non-coding (n), cds (c), and protein (p) coordinates.

class hgvs.transcriptmapper.TranscriptMapper(hdp, tx_ac, alt_ac, alt_aln_method)

Bases: object

Provides coordinate (not variant) mapping operations between genomic (g), non-coding (n), cds (c), and protein (p) coordinates. All coordinates are 1-based inclusive, per the HGVS recommendations. All methods take hgvs.location.Interval objects.

Parameters:
  • hdp – HGVS Data Provider Interface-compliant instance (see hgvs.dataproviders.interface.Interface)
  • tx_ac (str) – string representing transcript accession (e.g., NM_000551.2)
  • alt_ac (str) – string representing the reference sequence accession (e.g., NM_000551.3)
  • alt_aln_method (str) – string representing the alignment method; valid values depend on data source
c_to_g(c_interval)

convert a transcript CDS (c.) interval to a genomic (g.) interval

c_to_n(c_interval)

convert a transcript CDS (c.) interval to a transcript cDNA (n.) interval

g_to_c(g_interval)

convert a genomic (g.) interval to a transcript CDS (c.) interval

g_to_n(g_interval)

convert a genomic (g.) interval to a transcript cDNA (n.) interval

is_coding_transcript
n_to_c(n_interval)

convert a transcript cDNA (n.) interval to a transcript CDS (c.) interval

n_to_g(n_interval)

convert a transcript cDNA (n.) interval to a genomic (g.) interval