Mapping

hgvs.assemblymapper

class hgvs.assemblymapper.AssemblyMapper(hdp, assembly_name='GRCh38', alt_aln_method='splign', normalize=True, prevalidation_level='EXTRINSIC', in_par_assume='X', replace_reference=True, add_gene_symbol=False, *args, **kwargs)[source]

Bases: hgvs.variantmapper.VariantMapper

Provides simplified variant mapping for a single assembly and transcript-reference alignment method.

AssemblyMapper inherits VariantMapper, which provides all projection functionality, and adds:

  • Automatic selection of genomic sequence accession
  • Transcript selection from genomic coordinates
  • Normalization after projection
  • Special handling for PAR regions

AssemblyMapper is instantiated with an assembly name and alt_aln_method. These enable the following conveniences over VariantMapper:

  • The assembly and alignment method are used to automatically select an appropriate chromosomal reference sequence when mapping from a transcript to a genome (i.e., c_to_g(…) and n_to_g(…)).
  • A new method, relevant_trancripts(g_variant), returns a list of transcript accessions available for the specified variant. These accessions are candidates mapping from genomic to trancript coordinates (i.e., g_to_c(…) and g_to_n(…)).

Note: AssemblyMapper supports only chromosomal references (e.g., NC_000006.11). It does not support contigs or other genomic sequences (e.g., NT_167249.1).

Parameters:
  • hdp (object) – instance of hgvs.dataprovider subclass
  • replace_reference (bool) – replace reference (entails additional network access)
  • assembly_name (str) – name of assembly (“GRCh38.p5”)
  • alt_aln_method (str) – genome-transcript alignment method (“splign”, “blat”, “genewise”)
  • normalize (bool) – normalize variants
  • prevalidation_level (str) – None or Intrinsic or Extrinsic validation before mapping
  • in_par_assume (str) – during x_to_g, assume this chromosome name if alignment is ambiguous
Raises:

HGVSError subclasses – for a variety of mapping and data lookup failures

c_to_g(var_c)[source]

Given a parsed c. variant, return a g. variant on the specified transcript using the specified alignment method (default is “splign” from NCBI).

Parameters:
  • var_c (hgvs.sequencevariant.SequenceVariant) – a variant object
  • alt_ac (str) – a reference sequence accession (e.g., NC_000001.11)
  • alt_aln_method (str) – the alignment method; valid values depend on data source
Returns:

variant object (hgvs.sequencevariant.SequenceVariant)

Raises:

HGVSInvalidVariantError – if var_c is not of type “c”

c_to_n(var_c)[source]

Given a parsed c. variant, return a n. variant on the specified transcript using the specified alignment method (default is “transcript” indicating a self alignment).

Parameters:var_c (hgvs.sequencevariant.SequenceVariant) – a variant object
Returns:variant object (hgvs.sequencevariant.SequenceVariant)
Raises:HGVSInvalidVariantError – if var_c is not of type “c”
c_to_p(var_c)[source]

Converts a c. SequenceVariant to a p. SequenceVariant on the specified protein accession Author: Rudy Rico

Parameters:
Return type:

hgvs.sequencevariant.SequenceVariant

g_to_c(var_g, tx_ac)[source]

Given a parsed g. variant, return a c. variant on the specified transcript using the specified alignment method (default is “splign” from NCBI).

Parameters:
  • var_g (hgvs.sequencevariant.SequenceVariant) – a variant object
  • tx_ac (str) – a transcript accession (e.g., NM_012345.6 or ENST012345678)
  • alt_aln_method (str) – the alignment method; valid values depend on data source
Returns:

variant object (hgvs.sequencevariant.SequenceVariant) using CDS coordinates

Raises:

HGVSInvalidVariantError – if var_g is not of type “g”

g_to_n(var_g, tx_ac)[source]

Given a parsed g. variant, return a n. variant on the specified transcript using the specified alignment method (default is “splign” from NCBI).

Parameters:
  • var_g (hgvs.sequencevariant.SequenceVariant) – a variant object
  • tx_ac (str) – a transcript accession (e.g., NM_012345.6 or ENST012345678)
  • alt_aln_method (str) – the alignment method; valid values depend on data source
Returns:

variant object (hgvs.sequencevariant.SequenceVariant) using transcript (n.) coordinates

Raises:

HGVSInvalidVariantError – if var_g is not of type “g”

g_to_t(var_g, tx_ac)[source]
n_to_c(var_n)[source]

Given a parsed n. variant, return a c. variant on the specified transcript using the specified alignment method (default is “transcript” indicating a self alignment).

Parameters:var_n (hgvs.sequencevariant.SequenceVariant) – a variant object
Returns:variant object (hgvs.sequencevariant.SequenceVariant)
Raises:HGVSInvalidVariantError – if var_n is not of type “n”
n_to_g(var_n)[source]

Given a parsed n. variant, return a g. variant on the specified transcript using the specified alignment method (default is “splign” from NCBI).

Parameters:
  • var_n (hgvs.sequencevariant.SequenceVariant) – a variant object
  • alt_ac (str) – a reference sequence accession (e.g., NC_000001.11)
  • alt_aln_method (str) – the alignment method; valid values depend on data source
Returns:

variant object (hgvs.sequencevariant.SequenceVariant)

Raises:

HGVSInvalidVariantError – if var_n is not of type “n”

relevant_transcripts(var_g)[source]

return list of transcripts accessions (strings) for given variant, selected by genomic overlap

t_to_g(var_t)[source]
t_to_p(var_t)[source]

Return a protein variant, or “non-coding” for non-coding variant types

CAUTION: Unlike other x_to_y methods that always return SequenceVariant instances, this method returns a string when the variant type is n. This is intended as a convenience, particularly when looping over relevant_transcripts, projecting with g_to_t, then desiring a protein representation for coding transcripts.

hgvs.variantmapper

Provides VariantMapper and AssemblyMapper to project variants between sequences using AlignmentMapper.

class hgvs.variantmapper.VariantMapper(hdp, replace_reference=True, prevalidation_level='EXTRINSIC', add_gene_symbol=False)[source]

Bases: object

Maps SequenceVariant objects between g., n., r., c., and p. representations.

g⟷{c,n,r} projections are similar in that c, n, and r variants may use intronic coordinates. There are two essential differences that distinguish the three types:

  • Sequence start: In n and r variants, position 1 is the sequence start; in c variants, 1 is the transcription start site.
  • Alphabet: In n and c variants, sequences are DNA; in r. variants, sequences are RNA.

This differences are summarized in this diagram:

g ----acgtatgcac--gtctagacgt----      ----acgtatgcac--gtctagacgt----      ----acgtatgcac--gtctagacgt----
      \         \/         /              \         \/         /              \         \/         /
c      acgtATGCACGTCTAGacgt         n      acgtatgcacgtctagacgt         r      acguaugcacgucuagacgu
           1                               1                                   1
p          MetHisValTer

The g excerpt and exon structures are identical. The g⟷n transformation, which is the most basic, accounts for the offset of the aligned sequences (shown with “1”) and the exon structure. The g⟷c transformation is akin to g⟷n transformation, but requires an addition offset to account for the translation start site (c.1). The CDS in uppercase. The g⟷c transformation is akin to g⟷n transformation with a change of alphabet.

Therefore, this this code uses g⟷n as the core transformation between genomic and c, n, and r variants: All c⟷g and r⟷g transformations use n⟷g after accounting for the above differences. For example, c_to_g accounts for the transcription start site offset, then calls n_to_g.

All methods require and return objects of type hgvs.sequencevariant.SequenceVariant.

Parameters:
  • replace_reference (bool) – replace reference (entails additional network access)
  • prevalidation_level (str) – None or Intrinsic or Extrinsic validation before mapping
c_to_g(var_c, alt_ac, alt_aln_method='splign')[source]

Given a parsed c. variant, return a g. variant on the specified transcript using the specified alignment method (default is “splign” from NCBI).

Parameters:
  • var_c (hgvs.sequencevariant.SequenceVariant) – a variant object
  • alt_ac (str) – a reference sequence accession (e.g., NC_000001.11)
  • alt_aln_method (str) – the alignment method; valid values depend on data source
Returns:

variant object (hgvs.sequencevariant.SequenceVariant)

Raises:

HGVSInvalidVariantError – if var_c is not of type “c”

c_to_n(var_c)[source]

Given a parsed c. variant, return a n. variant on the specified transcript using the specified alignment method (default is “transcript” indicating a self alignment).

Parameters:var_c (hgvs.sequencevariant.SequenceVariant) – a variant object
Returns:variant object (hgvs.sequencevariant.SequenceVariant)
Raises:HGVSInvalidVariantError – if var_c is not of type “c”
c_to_p(var_c, pro_ac=None)[source]

Converts a c. SequenceVariant to a p. SequenceVariant on the specified protein accession Author: Rudy Rico

Parameters:
Return type:

hgvs.sequencevariant.SequenceVariant

g_to_c(var_g, tx_ac, alt_aln_method='splign')[source]

Given a parsed g. variant, return a c. variant on the specified transcript using the specified alignment method (default is “splign” from NCBI).

Parameters:
  • var_g (hgvs.sequencevariant.SequenceVariant) – a variant object
  • tx_ac (str) – a transcript accession (e.g., NM_012345.6 or ENST012345678)
  • alt_aln_method (str) – the alignment method; valid values depend on data source
Returns:

variant object (hgvs.sequencevariant.SequenceVariant) using CDS coordinates

Raises:

HGVSInvalidVariantError – if var_g is not of type “g”

g_to_n(var_g, tx_ac, alt_aln_method='splign')[source]

Given a parsed g. variant, return a n. variant on the specified transcript using the specified alignment method (default is “splign” from NCBI).

Parameters:
  • var_g (hgvs.sequencevariant.SequenceVariant) – a variant object
  • tx_ac (str) – a transcript accession (e.g., NM_012345.6 or ENST012345678)
  • alt_aln_method (str) – the alignment method; valid values depend on data source
Returns:

variant object (hgvs.sequencevariant.SequenceVariant) using transcript (n.) coordinates

Raises:

HGVSInvalidVariantError – if var_g is not of type “g”

g_to_t(var_g, tx_ac, alt_aln_method='splign')[source]
n_to_c(var_n)[source]

Given a parsed n. variant, return a c. variant on the specified transcript using the specified alignment method (default is “transcript” indicating a self alignment).

Parameters:var_n (hgvs.sequencevariant.SequenceVariant) – a variant object
Returns:variant object (hgvs.sequencevariant.SequenceVariant)
Raises:HGVSInvalidVariantError – if var_n is not of type “n”
n_to_g(var_n, alt_ac, alt_aln_method='splign')[source]

Given a parsed n. variant, return a g. variant on the specified transcript using the specified alignment method (default is “splign” from NCBI).

Parameters:
  • var_n (hgvs.sequencevariant.SequenceVariant) – a variant object
  • alt_ac (str) – a reference sequence accession (e.g., NC_000001.11)
  • alt_aln_method (str) – the alignment method; valid values depend on data source
Returns:

variant object (hgvs.sequencevariant.SequenceVariant)

Raises:

HGVSInvalidVariantError – if var_n is not of type “n”

t_to_g(var_t, alt_ac, alt_aln_method='splign')[source]

hgvs.projector

Utility class that projects variants from one transcript to another via a common reference sequence.

class hgvs.projector.Projector(hdp, alt_ac, src_ac, dst_ac, src_alt_aln_method='splign', dst_alt_aln_method='splign')[source]

Bases: object

The Projector class implements liftover between two transcripts via a common reference sequence.

Parameters:
  • hdp – HGVS Data Provider Interface-compliant instance (see hgvs.dataproviders.interface.Interface)
  • ref – string representing the common reference assembly (e.g., GRCh37.p10)
  • src_ac – string representing the source transcript accession (e.g., NM_000551.2)
  • dst_ac – string representing the destination transcript accession (e.g., NM_000551.3)
  • src_alt_aln_method – string representing the source transcript alignment method
  • dst_alt_aln_method – string representing the destination transcript alignment method

This class assumes (and verifies) that the transcripts are on the same strand. This assumption obviates some work in flipping sequence variants twice unnecessarily.

project_interval_backward(c_interval)[source]

project c_interval on the destination transcript to the source transcript

Parameters:c_interval – an hgvs.interval.Interval object on the destination transcript
Returns:c_interval: an hgvs.interval.Interval object on the source transcript
project_interval_forward(c_interval)[source]

project c_interval on the source transcript to the destination transcript

Parameters:c_interval – an hgvs.interval.Interval object on the source transcript
Returns:c_interval: an hgvs.interval.Interval object on the destination transcript
project_variant_backward(c_variant)[source]

project c_variant on the source transcript onto the destination transcript

Parameters:c_variant – an hgvs.sequencevariant.SequenceVariant object on the source transcript
Returns:c_variant: an hgvs.sequencevariant.SequenceVariant object on the destination transcript
project_variant_forward(c_variant)[source]

project c_variant on the source transcript onto the destination transcript

Parameters:c_variant – an hgvs.sequencevariant.SequenceVariant object on the source transcript
Returns:c_variant: an hgvs.sequencevariant.SequenceVariant object on the destination transcript

hgvs.alignmentmapper

Mapping positions between pairs of sequence alignments

The AlignmentMapper class is at the heart of mapping between aligned sequences.

class hgvs.alignmentmapper.AlignmentMapper(hdp, tx_ac, alt_ac, alt_aln_method)[source]

Bases: object

Provides coordinate (not variant) mapping operations between genomic (g), non-coding (n) and cds (c) coordinates according to a CIGAR.

Parameters:
  • hdp – HGVS Data Provider Interface-compliant instance (see hgvs.dataproviders.interface.Interface)
  • tx_ac (str) – string representing transcript accession (e.g., NM_000551.2)
  • alt_ac (str) – string representing the reference sequence accession (e.g., NC_000019.10)
  • alt_aln_method (str) – string representing the alignment method; valid values depend on data source
alt_ac
alt_aln_method
c_to_g(c_interval)[source]

convert a transcript CDS (c.) interval to a genomic (g.) interval

c_to_n(c_interval)[source]

convert a transcript CDS (c.) interval to a transcript cDNA (n.) interval

cds_end_i
cds_start_i
cigar
cigar_op
g_to_c(g_interval)[source]

convert a genomic (g.) interval to a transcript CDS (c.) interval

g_to_n(g_interval)[source]

convert a genomic (g.) interval to a transcript cDNA (n.) interval

gc_offset
is_coding_transcript
n_to_c(n_interval)[source]

convert a transcript cDNA (n.) interval to a transcript CDS (c.) interval

n_to_g(n_interval)[source]

convert a transcript (n.) interval to a genomic (g.) interval

ref_pos
strand
tgt_len
tgt_pos
tx_ac