This section is intended for all users and provides an understanding of key concepts and components of the hgvs package.
Reference Sequence Types¶
The HGVS Recommendations provide for six types of reference sequences. Because the type influences the syntax and object representation in the hgvs package, it is important to understand these distinctions. A summary of the types follows:
Datum refers to the definition for position 1 in the sequence. “Sequence start” means the first position of the sequence. “Translation start” means the position of the ATG that typically starts translation (only for coding transcripts).
Continuous coordinates are the familiar ordinal counting (1, 2, 3, ...). There are no breaks for intervening sequence.
Base-Offset coordinates use a base position, which is an index in the specified sequence, and an optional offset from that base position. Non-zero offsets refer to non-coding sequence, such as 5’ UTR, 3’ UTR, or intronic position. Examples are 22 (with a zero offset), 22+6, and *6.
Variant Object Representation¶
HGVS variants are represented using classes that represent elemental concepts of an HGVS sequence variant. Each of the objects contains references to data that define the objects; those data may be Python built in types such as integers (int) or strings (unicode), or they may be other classes in the hgvs package.
For example, a variant parsed like this:
>>> import hgvs.parser >>> hgvsparser = hgvs.parser.Parser() >>> var = hgvsparser.parse_hgvs_variant('NM_001197320.1:c.281C>T')
will generate an object tree like the following:
For that variant, the properties may be obtained easily by dot lookup:
>>> var.ac 'NM_001197320.1' >>> var.type 'c' >>> var.posedit PosEdit(pos=281, edit=C>T, uncertain=False) >>> var.posedit.pos BaseOffsetInterval(start=281, end=281, uncertain=False) >>> var.posedit.pos.start, var.posedit.pos.end (BaseOffsetPosition(base=281, offset=0, datum=Datum.CDS_START, uncertain=False), BaseOffsetPosition(base=281, offset=0, datum=Datum.CDS_START, uncertain=False)) >>> var.posedit.edit NARefAlt(ref='C', alt='T', uncertain=False)
The object representation makes it easy to modify variants conceptually rather than textually. For example, if the previous variant was inferred rather than sequenced, we might wish to declare that it is uncertain, which then causes the stringified version to contain the edit in parentheses:
>>> var.posedit.uncertain = True >>> str(var) 'NM_001197320.1:c.(281C>T)'
Variant Mapping Tools¶
Variant mapping is supported by several modules. Most users will
likely be content with
completeness, it may help to understand how all of the mappers relate
to each other.
The IntervalMapper maps pairs of contiguous sequence intervals to each other. It is the “lowest” component of the mapping hierarchy and “knows” nothing about biological sequences.The TranscriptMapper uses IntervalMapper to map pairs of exon segments (typically exons in the transcript and genomic sequences). It is must be instantiated with a transcript accession, reference accession, and alignment method, and provides functions to map sequence intervals (not variants) for the specified alignment. It is also accommodates strand orientation.The VariantMapper uses
hgvs.transcriptmapper.TranscriptMapperto provide g<->r, r<->c, g<->c, and c->p transformations for
SequenceVariantobjects. As with the TranscriptMapper, it must be instantiated with an appropriate transcript, reference, and alignment method.VariantMapper requires that the caller provide a transcript accession and an appropriate reference sequence, which in turn requires knowing the correct reference sequence. The alignment method is also required. While the VariantMapper interface serves the general case of mapping to any sequence (including patch sequences), it is burdensome for the most common case. AssemblyMapper wraps VariantMapper to provide identical mapping functionality that is tailored for mapping between a transcript and a primary assembly.Projector maps variants between transcripts using a common reference and alignment method. For example, this tool can transfer a variant from one RefSeq to another, or even from an Ensembl transcript to a RefSeq.
External Data Sources¶
Variant mapping and validation requires access to external data,
specifically exon structures, transcript alignments, accessions, and
sequences. In order to isolate the hgvs package from the myriad
choices and tradeoffs, these data are provided through an
implementation of the (abstract) Data Provider Interface
hgvs.dataproviders.interface). Currently, the only concrete
implementation of the data provider interface uses UTA, an archive
of transcripts, transcript sequences, and transcript-reference
Invitae provides a public UTA instance at
(PostgreSQL). hgvs uses this public UTA instance by default, so
most users won’t need to worry about this aspect of the hgvs package.
However, a docker image of UTA is also available; see
Installing hgvs for details.
Alternatively, users may implement their own providers that conform to
the data providers interface. See
hgvs.dataproviders.uta for an