Quick Start¶
This tutorial provides a comprehensive example of how to use the HGVS package. Specifically, we’ll:
- install hgvs
- parse a genomic variant
- project the genomic variant to all transcripts
- infer the amino acid changes for coding transcripts
We’ll use rs397509113 in BRCA1. This variant is coincident with an exon in 3 coding transcripts, an intron in 2 other coding transcripts, and a non-coding transcript.
transcript (c.) | protein (p.) | comment |
---|---|---|
NM_007294.3:c.3844del | NP_009225.1:p.(Glu1282AsnfsTer25) | |
NM_007297.3:c.3703del | NP_009228.2:p.(Glu1235AsnfsTer25) | |
NM_007300.3:c.3844del | NP_009231.2:p.(Glu1282AsnfsTer25) | |
NM_007298.3:c.788-655del | NP_009229.2:p.? | intronic variant |
NM_007299.3:c.788-655del | NP_009230.2:p.? | intronic variant |
NR_027676.1:n.3980del | non-coding | non-coding transcript |
Install hgvs¶
For this demo, you’ll obviously need hgvs. In a reasonably modern environment, the following should suffice:
$ pip install hgvs
More detailed installation instructions are in Installing hgvs.
Start hgvs-shell¶
The hgvs
package includes an executable called hgvs-shell
,
which sets up hgvs
for you. On the command line, type:
$ hgvs-shell
This is approximately the same thing as:
$ IPython
>>> from hgvs.easy import *
hgvs.easy
connects to data sources and initializes commonly used
objects that provide most functionality.
Note
Variant validation, normalization, and projection require
access to external data, specifically exon structures,
transcript alignments, and protein accessions. Right now,
the only source of this data is via the UTA sister projects.
When you import hgvs.easy
, you will connect to
publicly available data sources. If you want more
information on the architecture of hgvs
and UTA, see
Introduction. See Installing hgvs for information about
installing data sources locally for speed and privacy.
Parse the genomic variant¶
In the hgvs-shell
, do:
>>> var_g = parse("NC_000017.11:g.43091687delC")
Note
All functionality in hgvs
is provided by Python
classes. hgvs.easy
exposes common methods with
functional forms also, which are used in this quick start
guide. For example, parse(...)
above actually calls
`parser.parse(...)
, where parser
is an instance of
the hgvs.parser.Parser
class.
Parsing a variant results in objects that represent the variant. A
SequenceVariant object is comprised of an accession (ac
), an HGVS
sequence type
(c,g,m,n,r,p), and 0 or more specific sequence
changes (posedit
– a POSition and EDIt).:
>>> var_g
SequenceVariant(ac=NC_000017.11, type=g, posedit=43091687del, gene=None)
The posedit
is itself an object of the hgvs.posedit.PosEdit
class:
>>> var_g.posedit
PosEdit(pos=43091687, edit=del, uncertain=False)
The pos
(position) and edit
attributes are also objects that
can represent intervals and more complex edit operations like indels.
The uncertain
flag enables representation of HGVS uncertainty
(typically with parentheses around the uncertain
component). “stringifying” a variant regenerates an HGVS variant:
>>> str(var_g)
'NC_000017.11:g.43091687del'
>>> "This is a variant: {v}".format(v=var_g)
'This is a variant: NC_000017.11:g.43091687del'
And, in Python 3, stringification works in f-strings, like so:
> >> f"{var_g}"
'NC_000017.11:g.43091687del'
Validating and Normalizing Variants¶
hgvs provides functionality to validate and normalize variants:
>>> normalize(var_g)
SequenceVariant(ac=NC_000017.11, type=g, posedit=43091688del, gene=None)
>>> validate(var_g)
True
Projecting variants between sequences¶
When two sequences have alignments available in , a variant may be
“projected” from one sequence to the other. hgvs
supports
projecting variants
- from g to c, n
- from c to g, n, p
- from n to c, g
The hgvs.assemblymapper.AssemblyMapper
class provides a
high-level interface to variant projection. hgvs.easy
initializes AssemblyMapper instances for GRCh37 and GRCh37 as am37
and am38
respectively. For example:
>>> transcripts = am38.relevant_transcripts(var_g)
>>> sorted(transcripts)
['NM_007294.3', 'NM_007297.3', 'NM_007298.3', 'NM_007299.3', 'NM_007300.3', 'NR_027676.1']
We can now project the genomic variant, var_g
, to each of these
transcripts using the g_to_t
function, and the transcript variant
to a protein sequnce using the t_to_p
function.
>>> for ac in sorted(get_relevant_transcripts(var_g)):
... var_t = g_to_t(var_g, ac)
... var_p = t_to_p(var_t)
... print("-> " + str(var_t) + " (" + str(var_p) + ") ")
...
-> NM_007294.3:c.3844del (NP_009225.1:p.(Glu1282AsnfsTer25))
-> NM_007297.3:c.3703del (NP_009228.2:p.(Glu1235AsnfsTer25))
-> NM_007298.3:c.788-655del (NP_009229.2:p.?)
-> NM_007299.3:c.788-655del (NP_009230.2:p.?)
-> NM_007300.3:c.3844del (NP_009231.2:p.(Glu1282AsnfsTer25))
-> NR_027676.1:n.3980del (non-coding)
In hgvs
, the t
type can be either c
or n
. Only
variants on coding sequences (c.
) can be projected to a protein
sequence. As a special case, t_to_p
returns “non-coding” when the
input variant is on a non-coding sequence.