Parsing and Formatting


Provides parser for HGVS strings and HGVS-related conceptual components, such as intronic-offset coordiates

class hgvs.parser.Parser(grammar_fn=u'/home/docs/.cache/Python-Eggs/hgvs-1.3.1.dev2+g33e0634-py2.7.egg-tmp/hgvs/_data/hgvs.pymeta', expose_all_rules=False)

Bases: object

Provides comprehensive parsing of HGVS varaint strings (i.e., variants represented according to the Human Genome Variation Society recommendations) into Python representations. The class wraps a Parsing Expression Grammar, exposing rules of that grammar as methods (prefixed with parse_) that parse an input string according to the rule. The class exposes all rules, so that it’s possible to parse both full variant representations as well as components, like so:

>>> hp = Parser()
>>> v = hp.parse_hgvs_variant("NM_01234.5:c.22+1A>T")
>>> v
SequenceVariant(ac=NM_01234.5, type=c, posedit=22+1A>T, gene=None)
>>> v.posedit.pos
BaseOffsetInterval(start=22+1, end=22+1, uncertain=False)
>>> i = hp.parse_c_interval("22+1")
>>> i
BaseOffsetInterval(start=22+1, end=22+1, uncertain=False)

The parse_hgvs_variant and parse_c_interval methods correspond to the hgvs_variant and c_interval rules in the grammar, respectively.

As a convenience, the Parser provides the parse method as a shorthand for parse_hgvs_variant: >>> v = hp.parse(“NM_01234.5:c.22+1A>T”) >>> v SequenceVariant(ac=NM_01234.5, type=c, posedit=22+1A>T, gene=None)

Because the methods are generated on-the-fly and depend on the grammar that is loaded at runtime, a full list of methods is not available in the documentation. However, the list of rules/methods is available via the rules instance variable.

A few notable methods are listed below:

parse_hgvs_variant() parses any valid HGVS string supported by the grammar.

>>> hp.parse_hgvs_variant("NM_01234.5:c.22+1A>T")
SequenceVariant(ac=NM_01234.5, type=c, posedit=22+1A>T, gene=None)
>>> hp.parse_hgvs_variant("NP_012345.6:p.Ala22Trp")
SequenceVariant(ac=NP_012345.6, type=p, posedit=Ala22Trp, gene=None)

The hgvs_variant rule iteratively attempts parsing using the major classes of HGVS variants. For slight improvements in efficiency, those rules may be invoked directly:

>>> hp.parse_p_variant("NP_012345.6:p.Ala22Trp")
SequenceVariant(ac=NP_012345.6, type=p, posedit=Ala22Trp, gene=None)

Similarly, components of the underlying structure may be parsed directly as well:

>>> hp.parse_c_posedit("22+1A>T")
PosEdit(pos=22+1, edit=A>T, uncertain=False)
>>> hp.parse_c_interval("22+1")
BaseOffsetInterval(start=22+1, end=22+1, uncertain=False)

parse HGVS variant v, returning a SequenceVariant

Parameters:v (str) – an HGVS-formatted variant as a string
Return type:SequenceVariant