External Data Providers

hgvs.dataproviders.interface

Defines the abstract data provider interface

class hgvs.dataproviders.interface.Interface(mode=None, cache=None)

Bases: object

Variant mapping and validation requires access to external data, specifically exon structures, transcript alignments, and protein accessions. In order to isolate the hgvs package from the myriad choices and tradeoffs, these data are provided through an implementation of the (abstract) HGVS Data Provider Interface.

As of June 2014, the only available data provider implementation uses the Universal Transcript Archive (UTA), a sister project that provides access to transcripts and genome-transcript alignments. Invitae provides a public UTA database instance that is used by default; see the UTA page for instructions on installing your own PostgreSQL or SQLite version. In the future, other implementations may be availablefor other data sources.

Pure virtural class for the HGVS Data Provider Interface. Every data provider implementation should be a subclass (possibly indirect) of this class.

Parameters:
  • mode (str) – cache mode (None[default lru cache], ‘learn’, ‘run’, ‘verify’)
  • cache (str) – local cache file name
data_version()
get_acs_for_protein_seq()
get_assembly_map(assembly_name)
get_gene_info(gene)
get_pro_ac_for_tx_ac(tx_ac)
get_seq(ac, start_i=None, end_i=None)
get_similar_transcripts(tx_ac)
get_tx_exons(tx_ac, alt_ac, alt_aln_method)
get_tx_for_gene(gene)
get_tx_for_region(alt_ac, alt_aln_method, start_i, end_i)
get_tx_identity_info(tx_ac)
get_tx_info(tx_ac, alt_ac, alt_aln_method)
get_tx_mapping_options(tx_ac)
interface_version()
required_version = None
schema_version()

hgvs.dataproviders.uta

implements an hgvs data provider interface using UTA (https://github.com/biocommons/uta)

class hgvs.dataproviders.uta.ParseResult

Bases: urlparse.ParseResult

Subclass of url.ParseResult that adds database and schema methods, and provides stringification.

database
schema
class hgvs.dataproviders.uta.UTABase(url, mode=None, cache=None)

Bases: hgvs.dataproviders.interface.Interface

data_version()
get_acs_for_protein_seq(seq)

returns a list of protein accessions for a given sequence. The list is guaranteed to contain at least one element with the MD5-based accession (MD5_01234abc…def56789) at the end of the list.

get_assembly_map(assembly_name)

return a list of accessions for the specified assembly name (e.g., GRCh38.p5)

get_gene_info(gene)

returns basic information about the gene.

Parameters:gene (str) – HGNC gene name

# database results hgnc | ATM maploc | 11q22-q23 descr | ataxia telangiectasia mutated summary | The protein encoded by this gene belongs to the PI3/PI4-kinase family. This… aliases | AT1,ATA,ATC,ATD,ATE,ATDC,TEL1,TELO1 added | 2014-02-04 21:39:32.57125

get_pro_ac_for_tx_ac(tx_ac)

Return the (single) associated protein accession for a given transcript accession, or None if not found.

get_seq(ac, start_i=None, end_i=None)
get_similar_transcripts(tx_ac)

Return a list of transcripts that are similar to the given transcript, with relevant similarity criteria.

>> sim_tx = hdp.get_similar_transcripts(‘NM_001285829.1’) >> dict(sim_tx[0]) { ‘cds_eq’: False, ‘cds_es_fp_eq’: False, ‘es_fp_eq’: True, ‘tx_ac1’: ‘NM_001285829.1’, ‘tx_ac2’: ‘ENST00000498907’ }

where:

  • cds_eq means that the CDS sequences are identical
  • es_fp_eq means that the full exon structures are identical (i.e., incl. UTR)
  • cds_es_fp_eq means that the cds-clipped portions of the exon structures are identical (i.e., ecluding. UTR)
  • Hint: “es” = “exon set”, “fp” = “fingerprint”, “eq” = “equal”

“exon structure” refers to the start and end coordinates on a specified reference sequence. Thus, having the same exon structure means that the transcripts are defined on the same reference sequence and have the same exon spans on that sequence.

get_tx_exons(tx_ac, alt_ac, alt_aln_method)

return transcript exon info for supplied accession (tx_ac, alt_ac, alt_aln_method), or None if not found

Parameters:
  • tx_ac (str) – transcript accession with version (e.g., ‘NM_000051.3’)
  • alt_ac (str) – specific genomic sequence (e.g., NC_000011.4)
  • alt_aln_method (str) – sequence alignment method (e.g., splign, blat)

# tx_exons = db.get_tx_exons(‘NM_199425.2’, ‘NC_000020.10’, ‘splign’) # len(tx_exons) 3

tx_exons have the following attributes:

{
    'tes_exon_set_id' : 98390
    'aes_exon_set_id' : 298679
    'tx_ac'           : 'NM_199425.2'
    'alt_ac'          : 'NC_000020.10'
    'alt_strand'      : -1
    'alt_aln_method'  : 'splign'
    'ord'             : 2
    'tx_exon_id'      : 936834
    'alt_exon_id'     : 2999028
    'tx_start_i'      : 786
    'tx_end_i'        : 1196
    'alt_start_i'     : 25059178
    'alt_end_i'       : 25059588
    'cigar'           : '410='
}

For example:

# tx_exons[0][‘tx_ac’] ‘NM_199425.2’

get_tx_for_gene(gene)

return transcript info records for supplied gene, in order of decreasing length

Parameters:gene (str) – HGNC gene name
get_tx_for_region(alt_ac, alt_aln_method, start_i, end_i)

return transcripts that overlap given region

Parameters:
  • alt_ac (str) – reference sequence (e.g., NC_000007.13)
  • alt_aln_method (str) – alignment method (e.g., splign)
  • start_i (int) – 5’ bound of region
  • end_i (int) – 3’ bound of region
get_tx_identity_info(tx_ac)

returns features associated with a single transcript.

Parameters:tx_ac (str) – transcript accession with version (e.g., ‘NM_199425.2’)

# database output -[ RECORD 1 ]–+————- tx_ac | NM_199425.2 alt_ac | NM_199425.2 alt_aln_method | transcript cds_start_i | 283 cds_end_i | 1003 lengths | {707,79,410} hgnc | VSX1

get_tx_info(tx_ac, alt_ac, alt_aln_method)

return a single transcript info for supplied accession (tx_ac, alt_ac, alt_aln_method), or None if not found

Parameters:
  • tx_ac (str) – transcript accession with version (e.g., ‘NM_000051.3’)
  • alt_ac (str) – specific genomic sequence (e.g., NC_000011.4)
  • alt_aln_method (str) – sequence alignment method (e.g., splign, blat)

# database output -[ RECORD 1 ]–+———— hgnc | ATM cds_start_i | 385 cds_end_i | 9556 tx_ac | NM_000051.3 alt_ac | AC_000143.1 alt_aln_method | splign

get_tx_mapping_options(tx_ac)

Return all transcript alignment sets for a given transcript accession (tx_ac); returns empty list if transcript does not exist. Use this method to discovery possible mapping options supported in the database

Parameters:tx_ac (str) – transcript accession with version (e.g., ‘NM_000051.3’)

# database output -[ RECORD 1 ]–+———— hgnc | ATM cds_start_i | 385 cds_end_i | 9556 tx_ac | NM_000051.3 alt_ac | AC_000143.1 alt_aln_method | splign -[ RECORD 2 ]–+———— hgnc | ATM cds_start_i | 385 cds_end_i | 9556 tx_ac | NM_000051.3 alt_ac | NC_000011.9 alt_aln_method | blat

required_version = u'1.1'
schema_version()
class hgvs.dataproviders.uta.UTA_postgresql(url, pooling=False, application_name=None, mode=None, cache=None)

Bases: hgvs.dataproviders.uta.UTABase

close()
hgvs.dataproviders.uta.connect(db_url=None, pooling=False, application_name=None, mode=None, cache=None)

Connect to a UTA database instance and return a UTA interface instance.

Parameters:
  • db_url (string) – URL for database connection
  • pooling (bool) – whether to use connection pooling (postgresql only)
  • application_name (str) – log application name in connection (useful for debugging; PostgreSQL only)

When called with an explicit db_url argument, that db_url is used for connecting.

When called without an explicit argument, the function default is determined by the environment variable UTA_DB_URL if it exists, or hgvs.datainterface.uta.public_db_url otherwise.

>>> hdp = connect()
>>> hdp.schema_version()
'1.1'

The format of the db_url is driver://user:pass@host/database (the same as that used by SQLAlchemy). Examples:

A remote public postgresql database:
postgresql://anonymous:anonymous@uta.biocommons.org/uta’
A local postgresql database:
postgresql://localhost/uta
A local SQLite database:
sqlite:////tmp/uta-0.0.6.db

For postgresql db_urls, pooling=True causes connect to use a psycopg2.pool.ThreadedConnectionPool.