External Data Providers¶
hgvs.dataproviders.interface
¶
Defines the abstract data provider interface
-
class
hgvs.dataproviders.interface.
Interface
(mode=None, cache=None)[source]¶ Bases:
object
Variant mapping and validation requires access to external data, specifically exon structures, transcript alignments, and protein accessions. In order to isolate the hgvs package from the myriad choices and tradeoffs, these data are provided through an implementation of the (abstract) HGVS Data Provider Interface.
As of June 2014, the only available data provider implementation uses the Universal Transcript Archive (UTA), a sister project that provides access to transcripts and genome-transcript alignments. Invitae provides a public UTA database instance that is used by default; see the UTA page for instructions on installing your own PostgreSQL or SQLite version. In the future, other implementations may be availablefor other data sources.
Pure virtural class for the HGVS Data Provider Interface. Every data provider implementation should be a subclass (possibly indirect) of this class.
Parameters: -
required_version
= None¶
-
hgvs.dataproviders.uta
¶
implements an hgvs data provider interface using UTA (https://github.com/biocommons/uta)
-
class
hgvs.dataproviders.uta.
ParseResult
[source]¶ Bases:
urllib.parse.ParseResult
Subclass of url.ParseResult that adds database and schema methods, and provides stringification.
-
database
¶
-
schema
¶
-
-
class
hgvs.dataproviders.uta.
UTABase
(url, mode=None, cache=None)[source]¶ Bases:
hgvs.dataproviders.interface.Interface
-
get_acs_for_protein_seq
(seq)[source]¶ returns a list of protein accessions for a given sequence. The list is guaranteed to contain at least one element with the MD5-based accession (MD5_01234abc…def56789) at the end of the list.
-
get_assembly_map
(assembly_name)[source]¶ return a list of accessions for the specified assembly name (e.g., GRCh38.p5)
-
get_gene_info
(gene)[source]¶ returns basic information about the gene.
Parameters: gene (str) – HGNC gene name # database results hgnc | ATM maploc | 11q22-q23 descr | ataxia telangiectasia mutated summary | The protein encoded by this gene belongs to the PI3/PI4-kinase family. This… aliases | AT1,ATA,ATC,ATD,ATE,ATDC,TEL1,TELO1 added | 2014-02-04 21:39:32.57125
-
get_pro_ac_for_tx_ac
(tx_ac)[source]¶ Return the (single) associated protein accession for a given transcript accession, or None if not found.
-
get_similar_transcripts
(tx_ac)[source]¶ Return a list of transcripts that are similar to the given transcript, with relevant similarity criteria.
>> sim_tx = hdp.get_similar_transcripts(‘NM_001285829.1’) >> dict(sim_tx[0]) { ‘cds_eq’: False, ‘cds_es_fp_eq’: False, ‘es_fp_eq’: True, ‘tx_ac1’: ‘NM_001285829.1’, ‘tx_ac2’: ‘ENST00000498907’ }
where:
- cds_eq means that the CDS sequences are identical
- es_fp_eq means that the full exon structures are identical (i.e., incl. UTR)
- cds_es_fp_eq means that the cds-clipped portions of the exon structures are identical (i.e., ecluding. UTR)
- Hint: “es” = “exon set”, “fp” = “fingerprint”, “eq” = “equal”
“exon structure” refers to the start and end coordinates on a specified reference sequence. Thus, having the same exon structure means that the transcripts are defined on the same reference sequence and have the same exon spans on that sequence.
-
get_tx_exons
(tx_ac, alt_ac, alt_aln_method)[source]¶ return transcript exon info for supplied accession (tx_ac, alt_ac, alt_aln_method), or None if not found
Parameters: # tx_exons = db.get_tx_exons(‘NM_199425.2’, ‘NC_000020.10’, ‘splign’) # len(tx_exons) 3
tx_exons have the following attributes:
{ 'tes_exon_set_id' : 98390 'aes_exon_set_id' : 298679 'tx_ac' : 'NM_199425.2' 'alt_ac' : 'NC_000020.10' 'alt_strand' : -1 'alt_aln_method' : 'splign' 'ord' : 2 'tx_exon_id' : 936834 'alt_exon_id' : 2999028 'tx_start_i' : 786 'tx_end_i' : 1196 'alt_start_i' : 25059178 'alt_end_i' : 25059588 'cigar' : '410=' }
For example:
# tx_exons[0][‘tx_ac’] ‘NM_199425.2’
-
get_tx_for_gene
(gene)[source]¶ return transcript info records for supplied gene, in order of decreasing length
Parameters: gene (str) – HGNC gene name
-
get_tx_for_region
(alt_ac, alt_aln_method, start_i, end_i)[source]¶ return transcripts that overlap given region
Parameters:
-
get_tx_identity_info
(tx_ac)[source]¶ returns features associated with a single transcript.
Parameters: tx_ac (str) – transcript accession with version (e.g., ‘NM_199425.2’) # database output -[ RECORD 1 ]–+————- tx_ac | NM_199425.2 alt_ac | NM_199425.2 alt_aln_method | transcript cds_start_i | 283 cds_end_i | 1003 lengths | {707,79,410} hgnc | VSX1
-
get_tx_info
(tx_ac, alt_ac, alt_aln_method)[source]¶ return a single transcript info for supplied accession (tx_ac, alt_ac, alt_aln_method), or None if not found
Parameters: # database output -[ RECORD 1 ]–+———— hgnc | ATM cds_start_i | 385 cds_end_i | 9556 tx_ac | NM_000051.3 alt_ac | AC_000143.1 alt_aln_method | splign
-
get_tx_mapping_options
(tx_ac)[source]¶ Return all transcript alignment sets for a given transcript accession (tx_ac); returns empty list if transcript does not exist. Use this method to discovery possible mapping options supported in the database
Parameters: tx_ac (str) – transcript accession with version (e.g., ‘NM_000051.3’) # database output -[ RECORD 1 ]–+———— hgnc | ATM cds_start_i | 385 cds_end_i | 9556 tx_ac | NM_000051.3 alt_ac | AC_000143.1 alt_aln_method | splign -[ RECORD 2 ]–+———— hgnc | ATM cds_start_i | 385 cds_end_i | 9556 tx_ac | NM_000051.3 alt_ac | NC_000011.9 alt_aln_method | blat
-
required_version
= '1.1'¶
-
-
class
hgvs.dataproviders.uta.
UTA_postgresql
(url, pooling=False, application_name=None, mode=None, cache=None)[source]¶
-
hgvs.dataproviders.uta.
connect
(db_url=None, pooling=False, application_name=None, mode=None, cache=None)[source]¶ Connect to a UTA database instance and return a UTA interface instance.
Parameters: When called with an explicit db_url argument, that db_url is used for connecting.
When called without an explicit argument, the function default is determined by the environment variable UTA_DB_URL if it exists, or hgvs.datainterface.uta.public_db_url otherwise.
>>> hdp = connect() >>> hdp.schema_version() '1.1'
The format of the db_url is driver://user:pass@host/database/schema (the same as that used by SQLAlchemy). Examples:
- A remote public postgresql database:
- postgresql://anonymous:anonymous@uta.biocommons.org/uta/uta_20170707’
- A local postgresql database:
- postgresql://localhost/uta_dev/uta_20170707
For postgresql db_urls, pooling=True causes connect to use a psycopg2.pool.ThreadedConnectionPool.