ensembl-rest
A Python interface to the Ensembl REST APIs. A whole world of biological data at your fingertips.
Examples
The library exports methods that point to each endpoint of the API, such as:
>>> import ensembl_rest >>> ensembl_rest . symbol_lookup ( species = 'homo sapiens' , symbol = 'BRCA2' )
{ 'species': 'human', 'object_type': 'Gene', 'description': 'BRCA2, DNA repair associated [Source:HGNC Symbol;Acc:HGNC:1101]', 'assembly_name': 'GRCh38', 'end': 32400266, ... ... ... 'seq_region_name': '13', 'strand': 1, 'id': 'ENSG00000139618', 'start': 32315474}
All the endpoints are listed on the API website. A quick lookup of the methods can be obtained by calling help on the module:
>>> help ( ensembl_rest )
If you want to use an endpoint from the ones enlisted in the API website, say GET lookup/symbol/:species/:symbol , then the name of the corresponding method is in the endpoint documentation URL, in this case, the documentation links to http://rest.ensembl.org/documentation/info/symbol_lookup so the corresponding method name is symbol_lookup.
>>> help ( ensembl_rest . symbol_lookup )
Help on function symbol_lookup in module ensembl_rest: symbol_lookup(*args, **kwargs) Lookup ``GET lookup/symbol/:species/:symbol`` Find the species and database for a symbol in a linked external database **Parameters** - Required: + **Name**: species + *Type*: String + *Description*: Species name/alias + *Default*: - + *Example Values*: homo_sapiens, human ... ... - Optional: + **Name**: expand + *Type*: Boolean(0,1) + *Description*: Expands the search to include any connected features. e.g. If the object is a gene, its transcripts, translations and exons will be returned as well. ... ... **Resource info** - **Methods**: GET - **Response formats**: json, xml, jsonp **More info** https://rest.ensembl.org/documentation/info/symbol_lookup
We can see from the resource string GET lookup/symbol/:species/:symbol that this method contains 2 parameters called species and symbol, so we can call the method in the following way:
>>> ensembl_rest . symbol_lookup ( species = 'homo sapiens' , symbol = 'TP53' ) # Or like this... >>> ensembl_rest . symbol_lookup ( 'homo sapiens' , 'TP53' )
{'source': 'ensembl_havana', 'object_type': 'Gene', 'logic_name': 'ensembl_havana_gene', ... ... ... 'start': 32315474}
One can provide optional parameters with the params keyword (the specific parameters to pass depend on the specific endpoint, the official endpoints documentation can be found here)_:
# Fetch also exons, transcripts, etc... >>> ensembl_rest . symbol_lookup ( 'human' , 'BRCA2' , params = { 'expand' : True })
{'source': 'ensembl_havana', 'seq_region_name': '13', 'Transcript': [{'source': 'ensembl_havana', 'object_type': 'Transcript', 'logic_name': 'ensembl_havana_transcript', 'Exon': [{'object_type': 'Exon', 'version': 4, 'species': 'human', 'assembly_name': 'GRCh38', ... ... ... 'biotype': 'protein_coding', 'start': 32315474}
The parameters for the POST endpoints are also provided via the params keyword , such as in the next example:
>>> ensembl_rest . symbol_post ( species = 'human' , params = { 'symbols' : [ "BRCA2" , "TP53" , "BRAF" ]})
{ "BRCA2": { "source": "ensembl_havana", "object_type": "Gene", "logic_name": "ensembl_havana_gene", "description": "BRCA2, DNA repair associated [Source:HGNC Symbol;Acc:HGNC:1101]", ... ... }, "TP53": { ... ... }. "BRAF": { ... ... "strand": -1, "id": "ENSG00000157764", "start": 140719327 } }
Another common usage is to fetch sequences of known genes:
>>> ensembl_rest . sequence_id ( 'ENSG00000157764' )
{'desc': 'chromosome:GRCh38:7:140719327:140924928:-1', 'query': 'ENSG00000157764', 'version': 13, 'id': 'ENSG00000157764', 'seq': 'TTCCCCCAATCCCCTCAGGCTCGG...ATTGACTGCATGGAGAAGTCTTCA', 'molecule': 'dna'}
if you want it in FASTA, you can modify the headers:
>>> ensembl_rest . sequence_id ( 'ENSG00000157764' , headers = { 'content-type' : 'text/x-fasta' })
>ENSG00000157764.13 chromosome:GRCh38:7:140719327:140924928:-1 TTCCCCCAATCCCCTCAGGCTCGGCTGCGCCCGGGGCCGCGGGCCGGTACCTGAGGTGGC CCAGGCGCCCTCCGCCCGCGGCGCCGCCCGGGCCGCTCCTCCCCGCGCCCCCCGCGCCCC CCGCTCCTCCGCCTCCGCCTCCGCCTCCGCCTCCCCCAGCTCTCCGCCTCCCTTCCCCCT ...
Notice that, if left unchanged, the methods ask for data in dictionary (JSON) format so that they are easy to use. If the response cannot be decoded as such, then it is returned as plain text, such as the above.
You can also map betweeen assemblies
>>> ensembl_rest . assembly_map ( species = 'human' , asm_one = 'GRCh37' , region = 'X:1000000..1000100:1' , asm_two = 'GRCh38' ) # Or... >>> region_str = ensembl_rest . region_str ( chrom = 'X' , start = 1000000 , end = 1000100 ) >>> ensembl_rest . assembly_map ( species = 'human' , asm_one = 'GRCh37' , region = region_str , asm_two = 'GRCh38' )
{'mappings': [{'original': {'seq_region_name': 'X', 'strand': 1, 'coord_system': 'chromosome', 'end': 1000100, 'start': 1000000, 'assembly': 'GRCh37'}, 'mapped': {'seq_region_name': 'X', 'strand': 1, 'coord_system': 'chromosome', 'end': 1039365, 'start': 1039265, 'assembly': 'GRCh38'}}]}
The above problem (mapping from one assembly to another) is so frequent that the library provides a specialized class AssemblyMapper to efficiently mapping large amounts of regions between assemblies. This class avoids the time-consuming task of making a web request every time a mapping is needed by fetching the mapping of the whole assembly right from the instantiation. This is a time-consuming operation by itself, but it pays off when one has to transform repeatedly betweeen assemblies.:
>>> mapper = ensembl_rest.AssemblyMapper( species='human', from_assembly='GRCh37', to_assembly='GRCh38' ) >>> mapper.map(chrom='1', pos=1000000) 1064620
You can also find orthologs, paralogs and gene tree information, along with variation data and basically everything Ensembl has to offer.
If you want to instantiate your own client, you can do it by using the ensembl_rest.EnsemblClient class, this class is the one that contains all the endpoint methods.
>>> client = ensembl_rest . EnsemblClient () >>> client . symbol_lookup ( 'homo sapiens' , 'TP53' )
{'source': 'ensembl_havana', 'object_type': 'Gene', 'logic_name': 'ensembl_havana_gene', 'version': 14, 'species': 'human', ... ... ...}
Finally, the library exposes the class ensembl_rest.HTTPError that allows to handle errors in the requests. An example of its utility is when using the GET genetree/member/symbol/:species/:symbol endpoint to query for gene trees in order to find ortholog and paralog proteins and genes. This endpoint returns an HTTP error when a gene tree is not found with code 400 and the error message Lookup found nothing. We can use this information to detect the error and handle it, or to simply ignore it if we expected it:
Examples
The library exports methods that point to each endpoint of the API, such as:
>>> import ensembl_rest >>> ensembl_rest . symbol_lookup ( species = 'homo sapiens' , symbol = 'BRCA2' )
{ 'species': 'human', 'object_type': 'Gene', 'description': 'BRCA2, DNA repair associated [Source:HGNC Symbol;Acc:HGNC:1101]', 'assembly_name': 'GRCh38', 'end': 32400266, ... ... ... 'seq_region_name': '13', 'strand': 1, 'id': 'ENSG00000139618', 'start': 32315474}
All the endpoints are listed on the API website. A quick lookup of the methods can be obtained by calling help on the module:
>>> help ( ensembl_rest )
If you want to use an endpoint from the ones enlisted in the API website, say GET lookup/symbol/:species/:symbol , then the name of the corresponding method is in the endpoint documentation URL, in this case, the documentation links to http://rest.ensembl.org/documentation/info/symbol_lookup so the corresponding method name is symbol_lookup.
>>> help ( ensembl_rest . symbol_lookup )
Help on function symbol_lookup in module ensembl_rest: symbol_lookup(*args, **kwargs) Lookup ``GET lookup/symbol/:species/:symbol`` Find the species and database for a symbol in a linked external database **Parameters** - Required: + **Name**: species + *Type*: String + *Description*: Species name/alias + *Default*: - + *Example Values*: homo_sapiens, human ... ... - Optional: + **Name**: expand + *Type*: Boolean(0,1) + *Description*: Expands the search to include any connected features. e.g. If the object is a gene, its transcripts, translations and exons will be returned as well. ... ... **Resource info** - **Methods**: GET - **Response formats**: json, xml, jsonp **More info** https://rest.ensembl.org/documentation/info/symbol_lookup
We can see from the resource string GET lookup/symbol/:species/:symbol that this method contains 2 parameters called species and symbol, so we can call the method in the following way:
>>> ensembl_rest . symbol_lookup ( species = 'homo sapiens' , symbol = 'TP53' ) # Or like this... >>> ensembl_rest . symbol_lookup ( 'homo sapiens' , 'TP53' )
{'source': 'ensembl_havana', 'object_type': 'Gene', 'logic_name': 'ensembl_havana_gene', ... ... ... 'start': 32315474}
One can provide optional parameters with the params keyword (the specific parameters to pass depend on the specific endpoint, the official endpoints documentation can be found here)_:
# Fetch also exons, transcripts, etc... >>> ensembl_rest . symbol_lookup ( 'human' , 'BRCA2' , params = { 'expand' : True })
{'source': 'ensembl_havana', 'seq_region_name': '13', 'Transcript': [{'source': 'ensembl_havana', 'object_type': 'Transcript', 'logic_name': 'ensembl_havana_transcript', 'Exon': [{'object_type': 'Exon', 'version': 4, 'species': 'human', 'assembly_name': 'GRCh38', ... ... ... 'biotype': 'protein_coding', 'start': 32315474}
The parameters for the POST endpoints are also provided via the params keyword , such as in the next example:
>>> ensembl_rest . symbol_post ( species = 'human' , params = { 'symbols' : [ "BRCA2" , "TP53" , "BRAF" ]})
{ "BRCA2": { "source": "ensembl_havana", "object_type": "Gene", "logic_name": "ensembl_havana_gene", "description": "BRCA2, DNA repair associated [Source:HGNC Symbol;Acc:HGNC:1101]", ... ... }, "TP53": { ... ... }. "BRAF": { ... ... "strand": -1, "id": "ENSG00000157764", "start": 140719327 } }
Another common usage is to fetch sequences of known genes:
>>> ensembl_rest . sequence_id ( 'ENSG00000157764' )
{'desc': 'chromosome:GRCh38:7:140719327:140924928:-1', 'query': 'ENSG00000157764', 'version': 13, 'id': 'ENSG00000157764', 'seq': 'TTCCCCCAATCCCCTCAGGCTCGG...ATTGACTGCATGGAGAAGTCTTCA', 'molecule': 'dna'}
if you want it in FASTA, you can modify the headers:
>>> ensembl_rest . sequence_id ( 'ENSG00000157764' , headers = { 'content-type' : 'text/x-fasta' })
>ENSG00000157764.13 chromosome:GRCh38:7:140719327:140924928:-1 TTCCCCCAATCCCCTCAGGCTCGGCTGCGCCCGGGGCCGCGGGCCGGTACCTGAGGTGGC CCAGGCGCCCTCCGCCCGCGGCGCCGCCCGGGCCGCTCCTCCCCGCGCCCCCCGCGCCCC CCGCTCCTCCGCCTCCGCCTCCGCCTCCGCCTCCCCCAGCTCTCCGCCTCCCTTCCCCCT ...
Notice that, if left unchanged, the methods ask for data in dictionary (JSON) format so that they are easy to use. If the response cannot be decoded as such, then it is returned as plain text, such as the above.
You can also map betweeen assemblies
>>> ensembl_rest . assembly_map ( species = 'human' , asm_one = 'GRCh37' , region = 'X:1000000..1000100:1' , asm_two = 'GRCh38' ) # Or... >>> region_str = ensembl_rest . region_str ( chrom = 'X' , start = 1000000 , end = 1000100 ) >>> ensembl_rest . assembly_map ( species = 'human' , asm_one = 'GRCh37' , region = region_str , asm_two = 'GRCh38' )
{'mappings': [{'original': {'seq_region_name': 'X', 'strand': 1, 'coord_system': 'chromosome', 'end': 1000100, 'start': 1000000, 'assembly': 'GRCh37'}, 'mapped': {'seq_region_name': 'X', 'strand': 1, 'coord_system': 'chromosome', 'end': 1039365, 'start': 1039265, 'assembly': 'GRCh38'}}]}
The above problem (mapping from one assembly to another) is so frequent that the library provides a specialized class AssemblyMapper to efficiently mapping large amounts of regions between assemblies. This class avoids the time-consuming task of making a web request every time a mapping is needed by fetching the mapping of the whole assembly right from the instantiation. This is a time-consuming operation by itself, but it pays off when one has to transform repeatedly betweeen assemblies.:
>>> mapper = ensembl_rest.AssemblyMapper( species='human', from_assembly='GRCh37', to_assembly='GRCh38' ) >>> mapper.map(chrom='1', pos=1000000) 1064620
You can also find orthologs, paralogs and gene tree information, along with variation data and basically everything Ensembl has to offer.
If you want to instantiate your own client, you can do it by using the ensembl_rest.EnsemblClient class, this class is the one that contains all the endpoint methods.
>>> client = ensembl_rest . EnsemblClient () >>> client . symbol_lookup ( 'homo sapiens' , 'TP53' )
{'source': 'ensembl_havana', 'object_type': 'Gene', 'logic_name': 'ensembl_havana_gene', 'version': 14, 'species': 'human', ... ... ...}
Finally, the library exposes the class ensembl_rest.HTTPError that allows to handle errors in the requests. An example of its utility is when using the GET genetree/member/symbol/:species/:symbol endpoint to query for gene trees in order to find ortholog and paralog proteins and genes. This endpoint returns an HTTP error when a gene tree is not found with code 400 and the error message Lookup found nothing. We can use this information to detect the error and handle it, or to simply ignore it if we expected it:
Comentarios
Publicar un comentario