Retrieving genome ID from protein object throws 414

I am iterating over 2.5K protein objects and somehow retrieving the genome identifier throws a 414 error:

----> 9                             protein.genomes().results()[0].id(),
    
     75         # TODO: OmxResponse should be able to handle the response.status_code as well
     76         if response.status_code < 200 or response.status_code >= 300:
---> 77             raise self._process_http_response(response)
     78 
     79         r = OmxResponse.OmxResponse(self, response)

ServiceException: <html>
<head><title>414 Request-URI Too Large</title></head>
<body>
<center><h1>414 Request-URI Too Large</h1></center>
<hr><center>nginx</center>
</body>
</html>

I was trying to print in a csv output the genome ID that I can see in the ‘genome’ column of my DataFrame object . Perhaps I’m using the wrong syntax to do this?

Hi @isaphan,

Please update your version of the omxware Python SDK to 0.1.38. We have fixed the issue. Thank you for using our platform.

Regards,

Ed

2 Likes

Hello @eseabolt

thank you for the quick fix. It almost worked, though adding the genome ID to the result seems to have impacted performance dramatically. I merely got to 7% of results in over 20min, whereas omitting the genome ID reduces download time to < 5s.

stack trace:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
in
7 filewriter.writerow( [protein.id(),
8 protein.name(),
----> 9 protein.genomes().results()[0].id(),
10 protein.sequence()] )

IndexError: list index out of range

I am confused: do not all proteins have a genome ID?

@isaphan would you mind sharing a code snippet – to reproduce the error ?

@harsha see below. Output is immediate < 1s if the genome ID call is left out. I think I am missing a simple to_s() method for retrieving the first genome ID from the results, that does not require doing an actual call to the underlying datastore.

results_list = []
for counter in range(1, pages + 1):
    query = omx.proteins(protein_name=all,
                           page_size=25,
                           classification='virus',
                           collection = ['covid19'],
                           page_number=counter)
    results = query.results(type='list')
    for item in results:
        results_list.append(item)

import sys
import csv
with open('sc2_proteins.csv', 'w', newline='\n') as csvfile:
    filewriter = csv.writer(csvfile, delimiter=',')
    filewriter.writerow(['id', 'name',  'genome', 'sequence'])
    for protein in results_list:
        filewriter.writerow( [protein.id(), 
                            protein.name(), 
                            protein.genomes().results()[0].id(),
                            protein.sequence()] )

@isaphan – did you really mean protein_name = ‘all’ ?

Depending on which proteins you are querying – the execution time may vary. There are some proteins found in more than 35,000 Genomes.

The genome object in the proteins object appears in the results dataframe as one single (random?) genome ID, e.g. [‘EPI_ISL_417319’]. Is there a way to retrieve just that one ID?

if genome.id() is all that you need, then try this…
.

genes = omx.genes(ids=['669ade328df11dfffc3f082dd3811aea','d81d5314d04c0f44c2319d7fdb0adc83'])
genes = genes.results()

gene = genes[0]
for idx in range(0, len(gene._genomes)-1):
    print(gene._genomes[idx].id())

gene = genes[1]
for idx in range(0, len(gene._genomes)-1):
    print(str(idx) + ": " + gene._genomes[idx].id())

you can do the same for proteins as well…

Thanks
Harsha

2 Likes

Yes, that did the trick. Thanks a million @harsha !

1 Like

Happy to help! Good luck.

-Harsha