How to filter by host?

How do I filter protein results by host?

The covid19 collection contains items like
protein id 5f90d25737b98547b6fb9f393a028a27

A blastp search to NCBI shows this is a Pangolin CoV and not isolated from human SARS-CoV-2. According to NCBI, no human isolate is identical to this protein: 298359177 - Identical Protein Groups - NCBI

Hi @isaphan,

Unfortunately, we don’t have a filter option for that as our platform focuses on microbes regardless of their source. A cursory glance at the SARS CoV 2 genomes we processed from GISAID seem to be mostly human. I don’t have an exact count, but would assume that non-human genomes from GISAID EpiCoV will be smaller in number. For the sequence you provided these genomes are in fact as you say from Pangolin.

EPI_ISL_410541
EPI_ISL_410540
EPI_ISL_410542
EPI_ISL_410538

This paper also provides organism information on some of the genomes from GISAID: https://arxiv.org/ftp/arxiv/papers/2003/2003.05580.pdf. If it would be helpful and you are mostly working with the GISAID genomes we have on hand I could provide a list of accessions that are not human.

Regards,

Ed

1 Like

Thanks for the offer Ed, that would be great. I’m clustering right now to filter out the ‘noise’, but it’s a best guess and a list of IDs would be most helpful to check that I am not losing any real hCoV data.

On a related note, can you confirm that sequences that contain a ‘*’ are pseudo genes?

Just went on GISAID and extracted all non-human genome IDs.

Hi @isaphan,

You beat me to it then. I just got them downloaded and was about to prepare the list. Do you have what you need now?

Regards,

Ed

I have a list of 19 genomes IDs that are from non-human CoVs. Is that what you have?

My list has 15, but I downloaded them last night. I also filter by complete genome.

OK, can you send me your list?

sorry for the delay. Here’s the list:

EPI_ISL_402131
EPI_ISL_408514
EPI_ISL_408515
EPI_ISL_410538
EPI_ISL_410539
EPI_ISL_410540
EPI_ISL_410541
EPI_ISL_410542
EPI_ISL_410543
EPI_ISL_410544
EPI_ISL_410721
EPI_ISL_422402
EPI_ISL_422403
EPI_ISL_422404
EPI_ISL_422405

I am confused, my list does not match. How are you doing these searches?

When I search GISAID through their webportal, the last 4 genomes come up as host = human, location = Africa / Ghana
EPI_ISL_422402
EPI_ISL_422403
EPI_ISL_422404
EPI_ISL_422405

Not sure what happened. Last night I had downloaded the all virus and the human only and did a simple set diff between the two. The result was what I reported above. I downloaded the data again just now and did the diff again, here’s the results:

These are all non-human. Apologies for the confusion, not sure what happened. Do these match yours now?

Non-Human - 5,701 genomes
Human - 5,690 genomes

hCoV-19/pangolin/Guangxi/P5E/2017|EPI_ISL_410541|2017
hCoV-19/pangolin/Guangxi/P1E/2017|EPI_ISL_410539|2017
hCoV-19/pangolin/Guangxi/P3B/2017|EPI_ISL_410543|2017
hCoV-19/pangolin/Guangdong/P2S/2019|EPI_ISL_410544|2019
hCoV-19/pangolin/Guangxi/P2V/2017|EPI_ISL_410542|2017
hCoV-19/Wuhan/IVDC-HB-envF13-21/2020|EPI_ISL_408515|2020-01-01
hCoV-19/pangolin/Guangxi/P5L/2017|EPI_ISL_410540|2017
hCoV-19/Wuhan/IVDC-HB-envF13-20/2020|EPI_ISL_408514|2020-01-01
hCoV-19/bat/Yunnan/RaTG13/2013|EPI_ISL_402131|2013-07-24
hCoV-19/pangolin/Guangdong/1/2019|EPI_ISL_410721|2019
hCoV-19/pangolin/Guangxi/P4L/2017|EPI_ISL_410538|2017

Ahhh, much better :grinning: