SARS-CoV-2 Spike protein start site issue

Hello omx Team,

It is critical for my study that the protein start sites are accurate.
How are you extracting the CDSs from GISAID?

I am currently stuck in my analysis because the start sites of the Spike glycoproteins appear to be systematically upstream of the UniProt manually curated references sequence:
https://viralzone.expasy.org/resources/Coronav/P0DTC2.txt

I also checked against the NCBI sequences for Spike: they start at the next Met and are shorter than the sequence I’m getting back from omx:

IBM_7881             MFLLTTKRTMFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHST    60
QIS30525             ---------MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHST    51
sp|P0DTC2|SPIKE_WCPV ---------MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHST    

IBM_7881 corresponds to omx record:

  • MD5/protein id = 99305308aeb13f990ac638375ba95975
  • genome = EPI_ISL_418065

Please can you comment?
Thanks!

Hi Isabelle,
The genes and proteins are called using Prokka. Detailed methods for our annotation process relating to bacterial sequences can be found in our pre-print on arXiv ([1911.02095] IBM Functional Genomics Platform, A Cloud-Based Platform for Studying Microbial Life at Scale), but I’ll also add relevant details here too.

As a modification to Prokka’s base capability, we update the default reference more frequently than their release cycle so that we always use the latest data available. We still apply the same QC requirements to the reference data: UniProtKB (SwissProt), remove fragment entries, and require experimental mRNA or proteomics level evidence. Prokka’s default reference does not yet include any COVID-19 data, so we pull the COVID-19 pre-release data directly from UniProt (last retrieved April 1st) and are using that to supplement our gene and protein calling.

We’re actively updating COVID-19 reference data as it becomes available too. As I’m sure you understand, this data grows very quickly. If there are other better up to date references that we should consider, we’d be happy to take a look and possibly integrate those to our pipelines.

- Kristen (aka Superman)

I’m also looping in @eseabolt here too and together we can fill in any additional details you need :slight_smile:

Thanks Kristen @superman

full disclosure: I worked 7 years at Swiss-Prot.
Viral sequences are a different beast to bacteria. I would urge you to correct your SARS-CoV-2 protein boundaries against the Swiss-Prot references: SARS coronavirus 2 ~ ViralZone page

It would be an immense service to people like me to be able to download the individual proteins (non-structural and others) produced from the polyproteins after cleavage. No database to my knowledge is offering that.

But as a start, if we could just get the accurate proteins with start sites consistent with the latest Swiss-Prot annotation (Philippe LeMercier who heads ViralZone is at Swiss-Prot), that would be most helpful.

Isabelle

2 Likes

Nice! Glad to know an insider :slight_smile: I’ll take a look at this to see if we can update our sequences or even provide the cleaved polyprotein products.

Hello Kristen @superman,

do you have a plan to correct the protein sequences? I just checked again and the erroneous sequences are still there, e.g.
ORF1ab 5e333f345430611f1a8c347f64ff1d1d is truncated by 4564 AAs and starts at 4565 instead of 1 . I verified that the corresponding GISAID ORF1ab sequence is complete and translates correctly full length with a single AA mutation compared to the UniProt reference R1AB_SARS2.

I spent a week clustering and cleaning the omx protein sequences only to realize that it was a waste of time because the protein sequences are truncated, contain erroneous alternate start codons (biological nonsense for ORF1ab non-structural protein), include anti-sense translation of ORF1ab, frameshift translation of nucleoprotein and more.

I would not advertise the number of proteins, they are misleading.
Same for mutations, there are thorough analyses now available that indicate that most mutations are close to the error rate of the sequencing process.

My own simple analysis shows very clearly that the omx protein sequences are plagued with erroneous mutations due to frameshift-induced truncations: these are very clearly not real and products of sequencing artefacts, e.g. 38376c0a518d5a6fafd10f6b55226e62 . The C-term is a sequencing error. Coronavirus would not exist with a 2/3 truncation of Spike :wink:

Curating the GISAID data is a huge task, but I would venture that for IBM, performing massive alignments to MN908947 and extracting the ORFs by position from the NCBI annotation would be trivial? You could then only keep CDSs that translate to something like 95% of expected length and are > 95% identical to the ref.

I still think you are offering a very valuable service and would be happy to help (say, if you were to offer me extended access :wink: ).

Cheers,
Isabelle

Hi Isabelle,
Thank you so much for raising this. We most certainly want to generate the most accurate data possible. I will reach out to you directly so we can look into these issues further. It would be great to learn more of your findings and how we can correct these truncations due to sequencing errors or other causes. We’re most certainly open to collaborating further :slight_smile:

And yes, you’re right. We most certainly could complete an alignment to the reference sequence and provide the resulting CDS as well.

Best regards and I look forward to talking with you soon,
Kristen

Hi Isabelle,

We wanted to update you on our progress since we spoke last. We have identified the sources causing the truncations you observed. Some of it was related to input genome quality. We have made a few modifications in the inclusion criteria for our genome quality and our results related to truncated products have significantly improved.

Also, while working on these improvements, we are added two updated capabilities to our release: using the latest UniProt SARS-CoV-2 reference proteins updated as of 4/29/2020 as well as a special release of InterProScan for SARS-CoV-2. All of the data on the platform is being re-processed with these improvements and the release will be rolled out within a week. Thank you for your helpful feedback. We will follow up with full release notes with additional details once these changes are live on the site.

- The Functional Genomics Team

Hello Karen,
thank you for the updates and apologies for the delay in responding while I was working with the team on getting our manuscript submitted. Is there anything in particular that you would like me to look at?
Best regards,
Isabelle