Hello Kristen @superman,
do you have a plan to correct the protein sequences? I just checked again and the erroneous sequences are still there, e.g.
ORF1ab 5e333f345430611f1a8c347f64ff1d1d is truncated by 4564 AAs and starts at 4565 instead of 1 . I verified that the corresponding GISAID ORF1ab sequence is complete and translates correctly full length with a single AA mutation compared to the UniProt reference R1AB_SARS2.
I spent a week clustering and cleaning the omx protein sequences only to realize that it was a waste of time because the protein sequences are truncated, contain erroneous alternate start codons (biological nonsense for ORF1ab non-structural protein), include anti-sense translation of ORF1ab, frameshift translation of nucleoprotein and more.
I would not advertise the number of proteins, they are misleading.
Same for mutations, there are thorough analyses now available that indicate that most mutations are close to the error rate of the sequencing process.
My own simple analysis shows very clearly that the omx protein sequences are plagued with erroneous mutations due to frameshift-induced truncations: these are very clearly not real and products of sequencing artefacts, e.g. 38376c0a518d5a6fafd10f6b55226e62 . The C-term is a sequencing error. Coronavirus would not exist with a 2/3 truncation of Spike
Curating the GISAID data is a huge task, but I would venture that for IBM, performing massive alignments to MN908947 and extracting the ORFs by position from the NCBI annotation would be trivial? You could then only keep CDSs that translate to something like 95% of expected length and are > 95% identical to the ref.
I still think you are offering a very valuable service and would be happy to help (say, if you were to offer me extended access ).