Keeping IBM Functional Genomics Platform fresh and up-to-date

Keeping pace with the vast amount of genomic information that is submitted
to the National Center for Biotechnology Information (NCBI) on a daily basis
is a daunting task. Depending on requirements, genome sequence data may require a number
of preprocessing steps before it can even be used for analysis. That is why
the IBM Functional Genomics Platform team has devised a new monitoring service allowing our database
to be continually updated with the latest genome assemblies and sequence data from NCBI. This keeps
IBM Functional Genomics Platform’s database fresh and allows our users to focus more on their research
rather than the preprocessing steps required for assembly and annotation.

IBM Functional Genomics Platform monitors 7 key databases from NCBI. These include:

  1. BioProject
  2. BioSample
  3. GenBank
  4. Pathogen
  5. RefSeq
  6. Sequence Read Archive (SRA)
  7. Taxonomy

IBM Functional Genomics Platform uses the information in BioProject and BioSample to supplement and
enrich our search capabilities. Every genome in IBM Functional Genomics Platform is associated with
its corresponding metadata record in these databases to help provide additional
information to aid our users in discovering data for their research. GenBank,
Pathogen, RefSeq, and SRA are used to continually update IBM Functional Genomics Platform’s microbial genomic
sequence database including genes, proteins and their domains. We use the latest
taxonomy tree to ensure that genomic data is associated with the most accurate
taxonomic identifiers known to date.

What criteria do we use to identify that a new genome should be added or
modified in IBM Functional Genomics Platform? For genomes that are part of GenBank, Pathogen, and RefSeq
we use the following conditions:

  1. Must be bacterial. Currently, IBM Functional Genomics Platform’s database is bacterial
    only. We use the taxonomic tree to identify bacterial lineages.
  2. Must have an assembly level of “Complete Genome”. Per NCBI, complete genome
    provides the highest quality data as all genomic data contained in the assembly
    is represented.

If these two conditions are satisfied, the monitor will schedule the annotation
pipeline to run for the genome. This includes pulling the assembled genome data down
from NCBI, determining the genes and proteins, locating domains, phenotypes,
pathways, gene ontology, etc.

For SRA, the process is a bit more complicated as the reads from the NCBI
database must go through an assembly process first. The conditions for adding
or updating SRA in IBM Functional Genomics Platform are as follows:

  1. Must be bacterial. The taxonomic tree is again consulted to determine this.
  2. Library strategy must be Whole Genome Sequenced (WGS).
  3. Library source must be Genomic.
  4. Library descriptor must indicate pair-ended reads.
  5. Library platform must be Illumina.

If these conditions are satisfied, the monitor will schedule the assembly and
annotation pipelines to run for the genome.

We hope this gives you better insight into how IBM Functional Genomics Platform’s database is continually
updated and refreshed to give you access to the best possible bacterial genomic
information for your research needs.

Feel free to post below with any questions or comments.


Excellent description @eseabolt. So excited to be able to bring in the latest data to IBM Functional Genomics Platform and watch this repository grow even bigger!