(141g) A Naive Bayesian Classifier of Gram Stain Phenotypes From Genotype Functional Roles | AIChE

(141g) A Naive Bayesian Classifier of Gram Stain Phenotypes From Genotype Functional Roles



A naive Bayesian classifier of Gram stain phenotypes from
genotype functional roles

Ricardo L Colasanti1,∗, Janaka N Edirisinghe 1, Christopher S Henry1
1 Mathematics and Computer Science Division, Argonne National Laboratory, 9700 South
Cass Avenue, Argonne, IL 60439 USA
∗ E-mail: ric.colasanti@gmail.com

Abstract

The recent advances in genome sequencing mean that we now have access to such vast quantities of data on the
genetic codes of bacteria that it is increasingly difficult to analyse them, and deduce a hypothesis  [3]. This
linked with the increased scope of ’Omics’ technologies: genomics, transcriptomics, proteomics,
metabolomics, epigenomics and metagenomics  [4] means that biological hypothesis formation must become
more automated. Metagenomics is an interesting example. Much of modern microbial ecology has
moved from the Petri dish and growth media as a means of identification, to the sequencer. In fact
the term Ecosystomics has regretfully been coined  [1]. And microbial ecology is at the heart of
modern science with the human microbiome project  [5] now an important research area in human
health.

We are exploring the use of naive Bayesian classifiers as a small step in the quest for automated hypothesis
formation. We set out to predict the phenotypic behaviour of a bacteria from a list of annotated biological
functions. As a test of our approach, we selected the identification of Gram stain classification of a bacteria.
Because the gram stain is a marker for peptidoglycan, which is present in the thick-layered walls of Gram +ve
bacteria and because the cell wall defines the biochemistry of the cell, we expect to see clear functional
signatures of Gram stain classification in genome annotations.

There has been previous work with machine learning of genomes [2]; our work differs in that we are utilized
annotated functions as predictors, instead of protein sequences. There has been a large amount of work done on
encoding of biological functions, giving descriptions of bacteria in terms of the roles that their proteins
perform [6]. The use of functional roles allows us to conflate numerous genome sequences together as
pre-classified classes. Genomic databases, such as SEED or the DOE Knowledgebase, contain a large set of
functional roles and the sets of genomes that give rise to them. Some functional roles are common to all bacteria,
some are specific to certain types. We are using machine learning methods to find underlying patterns in the
presence or absense of functional roles with the aim of being able classify bacteria based on their
annotations.

With bacterial Gram phenotype classification, we have taken the same machine learning approach as that
taken by document filters. Each bacteria is represented as a document containing functional role words. We have
data of bacteria classified as Gram+ve and Gram−ve. We used the relative frequency of each functional role in
Gram+ve and Gram−ve bacteria to classify unseen bacteria.

The classifier was trained by first reading through the training set and creating a list of functional role
attributes. The number of times each functional role occurs in a Gram+ve and Gram−ve bacteria is counted.
This is used to calculate a conditional probability of each functional role given either Gram+ve or Gram−ve.
These probabilities are used to calculate the probability of an unseen bacteria being Gram+ve or
Gram−ve. This is accomplished by a Bayesian combined conditional probability of the functional
roles of the unseen bacteria to produce a relative probability of that bacteria being Gram+ve or
Gram−ve

The results indicate the power of applying naive predictors to automatically classify genomes with any set of
phenotypes for which a diverse training set of genomes is available. The classifier was trained on 86 known Gram
+ve bacteria and 252 Gram −ve bacteria . After training, it was tested on 112 unseen bacteria; the classifier
exhibited an average accuracy of 0.97 and a balanced accuracy measurement of 0.96. On average the classifier
only misclassified 3 bacteria on any single test. An examination of the list of all the bacteria that were
misclassified provided a couple of interesting insights. The genus Deinococcus is actually a very difficult bacteria
to stain with crystal violet and as such is difficult to classify within the Gram classification. The bacteria
Syntrophomonas wolfei was the only Gram−ve bacteria that was misclassified as a Gram−ve. Interestingly
this bacteria is unique in having a multilayered cell wall and its lack of internal membrane-bound
organelles. This is perhaps why the classifier identifies it as a Gram+ve bacteria, because as we
have noted the cell wall defines the biochemistry of the cell and the biochemistry is a function of
the collective functional roles of the enzymes. The biochemistry of the stout walled Gram−ve is
probably closer to that of the thick walled Gram+ve bacteria. The reasons for the other misclassified
bacteria are more difficult to ascertain, but may be due to a less than perfect annotation of their
genome. However, we emphasize how our naive classifier, built based only on an input training set of
preclassified genomes (and a database of consistent genome annotations) exposes new biology for deeper
analysis.

The tools used to produce our classifier from a training set, and the tools used to apply our classifier
to any new genome based only on its genome sequence have been built into the DOE Systems
Biology Knowledgebase, enabling their application for new studies by the scientific community. We
will highlight and demonstrate these tools in the context of the analysis of our new Gram stain
classifier. References

1.   Anthony M Poole, Daniel B Stouffer, and Jason M Tylianakis. ecosystomics: ecology by sequencer. Trends in Ecology and Evolution, 27(6):309, 2012.

2.   Margaret A Shipp, Ken N Ross, Pablo Tamayo, Andrew P Weng, Jeffery L Kutok, Ricardo CT Aguiar, Michelle Gaasenbeek, Michael Angelo, Michael Reich, Geraldine S Pinkus, et al. Diffuse large b-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nature medicine, 8(1):68–74, 2002.

3.   A.L. Tarca, V.J. Carey, X. Chen, R. Romero, and S. Drăghici. Machine learning and its applications to biology. PLoS computational biology, 3(6):e116, 2007.

4.   Wouter G Touw, Jumamurat R Bayjanov, Lex Overmars, Lennart Backus, Jos Boekhorst, Michiel Wels, and Sacha AFT van Hijum. Data mining in the life sciences with random forest: a walk in the park or lost in the jungle? Briefings in bioinformatics, 2012.

5.   Peter J Turnbaugh, Ruth E Ley, Micah Hamady, Claire M Fraser-Liggett, Rob Knight, and Jeffrey I Gordon. The human microbiome project. Nature, 449(7164):804–810, 2007.

6.   T.Y. Wong, L.A. Preston, and N.L. Schiller. Alginate lyase: review of major sources and enzyme characteristics, structure-function analysis, biological roles, and applications. Annual Reviews in Microbiology, 54(1):289–340, 2000.

Checkout

This paper has an Extended Abstract file available; you must purchase the conference proceedings to access it.

Checkout

Do you already own this?

Pricing

Individuals

AIChE Pro Members $150.00
AIChE Graduate Student Members Free
AIChE Undergraduate Student Members Free
AIChE Explorer Members $225.00
Non-Members $225.00