Publication

CAZymes Analysis Toolkit (CAT): Web service for searching and analyzing carbohydrate-active enzymes in a newly sequenced organism using CAZy database Public

Byung H Park, Tatiana V Karpinets, Mustafa H Syed, Michael R Leuze, and Edward C Uberbacher 2010 August 09 Glycobiology (2010) 20 (12): 1574-1584
Download Publication: Glycobiology-2010-Park-1574-84.pdf

Abstract

The Carbohydrate-Active Enzyme (CAZy) database provides a rich set of manually annotated enzymes that degrade, modify, or create glycosidic bonds. Despite rich and invaluable information stored in the database, software tools utilizing this information for annotation of newly sequenced genomes by CAZy families are limited. We have employed two annotation approaches to fill the gap between manually curated high-quality protein sequences collected in the CAZy database and the growing number of other protein sequences produced by genome or metagenome sequencing projects. The first approach is based on a similarity search against the entire nonredundant sequences of the CAZy database. The second approach performs annotation using links or correspondences between the CAZy families and protein family domains. The links were discovered using the association rule learning algorithm applied to sequences from the CAZy database. The approaches complement each other and in combination achieved high specificity and sensitivity when cross-evaluated with the manually curated genomes of Clostridium thermocellum ATCC 27405 and Saccharophagus degradans 2-40. The capability of the proposed framework to predict the function of unknown protein domains and of hypothetical proteins in the genome of Neurospora crassa is demonstrated. The framework is implemented as a Web service, the CAZymes Analysis Toolkit, and is available at http://cricket.ornl.gov/cgi-bin/cat.cgi.

Highlights

Fig1
Evaluations of the similarity search algorithm using CAZy annotation of C. thermocellum ATCC 27405 (left) and S. degradans 2-40 (right) as benchmarks. In each case, circular- and triangular-shaped plots denote specificities with and without additional evaluation of Pfam domains in the compared proteins, respectively. Likewise, rectangular- and diamond-shaped plots denote sensitivities with and without Pfam domains evaluation, respectively. For the analyzed organisms, no difference in the outputs was found between uni-directional and bi-directional blast search

Fig2
Evaluation of the CAZy family annotations based on the association rules between CAZy families and Pfam domains using CAZy annotation of C. thermocellum ATCC 27405 (left) and S. degradans 2-40 (right) as benchmarks. In each case, diamond- and rectangular-shaped plots denote specificity and sensitivity, respectively. In each case, rules with the minimum support of 10 are used

Fig3Fig3
Visualization of the domain architecture in CAT for selected glycosyltransferases family 8 (A), glycoside hydrolases family 114 (B), glycoside hydrolases family 108 (C), and carbohydrate esterases family 6 (D) confirming links of the CAZy families with domains of unknown function (DUF)

Citation

Park BH, Karpinets TV, Syed MH, Leuze MR, Uberbacher EC. CAZymes Analysis Toolkit (CAT): web service for searching and analyzing carbohydrate-active enzymes in a newly sequenced organism using CAZy database. Glycobiology. 2010 Dec;20(12):1574-84. Epub 2010 Aug 9. PubMed PMID: 20696711.


Comments

The protein sequences of genes in CAZy database were downloaded in batch through the eUtils Web service of GenBank (http://eutils.ncbi.nlm.nih.gov) and stored in FASTA format with GenBank accession as IDs. The sequences are further processed to remove redundancies by keeping only the latest submission each sequence to GenBank. This set of unique sequences is used as the reference set for the similarity search and will be further referred as the CAZy reference set. Supplementary data for this article is available online at http://glycob.oxfordjournals.org/.