Genetics
In 1995, scientists at Kyoto University initiated the KEGG (Kyoto Encyclopedia of Genes and Genomes) database project under the Japan Human Genome Program for the biological interpretation of genome sequence data. The main goal of KEGG was to establish links between sets of genes in the genome and high-level functions of the cell and organism. Among other things, we developed the KEGG PATHWAY database as a representation of high-level features, the KEGG GENES database as a set of fully sequenced genomes, and the KO (KEGG Orthology) database for linking genes to high-level features.
As of July 19, 2022, KEGG has over 25,000 data in KEGG Orthology . And the number of genes for different organisms described in the database is close to 42 million. There are 8,234 organisms in KEGG, of which: 770 are eukaryotes, 7072 are bacteria, and 392 are archaea.
KEGG database statistics. Updated daily. https://www.kegg.jp/kegg/docs/statistics.html
Genome annotation is done differently in KEGG than in most other databases.
First, molecular features are stored in the KO database and linked to ortholog groups so that experimental data from a particular organism can be extended to other organisms. Annotating individual genes in the GENES database consists of simply creating links to the KO database by assigning KO record identifiers, called K-numbers.
Second, ortholog groups are defined in the context of KEGG path maps and other molecular networks that are built from K nodes. Thus, the genome annotation procedure to transform the set of genes in the genome into a set of K numbers leads to the automatic reconstruction of KEGG paths and other networks, which allows the interpretation of high-level features.
In early 2015, we added virus and plasmid categories that are important for metagenome analysis and antimicrobial resistance, respectively. Then they introduced the add-on category, where for the first time they began to collect protein sequence data from the published literature, rather than just importing complete genome sequences from RefSeq or GenBank. This is necessary because the pathway map generated from the literature sometimes contains genes and proteins from organisms whose genome sequences are unknown.
The genomic information category contains the GENOME and GENES databases for collections of organisms with complete genomes and their gene catalogs, which are mainly taken from the RefSeq and GenBank databases.
The KO database containing groups of orthologs associated with molecular functions is the hub for linking genomic information to systems information (system information) through the KEGG mapping procedure, as well as to Chemical information (chemical information) through the metabolic network.
The COMPOUND, GLYCAN, REACTION, RPAIR, RCLASS and ENZYME databases contain chemicals and reactions in the category of chemical information and are called KEGG LIGAND for historical reasons. The ENZYME database is derived from the Enzyme Nomenclature database. There is also a small dataset of reaction modules that can be used to annotate enzyme genes.
The Health information category consists of the DISEASE, DRUG, DGROUP and ENVIRON databases for disease and drug information. DGROUP is a relatively recently added database that is being developed to group functionally identical or similar drugs in drug interaction networks. KEGG MEDICUS is a public interface linking these internally developed databases with inserts for all medicines sold in Japan and the United States. The Japanese version of KEGG MEDICUS is particularly advanced in this integration and is mainly accessed through search engines.
The development of the KO database is closely related to the development of KEGG molecular networks, including KEGG pathway maps, BRITE functional hierarchies, and KEGG modules. Ideally, a KO represents a similarity group of a single sequence with an appropriate level of similarity. In fact, there are a number of difficulties. One KO may consist of several sequence similarity groups. As long as the constituent sequence similarity groups are well defined, the KOALA (KEGG Orthology and Links Annotation) program for computationally assigning K numbers works well. However, there is still a small amount of obsolete KO related sequence data which is not well defined.
Internally, the KO grouping is constantly updated by manually checking the KOALA annotation procedure. For third-party users, the basis of the grouping of COs and its correspondence to molecular function must be clarified by experimental data. Thus, serious efforts have been made to annotate individual KOs with reference information reporting gene and protein functional characterization experiments and, when possible, protein sequence data used in the experiments, such as those provided in INSDC (DDBJ/ENA/GenBank ).
Eukaryotes and prokaryotes with complete genomes make up KEGG organisms, identified by three- or four-letter organism codes. As shown in the second table, there are three additional categories: viruses, plasmids, and complement , with the two-letter codes vg, pg, and ag, respectively. The categories of viruses and plasmids are taken from the RefSeq collections. The annotation rate (K-number assignment) is very low for viruses, around 7% compared to 46% for KEGG organisms, but this category is useful in metagenome annotation. Many plasmids are included in the complete genomes of KEGG organisms, and the rest are selected and stored in the plasmid category.
The add -on category is a set of manually created protein sequence records. In the KEGG pathway maps, there have been cases in the past where no corresponding genes could be found in organisms using KEGG, only associations with UniProt were given. To associate them with sequence data and K numbers, records are created in the appendix using the original sequence data with International Nucleotide Sequence Database (INSDC) protein access numbers. In addition, there are areas where sequence records are created. One of them is the nomenclature of enzymes. Another area of focus is antimicrobial resistance (AMR).AMR is a significant problem in the treatment of infectious diseases and complications. Traditionally, the KEGG database has various content for infectious diseases and antimicrobials, including KEGG pathway maps for infectious diseases, KEGG metabolic pathway maps for antibiotic biosynthesis, KEGG drug structure maps for history of antimicrobial development, and KEGG DRUG records for all drugs, currently used.
Internally, the KO grouping is constantly updated by manually checking the KOALA annotation procedure. For third-party users, the basis of the grouping of COs and its correspondence to molecular function must be clarified by experimental data. Thus, serious efforts have been made to annotate individual KOs with reference information reporting gene and protein functional characterization experiments and, when possible, protein sequence data used in the experiments, such as those provided in INSDC (DDBJ/ENA/GenBank ).
Thanks to the genome annotation procedure in KEGG, the GENES database becomes structured in terms of KO groups. This facilitates the processing of sequence similarity search results with the GENES database, which is a simple assignment of the most appropriate K numbers, as implemented in the automatic annotation services KAAS and the recently released BlastKOALA and GhostKOALA.BlastKOALA is suitable for annotating fully sequenced genomes, while GhostKOALA, which uses GHOSTX and is 100 times faster, is suitable for annotating large datasets such as metagenomes. Both assign K numbers to query amino acid sequences and allow KEGG mapping to interpret high-level features. In BlastKOALA, the most appropriate K-numbers are determined in a manner similar to the KOALA program used internally to annotate KEGG organisms. In GhostKOALA, only the highest scores are checked for a K number. Another feature of GhostKOALA is the assignment of taxonomic compositions. To do this, the GhostKOALA pangenome data set is completed with sequences selected from CD-HIT clusters,
The add -on category is a set of manually created protein sequence records. In the KEGG pathway maps, there have been cases in the past where no corresponding genes could be found in organisms using KEGG, only associations with UniProt were given. To associate them with sequence data and K numbers, records are created in the appendix using the original sequence data with International Nucleotide Sequence Database (INSDC) protein access numbers. In addition, there are areas where sequence records are created. One of them is the nomenclature of enzymes. Another area of focus is antimicrobial resistance (AMR).AMR is a significant problem in the treatment of infectious diseases and complications. Traditionally, the KEGG database has various content for infectious diseases and antimicrobials, including KEGG pathway maps for infectious diseases, KEGG metabolic pathway maps for antibiotic biosynthesis, KEGG drug structure maps for history of antimicrobial development, and KEGG DRUG records for all drugs, currently used.
Internally, the KO grouping is constantly updated by manually checking the KOALA annotation procedure. For third-party users, the basis of the grouping of COs and its correspondence to molecular function must be clarified by experimental data. Thus, serious efforts have been made to annotate individual KOs with reference information reporting gene and protein functional characterization experiments and, when possible, protein sequence data used in the experiments, such as those provided in INSDC (DDBJ/ENA/GenBank ).
The following describes some of the protocols for working with KEGG.
This protocol is an introduction to the KEGG database resource. KEGG consists of fifteen main databases shown in Table 1.12.1 (Kanehisa et al., 2012). Each entry in the database, with the exception of entries in KEGG GENES and KEGG ENZYME, is identified by a unique identifier consisting of a database-specific prefix and a five-digit number called the card number. KEGG GENES and KEGG ENZYME are derived from RefSeq (Pruitt et al., 2012) and ExplorEnz (McDonald et al., 2009), respectively, and use the source database identifiers, namely the locus tag or NCBI Gene ID for GENES. And EU number for ENZYME.
Here they are divided into four types.
a. Metabolic pathway maps (Categories 0. Global Map and 1. Metabolism) described in Core Protocols 2 and 3.
b. Regulatory pathway maps (Categories 2. Genetic Information Processing, 3. Environmental Information Processing, 4. Cellular Processes, and 5. Body Systems) described in Core Protocol 4.
in. Disease pathway maps (Category 6. Human Diseases) described in Core Protocol 5.
e. Drug structure maps (Category 7. Drug Development) described in Core Protocol 6.
Each genome is identified by a three-letter organism code (in addition to the T number shown in Table 1.12.1), such as “hsa” for Homo sapiens (human).
This protocol is an introduction to the KEGG Pathway database. KEGG
Pathway is a set of hand-drawn reference diagrams or maps, each corresponding to a known biological pathway of functional significance. In addition, there are computer-generated organism-specific pathways to hand-drawn reference pathways.
a. Rectangles are gene products (proteins) associated with KEGG ORTHOLOGY (KO) records in reference pathways and KEGG GENES records in organism-specific pathways.
b. Small circles represent chemical compounds, glycans and other molecules associated with KEGG COMPOUND, KEGG GLYCAN and other elements.
in. Large ovals are links to other path maps.
e. You can click “Help” to understand the various symbols.
a. There are four types of reference pathways in metabolic maps. Paths prefixed with ko, ec, and rn are associated with KO, ENZYME, and REACTION entries, respectively, with blue-colored rectangles. With “Reference pathway (EC)” selected, click on the box marked 2.3.3.1 to see the information for that ENZYME entry. Click on the circle marked with citrate to see the information for that entry and more.
b. The drop-down menu of organisms can be used to color parts of the path that are known to exist for any given organism. Select for example “Homo sapiens (human)” to display a green path with human genes involved. Then click the same rectangle again to see that it is now linked to the corresponding GENES entry.
in. The drop-down menu also includes “Homo sapiens (human) + Disease/drug”, which displays genes for known diseases in pink and drug targets in blue.
e. As the number of complete genomes increases rapidly, the organisms drop-down menu becomes very long. It may be easier to choose the path for a specific organism from the “Organism menu” link.
a. There are no rectangles on the global map; instead, edges are associated with KO, ENZYME, REACTION, and GENES records.
b. The reference global map is colored according to the metabolism classification (from 1.1 to 1.11 on the KEGG PATHWAY database page), so maps for specific organisms are created by decolorizing parts without corresponding genes.
Select, for example, “Homo sapiens (human)” and then “Arabidopsis thaliana (watercress)” to see the difference between animal and plant metabolism.
KEGG metabolic pathway maps, especially global maps, are widely used to study metabolic abilities inferred from genomic, transcriptomic, metagenomic, and other data, and to compare or combine the metabolic abilities of multiple organisms. This protocol presents methods for direct access to organism-specific pathways.
a. To do this, go to the KEGG homepage. Enter an organism code, such as hsa, into the small search box under “Organism-specific entry points” and click “Go”.
b. Or you can go to the KEGG2 page. Enter the organism code in the “KEGG for specific organisms” section and click “Go”.
in. The summary page for that organism opens. Click the Pathway link in the navigation bar to see the entire set of path maps available for that organism.
a. Return to the KEGG home page again. Enter this time
two organism codes separated by a space or joined by a plus sign, such as “hsa ath” or “hsa+ath”.
b. Or do the same as the first example on the KEGG2 page, in the “KEGG mapping for genome comparison and combination” section, and click “Go”.
in. The summary page for that set of organisms opens. Click on “Pathway maps” in the navigation bar to see the available list of paths. Select the global metabolism map 01100.
e. The global map is now displayed in three colors: green for path elements specific to the first organism (hsa – Homo sapiens), red for elements specific to the second organism (ath – Arabidopsis thaliana), and blue for elements common to the two organisms. organisms (Fig. 1.12.3).
a. Open the KEGG homepage. Go to “KEGG Organisms” under “Organism-specific entry points”. In the displayed “KEGG Organisms: Complete Genomes” table, click on any of the category names, such as “Vertebrates”.
b. To view group path maps, open the “Pathway maps” link in the navigation bar. Select, for example, the metabolic map 00010 Glycolysis/Gluconeogenesis. The number of genes corresponding to each node (rectangle) is shown in color gradation.
in. Return to the “KEGG Organisms: Complete Genomes” table. There is a Pan link in the top right corner. Click on this link to view a list of KEGG pangenomes. Here you can select any species name, such as “Escherichia coli” to get a collection of different strains.
e. Alternatively, on the KEGG2 page, enter the name of the category, or the name of the group of organisms, or the name of the pangenomic species in the “KEGG mapping for genome comparison and combination” section and click Go.
e. Color grading is also used in every map.
The KEGG GENES database is a collection of full genome catalogs with high quality sequence data. For prokaryotes, all genomes available from the NCBI RefSeq ftp site ( ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ ) are included in KEGG GENES. For eukaryotes, most of the data comes from the RefSeq release ( ftp://ftp.ncbi.nih.gov/refseq/release/ ).
a. The input field contains the number 1080, which in this case corresponds to the NCBI gene identifier.
b. The names and definitions of the genes in the next two fields are taken from RefSeq without any changes.
in. The Orthology field contains the annotation specified by KEGG, which is the assignment of the KEGG Orthology (KO) group, identified by the K number.
e. The following fields contain links to other KEGG databases containing information on pathways in which the gene product is involved, diseases associated with the gene, drugs targeting the gene product, and the BRITE hierarchy for classifying genes/proteins.
e. The SSDB and Motif fields contain search tools in the KEGG SSDB database.
f. The “Other DBs” field provides links to external databases that contain related information. The “All links” field on the right is a summary of links through the GenomeNet LinkDB system. The PDB field contains links to 3D structure data, if any.
I. The fields AA seq and NT seq can be used to extract sequence data for further analysis, such as searching for sequence similarity using BLAST or FASTA.