Yana Slesarenko

Yana Slesarenko

How KEGG works

Genetics

In 1995, scientists at Kyoto University initiated the KEGG (Kyoto Encyclopedia of Genes and Genomes) database project under the Japan Human Genome Program for the biological interpretation of genome sequence data. The main goal of KEGG was to establish links between sets of genes in the genome and high-level functions of the cell and organism. Among other things, we developed the KEGG PATHWAY database as a representation of high-level features, the KEGG GENES database as a set of fully sequenced genomes, and the KO (KEGG Orthology) database for linking genes to high-level features.

As of July 19, 2022, KEGG has over 25,000 data in KEGG Orthology . And the number of genes for different organisms described in the database is close to 42 million. There are 8,234 organisms in KEGG, of which: 770 are eukaryotes, 7072 are bacteria, and 392 are archaea.

KEGG database statistics. Updated daily. https://www.kegg.jp/kegg/docs/statistics.html

Genome annotation is done differently in KEGG than in most other databases. 

First, molecular features are stored in the KO database and linked to ortholog groups so that experimental data from a particular organism can be extended to other organisms. Annotating individual genes in the GENES database consists of simply creating links to the KO database by assigning KO record identifiers, called K-numbers. 

Second, ortholog groups are defined in the context of KEGG path maps and other molecular networks that are built from K nodes. Thus, the genome annotation procedure to transform the set of genes in the genome into a set of K numbers leads to the automatic reconstruction of KEGG paths and other networks, which allows the interpretation of high-level features. 

In early 2015, we added virus and plasmid categories that are important for metagenome analysis and antimicrobial resistance, respectively. Then they introduced the add-on category, where for the first time they began to collect protein sequence data from the published literature, rather than just importing complete genome sequences from RefSeq or GenBank. This is necessary because the pathway map generated from the literature sometimes contains genes and proteins from organisms whose genome sequences are unknown.

The genomic information category contains the GENOME and GENES databases for collections of organisms with complete genomes and their gene catalogs, which are mainly taken from the RefSeq and GenBank databases.

The KO database containing groups of orthologs associated with molecular functions is the hub for linking genomic information to systems information (system information) through the KEGG mapping procedure, as well as to Chemical information (chemical information) through the metabolic network.

The COMPOUND, GLYCAN, REACTION, RPAIR, RCLASS and ENZYME databases contain chemicals and reactions in the category of chemical information and are called KEGG LIGAND for historical reasons. The ENZYME database is derived from the Enzyme Nomenclature database. There is also a small dataset of reaction modules that can be used to annotate enzyme genes.

The Health information category consists of the DISEASE, DRUG, DGROUP and ENVIRON databases for disease and drug information. DGROUP is a relatively recently added database that is being developed to group functionally identical or similar drugs in drug interaction networks. KEGG MEDICUS is a public interface linking these internally developed databases with inserts for all medicines sold in Japan and the United States. The Japanese version of KEGG MEDICUS is particularly advanced in this integration and is mainly accessed through search engines.

Experimental evidence

The development of the KO database is closely related to the development of KEGG molecular networks, including KEGG pathway maps, BRITE functional hierarchies, and KEGG modules. Ideally, a KO represents a similarity group of a single sequence with an appropriate level of similarity. In fact, there are a number of difficulties. One KO may consist of several sequence similarity groups. As long as the constituent sequence similarity groups are well defined, the KOALA (KEGG Orthology and Links Annotation) program for computationally assigning K numbers works well. However, there is still a small amount of obsolete KO related sequence data which is not well defined.

Internally, the KO grouping is constantly updated by manually checking the KOALA annotation procedure. For third-party users, the basis of the grouping of COs and its correspondence to molecular function must be clarified by experimental data. Thus, serious efforts have been made to annotate individual KOs with reference information reporting gene and protein functional characterization experiments and, when possible, protein sequence data used in the experiments, such as those provided in INSDC (DDBJ/ENA/GenBank ).

Eukaryotes and prokaryotes with complete genomes make up KEGG organisms, identified by three- or four-letter organism codes. As shown in the second table, there are three additional categories: viruses, plasmids, and complement , with the two-letter codes vg, pg, and ag, respectively. The categories of viruses and plasmids are taken from the RefSeq collections. The annotation rate (K-number assignment) is very low for viruses, around 7% compared to 46% for KEGG organisms, but this category is useful in metagenome annotation. Many plasmids are included in the complete genomes of KEGG organisms, and the rest are selected and stored in the plasmid category.

The add -on category is a set of manually created protein sequence records. In the KEGG pathway maps, there have been cases in the past where no corresponding genes could be found in organisms using KEGG, only associations with UniProt were given. To associate them with sequence data and K numbers, records are created in the appendix using the original sequence data with International Nucleotide Sequence Database (INSDC) protein access numbers. In addition, there are areas where sequence records are created. One of them is the nomenclature of enzymes. Another area of ​​focus is antimicrobial resistance (AMR).AMR is a significant problem in the treatment of infectious diseases and complications. Traditionally, the KEGG database has various content for infectious diseases and antimicrobials, including KEGG pathway maps for infectious diseases, KEGG metabolic pathway maps for antibiotic biosynthesis, KEGG drug structure maps for history of antimicrobial development, and KEGG DRUG records for all drugs, currently used.

Internally, the KO grouping is constantly updated by manually checking the KOALA annotation procedure. For third-party users, the basis of the grouping of COs and its correspondence to molecular function must be clarified by experimental data. Thus, serious efforts have been made to annotate individual KOs with reference information reporting gene and protein functional characterization experiments and, when possible, protein sequence data used in the experiments, such as those provided in INSDC (DDBJ/ENA/GenBank ).

BlastKOALA and GhostKOALA

Thanks to the genome annotation procedure in KEGG, the GENES database becomes structured in terms of KO groups. This facilitates the processing of sequence similarity search results with the GENES database, which is a simple assignment of the most appropriate K numbers, as implemented in the automatic annotation services KAAS and the recently released BlastKOALA and GhostKOALA.BlastKOALA is suitable for annotating fully sequenced genomes, while GhostKOALA, which uses GHOSTX and is 100 times faster, is suitable for annotating large datasets such as metagenomes. Both assign K numbers to query amino acid sequences and allow KEGG mapping to interpret high-level features. In BlastKOALA, the most appropriate K-numbers are determined in a manner similar to the KOALA program used internally to annotate KEGG organisms. In GhostKOALA, only the highest scores are checked for a K number. Another feature of GhostKOALA is the assignment of taxonomic compositions. To do this, the GhostKOALA pangenome data set is completed with sequences selected from CD-HIT clusters,

The add -on category is a set of manually created protein sequence records. In the KEGG pathway maps, there have been cases in the past where no corresponding genes could be found in organisms using KEGG, only associations with UniProt were given. To associate them with sequence data and K numbers, records are created in the appendix using the original sequence data with International Nucleotide Sequence Database (INSDC) protein access numbers. In addition, there are areas where sequence records are created. One of them is the nomenclature of enzymes. Another area of ​​focus is antimicrobial resistance (AMR).AMR is a significant problem in the treatment of infectious diseases and complications. Traditionally, the KEGG database has various content for infectious diseases and antimicrobials, including KEGG pathway maps for infectious diseases, KEGG metabolic pathway maps for antibiotic biosynthesis, KEGG drug structure maps for history of antimicrobial development, and KEGG DRUG records for all drugs, currently used.

Internally, the KO grouping is constantly updated by manually checking the KOALA annotation procedure. For third-party users, the basis of the grouping of COs and its correspondence to molecular function must be clarified by experimental data. Thus, serious efforts have been made to annotate individual KOs with reference information reporting gene and protein functional characterization experiments and, when possible, protein sequence data used in the experiments, such as those provided in INSDC (DDBJ/ENA/GenBank ).

The following describes some of the protocols for working with KEGG.

Protocol 1

KEGG DATABASE RESOURCE: GETTING STARTED

This protocol is an introduction to the KEGG database resource. KEGG consists of fifteen main databases shown in Table 1.12.1 (Kanehisa et al., 2012). Each entry in the database, with the exception of entries in KEGG GENES and KEGG ENZYME, is identified by a unique identifier consisting of a database-specific prefix and a five-digit number called the card number. KEGG GENES and KEGG ENZYME are derived from RefSeq (Pruitt et al., 2012) and ExplorEnz (McDonald et al., 2009), respectively, and use the source database identifiers, namely the locus tag or NCBI Gene ID for GENES. And EU number for ENZYME.

  1. Open the KEGG website homepage at http://www.kegg.jp/ . The home page contains entry points to the most widely used databases and analysis tools.
  2. Select the KEGG2 link on the home page, designated as the main entry point to KEGG, which opens a KEGG list containing all available databases and computing tools (Figure 1.12.1).
  1. Return to the home page and open the KEGG PATHWAY link. Or click KEGG PATHWAY on the KEGG2 page. The KEGG PATHWAY database page opens.
    The KEGG2 and PATHWAY links (as well as the BRITE and MODULE links) are always in the navigation bar, which is color coded yellow for the top level and others (purple, red and blue) for the sublevel.
  2. This page contains a list of all available KEGG pathway maps.

Here they are divided into four types.

a. Metabolic pathway maps (Categories 0. Global Map and 1. Metabolism) described in Core Protocols 2 and 3.

b. Regulatory pathway maps (Categories 2. Genetic Information Processing, 3. Environmental Information Processing, 4. Cellular Processes, and 5. Body Systems) described in Core Protocol 4.

in. Disease pathway maps (Category 6. Human Diseases) described in Core Protocol 5.

e. Drug structure maps (Category 7. Drug Development) described in Core Protocol 6.

  1. Return to the home page and click on KEGG Organisms to open a table containing all available genomes in KEGG.

Each genome is identified by a three-letter organism code (in addition to the T number shown in Table 1.12.1), such as “hsa” for Homo sapiens (human).

  1. The left sidebar of the KEGG homepage contains links to useful information and documentation. For example, “Current statistics” allows you to see the number of data records in the individual KEGG databases, most of which are updated daily.
  2. The search box at the top can be used to search for a keyword in KEGG. Type alzheimer, for example, to view KEGG records related to Alzheimer’s.
  3. This search box can also be used to directly search for a specific entry in the KEGG database by entering its unique identifier, prefix plus a five-digit number, EC number, or gene identifier in the form org:gene, where org is the three letter organism code and gene is the locus tag or NCBI gene identifier. Try entering, for example, map00020, 2.3.3.1 or hsa:1956.

Protocol 2

KEGG PATHWAY: MAP OF THE METABOLIC PATHWAY

This protocol is an introduction to the KEGG Pathway database. KEGG 

Pathway is a set of hand-drawn reference diagrams or maps, each corresponding to a known biological pathway of functional significance. In addition, there are computer-generated organism-specific pathways to hand-drawn reference pathways. 

  1. Access the KEGG PATHWAY database by opening the KEGG PATHWAY link on the KEGG homepage or the KEGG2 page. 
  2. There are two types of metabolic cards: global cards and regular (or traditional) cards. Click “Citrate cycle (TCA cycle)” in the “1.1 Carbohydrate Metabolism” category to view the normal metabolic map (map00020) shown in Figure 1. 1.12.2.
    1. Elements on a path map are represented by different symbols, which can have slightly different meanings in different types of path maps.

    a. Rectangles are gene products (proteins) associated with KEGG ORTHOLOGY (KO) records in reference pathways and KEGG GENES records in organism-specific pathways.

    b. Small circles represent chemical compounds, glycans and other molecules associated with KEGG COMPOUND, KEGG GLYCAN and other elements.

    in. Large ovals are links to other path maps.

    e. You can click “Help” to understand the various symbols.

    1. In the upper left corner there is a drop-down menu for selecting reference paths and organism names. They are distinguished by a prefix, such as map00020, ko00020, ec00020, rn00020, and hsa00020, and by the color of the rectangles and links from the rectangles.

    a. There are four types of reference pathways in metabolic maps. Paths prefixed with ko, ec, and rn are associated with KO, ENZYME, and REACTION entries, respectively, with blue-colored rectangles. With “Reference pathway (EC)” selected, click on the box marked 2.3.3.1 to see the information for that ENZYME entry. Click on the circle marked with citrate to see the information for that entry and more.

    b. The drop-down menu of organisms can be used to color parts of the path that are known to exist for any given organism. Select for example “Homo sapiens (human)” to display a green path with human genes involved. Then click the same rectangle again to see that it is now linked to the corresponding GENES entry. 

    in. The drop-down menu also includes “Homo sapiens (human) + Disease/drug”, which displays genes for known diseases in pink and drug targets in blue. 

    e. As the number of complete genomes increases rapidly, the organisms drop-down menu becomes very long. It may be easier to choose the path for a specific organism from the “Organism menu” link.

    1. Return to the KEGG PATHWAY database page and click on “Metabolic pathways [zoom out]” under the category of “0. Global Map. Global maps are generated by manually combining conventional maps to provide an overall picture of both primary and secondary metabolism. 

    a. There are no rectangles on the global map; instead, edges are associated with KO, ENZYME, REACTION, and GENES records. 

    b. The reference global map is colored according to the metabolism classification (from 1.1 to 1.11 on the KEGG PATHWAY database page), so maps for specific organisms are created by decolorizing parts without corresponding genes.

    Select, for example, “Homo sapiens (human)” and then “Arabidopsis thaliana (watercress)” to see the difference between animal and plant metabolism.

KEGG PATHWAY: COMPARISON AND COMBINATION OF GENOMES

KEGG metabolic pathway maps, especially global maps, are widely used to study metabolic abilities inferred from genomic, transcriptomic, metagenomic, and other data, and to compare or combine the metabolic abilities of multiple organisms. This protocol presents methods for direct access to organism-specific pathways.

  1. Here are three examples. The first is for one organism.

a. To do this, go to the KEGG homepage. Enter an organism code, such as hsa, into the small search box under “Organism-specific entry points” and click “Go”. 

b. Or you can go to the KEGG2 page. Enter the organism code in the “KEGG for specific organisms” section and click “Go”.

in. The summary page for that organism opens. Click the Pathway link in the navigation bar to see the entire set of path maps available for that organism.

  1. The second example concerns the comparison of several organisms or combinations. 

a. Return to the KEGG home page again. Enter this time

two organism codes separated by a space or joined by a plus sign, such as “hsa ath” or “hsa+ath”.

b. Or do the same as the first example on the KEGG2 page, in the “KEGG mapping for genome comparison and combination” section, and click “Go”.

in. The summary page for that set of organisms opens. Click on “Pathway maps” in the navigation bar to see the available list of paths. Select the global metabolism map 01100.

e. The global map is now displayed in three colors: green for path elements specific to the first organism (hsa – Homo sapiens), red for elements specific to the second organism (ath – Arabidopsis thaliana), and blue for elements common to the two organisms. organisms (Fig. 1.12.3).

  1. The third example refers to a group of organisms including the pangenome.

a. Open the KEGG homepage. Go to “KEGG Organisms” under “Organism-specific entry points”. In the displayed “KEGG Organisms: Complete Genomes” table, click on any of the category names, such as “Vertebrates”.

b. To view group path maps, open the “Pathway maps” link in the navigation bar. Select, for example, the metabolic map 00010 Glycolysis/Gluconeogenesis. The number of genes corresponding to each node (rectangle) is shown in color gradation.

in. Return to the “KEGG Organisms: Complete Genomes” table. There is a Pan link in the top right corner. Click on this link to view a list of KEGG pangenomes. Here you can select any species name, such as “Escherichia coli” to get a collection of different strains.

e. Alternatively, on the KEGG2 page, enter the name of the category, or the name of the group of organisms, or the name of the pangenomic species in the “KEGG mapping for genome comparison and combination” section and click Go.

e. Color grading is also used in every map.

KEGG GENES: GENE CATALOGS OF COMPLETE GENOMES

The KEGG GENES database is a collection of full genome catalogs with high quality sequence data. For prokaryotes, all genomes available from the NCBI RefSeq ftp site ( ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ ) are included in KEGG GENES. For eukaryotes, most of the data comes from the RefSeq release ( ftp://ftp.ncbi.nih.gov/refseq/release/ ). 

  1. The KEGG GENES database can be accessed from the KEGG or KEGG2 home page.
  2. The search box at the top is used to retrieve the GENES record or to search the SSDB (Sequence Similarity Database). For example, the conductance regulator (CFTR) in humans (Fig. 1.12.11).

a. The input field contains the number 1080, which in this case corresponds to the NCBI gene identifier.

b. The names and definitions of the genes in the next two fields are taken from RefSeq without any changes. 

in. The Orthology field contains the annotation specified by KEGG, which is the assignment of the KEGG Orthology (KO) group, identified by the K number.

e. The following fields contain links to other KEGG databases containing information on pathways in which the gene product is involved, diseases associated with the gene, drugs targeting the gene product, and the BRITE hierarchy for classifying genes/proteins.

e. The SSDB and Motif fields contain search tools in the KEGG SSDB database.

f. The “Other DBs” field provides links to external databases that contain related information. The “All links” field on the right is a summary of links through the GenomeNet LinkDB system. The PDB field contains links to 3D structure data, if any.

  1. The “Position” field indicates the location of this gene in the genome, and the “Genome map” button, if available, will display the position of this gene on the chromosome map.

I. The fields AA seq and NT seq can be used to extract sequence data for further analysis, such as searching for sequence similarity using BLAST or FASTA.

  1. Return to the KEGG GENES database page. The two search fields in the first section of the gene catalogs are used for keyword searches, one for the entire GENES database and the other for a specific organism.
  2. Additional gene catalogs exist, including DGENES for draft genomes, EGENES for EST contigs, and MGENES for metagenomes that are given automatic annotations, as well as VGENES without any annotations. They are designed to complement the KEGG pool of organisms with complete genomes.
Yana Slesarenko

Yana Slesarenko

Read also

Genetic editing

More

Write to us