Yana Slesarenko

Yana Slesarenko

Gene Ontology​

The mission of the GO Consortium is to develop a comprehensive computational model of biological systems, from the molecular to the organismal level, for many species on the tree of life.


How GO works

The PubMed literature database contains more than 15 million articles, and no one is able to read all this information in such quantity. One way that bioinformaticians have addressed is through the discipline of ontology. Ontology allows you to store experimental data in such a way that they represent a formal, structured, fixed representation of reality that underlies biological science. From a biologist’s point of view, the development of bioontologies makes it easier to analyze very large datasets. An ontology such as GO is used to create annotations by model organism database curators and genome annotation centers that capture information about the contribution of gene products to biological systems in a form accessible to computational algorithms. Since such annotations are an integral part of the use of bioontologies, it is important to understand how the curatorial process works. Therefore, we describe how GO annotations illustrate important aspects of this process. Each term in the “Gene Ontology” has a number of attributes: a unique digital identifier, a name, a dictionary to which the term belongs, and a definition. Terms can have synonyms, which are divided into exactly corresponding to the meaning of the term, broader, narrower, and having some relation to the term. Attributes such as links to sources, other databases, and comments on the meaning and usage of the term may also be present. The ontology is built on the principle of a directed acyclic graph: each term is associated with one or more other terms through a different type of relationship.

  • “A is a B” – A is a special case of B,
  • “A part of B” – A is part of B,
  • “B has part A” – B includes A,
  • “A regulates B” – A regulates B,
  • “A positively regulates B” – A positively regulates B,
  • “A negatively regulates B” – A negatively regulates B,
  • “A occurs in B” – A occurs at B.

Briefly, the annotation process unfolds in several stages. First, certain experiments documented in the biomedical literature are identified as being relevant to the responsibilities of a particular curator in the curation process. Second, the facilitator applies expert knowledge to document the results of each selected experiment. This process entails determining which gene products are being studied in the experiment, the nature of the experiment itself, and the molecular functions, biological processes, and cellular components that the experiment identifies as being correlated with the gene product. Then the curator creates an annotation that captures the appropriate relationships between the corresponding types of ontologies. Finally, annotation quality control processes are used to ensure that that the abstract has the correct formal structure, evaluate the consistency of the abstracts among curators and curation teams, and collect knowledge resulting from the annotation activity for the contribution it can make to refinement and improvement. an extension of GO itself, and increasingly of other ontologies as well.

The main goal of the GO annotation effort is to create genome-specific annotations supported by evidence obtained from experiments performed on the annotated organism. However, many annotations are derived from experiments done on other organisms, or are not derived from experiments at all, but rather from knowledge of the sequence features of the gene in question. Such information is also captured in GO annotations using the appropriate proof codes. Thus, it is important for the user of such annotations to understand that these codes reflect: either that the annotation is based on experimental data supporting the statement, or that the annotation is a prediction based on structural similarity. The difference between experimentally verified and calculated GO annotations can be determined in the annotation file.

The decision on which GO term to use in an abstract depends on several factors. The experiment itself imposes some limitations on the resolution of what can be understood from its results. For example, cell fractionation can localize protein molecules to the cell nucleus, and immunolocalization experiments can localize molecules of the same type of protein to the cell nucleolus. As a result, the same gene can have annotations for different terms in the same ontology, since the annotations are based on different experiments. Efforts are made to ensure annotation consistency through regular annotation consistency checks. Where inconsistencies are identified, the GOC takes steps to address them, working with the appropriate maintainers and, where appropriate, subject matter experts. Limitations of experimental methods may lead curators to use their own scientific background when choosing a term. It is important to keep in mind that the choice of the term GO is sometimes made by a commentator’s inference based on his or her previous knowledge.

Molecular Function Annotation


Adh1 gene (alcohol dehydrogenase 1) – a product of the alcohol dehydrogenase 1 gene (class I) – molecular function of alcohol dehydrogenase activity.

The term “activity” in this sense is used in a biochemical context; and more appropriately read as meaning: “potential activity”.

Note that although the same string “alcohol dehydrogenase” is used in both the name of the gene and the molecular function, the string itself refers to different entities: in the former case, the type of molecule; in the latter, to the type of function that the molecule tends to perform. This ambiguity is rooted in the tendency to name molecules based on the functions they perform, and it is important to understand this distinction because the name of a molecule and the molecular function to which the molecule is assigned do not necessarily match.

If we say that the product of a gene can potentially perform a specific function, this does not mean that it will actually perform it. That is, mouse Zp2 gene product molecules are found in the oocyte and tend to bind Acr type gene product molecules during fertilization. However, if the egg is never fertilized, the molecules still exist and they still tend to perform the binding function, but this function is never performed.

Abstract of biological process

Molecular function is the enduring potential of a gene product to act in a certain way. A biological process is the performance by an object of one or more molecular functions working together to achieve a specific biological goal. There is a connection between molecular functions and biological processes.

From the point of view of gene annotations, we are interested in the fact that the molecules of a gene product can be associated with objects of a molecular function (known or unknown), the implementation of which contributes to the emergence of a biological process. Such type-type relationships can be inferred because experiments are designed to test what happens when certain biological conditions are satisfied under typical circumstances—circumstances in which perturbing events do not intervene as a result of the experimenter’s efforts. Experiments are designed to be reproducible and predictive, describing cases that one would expect to find in biological systems that meet certain conditions. If future experiments show

Such annotations sometimes indicate errors in the type-type relationships described in the ontology. An example is the removal of the serotonin secretion type as the is_a child of neurotransmitter secretion from the GO Biological Process ontology. This modification was made as a result of an abstract to a paper showing that serotonin can be secreted by cells of the immune system where it does not act as a neurotransmitter.

Annotation of the cellular component

In the overwhelming majority of cases, annotations linking a gene product with types of cellular components are made on the basis of direct observation of the cellular component object in a microscope. For example, an experiment is reported in which an antibody that recognizes gene products of the Atp1a1 gene is used to mark the location of objects of such products in mouse preimplantation embryos. Fluorescent staining shows that the gene products are located on the plasma membrane of the cells of these embryos. In this case, the objects of the gene products are molecules bound by fluorescent antibodies, and the object of the cellular component is the plasma membrane observed under a microscope. Accordingly, the curator used the results of this experiment, to annotate the Atp1a1 gene product to the plasma membrane of the GO cell component. As with molecular function and biological processes, there is also a relationship between molecular function and cellular component. It is easy to assume that if a gene product molecule is found in the object of a given cellular component, then this gene product can potentially perform its function in this cellular component as well. If, nevertheless, the execution of a function is found, then we can generalize about the type of molecular function and the type of cellular component. that if a gene product molecule is found in the object of a given cellular component, then this gene product can potentially perform its function in this cellular component as well. If, nevertheless, the execution of a function is found, then we can generalize about the type of molecular function and the type of cellular component. that if a gene product molecule is found in the object of a given cellular component, then this gene product can potentially perform its function in this cellular component as well. If, nevertheless, the execution of a function is found, then we can generalize about the type of molecular function and the type of cellular component.

As with molecular function and biological process, experimental evidence for annotations of molecular function and cellular component can often be separated. Therefore, from a practical point of view, these ontologies are also developed separately.

Experimental Annotations​

For data types that indicate the validity of an annotation (evidence code), there is a special ontology related to the OBO project. It includes various annotation methods, both manual and automatic. For example:

  • IDA (Inferred from Direct Assay) – experimental data.
  • TAS (Traceable Author Statement) – data from a scientific publication.
  • IMP (Inferred from Mutant Phenotype) – data obtained on the basis of the mutant phenotype.
  • IGI (Inferred from Genetic Interaction) – based on the interaction of genes.
  • IPI (Inferred from Physical Interaction) – based on physical interaction.
  • RCA (Inferred from Reviewed Computational Analysis) – based on reliable computational analysis.
  • ISS (Inferred from Sequence Similarity) – based on sequence similarity.
  • IGC (Inferred from Genomic Context) – based on the genomic context.
  • IEP (Inferred from Expression Pattern) – based on the nature of the expression.
  • NAS (Non-traceable Author Statement) – based on unpublished data.
  • IEA (Inferred from Electronic Annotation) – based on automatic extraction from other annotation databases.
  • IC (Inferred by Curator) – data attributed by the curator.
  • ND (No biological Data available) – there are no reliable data.

Curated non-experimental annotations​

Part of the evidence comes from non-experimental hand-crafted annotations. In this case, each abstract is reviewed by a curator, but they are not experimental in the sense that there is no direct experimental evidence in the mainstream literature on which they are based; instead, they are derived by curators on the basis of various kinds of analyses.

ISS (derived from sequence or structural similarity) is a superclass (i.e. parent) of the evidence codes ISA (derived from sequence alignment), ISO (derived from sequence orthology) and ISM (derived from sequence model). Each of the three ISS subcategories must be used when only one method was used for inference. For example, to increase the accuracy of the distribution of a function over sequence similarity, many methods take into account the evolutionary relationships between genes. Most of these methods rely on orthology (ISO proof code) because the function of orthologs tends to be more conservative across species than paralogs.

Another approach to feature prediction involves supervised machine learning based on features derived from a protein sequence (ISM evidence code). This approach uses a training set of classified sequences to learn features that can be used to infer the features of genes.
IGC (Inferred from Genomic Context) includes, among other things, such things as the identity of genes adjacent to the gene product in question (i.e. synteny), operon structure, and genome-wide phylogenetic or other analysis.

Relatively new are four codes of evidence related to phylogenetic analysis. IBA (derived from the biological aspect of the ancestor) and IBD (derived from the biological aspect of the descendant) indicate annotations that propagate along the gene tree. The loss of an active site, a binding site, or a domain critical for a particular function can be annotated with an IKR (Inferred from Key Residues) confirmation code. Finally, negative annotations can be assigned to highly divergent sequences using the IRD (Inferred from Rapid Divergence) code.

RCA (derived from Reviewed Computational Analysis) captures annotations derived from predictions based on computational analysis of large-scale experimental datasets or from computational analysis that combines multiple types of datasets, including experimental data (e.g., expression data, interaction data protein-protein), genetic interaction data), sequence data (eg, promoter sequence, sequence-based structural predictions), or mathematical models.

Further, there are two types of annotations derived from the author’s statements. A traced author statement (TAS) that cites the result but not the original evidence itself, such as to review articles.

The last two evidence codes for curated non-experimental annotations are IC (curated by curator) and ND (no biological data available). If the assignment of a GO term is made using the curator’s expertise, inferences from the context of the available data, but without any direct evidence, the proof code IC is used.

An ND proof code indicates that the function is currently unknown (i.e., that there are currently no characteristics of the gene available). Such an annotation is made at the root of the respective ontology to indicate which functional aspect is unknown.

Auto-assigned annotations

The IEA proof code (inference from electronic abstract) is used for all inferences made without human observation, regardless of the method used. The IEA evidence code is by far the most widely used evidence code. The guiding idea behind computational function annotation is the notion that genes with similar sequences or structures are likely to be evolutionarily related, and thus, assuming they have largely retained their hereditary function, they could still perform similar functionalities. roles today.

For example, changes in the number of annotations with the term GO “ATPase activity” (GO:0016887) over time. Use an up-to-date version of the ontology/annotations and make sure that the conclusions drawn are up to date with the latest data. Graph obtained from GOTrack (http://www.chibi.ubc.ca/gotrack)

Online interfaces for data access

The following are online interfaces for accessing and interacting with data using standard web browsers. Most GO users can use data browsers such as AmiGO, QuickGO, and data browsers built into more specific databases.

AmiGO ( http://amigo.geneontology.org ) is the official open source web tool for querying, viewing and visualizing.

Gene ontology and annotations collected from MOD (Model Organism Database), UniProtKB and other sources (for a complete list of member organizations that currently contribute to the GOC, see http://geneontology.org/page/go-consortium -contributors list) . Notable features include: basic search, browsing, the ability to upload custom datasets, and more.

The Gene Ontology Annotation (GOA) project of the European Molecular Biology Laboratory of the European Bioinformatics Institute (EMBL-EBI) also provides the QuickGO browser ( http://www.ebi.ac.uk/QuickGO . A web tool that allows you to easily view Gene Ontology (GO ) and all related electronic and manual GO annotations provided by the GO Consortium Annotation Groups.

AmiGO and QuickGO use the same GO datasets with slightly different implementations depending on the requirements of funding sources and respective users. AmiGO as a whole is a product of the GO Consortium and the official channel for the distribution of GO datasets in accordance with the NHGRI-NIH funding guidelines. QuickGO is produced, operated and financed by EMBL-EBI; members of the QuickGO leadership team are also members of the GOC.

AmiGO (A) and QuickGO (B) browser pages

Term Enrichment Tool

  1. Open “Inferred annotation” and select [+] next to “epithelial cell differentiation”

  2. Remove the text filter by clicking the [x] next to the text entry.

This will leave the user with all GO annotations directly or indirectly annotated with “epithelial cell differentiation” (GO:0009913) that are not human data and have some kind of experimental data associated with them.

The annotation process captures the activity and localization of the gene product using GO terms, providing a link and indicating the type of evidence available to support the assignment of each term using evidence codes. Currently, the main format for annotation information in GO is the gene association file (GAF, http://geneontology.org/page/go-annotation-file-formats). This is a standardized file format that members of the Consortium use to send data. Annotation data is stored in simple tab-delimited text files, where each line in the file represents a single association between a gene product and a GO term, with a proof code, a link to maintain links between them, and other information. The GAF file format has several different “flavors”, the most recent version being 2.1. Recently, GPAD/GPI files have been developed that are essentially a normalized version of the GAF information. They are expected to become more popular in the future, and more information about them can be found on the GO website (http://geneontology.org/page/go-annotation-file-formats).

Gene-categorical analysis is a very famous use case for Gene Ontology. Not surprisingly, users can choose from a variety of software implementations that will perform this kind of analysis. For example, the current version of the Gene Ontology Consortium website (geneontology.org) provides access to the Fisher Exact Test method right on the first page. There are also graphical tools that integrate into existing frameworks such as BiNGO, standalone graphical clients such as Ontologizer5, or packages for Bioconductor such as topGo, mgsa or gCMAP.

To enable a structured description of experimental, computational and other types of evidence to support claims recorded in scientific databases, the Ontology of Evidence and Inference (ECO) was created (http://eviden ceontology.org). ECO describes several types of evidence, including evidence from experimental (i.e., wet lab) methods; evidence derived from computational methods, claims made by authors (whether or not supported by evidence), and conclusions drawn by researchers curating the literature. In addition to summarizing the evidence supporting a particular claim, ECO also offers a means to document whether a computer or human has performed the annotation process. Including ECO in an annotation system allows the structure of an ontology to be used in such a way that related data can be grouped hierarchically, users can select data related to specific types of evidence, and quality control pipelines can be optimized. Today, more than 30 resources, including gene ontology, use evidence and inference ontology to represent both evidence and how annotations are made.

A simplified representation of ECO with a general structure. ECO includes two root classes along with their respective hierarchies, evidence (black terms) and assertion method (pink terms)

ECO also includes such types of evidence as “curator’s conclusion” and “author’s statement”.

In addition to describing evidence, ECO can also describe the means by which claims are made, i.e. human or machine. ECO calls it “the assertion method” and defines it as “the means by which an assertion is made about an object”. For example, if a curator makes an annotation after reading an experimental result in a scientific paper, or after manually evaluating the results of a pairwise sequence alignment, the ECO may indicate that a manual curation method was used. Conversely, if an algorithm was used to assign a predicted function to a protein, ECO may indicate that an automated computational method was used. Thus, “approval method” forms a second root class with two branches: “manual approval” and “automatic approval”.

The current version of the ECO includes 630 terms that describe the cross-products “proof”, “method of assertion”, or “proof x method of assertion”.

For example, the user enters the word “proteolysis” in the query field (Fig. 1a) and sees the number of matches (Fig. 1b). Then, after clicking on “Annotations” in the blue box, the user sees all the terms associated with the annotations that had matches with “proteolysis” (Fig. 2a, b). When you click on “Evidence” in the filter box (Figure 2a), expand it to display all the constituent types of evidence (Figure 3).

Clicking on “traceable author statement used in manual assertion” will open a subset of results that match this more stringent filter (Figure 4). The evidence filter field now says “Nothing to filter” (Figure 5).

Total. The aim of the Gene Ontology (GO) project is to provide a unified way to describe the functions of the gene products of organisms in all kingdoms of life and thus make it possible to analyze genomic data. This is an ongoing process as our understanding of biology grows and improves. It is a computational model of biological reality, and we hope that every researcher will be happy to contribute and will consider it the best means of sharing knowledge gained in the course of their own research.

Write to us

  1. var(--content-transition-delay) / 3[][][]