Part 1 (HGNC)
In the absence of a universally agreed alternative, the HGNC maintains the definition of a gene as "a DNA segment that contributes to phenotype/function. In the absence of demonstrated function a gene may be characterized by sequence, transcription or homology".
Each gene is assigned only one symbol; the HGNC does not routinely name isoforms (i.e. alternate transcripts or splice variants). This means no separate symbols for protein-coding or non-coding RNA isoforms of a protein-coding locus or alternative transcripts from a non-coding RNA locus. In exceptional circumstances, and following community demand, separate symbols have been approved for gene segments in complex loci, e.g. the UGT1 locus. Putative bicistronic loci may be assigned separate symbols to represent the distinct gene products.
Every gene that we name is assigned a unique symbol, HGNC ID (in the format HGNC:#) and descriptive name. Symbols contain only uppercase Latin letters and Arabic numerals, and punctuation is avoided, with an exception for hyphens in specific groups. Symbols should not be the same as commonly used abbreviations, to facilitate data retrieval. Nomenclature should not contain references to any species or 'G' for gene, nor should it be offensive or pejorative.Protein coding genes
We aim to name protein-coding genes based on a key normal function of the gene product.
In the absence of functional data, protein-coding genes may be named in the following ways:
Where possible, related genes are named using a common root symbol to enable grouping, typically based on sequence homology, shared function or membership of protein complexes.
For genes involved in specific immune processes, or encoding an enzyme, receptor or ion channel, we consult with specialist nomenclature groups (please see supplementary note at https://www.readcube.com/articles/supplement?doi=10.1038%2Fs41588-020-0669-3&index=0). For other major gene groups we consult a panel of advisors when naming new members and discussing proposed nomenclature updates.Pseudogenes
We define a pseudogene as a sequence that is incapable of producing a functional protein product but has a high level of homology to a functional gene. In general, we only name pseudogenes that retain homology to a significant proportion of the functional ancestral gene.
Processed pseudogenes are named based on the specific parent gene, with a P and number appended to the parent gene symbol (e.g. NACAP10, "NACA pseudogene 10"). The numbering is usually species-specific.
Pseudogenes that retain most of the coding sequence compared to other family members (and are usually unprocessed) are named as a new family member with a "P" suffix, e.g. DDX12P, "DEAD/H-box helicase 12, pseudogene". This naming format is also used for genes that are pseudogenized relative to their functional ortholog in another species. Note, rarely such pseudogenes do not include the "P" if the symbol is well established, e.g. MMP23A; "matrix metallopeptidase 23A (pseudogene)".
We name non-coding RNA (ncRNA) genes according to their RNA type, please see our recent review (https://www.embopress.org/doi/full/10.15252/embj.2019103777) for a full description.
For small RNAs where an expert resource exists, we follow their naming conventions as follows:
Other classes of small ncRNAs are named in collaboration with specialist advisors. Major classes of small ncRNA include:
Long non-coding RNAs (lncRNAs) are preferentially given unique symbols based on published function akin to protein-coding genes. LncRNA genes that have been annotated by the RefSeq and GENCODE projects for which no suitable published information on which to base a symbol exists are named in the following systematic way:
Readthrough transcripts are normally produced from adjacent loci and include coding and/or non-coding parts of two (or more) genes. The HGNC only names readthrough transcripts that are consistently annotated by both the RefSeq annotators at NCBI and the GENCODE annotators at Ensembl. These transcripts have the locus type "readthrough transcript" and are symbolized using the two (or more) symbols from the parent genes, separated by a hyphen, e.g. ZNF511-PRAP1, and the name "[symbol] readthrough", e.g. "ZNF511-PRAP1 readthrough". The name may also include additional information about the potential coding status of the transcript, such as "(NMD candidate)".
Historically, the HGNC has only approved symbols for genes that are on the human reference genome. Rare exceptions have been made when requested by particular communities with dedicated nomenclature committees, such as the HLA community. Future naming of structural variants will be restricted to those on alternate loci that have been incorporated into the human reference genome by the Genome Reference Consortium (GRC). The underscore character is reserved for genes annotated on alternate reference loci, e.g. C4B_2 is a second copy of C4B on a 6p21.3 alternate reference locus.
Note: HGNC no longer name phenotypes (please see contact OMIM) or genomic regions, nor do we name transposable-element insertions in the human genome. For products of gene translocations or fusions, we recommend the format SYMBOL1/SYMBOL2, to avoid confusion with the SYMBOL1-SYMBOL2 format we approve for readthrough transcripts. Sequence variant nomenclature is the remit of the HGVS. For protein nomenclature, please see the International Protein Nomenclature Guidelines, which were written with the involvement of the HGNC. In agreement with these guidelines, we recommend that "protein and gene symbols should use the same abbreviation", with proteins using non-italicised symbols to differentiate them from genes.Naming orthologs across species
We recommend that orthologous genes across vertebrate (and where appropriate, non-vertebrate) species should have the same gene symbol. To distinguish the species of origin for homologous genes with the same gene symbol, we recommend citing the NCBI taxonomy ID, as well as the species name or the GenBank common name, e.g. Taxonomy ID: 9598 and either Pan troglodytes or chimpanzee.The Vertebrate Gene Nomenclature Committee
The Vertebrate Gene Nomenclature Committee (VGNC, [https://vertebrate.genenames.org/]) is an extension of the HGNC responsible for assigning standardized nomenclature to genes in vertebrate species that currently lack their own nomenclature committee. The VGNC coordinates with the five established existing vertebrate nomenclature committees, MGNC (mouse), RGNC (rat), CGNC (chicken), XNC (Xenopus frog) and ZNC (zebrafish), to ensure vertebrate genes are named in line with their human homologs.
Vertebrate orthologs of human C#orf# genes are assigned the human symbol with the other species chromosome number as a prefix and an H denoting human. For example, as the ortholog of human C1orf100 is on cow chromosome 16, the cow symbol is C16H1orf100 with the corresponding gene name "chromosome 16 C1orf100 homolog".
Gene families with a complex evolutionary history should ideally be named with the help of an expert in the field, as has already been implemented for the olfactory receptor and cytochrome P450 gene families.
Reference: https://rdcu.be/b53pu (PMID 32747822, doi: 10.1038/s41588-020-0669-3).
Gene Nomenclature Part 2 (HGNC)
TABLE OF CONTENTS
Definition: A gene is a DNA segment that contributes to phenotype/function. In the absence of demonstrated function a gene may be characterized by sequence, transcription or homology.
1. General Rules for Gene Nomenclature 1.1. Requirements for designation by gene symbol
1.2.1. Gene symbols are designated by upper case Latin letters or by a combination of upper-case letters and Arabic numbers. Symbols should be short in order to be useful, and should not attempt to represent all known information about a gene. Ideally symbols should be no longer than six characters in length. Based on classical genetic guidelines, gene symbols are always either underlined or italicized when referring to genotypic information (phenotypic information is represented in standard fonts). Exceptions to this rule are in catalogs of known genes, and when fragments or synthesized segments of genes are referred to. New symbols must not duplicate existing gene symbols (check the Genome Database, or the HUGO/GDB Nomenclature Committee list of approved gene symbols).
1.3.1. Gene names should be brief and specific and should convey the character or function of the gene.
1.4. DNA segments
Part I: D for DNA
Part II: 0,1,2,...22,X,Y,XY for the chromosomal assignment, where XY is for segments homologous on the X and Y chromosomes, and 0 is for unknown chromosomal assignment.
Part III: A symbol indicating the complexity of the DNA segment detected by the probe, with S for a unique DNA segment, and Z for repetitive DNA segments found at a single chromosome site or F for small undefined families of homologous sequences found on multiple chromosomes. Part IV: 1,2,3,..., a sequential number to give uniqueness to the above concatenated characters.
Part V:When the DNA segment is known to be an expressed sequence the suffix E can be added to indicate this fact.
These numbers can now be generated automatically in the Genome Database, following entry of clone details.
2. Recommendations for symbol construction 2.1. Hierarchical symbols, gene families and series
2.1.1. Every attempt should be made to represent information in a hierarchical form to facilitate retrieval of sets of related genes from computerized databases.
2.2.1. Homologous genes in different species (orthologs) should where possible have the same gene nomenclature.
2.3.1. Predicted genes.
Genes predicted from EST clusters or from genomic sequence alone are regarded as putative, and are designated by the chromosome of origin and arbitrary number. Example: C2ORF1
Molecular technology has identified sequences (generally not transcribed) that bear striking homologies to structural gene sequences. These sequences are termed pseudogenes. In order to show the relatedness of pseudogenes to functional genes, pseudogenes will be identified with the gene symbol of the structural gene followed by a P for pseudogene. In order to reserve P for pseudogenes, the use of P as the last character of a structural gene symbol should be avoided where possible. Examples: HBBP1 (hemoglobin, beta pseudogene 1); ACTBP1 (actin, beta pseudogene 1); ACTBP2 (actin, beta pseudogene 2), etc. Pseudogenes may be on different chromosomes or closely linked to the functional gene and occur in varying numbers.
2.3.3. Related sequences
Related sequences identified by cross-hybridisation, and or by computer searching of sequence databases (BLAST, FASTA), where no other functional information is available for the construction of a symbol, are designated with the symbol of the known gene followed by an L for like. (see also homology section 2.3).
2.4. Enzymes and proteins
2.5.1. Inherited clinical disorders (monogenic Mendelian inheritance).
The first gene symbol allocated to an inherited clinical phenotype may be based on an acronym which has been established as a name for the disorder, whilst following the rules described in section 1. Example: ACH for achondroplasia. However it is usual for this symbol to change when the gene product or function is identified. In some cases a gene symbol based on product or function will already exist, and this will take precedence over the symbol derived from the clinical disorder when the gene descriptions are merged for example in the case of achondroplasia the symbol changed to FGFR3 and the name to fibroblast growth factor receptor 3 (achondroplasia, thanatophoric dwarfism)..
2.5.2. Complex/polygenic traits
Genome searches may suggest a contributing locus in a complex trait, which may for convenience be given a gene symbol, although a proportion of these will disappear in time. A symbol allocated to such a gene will not be re-used.
2.5.3. Contiguous gene syndromes.
Syndromes clearly associated with multiple loci should not be given gene symbols. Syndromes associated with a regional deletion or duplication may be assigned the letters CR (for chromosome region), in place of S for syndrome. Examples: ANCR (Angelman syndrome chromosome region), DCR (Down syndrome chromosome region). However, as advances in database design have now increased the possible ways of representing this type of information, we recommend that such symbols are now classified as syndromic region symbols and not gene symbols.
2.5.4. Loss of heterozygosity.
A chromosomal region in which the existence of genes may be inferred by loss of heterozygosity can be designated by a symbol consisting of the letters LOH, the chromosome number, CR (for chromosomal region) and then an arbitrary number. 2.6. Letters reserved for specific usage
2.6.1 Certain letters, or combinations of letters are used as the last letter in a symbol to represent a specific meaning, these are P for pseudogene (but note also BP for binding protein), L for like (see 2.1.), R for receptor or regulator, N or NH for inhibitor. The use of these for other meanings should be avoided where possible.
Allele terminology is now the responsibility of the Mutation Database (ref/URL) 4. Printing Gene and Allele Symbols
Gene and allele symbols are underlined in manuscript and italicized in print. Italics need not be used in catalogs. It may be convenient in manuscripts, computer printouts and in printed text to designate a gene symbol by following it with an asterisk (e.g. PGM1*). When only allele symbols are displayed they can be preceded by an asterisk. For example, for PGM1*1, the allele is printed as *1.
Table 1: Species Abbreviations
Table 2: Greek-to-Latin alphabet conversion
Table 3: Single-letter amino acid symbols
5. Acknowledgements: The Nomenclature meeting held in Toronto on 5th March 1997 was made possible by the support of the EU, through a contract to HUGO.
6. Reference: J.A. White, P.J. McAlpine, S. Antonarakis, H. Cann, K. Frazer, J. Frezal, D. Lancet, J. Nahmias, P. Pearson, J. Peters, A. Scott, H. and the attendees at the nomenclature meeting 5th of March 1997.