Gene Nomenclature

References

Gene Nomenclature

Part 1 (HGNC)

In the absence of a universally agreed alternative, the HGNC maintains the definition of a gene as "a DNA segment that contributes to phenotype/function. In the absence of demonstrated function a gene may be characterized by sequence, transcription or homology".

Each gene is assigned only one symbol; the HGNC does not routinely name isoforms (i.e. alternate transcripts or splice variants). This means no separate symbols for protein-coding or non-coding RNA isoforms of a protein-coding locus or alternative transcripts from a non-coding RNA locus. In exceptional circumstances, and following community demand, separate symbols have been approved for gene segments in complex loci, e.g. the UGT1 locus. Putative bicistronic loci may be assigned separate symbols to represent the distinct gene products.

Every gene that we name is assigned a unique symbol, HGNC ID (in the format HGNC:#) and descriptive name. Symbols contain only uppercase Latin letters and Arabic numerals, and punctuation is avoided, with an exception for hyphens in specific groups. Symbols should not be the same as commonly used abbreviations, to facilitate data retrieval. Nomenclature should not contain references to any species or 'G' for gene, nor should it be offensive or pejorative.

Protein coding genes

We aim to name protein-coding genes based on a key normal function of the gene product.

In the absence of functional data, protein-coding genes may be named in the following ways:

Based on recognized structural domains and motifs encoded by the gene (e.g. BEND7, "BEN domain containing 7")
Based on homologous genes within the human genome (e.g. GPRIN3, "GPRIN family member 3")
Based on homologous genes from another species (e.g. FEM1A, "fem-1 homolog A")
Based only on the presence of an open reading frame (e.g. C17orf50, "chromosome 17 open reading frame 50")

Where possible, related genes are named using a common root symbol to enable grouping, typically based on sequence homology, shared function or membership of protein complexes.

For genes involved in specific immune processes, or encoding an enzyme, receptor or ion channel, we consult with specialist nomenclature groups (please see supplementary note at https://www.readcube.com/articles/supplement?doi=10.1038%2Fs41588-020-0669-3&index=0). For other major gene groups we consult a panel of advisors when naming new members and discussing proposed nomenclature updates.

Pseudogenes

We define a pseudogene as a sequence that is incapable of producing a functional protein product but has a high level of homology to a functional gene. In general, we only name pseudogenes that retain homology to a significant proportion of the functional ancestral gene.

Processed pseudogenes are named based on the specific parent gene, with a P and number appended to the parent gene symbol (e.g. NACAP10, "NACA pseudogene 10"). The numbering is usually species-specific.

Pseudogenes that retain most of the coding sequence compared to other family members (and are usually unprocessed) are named as a new family member with a "P" suffix, e.g. DDX12P, "DEAD/H-box helicase 12, pseudogene". This naming format is also used for genes that are pseudogenized relative to their functional ortholog in another species. Note, rarely such pseudogenes do not include the "P" if the symbol is well established, e.g. MMP23A; "matrix metallopeptidase 23A (pseudogene)".

We name non-coding RNA (ncRNA) genes according to their RNA type, please see our recent review (https://www.embopress.org/doi/full/10.15252/embj.2019103777) for a full description.

For small RNAs where an expert resource exists, we follow their naming conventions as follows:

MicroRNAs: miRBase assigns each microRNA stem-loop sequence a symbol in the format "mir-#" and each mature miRNA a symbol in the format "miR-#" followed by a unique sequential number that reflects order of submission to the database. The HGNC then approves a gene symbol for human miRNA genes in the format MIR#; for example, MIR17 represents the miRNA gene, mir-17 represents the stem-loop, and miR-17 represents the mature miRNA.
Transfer RNAs (tRNAs): The genomic tRNA database (GtRNAdb) ([http://gtrnadb.ucsc.edu/]) assigns a unique ID to each tRNA gene in the format tRNA-[three letter amino acid code]-[anticodon]-[GtRNAdb gene identifier], e.g. tRNA-Ala-AGC-1-1. The HGNC assigns a slightly condensed but equivalent tRNA gene symbol in the format TR[one letter amino acid code]-[anticodon][gtrnadb gene identifier], e.g. TRA-AGC1-1

Other classes of small ncRNAs are named in collaboration with specialist advisors. Major classes of small ncRNA include:

Small nuclear RNAs: Named with the root symbol "RNU" for "RNA, U# small nuclear"
Small nucleolar RNAs: Named with root symbols SNORD# for "small nucleolar RNA, C/D box" genes; SNORA# for "small nucleolar RNA, H/ACA box" genes; and SCARNA# for "small Cajal body-specific RNA" genes
Ribosomal RNAs: Named with the root symbols RNA45S, RNA28S, RNA18S, RNA5S, RNA5-8S

Long non-coding RNAs (lncRNAs) are preferentially given unique symbols based on published function akin to protein-coding genes. LncRNA genes that have been annotated by the RefSeq and GENCODE projects for which no suitable published information on which to base a symbol exists are named in the following systematic way:

LncRNAs that are intergenic with respect to protein coding genes are assigned the root symbol - LINC# followed by a 5-digit number e.g. LINC01018
LncRNAs that are antisense to the genomic span of a protein coding gene are assigned the symbol format [protein coding gene symbol]-AS# e.g. FAS-AS1
LncRNAs that are divergent to (share a bidirectional promoter with) a protein coding gene are assigned the symbol format [protein coding gene symbol]-DT e.g. ABCF1-DT
LncRNAs that are contained within an intron of a protein coding gene on the same strand are assigned the symbol format [protein coding gene symbol]-IT# e.g. AOAH-IT1
LncRNAs that overlap a protein coding gene on the same strand are assigned the symbol format [protein gene coding symbol]-OT# e.g. C5-OT1
LncRNAs that contain microRNA or snoRNA genes within introns or exons are named as host genes e.g. MIR17HG, SNHG7

Readthrough transcripts

Readthrough transcripts are normally produced from adjacent loci and include coding and/or non-coding parts of two (or more) genes. The HGNC only names readthrough transcripts that are consistently annotated by both the RefSeq annotators at NCBI and the GENCODE annotators at Ensembl. These transcripts have the locus type "readthrough transcript" and are symbolized using the two (or more) symbols from the parent genes, separated by a hyphen, e.g. ZNF511-PRAP1, and the name "[symbol] readthrough", e.g. "ZNF511-PRAP1 readthrough". The name may also include additional information about the potential coding status of the transcript, such as "(NMD candidate)".

Historically, the HGNC has only approved symbols for genes that are on the human reference genome. Rare exceptions have been made when requested by particular communities with dedicated nomenclature committees, such as the HLA community. Future naming of structural variants will be restricted to those on alternate loci that have been incorporated into the human reference genome by the Genome Reference Consortium (GRC). The underscore character is reserved for genes annotated on alternate reference loci, e.g. C4B_2 is a second copy of C4B on a 6p21.3 alternate reference locus.

Note: HGNC no longer name phenotypes (please see contact OMIM) or genomic regions, nor do we name transposable-element insertions in the human genome. For products of gene translocations or fusions, we recommend the format SYMBOL1/SYMBOL2, to avoid confusion with the SYMBOL1-SYMBOL2 format we approve for readthrough transcripts. Sequence variant nomenclature is the remit of the HGVS. For protein nomenclature, please see the International Protein Nomenclature Guidelines, which were written with the involvement of the HGNC. In agreement with these guidelines, we recommend that "protein and gene symbols should use the same abbreviation", with proteins using non-italicised symbols to differentiate them from genes.

Naming orthologs across species

We recommend that orthologous genes across vertebrate (and where appropriate, non-vertebrate) species should have the same gene symbol. To distinguish the species of origin for homologous genes with the same gene symbol, we recommend citing the NCBI taxonomy ID, as well as the species name or the GenBank common name, e.g. Taxonomy ID: 9598 and either Pan troglodytes or chimpanzee.

The Vertebrate Gene Nomenclature Committee

The Vertebrate Gene Nomenclature Committee (VGNC, [https://vertebrate.genenames.org/]) is an extension of the HGNC responsible for assigning standardized nomenclature to genes in vertebrate species that currently lack their own nomenclature committee. The VGNC coordinates with the five established existing vertebrate nomenclature committees, MGNC (mouse), RGNC (rat), CGNC (chicken), XNC (Xenopus frog) and ZNC (zebrafish), to ensure vertebrate genes are named in line with their human homologs.

Vertebrate orthologs of human C#orf# genes are assigned the human symbol with the other species chromosome number as a prefix and an H denoting human. For example, as the ortholog of human C1orf100 is on cow chromosome 16, the cow symbol is C16H1orf100 with the corresponding gene name "chromosome 16 C1orf100 homolog".

Gene families with a complex evolutionary history should ideally be named with the help of an expert in the field, as has already been implemented for the olfactory receptor and cytochrome P450 gene families.

Reference: https://rdcu.be/b53pu (PMID 32747822, doi: 10.1038/s41588-020-0669-3).

Gene Nomenclature Part 2 (HGNC)

TABLE OF CONTENTS

1. General Rules for Gene Nomenclature
1.1. Requirements for designation by gene symbol
1.2. Gene symbols
1.3. Gene names
1.4. DNA segments
2. Recommendations for symbol construction
2.1. Hierarchical symbols, gene families and series
2.2. Homologies with other species
2.3. Genes identified from sequence information
2.4. Enzymes and proteins
2.5. Clinical disorders
2.6. Letters reserved for specific usage
3. Allele terminology
4. Printing Gene and Allele Symbols
5. Acknowledgements
6. Reference

Definition: A gene is a DNA segment that contributes to phenotype/function. In the absence of demonstrated function a gene may be characterized by sequence, transcription or homology.

1. General Rules for Gene Nomenclature 1.1. Requirements for designation by gene symbol

1.1.1. A gene symbol may be used to designate a clearly defined phenotype shown to be inherited as a monogenic Mendelian trait. (Example:TSC1).
1.1.2. Gene symbols may be allocated to as yet unidentified genes contributing to a complex trait shown by linkage or association with a known marker (for example IDDM6).
1.1.3. A gene symbol may be used to designate a cloned segment of DNA with sufficient structural, functional, and expression data to identify it as a transcribed entity. However, alternate transcripts from the same gene should not in general be given different gene symbols.
1.1.4. Gene symbols are also allocated to non-functional copies of genes (pseudogenes).
1.1.5. Genes encoded by the opposite (anti-sense) strand of a known gene will be given their own symbols.
1.1.6. A gene symbol may be given to a transcribed but untranslated DNA segment eg XIST.
1.1.7. A cellular phenotype from which the existence of a gene or genes can be inferred may have its own designation. Example: LOH#CR#.
1.1.8. If insufficient data are available to allocate a unique and meaningful gene symbol, a putative gene may be designated by the symbol C#ORF#. This symbol will also be used for EST clusters. Other fragments of expressed sequence will be designated by a D-number.

1.2. Gene symbols

1.2.1. Gene symbols are designated by upper case Latin letters or by a combination of upper-case letters and Arabic numbers. Symbols should be short in order to be useful, and should not attempt to represent all known information about a gene. Ideally symbols should be no longer than six characters in length. Based on classical genetic guidelines, gene symbols are always either underlined or italicized when referring to genotypic information (phenotypic information is represented in standard fonts). Exceptions to this rule are in catalogs of known genes, and when fragments or synthesized segments of genes are referred to. New symbols must not duplicate existing gene symbols (check the Genome Database, or the HUGO/GDB Nomenclature Committee list of approved gene symbols).
1.2.2. The initial character of the symbol should always be a letter. Subsequent characters may be other letters, or if necessary, Arabic numerals.
1.2.3. All characters of the symbol should be written on the same line; no superscripts or subscripts may be used.
1.2.4. No Roman numerals may be used. Roman numbers in previously used symbols should be changed to their Arabic equivalents.
1.2.5. Greek letters are not used in gene symbols. All Greek letters should be changed to letters in the Latin alphabet (Table 2).
1.2.6. A Greek letter prefixing a gene name must be changed to its Latin alphabet equivalent and placed at the end of the gene symbol. This permits alphabetical ordering of the gene in listings with similar properties, such as substrate specificities. Examples: GLA (galactosidase, alpha); GLB (galactosidase, beta).

1.3. Gene names

1.3.1. Gene names should be brief and specific and should convey the character or function of the gene.
1.3.2. The first letter of the symbol should be the same as that of the name in order to facilitate alphabetical listing and grouping, with the exception of the abbreviations noted in 2.6.2.

1.4. DNA segments

The following guidelines determine each part of the symbol

Part I: D for DNA

Part II: 0,1,2,...22,X,Y,XY for the chromosomal assignment, where XY is for segments homologous on the X and Y chromosomes, and 0 is for unknown chromosomal assignment.

Part III: A symbol indicating the complexity of the DNA segment detected by the probe, with S for a unique DNA segment, and Z for repetitive DNA segments found at a single chromosome site or F for small undefined families of homologous sequences found on multiple chromosomes. Part IV: 1,2,3,..., a sequential number to give uniqueness to the above concatenated characters.

Part V:When the DNA segment is known to be an expressed sequence the suffix E can be added to indicate this fact.

These numbers can now be generated automatically in the Genome Database, following entry of clone details.

2. Recommendations for symbol construction 2.1. Hierarchical symbols, gene families and series

2.1.1. Every attempt should be made to represent information in a hierarchical form to facilitate retrieval of sets of related genes from computerized databases.
2.1.2. Where gene products of similar function are encoded by different genes, the corresponding loci are designated by Arabic numerals placed immediately after the gene symbol, without any space between the letters and numbers used. Examples: PGM1, PGM2, PGM3 (three loci for phosphoglucomutase activity); ADH1, ADH2, ADH3 (three alcohol dehydrogenase loci); HBA1, HBA2 (duplicated forms of the alpha-hemoglobin gene). However, if they exist historically, single-letter suffixes may be used to designate these different loci. Example: LDHA, LDHB, LDHC (three lactate dehydrogenase loci).
2.1.3. A final character in the gene symbol may be used to specify a characteristic of the gene. While letters to specify tissue distribution have been used historically, Arabic numbers are now preferred as experience has shown that tissue specificity may not be as restricted as described initially.

2.2. Homologies with other species

2.2.1. Homologous genes in different species (orthologs) should where possible have the same gene nomenclature.
2.2.2. Human homologs of genes first identified in other species should not be designated by a symbol beginning with H for human.
2.2.3. When a locus or series of genes has been defined in one species, and it is reasonable to expect that in the future a homologous gene will be identified in man, we recommend that the designated symbol be reserved for the human loci. We recommend that this should be done in other species, for genes first identified in human.
2.2.4. When necessary to distinguish the species of origin for homologous genes with the same gene symbol, the three-letter code for different species already established by the Committee on Standardization in Human Cytogenetics (see Table 1), is recommended. The code is for use in publications only and not incorporated as part of the gene symbol. The species designation is added as a prefix to the gene symbol. For example HSA signifies Homo sapiens and MMU stands for Mus musculus. Examples of using the species designation with the gene symbol; human loci: (HSA)G6PD; (HSA)HBB; (HSA)ALB; homologous mouse loci: (MMU)G6pd; (MMU)Hbb; (MMU)Alb.
2.2.5. The agreement between human and mouse gene nomenclature for many homologous gene loci should be continued and extended to other species where possible.
2.2.6. Human homologs of genes in invertebrate, or prokaryote species, may be represented by the symbol used in the other species followed by an L to represent like. The use of H to represent homolog is no longer recommended, and will be discontinued.

2.3. Genes identified from sequence information

2.3.1. Predicted genes.

Genes predicted from EST clusters or from genomic sequence alone are regarded as putative, and are designated by the chromosome of origin and arbitrary number. Example: C2ORF1

2.3.2. Pseudogenes

Molecular technology has identified sequences (generally not transcribed) that bear striking homologies to structural gene sequences. These sequences are termed pseudogenes. In order to show the relatedness of pseudogenes to functional genes, pseudogenes will be identified with the gene symbol of the structural gene followed by a P for pseudogene. In order to reserve P for pseudogenes, the use of P as the last character of a structural gene symbol should be avoided where possible. Examples: HBBP1 (hemoglobin, beta pseudogene 1); ACTBP1 (actin, beta pseudogene 1); ACTBP2 (actin, beta pseudogene 2), etc. Pseudogenes may be on different chromosomes or closely linked to the functional gene and occur in varying numbers.

2.3.3. Related sequences

Related sequences identified by cross-hybridisation, and or by computer searching of sequence databases (BLAST, FASTA), where no other functional information is available for the construction of a symbol, are designated with the symbol of the known gene followed by an L for like. (see also homology section 2.3).

2.4. Enzymes and proteins

2.4.1 Names of genes coding for enzymes are based on those recommended by the Nomenclature Committee of the International Union of Biochemistry. Names of plasma proteins, hemoglobins, and specialized proteins are based on standard names and those recommended by their respective committees (??refs).

2.5. Clinical disorders

2.5.1. Inherited clinical disorders (monogenic Mendelian inheritance).

The first gene symbol allocated to an inherited clinical phenotype may be based on an acronym which has been established as a name for the disorder, whilst following the rules described in section 1. Example: ACH for achondroplasia. However it is usual for this symbol to change when the gene product or function is identified. In some cases a gene symbol based on product or function will already exist, and this will take precedence over the symbol derived from the clinical disorder when the gene descriptions are merged for example in the case of achondroplasia the symbol changed to FGFR3 and the name to fibroblast growth factor receptor 3 (achondroplasia, thanatophoric dwarfism)..

2.5.2. Complex/polygenic traits

Genome searches may suggest a contributing locus in a complex trait, which may for convenience be given a gene symbol, although a proportion of these will disappear in time. A symbol allocated to such a gene will not be re-used.

2.5.3. Contiguous gene syndromes.

Syndromes clearly associated with multiple loci should not be given gene symbols. Syndromes associated with a regional deletion or duplication may be assigned the letters CR (for chromosome region), in place of S for syndrome. Examples: ANCR (Angelman syndrome chromosome region), DCR (Down syndrome chromosome region). However, as advances in database design have now increased the possible ways of representing this type of information, we recommend that such symbols are now classified as syndromic region symbols and not gene symbols.

2.5.4. Loss of heterozygosity.

A chromosomal region in which the existence of genes may be inferred by loss of heterozygosity can be designated by a symbol consisting of the letters LOH, the chromosome number, CR (for chromosomal region) and then an arbitrary number. 2.6. Letters reserved for specific usage

2.6.1 Certain letters, or combinations of letters are used as the last letter in a symbol to represent a specific meaning, these are P for pseudogene (but note also BP for binding protein), L for like (see 2.1.), R for receptor or regulator, N or NH for inhibitor. The use of these for other meanings should be avoided where possible.
2.6.2 If the name of a gene contains a character or property for which there is a recognized abbreviation, the abbreviation should be used. Example: the single-letter abbreviation for amino acids (Table 3) used in aminoacyl residues, or approved biochemical abbreviations such as GLC for glucose and GSH for glutathione.

3. Allele terminology

Allele terminology is now the responsibility of the Mutation Database (ref/URL) 4. Printing Gene and Allele Symbols

Gene and allele symbols are underlined in manuscript and italicized in print. Italics need not be used in catalogs. It may be convenient in manuscripts, computer printouts and in printed text to designate a gene symbol by following it with an asterisk (e.g. PGM1*). When only allele symbols are displayed they can be preceded by an asterisk. For example, for PGM1*1, the allele is printed as *1.

Table 1: Species Abbreviations

	abbreviation	Species
	HSA	Homo sapiens
	PTR	Pan troglodytes (chimpanzee)
	GGO	Gorilla gorilla
	PPY	Pongo pygmaeus (orangutan)
	MMU	Mus musculus
	RNO	Rattus norvegicus
	MML	Macaca mulatta (Rhesus monkey)
	CAE	Cercopithecus aethiops (African green monkey)
	PPA	Papio papio (baboon)
	FCA	Felis catus (cat)
	CGR	Cricetulus griseus (hamster)
	OOV	Ovies ovies (sheep)
	BBO	Bos bovinus (cattle)
	SSC	Sus scrofa (pig)
	OCU	Oryctolagus cuniculus (rabbit)
	MRU	Macropus rufus (red kangaroo)

Table 2: Greek-to-Latin alphabet conversion

Greek	Lower case	Latin upper case conversion
a	alpha	A
[beta]	beta	B
[gamma]	gamma	G
d	delta	D
[epsilon]	epsilon	E
[zeta]	zeta	Z
[eta]	eta	H
[theta]	theta	Q
[iota]	iota	I
[kappa]	kappa	K
[lambda]	lambda	L
u	mu	M
[nu]	nu	N
[xi]	xi	X
[omicron]	omicron	O
[pi]	pi	P
[rho]	rho	R
[sigma]	sigma	S
[tau]	tau	T
[upsilon]	upsilon	Y
[phi]	phi	F
[chi]	chi	C
[psi]	psi	U
[omega]	omega	W

Table 3: Single-letter amino acid symbols

Amino acid	Three-letter symbol	One-letter symbol
Alanine	Ala	A
Arginine	Arg	R
Asparagine	Asn	N
Aspartic acid	Asp	D
Asn +Asp	Asx	B
Cysteine	Cys	C
Glutamine	Gln	Q
Glutamic acid	Glu	E
Gln + Glu	Glx	Z
Glycine	Gly	G
Histidine	His	H
Isoleucine	Ile	I
Leucine	leu	L
Lysine	Lys	K
Methionine	Met	M
Phenylalanine	Phe	F
Proline	Pro	P
Serine	Ser	S
Threonine	Thr	T
Tryptophan	Trp	W
Tyrosine	Tyr	Y
Valine	Val	V

5. Acknowledgements: The Nomenclature meeting held in Toronto on 5th March 1997 was made possible by the support of the EU, through a contract to HUGO.

6. Reference: J.A. White, P.J. McAlpine, S. Antonarakis, H. Cann, K. Frazer, J. Frezal, D. Lancet, J. Nahmias, P. Pearson, J. Peters, A. Scott, H. and the attendees at the nomenclature meeting 5^th of March 1997.