Gene Nomenclature

[raw]

* Acid Base Indicators

* Acids and Alkali

* Aging and Senescence

* Amino Acids – Feature

* Amino Acids – Structure

* Amniotic Fluid

* Base units (SI)

* Blood Indicators

* Buffer Mixtures

* Cell Animations

* Cell Image Gallery

* Cell Mentor

* Cell Spnapshots

* Cell Video Archives

* Cerebrospinal Fluid

* Enzymes
* Gastric Fluid

* Gene Nomenclature

* Genetic Code

* Greek Alphabet

* Heart EKGs

* Heart Sounds

* Hematologic Indicators

* Human CDs

* Human Non-CDs

* Ideal Weights and Heights

* Journals

* Libraries

* Mouse Primers

* Nomogram

* Pediatric Reference Values

* Periodic Table
* Physiologic Solutions

* Prefixes

* Restriction Enzymes

* Seminal Fluid Indicators

* SI Derived Units

* Species Abbreviations

* Stool Indicators

* Sweat Indicators
* Synovial Fluid Indicators

* Temperature Conversions

* Tris Buffer

* Tumor Atlas

* Urine Indicators

* Weights and Measures

Part 1 (HGNC)

In the absence of a universally agreed alternative, the HGNC maintains the definition of a gene as “a DNA segment that contributes to phenotype/function. In the absence of demonstrated function a gene may be characterized by sequence, transcription or homology”.

Each gene is assigned only one symbol; the HGNC does not routinely name isoforms (i.e. alternate transcripts or splice variants). This means no separate symbols for protein-coding or non-coding RNA isoforms of a protein-coding locus or alternative transcripts from a non-coding RNA locus. In exceptional circumstances, and following community demand, separate symbols have been approved for gene segments in complex loci, e.g. the UGT1 locus. Putative bicistronic loci may be assigned separate symbols to represent the distinct gene products.

Every gene that we name is assigned a unique symbol, HGNC ID (in the format HGNC:#) and descriptive name. Symbols contain only uppercase Latin letters and Arabic numerals, and punctuation is avoided, with an exception for hyphens in specific groups. Symbols should not be the same as commonly used abbreviations, to facilitate data retrieval. Nomenclature should not contain references to any species or ‘G’ for gene, nor should it be offensive or pejorative.

Protein coding genes

We aim to name protein-coding genes based on a key normal function of the gene product.

In the absence of functional data, protein-coding genes may be named in the following ways:

  1. Based on recognized structural domains and motifs encoded by the gene (e.g. BEND7, “BEN domain containing 7”)
  2. Based on homologous genes within the human genome (e.g. GPRIN3, “GPRIN family member 3”)
  3. Based on homologous genes from another species (e.g. FEM1A, “fem-1 homolog A”)
  4. Based only on the presence of an open reading frame (e.g. C17orf50, “chromosome 17 open reading frame 50”)

Where possible, related genes are named using a common root symbol to enable grouping, typically based on sequence homology, shared function or membership of protein complexes.

For genes involved in specific immune processes, or encoding an enzyme, receptor or ion channel, we consult with specialist nomenclature groups (please see supplementary note at https://www.readcube.com/articles/supplement?doi=10.1038%2Fs41588-020-0669-3&index=0). For other major gene groups we consult a panel of advisors when naming new members and discussing proposed nomenclature updates.

Pseudogenes

We define a pseudogene as a sequence that is incapable of producing a functional protein product but has a high level of homology to a functional gene. In general, we only name pseudogenes that retain homology to a significant proportion of the functional ancestral gene.

Processed pseudogenes are named based on the specific parent gene, with a P and number appended to the parent gene symbol (e.g. NACAP10, “NACA pseudogene 10”). The numbering is usually species-specific.

Pseudogenes that retain most of the coding sequence compared to other family members (and are usually unprocessed) are named as a new family member with a “P” suffix, e.g. DDX12P, “DEAD/H-box helicase 12, pseudogene”. This naming format is also used for genes that are pseudogenized relative to their functional ortholog in another species. Note, rarely such pseudogenes do not include the “P” if the symbol is well established, e.g. MMP23A; “matrix metallopeptidase 23A (pseudogene)”.

We name non-coding RNA (ncRNA) genes according to their RNA type, please see our recent review (https://www.embopress.org/doi/full/10.15252/embj.2019103777) for a full description.

For small RNAs where an expert resource exists, we follow their naming conventions as follows:

MicroRNAs
miRBase assigns each microRNA stem-loop sequence a symbol in the format “mir-#” and each mature miRNA a symbol in the format “miR-#” followed by a unique sequential number that reflects order of submission to the database. The HGNC then approves a gene symbol for human miRNA genes in the format MIR#; for example, MIR17 represents the miRNA gene, mir-17 represents the stem-loop, and miR-17 represents the mature miRNA.
Transfer RNAs (tRNAs)
The genomic tRNA database (GtRNAdb) ([http://gtrnadb.ucsc.edu/]) assigns a unique ID to each tRNA gene in the format tRNA-[three letter amino acid code]-[anticodon]-[GtRNAdb gene identifier], e.g. tRNA-Ala-AGC-1-1. The HGNC assigns a slightly condensed but equivalent tRNA gene symbol in the format TR[one letter amino acid code]-[anticodon][gtrnadb gene identifier], e.g. TRA-AGC1-1

Other classes of small ncRNAs are named in collaboration with specialist advisors. Major classes of small ncRNA include:

Small nuclear RNAs
Named with the root symbol “RNU” for “RNA, U# small nuclear”
Small nucleolar RNAs
Named with root symbols SNORD# for “small nucleolar RNA, C/D box” genes; SNORA# for “small nucleolar RNA, H/ACA box” genes; and SCARNA# for “small Cajal body-specific RNA” genes
Ribosomal RNAs
Named with the root symbols RNA45S, RNA28S, RNA18S, RNA5S, RNA5-8S

Long non-coding RNAs (lncRNAs) are preferentially given unique symbols based on published function akin to protein-coding genes. LncRNA genes that have been annotated by the RefSeq and GENCODE projects for which no suitable published information on which to base a symbol exists are named in the following systematic way:

  • LncRNAs that are intergenic with respect to protein coding genes are assigned the root symbol – LINC# followed by a 5-digit number e.g. LINC01018
  • LncRNAs that are antisense to the genomic span of a protein coding gene are assigned the symbol format [protein coding gene symbol]-AS# e.g. FAS-AS1
  • LncRNAs that are divergent to (share a bidirectional promoter with) a protein coding gene are assigned the symbol format [protein coding gene symbol]-DT e.g. ABCF1-DT
  • LncRNAs that are contained within an intron of a protein coding gene on the same strand are assigned the symbol format [protein coding gene symbol]-IT# e.g. AOAH-IT1
  • LncRNAs that overlap a protein coding gene on the same strand are assigned the symbol format [protein gene coding symbol]-OT# e.g. C5-OT1
  • LncRNAs that contain microRNA or snoRNA genes within introns or exons are named as host genes e.g. MIR17HG, SNHG7

Readthrough transcripts

Readthrough transcripts are normally produced from adjacent loci and include coding and/or non-coding parts of two (or more) genes. The HGNC only names readthrough transcripts that are consistently annotated by both the RefSeq annotators at NCBI and the GENCODE annotators at Ensembl. These transcripts have the locus type “readthrough transcript” and are symbolized using the two (or more) symbols from the parent genes, separated by a hyphen, e.g. ZNF511-PRAP1, and the name “[symbol] readthrough”, e.g. “ZNF511-PRAP1 readthrough”. The name may also include additional information about the potential coding status of the transcript, such as “(NMD candidate)”.

Historically, the HGNC has only approved symbols for genes that are on the human reference genome. Rare exceptions have been made when requested by particular communities with dedicated nomenclature committees, such as the HLA community. Future naming of structural variants will be restricted to those on alternate loci that have been incorporated into the human reference genome by the Genome Reference Consortium (GRC). The underscore character is reserved for genes annotated on alternate reference loci, e.g. C4B_2 is a second copy of C4B on a 6p21.3 alternate reference locus.

Note: HGNC no longer name phenotypes (please see contact OMIM) or genomic regions, nor do we name transposable-element insertions in the human genome. For products of gene translocations or fusions, we recommend the format SYMBOL1/SYMBOL2, to avoid confusion with the SYMBOL1-SYMBOL2 format we approve for readthrough transcripts. Sequence variant nomenclature is the remit of the HGVS. For protein nomenclature, please see the International Protein Nomenclature Guidelines, which were written with the involvement of the HGNC. In agreement with these guidelines, we recommend that “protein and gene symbols should use the same abbreviation”, with proteins using non-italicised symbols to differentiate them from genes.

Naming orthologs across species

We recommend that orthologous genes across vertebrate (and where appropriate, non-vertebrate) species should have the same gene symbol. To distinguish the species of origin for homologous genes with the same gene symbol, we recommend citing the NCBI taxonomy ID, as well as the species name or the GenBank common name, e.g. Taxonomy ID: 9598 and either Pan troglodytes or chimpanzee.

The Vertebrate Gene Nomenclature Committee

The Vertebrate Gene Nomenclature Committee (VGNC, [https://vertebrate.genenames.org/]) is an extension of the HGNC responsible for assigning standardized nomenclature to genes in vertebrate species that currently lack their own nomenclature committee. The VGNC coordinates with the five established existing vertebrate nomenclature committees, MGNC (mouse), RGNC (rat), CGNC (chicken), XNC (Xenopus frog) and ZNC (zebrafish), to ensure vertebrate genes are named in line with their human homologs.

Vertebrate orthologs of human C#orf# genes are assigned the human symbol with the other species chromosome number as a prefix and an H denoting human. For example, as the ortholog of human C1orf100 is on cow chromosome 16, the cow symbol is C16H1orf100 with the corresponding gene name “chromosome 16 C1orf100 homolog”.

Gene families with a complex evolutionary history should ideally be named with the help of an expert in the field, as has already been implemented for the olfactory receptor and cytochrome P450 gene families.

Reference: https://rdcu.be/b53pu (PMID 32747822, doi: 10.1038/s41588-020-0669-3).

Gene Nomenclature Part 2 (HGNC)

TABLE OF CONTENTS

1. General Rules for Gene Nomenclature

1.1. Requirements for designation by gene symbol

1.2. Gene symbols

1.3. Gene names

1.4. DNA segments

2. Recommendations for symbol construction

2.1. Hierarchical symbols, gene families and series

2.2. Homologies with other species

2.3. Genes identified from sequence information

2.4. Enzymes and proteins

2.5. Clinical disorders

2.6. Letters reserved for specific usage


3. Allele terminology

4. Printing Gene and Allele Symbols

5. Acknowledgements

6. Reference

Definition: A gene is a DNA segment that contributes to
phenotype/function. In the absence of demonstrated function a gene may be
characterized by sequence, transcription or homology.

1. General Rules for Gene Nomenclature
1.1. Requirements for designation by gene symbol

1.1.1. A gene symbol may be used to designate a clearly defined phenotype shown to be inherited as a monogenic Mendelian trait. (Example:TSC1).
1.1.2. Gene symbols may be allocated to as yet unidentified genes contributing
to a complex trait shown by linkage or association with a known marker (for
example IDDM6).
1.1.3. A gene symbol may be used to designate a cloned segment of DNA with
sufficient structural, functional, and expression data to identify it as a
transcribed entity. However, alternate transcripts from the same gene should not in general be given different gene symbols.
1.1.4. Gene symbols are also allocated to non-functional copies of genes (pseudogenes).
1.1.5. Genes encoded by the opposite (anti-sense) strand of a known gene will be given their own symbols.
1.1.6. A gene symbol may be given to a transcribed but untranslated DNA segment eg XIST.
1.1.7. A cellular phenotype from which the existence of a gene or genes can be inferred may have its own designation. Example: LOH#CR#.
1.1.8. If insufficient data are available to allocate a unique and meaningful
gene symbol, a putative gene may be designated by the symbol C#ORF#. This
symbol will also be used for EST clusters. Other fragments of expressed
sequence will be designated by a D-number.

1.2. Gene symbols

1.2.1. Gene symbols are designated by upper case Latin letters or by a
combination of upper-case letters and Arabic numbers. Symbols should be short
in order to be useful, and should not attempt to represent all known
information about a gene. Ideally symbols should be no longer than six
characters in length. Based on classical genetic guidelines, gene symbols are
always either underlined or italicized when referring to genotypic information (phenotypic information is represented in standard fonts). Exceptions to this rule are in catalogs
of known genes, and when fragments or synthesized segments of genes are
referred to. New symbols must not duplicate existing gene symbols (check the
Genome Database, or the HUGO/GDB Nomenclature Committee list of approved gene
symbols
).
1.2.2. The initial character of the symbol should always be a letter. Subsequent
characters may be other letters, or if necessary, Arabic numerals.
1.2.3. All characters of the symbol should be written on the same line; no
superscripts or subscripts may be used.
1.2.4. No Roman numerals may be used. Roman numbers in previously used symbols
should be changed to their Arabic equivalents.
1.2.5. Greek letters are not used in gene symbols. All Greek letters should be
changed to letters in the Latin alphabet (Table 2).
1.2.6. A Greek letter prefixing a gene name must be changed to its Latin
alphabet equivalent and placed at the end of the gene symbol. This permits
alphabetical ordering of the gene in listings with similar properties, such as
substrate specificities. Examples: GLA (galactosidase, alpha); GLB
(galactosidase, beta).


1.3. Gene names

1.3.1. Gene names should be brief and specific and should convey the character or function of the gene.
1.3.2. The first letter of the symbol should be the same as that of the name in order to facilitate alphabetical listing and grouping, with the exception of the abbreviations noted in 2.6.2.


1.4. DNA segments

The following guidelines determine each part of the symbol

Part I: D for DNA

Part II: 0,1,2,…22,X,Y,XY for the chromosomal assignment, where XY is for
segments homologous on the X and Y chromosomes, and 0 is for unknown
chromosomal assignment.

Part III: A symbol indicating the complexity of the DNA segment detected by the
probe, with S for a unique DNA segment, and Z for repetitive DNA segments found
at a single chromosome site or F for small undefined families of homologous
sequences found on multiple chromosomes. Part IV: 1,2,3,…, a sequential
number to give uniqueness to the above concatenated characters.

Part V:When the DNA segment is known to be an expressed sequence the suffix E
can be added to indicate this fact.

These numbers can now be generated automatically in the Genome Database,
following entry of clone details.


2. Recommendations for symbol construction


2.1. Hierarchical symbols, gene families and series

2.1.1. Every attempt should be made to represent information in a hierarchical form to facilitate retrieval of sets of related genes from computerized databases.
2.1.2. Where gene products of similar function are encoded by different genes,
the corresponding loci are designated by Arabic numerals placed immediately
after the gene symbol, without any space between the letters and numbers used.
Examples: PGM1, PGM2, PGM3 (three loci for phosphoglucomutase activity); ADH1,
ADH2, ADH3 (three alcohol dehydrogenase loci); HBA1, HBA2 (duplicated forms of
the alpha-hemoglobin gene). However, if they exist historically, single-letter
suffixes may be used to designate these different loci. Example: LDHA, LDHB,
LDHC (three lactate dehydrogenase loci).
2.1.3. A final character in the gene symbol may be used to specify a
characteristic of the gene. While letters to specify tissue distribution have
been used historically, Arabic numbers are now preferred as experience has
shown that tissue specificity may not be as restricted as described
initially.


2.2. Homologies with other species

2.2.1. Homologous genes in different species (orthologs) should where possible have the same gene nomenclature.
2.2.2. Human homologs of genes first identified in other species should not be designated by a symbol beginning with H for human.
2.2.3. When a locus or series of genes has been defined in one species, and it
is reasonable to expect that in the future a homologous gene will be identified
in man, we recommend that the designated symbol be reserved for the human loci.
We recommend that this should be done in other species, for genes first
identified in human.
2.2.4. When necessary to distinguish the species of origin for homologous genes
with the same gene symbol, the three-letter code for different species already
established by the Committee on Standardization in Human Cytogenetics (see
Table 1), is recommended. The code is for use in publications only and not
incorporated as part of the gene symbol. The species designation is added as a
prefix to the gene symbol. For example HSA signifies Homo sapiens and MMU
stands for Mus musculus. Examples of using the species designation with the
gene symbol; human loci: (HSA)G6PD; (HSA)HBB; (HSA)ALB; homologous mouse loci:
(MMU)G6pd; (MMU)Hbb; (MMU)Alb.
2.2.5. The agreement between human and mouse gene nomenclature for many homologous gene loci should be continued and extended to other species where possible.
2.2.6. Human homologs of genes in invertebrate, or prokaryote species, may be represented by the symbol used in the other species followed by an L to represent like. The use of H to represent homolog is no longer recommended, and will be discontinued.


2.3. Genes identified from sequence information

2.3.1. Predicted genes.

Genes predicted from EST clusters or from genomic sequence alone are regarded
as putative, and are designated by the chromosome of origin and arbitrary
number. Example: C2ORF1

2.3.2. Pseudogenes

Molecular technology has identified sequences (generally not transcribed) that
bear striking homologies to structural gene sequences. These sequences are
termed pseudogenes. In order to show the relatedness of pseudogenes to
functional genes, pseudogenes will be identified with the gene symbol of the
structural gene followed by a P for pseudogene. In order to reserve P for
pseudogenes, the use of P as the last character of a structural gene symbol
should be avoided where possible. Examples: HBBP1 (hemoglobin, beta pseudogene
1); ACTBP1 (actin, beta pseudogene 1); ACTBP2 (actin, beta pseudogene 2), etc.
Pseudogenes may be on different chromosomes or closely linked to the
functional gene and occur in varying numbers.

2.3.3. Related sequences

Related sequences identified by cross-hybridisation, and or by computer
searching of sequence databases (BLAST, FASTA), where no other functional
information is available for the construction of a symbol, are designated with
the symbol of the known gene followed by an L for like. (see also homology
section 2.3).


2.4. Enzymes and proteins

2.4.1 Names of genes coding for enzymes are based on those recommended by the
Nomenclature Committee of the International Union of Biochemistry. Names of
plasma proteins, hemoglobins, and specialized proteins are based on standard
names and those recommended by their respective committees (??refs).


2.5. Clinical disorders

2.5.1. Inherited clinical disorders (monogenic Mendelian inheritance).

The first gene symbol allocated to an inherited clinical phenotype may be based
on an acronym which has been established as a name for the disorder, whilst
following the rules described in section 1. Example: ACH for achondroplasia.
However it is usual for this symbol to change when the gene product or function
is identified. In some cases a gene symbol based on product or function will
already exist, and this will take precedence over the symbol derived from the
clinical disorder when the gene descriptions are merged for example in the
case of achondroplasia the symbol changed to FGFR3 and the name to fibroblast
growth factor receptor 3 (achondroplasia, thanatophoric dwarfism)..

2.5.2. Complex/polygenic traits

Genome searches may suggest a contributing locus in a complex trait, which may
for convenience be given a gene symbol, although a proportion of these will
disappear in time. A symbol allocated to such a gene will not be re-used.

2.5.3. Contiguous gene syndromes.

Syndromes clearly associated with multiple loci should not be given gene
symbols. Syndromes associated with a regional deletion or duplication may be
assigned the letters CR (for chromosome region), in place of S for syndrome.
Examples: ANCR (Angelman syndrome chromosome region), DCR (Down syndrome chromosome region). However, as advances in database design have now increased
the possible ways of representing this type of information, we recommend that
such symbols are now classified as syndromic region symbols and not gene
symbols.

2.5.4. Loss of heterozygosity.

A chromosomal region in which the existence of genes may be inferred by loss of
heterozygosity can be designated by a symbol consisting of the letters LOH, the
chromosome number, CR (for chromosomal region) and then an arbitrary number.

2.6. Letters reserved for specific usage

2.6.1 Certain letters, or combinations of letters are used as the last letter
in a symbol to represent a specific meaning, these are P for pseudogene (but
note also BP for binding protein), L for like (see 2.1.), R for receptor or
regulator, N or NH for inhibitor. The use of these for other meanings should be
avoided where possible.
2.6.2 If the name of a gene contains a character or property for which there is
a recognized abbreviation, the abbreviation should be used. Example: the
single-letter abbreviation for amino acids (Table 3) used in aminoacyl
residues, or approved biochemical abbreviations such as GLC for glucose and
GSH for glutathione.


3. Allele terminology

Allele terminology is now the responsibility of the Mutation Database
(ref/URL)

4. Printing Gene and Allele Symbols

Gene and allele symbols are underlined in manuscript and italicized in
print. Italics need not be used in catalogs. It may be convenient in
manuscripts, computer printouts and in printed text to designate a gene symbol
by following it with an asterisk (e.g. PGM1*). When only allele symbols are
displayed they can be preceded by an asterisk. For example, for PGM1*1, the
allele is printed as *1.

Table 1: Species Abbreviations




































abbreviation

Species

HSA

Homo
sapiens

PTR

Pan
troglodytes (chimpanzee)

GGO

Gorilla
gorilla

PPY

Pongo
pygmaeus (orangutan)

MMU

Mus
musculus

RNO

Rattus
norvegicus

MML

Macaca
mulatta (Rhesus monkey)

CAE

Cercopithecus
aethiops (African green monkey)

PPA

Papio
papio (baboon)

FCA

Felis
catus (cat)

CGR

Cricetulus
griseus (hamster)

OOV

Ovies
ovies (sheep)

BBO

Bos
bovinus (cattle)

SSC

Sus
scrofa (pig)

OCU

Oryctolagus
cuniculus (rabbit)

MRU

Macropus
rufus (red kangaroo)



Table 2: Greek-to-Latin alphabet conversion













































































Greek

Lower
case

Latin
upper case conversion

a

alpha

A

[beta]

beta

B

[gamma]

gamma

G

d

delta

D

[epsilon]

epsilon

E

[zeta]

zeta

Z

[eta]

eta

H

[theta]

theta

Q

[iota]

iota

I

[kappa]

kappa

K

[lambda]

lambda

L

u

mu

M

[nu]

nu

N

[xi]

xi

X

[omicron]

omicron

O

[pi]

pi

P

[rho]

rho

R

[sigma]

sigma

S

[tau]

tau

T

[upsilon]

upsilon

Y


[phi]

phi

F

[chi]

chi

C

[psi]

psi

U


[omega]

omega

W


Table 3: Single-letter amino acid symbols







































































Amino acid

Three-letter
symbol

One-letter
symbol

Alanine

Ala

A

Arginine

Arg

R

Asparagine

Asn

N

Aspartic
acid

Asp

D

Asn
+Asp

Asx

B

Cysteine

Cys

C

Glutamine

Gln

Q

Glutamic
acid

Glu

E

Gln
+ Glu

Glx

Z

Glycine

Gly

G

Histidine

His

H

Isoleucine

Ile

I

Leucine

leu

L

Lysine

Lys

K

Methionine

Met

M

Phenylalanine

Phe

F

Proline

Pro

P

Serine

Ser

S

Threonine

Thr

T

Tryptophan

Trp

W

Tyrosine

Tyr

Y

Valine

Val

V

5. Acknowledgements:
The Nomenclature meeting held in Toronto on 5th March 1997 was made possible
by the support of the EU, through a contract to HUGO.

6. Reference: J.A. White, P.J. McAlpine, S. Antonarakis, H. Cann, K.
Frazer, J. Frezal, D. Lancet, J. Nahmias, P. Pearson, J. Peters, A. Scott, H. and the attendees at the nomenclature meeting 5th of March 1997.

[/raw]