‘Gold mine of unexplored biology’: Short protein sequences could dramatically expand human genome | Science

The relatively small universe of human genes could grow by up to one-third, if a concerted effort to search for new genes that encode short proteins is successful. Many known miniproteins have already been shown to play key roles in cellular metabolism and disease, so the international effort to catalog new ones and determine their functions, announced last week in Nature Biotechnology, could shed light on a vast array of biochemical processes and provide targets for novel medicines.

“The microproteome is a potential gold mine of unexplored biology,” says Eric Olson, a molecular biologist at the University of Texas Southwestern Medical Center who is not involved with the new consortium. Anne O’Donnell-Luria, an expert in the genetics of rare diseases at Boston Children’s Hospital, adds that the expanded catalog could be a rich source of clues to genetic links to disease. “Everyone will be able to use this data set to make progress in their area.”

Only 19,370 human genes are known to code for proteins. But current catalogs only include genes for proteins containing at least 100 amino acids each, a cutoff chosen in part because longer DNA sequences make it easier for geneticists to look for commonalities between species. Many smaller proteins are known to exist, but they’ve largely flown under the radar even though some have been shown to play crucial roles in regulating the immune system, blocking other proteins, and destroying faulty RNAs. “The fact that these have been excluded represents a large hole in genetics and developmental biology,” says consortium member John Prensner, a pediatric oncologist at Boston Children’s.

When genes are translated into proteins, they are first transcribed into snippets of messenger RNA (mRNA). Cellular organelles called ribosomes then read those mRNA sequences and follow their instructions to string together amino acids into proteins. When scientists scan for genes, they typically look for distinctive DNA sequences flanked by start and stop signals for the protein assembly process, so-called open reading frames (ORFs).

In recent years, researchers have come up with other ways to identify protein-coding sequences. One called Ribo-seq uses high-throughput sequencing technology to catalog all the RNAs in a sample that are bound to a ribosome at a given time. Those RNA sequences point to likely genes, although the technique can’t prove that any one sequence makes a stable, functional protein. Ribo-seq databases now contain thousands of ORFs, many of which don’t code for known proteins and therefore may represent new ones.

In the consortium’s first phase, members scanned seven Ribo-seq databases for candidate ORFs that might correspond with small proteins. After weeding out redundant entries they came up with 7264 candidates. Next, the group will try to identify which of those yield proteins with actual cellular functions. Techniques such as mass spectrometry can help determine whether particular RNAs are translated into stable proteins. Others, such as epitope tagging, use antibodies to track marked proteins, revealing their location and abundance in cells and providing hints about their function.

For now, the 35 investigators involved are funding the effort from their own lab budgets, and don’t have immediate plans to seek dedicated funding. “There is so much there, this just needs to be done,” says consortium member Sebastiaan van Heesch, a systems biologist at the Princess Máxima Center for Pediatric Oncology in the Netherlands.