Importance Score: 72 / 100 🔴
Unlocking Nature’s Secrets: Biotech Firm Amasses Vast Genetic Database for “ChatGPT of Biology”
A UK-based biotech corporation, Basecamp Research, has dedicated recent years to amassing a vast collection of genetic information from microorganisms thriving in extreme environments globally. This effort has led to the identification of over a million species and nearly 10 billion genes previously unknown to science. The company asserts that this extensive database of planetary biodiversity will serve as the foundation for training a “ChatGPT of biology,” capable of addressing inquiries about life on Earth, although the success of this endeavor remains uncertain.
The Value of Expanding Genetic Knowledge
Jörg Overmann of the Leibniz Institute DSMZ in Germany, which maintains one of the world’s most diverse repositories of microbial cultures, acknowledges the value of expanding known genetic sequences. However, he questions whether such expansion will reliably yield practical applications, such as drug discovery or novel chemistry, without more comprehensive information regarding the organisms from which those sequences were derived. “I’m not convinced that in the end, the understanding of truly novel functions will be accelerated by this brute-force increase in the sequence space,” he stated.
Advancements in Machine Learning Models
Recent years have witnessed the development of multiple machine learning models designed to discern patterns and forecast relationships within large biological datasets. The most notable example is AlphaFold, which can predict the three-dimensional structure of a protein solely from genetic data. This achievement earned its creators at Google DeepMind the 2024 Nobel Prize in Chemistry.

LONZERUI2024 New Mens Smart Watch with a 420Mah Large Battery, 1.96-Inch Ultra HD Screen, Wireless Calling, Flashlight Feature,, Suitable for Android And Ios Outdoor Use
Price: $0.62

Mens Gothic Hoodie - Fashion Hoodies with Retro Lace Up Design, Casual Graphic Print, Streetwear Style for Winter Fall, Great Gift Idea
Price: $1.79

[Anti-Slip Basketball Shoes] Anti-Slip Durable Mens High-Top Basketball Shoes | Fashion Training Sneakers for Sports and Casual Wear
🎉 Exclusive deal [Price: $9.19]
Addressing the Biodiversity Gap
Despite the increasing sophistication of these “generative biology” models, their performance has not improved significantly, according to Frances Ding at the University of California, Berkeley. She suggests that this plateau may be due to a lack of biodiverse data. “Current models in biology are trained on datasets that disproportionately represent well-studied species (e.g., E. coli, mice, humans), and these models are worse at predicting properties about sequences from other parts of the tree of life,” she explains.
Basecamp Research’s Approach
Researchers at Basecamp aimed to resolve this biodiversity deficiency. The company’s expanding database now encompasses samples from over 120 locations across 26 countries, as detailed in a company report. Jonathan Finn, the company’s chief science officer, has indicated that their collection strategy emphasized underexplored extreme environments, ranging from the icy depths beneath Arctic sea ice to tropical hot springs. “Most of the samples that we’ve been going after are prokaryotic samples: bacteria, microbes and their viruses,” says Finn. “I know we’ve got some fungi in there.”
Key Findings from Genetic Analysis
Genetic analysis of these specimens revealed variations in genes that are nearly universally shared across the tree of life. Based on these differences, the company estimates that its data encompasses information from over 1 million species not represented in public genomic datasets used to train AI biology models. Collectively, these species contain approximately 9.8 billion newly identified genes, representing a ten-fold increase in the total number of known genes, each encoding a potentially valuable protein, according to the researchers.
The Vision: A “ChatGPT of Biology”
“By showing these models a large piece of nature, they should have a better understanding of how biology works,” says Finn. “We’re trying to build a ChatGPT of biology.”
The Scale of Microbial Diversity
Some estimates suggest that Earth harbors as many as a trillion microbial species, with only a small fraction having been thoroughly characterized. Thus, the sheer volume of new life identified by Basecamp is not entirely unexpected. “It’s almost inevitable that if you explore more you get more different gene variants,” says Leopold Parts at the Wellcome Sanger Institute, UK.
Potential and Skepticism
Basecamp is betting on the potential value of this new material, and it’s not alone. “This is one of the most exciting things I’ve seen in a long time,” says Nathan Frey, a machine learning researcher at Genentech, a biotech firm in the US. According to him, much of the existing work on AI models for biology has focused on refining algorithms or generating more data in laboratories, rather than venturing into the field to collect samples.
However, the prospect of this database yielding revolutionary model enhancements is met with some reservations. It remains uncertain whether the newly discovered diversity of proteins represents valuable new functions, such as enzymes that degrade plastic or proteins that can be adapted for gene editing. “They have to show that this novelty is useful somehow,” says Parts.
Challenges in Predicting Gene Function
Furthermore, if the new genes are significantly different from those already known, Overmann questions the ability of current tools to accurately predict their functions or to effectively use the data for training novel models. “You don’t have any clue what the majority of the genes do,” he notes. While the company might have assembled a treasure trove of biological discoveries, further in-depth laboratory investigation is needed to decipher and harness its potential, even with the aid of sophisticated AI.