Reasoning With Cells
Learning from virtual cell simulations, rBio takes scientists beyond the bounds of published research, with the ease of use of an LLM.
In recent years, large language models (LLMs), such as OpenAI’s ChatGPT, have taken the world by storm, offering user-friendly, conversational access to virtually all the information found on the internet. LLMs have proved to be great time-savers for scientists, because they can rapidly answer researchers’ questions and summarize scientific research. But as a tool for doing serious science, LLMs have their limits: ask them an open-ended question involving complex raw data and they quickly hit a wall.
Enter “reasoning models,” a recently developed subcategory of language models that incorporate logic and problem-solving. Though generally not as fast as standard LLMs, reasoning models like OpenAI’s o-series are superior at tasks where context and deduction are especially important.
A key focus for the Chan Zuckerberg Initiative is building AI-based virtual cell models, digital representations of cells that can be used in computer-simulated experiments that predict how actual cells will behave in laboratory experiments. Such models promise to greatly accelerate the pace of research, yet they are complex and often require many researchers to have special expertise. In addition, there are sometimes many different virtual cell models related to a particular task without an established way to coherently incorporate them into a single workflow. Most reasoning models also require vast amounts of experimental data to be trained, but oftentimes, there are models of datasets available without the actual datasets in the public domain.
To bridge this usability gap, CZI today announces rBio, the first reasoning model trained on virtual cell simulations from one or many virtual cell models. rBio distills information extracted from virtual cell models into a consistent model of natural language during training to allow users to easily apply sophisticated step-by-step reasoning to complex biological problems. This effectively turns virtual cell models into biology teachers for reasoning models, sidestepping the need for experimental data as the only teacher, and resulting in more capable reasoning LLMs for biology. Combining the power of one or many virtual cell models with the chat-style interface of LLMs could empower many more scientists to study biological questions based on rich foundation models of biology while remaining within a familiar interface.
An LLM That Taps the Power of Virtual Cells
When presented with completely new data and hypotheses, rBio saves scientists time by predicting research outcomes before they commit time and resources to test and discard multiple hypotheses in costly laboratory research.
While rBio has the potential to learn from many approaches to cell biology, the model has first been trained on perturbation models and gene co-expression patterns and gene regulatory pathways information extracted from TranscriptFormer — one of CZI’s virtual cell models. This versatile model is able to classify the variety of cell types and states across different species and stages of development. Scientists can ask rBio questions such as, “Would suppressing the actions of gene A result in an increase in activity of gene B?” In response, the model provides information about the resulting changes to cells, such as a shift from a healthy to a diseased state.
Answers to these questions can shape our understanding of the gene interactions contributing to neurodegenerative diseases like Alzheimer’s. Such knowledge could lead to earlier intervention, perhaps halting these diseases altogether someday.
rBio is another important step in CZI’s vision to build AI systems that can think like scientists — producing new knowledge by learning from virtual cell models and data. In building rBio, CZI has also constructed a broader framework for funneling the vast knowledge of virtual cell models, which will help build a range of widely accessible AI tools for biology in the coming years.
What’s Inside the rBio Box
While designing rBio, the research and engineering teams, led by CZI Senior Director of AI Theofanis Karaletsos and Senior Research Scientist Ana-Maria Istrate, overcame a fundamental challenge to teaching biology to LLMs. Language models are designed to learn from questions with unambiguous outcomes, such as “2 + 2 = ?” or whether water consists of hydrogen and oxygen. But biological questions must incorporate varying levels of uncertainty, such as whether a new drug is likely to cure a specific form of cancer.
With TranscriptFormer as a foundation, CZI engineers applied new methods to how LLMs are trained. Using an off-the-shelf language model as a scaffold, the team trained rBio with reinforcement learning, a common technique in which the model is rewarded for correct answers. But instead of asking a series of yes/no questions, the researchers tuned the rewards in proportion to the likelihood that the model’s answers were correct. This novel approach means that rBio learned to propose hypotheses that align with biological reality — improving accuracy, coherence and scientific value.
While TranscriptFormer takes its instructions and returns results as complex data, rBio allows users to interact in plain language. For example, during training, rBio was asked questions with the following structure: Are gene A and gene B likely to be co-expressed? Give a binary yes/no answer only. As a reasoning model, rBio can now answer these same questions worded differently.
But training an LLM using “soft” outcomes runs the risk of producing an inaccurate model. To ensure that rBio was not being misled, the team compared the model’s performance with that of multiple baseline LLMs. Across a variety of cell-labeling and perturbation prediction challenges, rBio outperformed the baseline models, showing that virtual cell models can be used to train a reliable LLM to reason about biology. Specifically, this first version of rBio outperforms previously published models on the PerturbQA benchmark like SUMMER (ICLR 2025), baseline LLMs like QWEN2.5, and matches a strongly performant rBio ablation directly trained on experimental data when using chain-of-thought. In other results, rBio improved understanding of perturbation tasks — activating or deactivating particular genes — zero-shot by learning general biology like gene co-expression patterns from TranscriptFormer, a model and task unrelated to perturbation. This shows promise for further inquiry about transferable knowledge from virtual cell models.
rBio in Action
Now available on CZI’s virtual cell platform, rBio can help accelerate the work of scientists studying gene perturbation in the laboratory. Machine learning practitioners can also use the rBio framework to train their own LLMs, or use rBio itself to benchmark their models.
rBio’s expertise is currently limited to gene perturbation, but every domain of cell biology covered by TranscriptFormer could be taught to rBio. In the future, the family of virtual cell models on the virtual cell platform could all be used to train similar reasoning models, or combined into comprehensive language models that understand cells from the smallest molecule to the largest system.
While rBio is ready for research, the model’s engineering team is continuing to improve the user experience, because the flexible problem-solving that makes reasoning models conversational also poses a number of challenges. One of these is ensuring the model has appropriate guardrails to keep rBio from providing answers to questions outside its expertise. These safety measures are a common step in the responsible development of all LLMs, such as when models decline to provide medical advice to users.
AI is already accelerating the pace of biological research. Virtual cell models are saving researchers from fruitless laboratory experiments, while LLMs provide a convenient way to build knowledge conversationally. rBio combines these strengths, demonstrating a framework for harnessing AI to answer biology’s toughest questions without the need for domain expertise. As CZI’s family of virtual cell models grows, reasoning models like rBio will help scientists as they study, develop treatments, and ultimately prevent disease.
Researchers can access rBio on the virtual cell platform, including a quick start; tutorial; the codebase on GitHub; and the preprint on bioRxiv.