AI in Science and Discovery
Science advances by forming hypotheses, gathering evidence, and iterating. That process has always been bottlenecked by two things: the ability to search through possible explanations, and the ability to measure the world precisely. AI is accelerating both. It is not replacing scientists — it is compressing the distance between a question and a testable answer, sometimes by years.
Protein Structure and the Biology Revolution
Proteins are the molecular machines of life. Every enzyme, antibody, and structural component of a cell is a protein — a chain of amino acids that folds into a precise three-dimensional shape, and whose shape determines its function. For 50 years, determining that shape experimentally (via X-ray crystallography or cryo-electron microscopy) was slow, expensive, and required specialized equipment. In many cases it took years per protein. The protein folding problem — predicting shape from sequence — was considered one of biology's hardest open problems. DeepMind's AlphaFold2, unveiled at the Critical Assessment of protein Structure Prediction (CASP14) competition in late 2020, predicted protein structures with accuracy that matched or exceeded experimental methods for most proteins it was tested on. The Structural Biology community described it as a breakthrough of historic magnitude. The mechanism behind AlphaFold2 combines multiple sequence alignment (comparing a protein's sequence to evolutionary relatives to infer which positions are structurally constrained) with a transformer-style attention architecture that reasons about pairwise relationships between amino acid residues across the chain. It was trained on the Protein Data Bank — tens of thousands of experimentally determined structures — and generalizes to proteins never seen in training. By 2023, DeepMind and EMBL's European Bioinformatics Institute had released predicted structures for over 200 million proteins — essentially the entire known protein universe. Researchers can now look up a predicted structure for almost any protein in minutes rather than designing a multi-year experimental campaign. Drug discovery, vaccine design, and fundamental biology have all been affected.
Evolutionary pressure leaves a fingerprint in protein sequences: positions that co-evolve are often spatially close in the folded structure. AlphaFold2 extracts this signal using multiple sequence alignment combined with attention mechanisms that model pairwise relationships between residues — letting the model infer structural constraints that are implicit in the evolutionary record.
Materials science is seeing parallel developments. The space of possible materials — combinations of elements, crystal structures, and synthesis conditions — is too vast to explore exhaustively by experiment. ML models trained on databases of known materials (such as the Materials Project, which catalogs computed properties for over 150,000 inorganic compounds) can predict properties like electrical conductivity, hardness, or thermal stability for candidate materials before any lab work begins. Google DeepMind released GNoME (Graph Networks for Materials Exploration) in 2023, predicting the stability of 2.2 million new crystal structures — 380,000 of which were estimated to be stable enough to synthesize. This represents a potential pipeline of new materials for batteries, solar cells, and semiconductors that would take centuries to screen experimentally. Climate and weather modeling is another domain where AI is changing what is computationally feasible. Traditional numerical weather prediction solves differential equations describing fluid dynamics on a global grid — computationally expensive and limited in resolution. ML models trained on decades of atmospheric reanalysis data (historical reconstructions of global weather) can produce 10-day global weather forecasts in seconds rather than hours, with accuracy that rivals the best traditional systems for many variables. Google's GraphCast and ECMWF's AI-augmented forecasts represent the leading edge of this approach. Climate modeling — projecting decades-long trends rather than days-long weather — is harder because small errors compound. AI is used here primarily as an emulator (approximating expensive physics simulations) and as a downscaling tool (adding fine-grained regional detail to coarse global model output), rather than replacing the underlying physics directly.
AlphaFold2 needed the Protein Data Bank. GNoME needed the Materials Project. GraphCast needed ECMWF's ERA5 reanalysis. Scientific AI depends on high-quality, curated, open scientific databases. Building and maintaining those databases is itself a major scientific contribution — and often an unglamorous one.
Flashcards — click each card to reveal the answer
Why is the protein folding problem considered solved 'to near-experimental accuracy' rather than 'perfectly'?
Why is AI used as an emulator in climate modeling rather than as a direct replacement for physics-based simulation?
Trace a Scientific AI Pipeline
- Pick one of the three scientific AI applications discussed: AlphaFold2, GNoME, or GraphCast.
- Draw or outline the pipeline from raw data to usable result. For each stage, identify: What is the input? What transformation happens? What is the output?
- For AlphaFold2: input is amino acid sequence; intermediate steps include multiple sequence alignment and attention over residue pairs; output is a predicted 3D coordinate set with per-residue confidence scores.
- For GNoME: input is candidate crystal structures as graphs; output is a stability prediction.
- For GraphCast: input is a global atmospheric state; output is a predicted atmospheric state 6 hours later (iterated to 10 days).
- Now ask: where in your chosen pipeline could errors enter? Where is the model likely to be least reliable?
- Write a paragraph explaining one specific limitation you identified and why it matters.