Genomic Surveillance with GISAID and Nextstrain

On 11 November 2021, a routine sequencing run at the Network for Genomics Surveillance in South Africa generated a SARS-CoV-2 genome with an unusual cluster of mutations on the spike protein. Within hours the sequence was uploaded to GISAID. Within days, scientists in Botswana, Hong Kong, and Israel had matched it to similar sequences. On 24 November, South Africa formally notified WHO. Two days after that, the variant got a Greek letter: Omicron. From sequencing to global designation took 13 days. Without genomic surveillance and an open data-sharing platform, that timeline would have been measured in months.

Genomic surveillance has become one of the most consequential tools in modern outbreak response. It lets scientists watch pathogens evolve in near real time, identify variants of concern before they dominate, track origin and spread, and design countermeasures targeted at the actual circulating strain. Two pieces of infrastructure carry most of this work: GISAID, the global database where most SARS-CoV-2 and influenza sequences live, and Nextstrain, the open-source platform that visualizes how those sequences relate. This post is part of pandemic preparedness 101, focused on how to read what these systems show.

Key Takeaways

Genomic surveillance is the systematic sequencing of pathogen genomes to track evolution, identify variants, and trace transmission chains in real time.
GISAID hosts more than 17 million SARS-CoV-2 sequences and millions of influenza sequences contributed by labs in over 200 countries. It is the de facto global pathogen database.
Nextstrain is an open-source platform that pulls public sequence data and visualizes evolutionary trees, geographic spread, and variant frequencies in interactive dashboards.
The infrastructure detected Omicron 13 days after the first sample, allowed identification of mpox clade Ib in 2024 within weeks of clustering, and tracked H5N1 from waterfowl into US dairy cattle within months.
Surveillance gaps remain large in low-income countries, in animal sequences, and in pathogens other than SARS-CoV-2 and flu. Closing these gaps is the most cited under-resourced area in pandemic preparedness.

What is genomic surveillance?

Genomic surveillance is the practice of sequencing the complete or partial genomes of pathogens collected through clinical, environmental, or animal samples, and analyzing those sequences for what they reveal about the pathogen population: variants, lineages, transmission chains, and emerging mutations of concern. The data are most useful when they are timely (sequenced within days of sample collection), well-annotated (linked to date, location, and metadata), and openly shared.

The concept is older than COVID-19 but the scale changed dramatically with the pandemic. Pre-2020, sequence-based outbreak investigations were retrospective. Most labs sequenced fewer than 100 samples a year. By 2022, the United Kingdom alone was sequencing more than 30,000 SARS-CoV-2 samples a week. The infrastructure built during COVID-19 is now being extended to influenza, mpox, RSV, and other priority pathogens, though investment has dropped substantially since 2023.

What is GISAID?

GISAID (the Global Initiative on Sharing All Influenza Data) was launched in 2008 to remove barriers to sharing influenza sequences. The platform asks contributors to register, agree to terms of use, and credit data providers when using sequences in publications. GISAID has been controversial precisely because of these access conditions, which differ from the fully open NCBI GenBank model. The trade-off has been pragmatic: many countries that would not contribute to fully open databases (because of perceived risks of being scooped or having data used without attribution) do contribute to GISAID.

By volume, GISAID is the dominant global pathogen sequence repository. As of 2025 it hosts more than 17 million SARS-CoV-2 sequences and millions of influenza A and B sequences. It is the system WHO uses to formally track variants of concern. The trade-off is that some genomic surveillance work, particularly in academic settings that prioritize fully open data, runs in parallel on NCBI GenBank or specialized databases. The two ecosystems are complementary rather than substitutable.

What does Nextstrain do?

Nextstrain is an open-source platform that pulls public sequence data, builds phylogenetic trees showing how genomes relate to each other, overlays geographic and temporal information, and presents the result as interactive dashboards anyone can read. The team built tools that take raw GISAID and GenBank submissions and turn them into a visual tree of how a virus is evolving across countries and over time, updated continuously.

The signature views you can read at nextstrain.org include:

Tree view: phylogenetic relationships colored by country, lineage, or any other metadata
Map view: geographic spread of lineages over time
Frequencies: which variants are growing or shrinking as a fraction of new cases
Genetic divergence: which mutations are accumulating and where on the genome they fall

Nextstrain runs separate dashboards for SARS-CoV-2, seasonal influenza A and B, mpox, RSV, dengue, Ebola, and several others. The same underlying tool (Augur for analysis, Auspice for visualization) is open-source and used by national surveillance teams to build country-specific dashboards.

How does genomic sequencing detect variants?

Variants are identified by clusters of mutations that share an inheritance pattern (a "lineage") and that exceed expected drift rates or carry mutations associated with biological consequences (immune escape, transmissibility, virulence). The Pango lineage system for SARS-CoV-2, the standard naming convention used worldwide, classifies sequences into a hierarchical tree where each new lineage represents a meaningfully distinct branch.

Three signals trigger investigation:

Rapid frequency growth: a lineage rising as a share of recent samples faster than would be expected from random drift. Omicron showed this pattern in South African sequences within days.
Mutation pattern of concern: clusters of mutations on functionally important regions (the spike receptor-binding domain for SARS-CoV-2, hemagglutinin for flu) that prior research has shown affect immune escape or transmission.
Cross-host or cross-region jumps: a sequence appearing in a new geographic area, host species, or transmission setting where it had not previously been detected.

Once flagged, sequences feed into laboratory work (neutralization assays, animal infection studies, structural biology) that confirms or rules out the public health concern. The full pipeline from sequence to variant designation can run in days for well-resourced systems and weeks to months otherwise.

Where are the genomic surveillance gaps?

Coverage is wildly uneven. By the end of 2022, the UK and Denmark had sequenced more than 5 percent of all detected SARS-CoV-2 cases nationally. Most of sub-Saharan Africa, parts of South Asia, and Latin America stayed below 0.5 percent. Africa CDC's Pathogen Genomics Initiative has narrowed the gap substantially since 2021, but the geographic blind spots from earlier in the pandemic delayed identification of multiple variants. The pattern repeated for mpox in 2022 to 2024 and for H5N1 in 2024.

Animal sequencing is the other major gap. Most genomic surveillance budgets target human samples, and surveillance of livestock, wildlife, and environmental samples is much thinner. The H5N1 dairy cattle outbreak that began in 2024 was detected partly through milk sequencing run by USDA, but routine livestock pathogen sequencing in most countries is limited. This is also a One Health framework gap. The third gap is for pathogens other than the well-funded ones: RSV, dengue, leptospirosis, and many others have far thinner global sequence coverage than SARS-CoV-2 or seasonal flu.

What can you read on Nextstrain yourself?

The Nextstrain dashboards at nextstrain.org are designed for both expert and lay audiences. A few useful entry points:

Seasonal flu (nextstrain.org/flu/seasonal): which influenza strains are dominant globally and regionally, ahead of WHO vaccine selection meetings
SARS-CoV-2 (nextstrain.org/ncov): current variant frequencies and recent lineage emergence
Mpox (nextstrain.org/monkeypox): clade Ib geographic spread since 2024
H5N1 (nextstrain.org/avian-flu): the global picture for highly pathogenic avian influenza, including the US dairy cattle outbreak

The dashboards update on a regular cadence (typically weekly) and pull from the latest public sequence submissions. They are useful for forming an independent picture of where outbreaks are heading, beyond what news headlines summarize. For a complementary signal, wastewater surveillance catches the same trends earlier in some cases but does not provide the lineage-level resolution that sequencing does.

FAQ

Is GISAID free to use?

Yes for registered researchers. Anyone with a verifiable academic, public health, or research affiliation can register and access the data. Use is governed by terms that require crediting data submitters when using sequences in publications. GISAID does not allow indiscriminate redistribution of the underlying sequence files, which is the source of most of the criticism against it.

Why isn't all sequence data on GenBank?

Many national labs choose GISAID specifically because of the credit-and-attribution model. Submitting a sequence to GenBank gives any user worldwide unrestricted access to use it without obligation to acknowledge the contributor. For low-income country labs, where sequencing capacity was built with limited funding and academic credit is part of the value proposition, GISAID has been a more politically sustainable model. The trade-offs are widely debated but the contribution differential is real.

Can Nextstrain detect a brand-new pathogen?

Not directly. Nextstrain visualizes already-identified pathogens. Detecting an entirely novel pathogen still requires unbiased metagenomic sequencing and bioinformatic pipelines that compare reads against all known references. Once a novel pathogen is identified and its first sequences are uploaded, Nextstrain can take over for tracking its spread and evolution.

How fast can a variant be identified?

For well-monitored pathogens like SARS-CoV-2 and seasonal flu, identification is days. For less-funded pathogens, weeks to months. The constraint is sample collection and sequencing throughput at the source, not the analytical infrastructure. The gap between "the variant exists" and "the variant is identified" is essentially the size of the surveillance investment in the country where it first emerges.

Does genomic surveillance help with outbreak response on the ground?

Yes, in two ways. It helps decision-makers know which strain is circulating, which informs vaccine, treatment, and diagnostic choices. And it helps trace transmission chains in real time, including identifying superspreader events, healthcare-associated outbreaks, and cross-border importation. Public health teams that have built sequence-informed contact tracing have closed outbreaks faster than those relying only on epidemiological case-finding.