Interactive visualization of latent relationships between genomic regions
Understanding This Visualization
A guide to reading ALPHAGenome DNA Embeddings in latent space — for biologists
What You Are Looking At
Dataset Configuration
12,421Genomic BinsNon-overlapping 131 KB windows covering the full human genome (hg38/GRCh38)
3,072Embedding DimensionsEach bin is described by a 3,072-number vector produced by ALPHAGenome
~131 KBBin ResolutionThe native input window size of the ALPHAGenome model (131,072 bp)
cosineUMAP Distance MetricMeasures the angle between embedding vectors — captures functional direction, not magnitude
15UMAP n_neighborsHow many local neighbors each point considers when building the map — balances local vs. global structure
0.1UMAP min_distMinimum separation between points in 2D — controls cluster tightness
Three Levels of Biological Resolution
Use the UMAP Mapping Level selector in the sidebar to switch between three views:
Level
Points
What Each Dot Represents
Genomic Bins
~12,400
A raw 131 KB stretch of DNA. Colored by chromosome or by physical position along the chromosome.
Individual Genes
~25,000
One dot per RefSeq gene. Computed by averaging the embeddings of all bins that overlap that gene.
Biological Pathways
~1,200
One dot per KEGG or Reactome pathway. Computed by averaging the gene-level embeddings for all member genes.
How This Data Was Obtained
1
Genome TilingThe human reference genome (hg38/GRCh38) was divided into ~12,400 non-overlapping windows of 131,072 base pairs each — the native input size of ALPHAGenome. This covers all 24 chromosomes (autosomes + X, Y).
2
Embedding with ALPHAGenomeEach 131 KB bin was passed through ALPHAGenome — a DNA foundation model trained on vast amounts of genomic sequence data. The model's final-layer representation (a 3,072-dimensional vector) was extracted and saved for each bin. This was computed on GPU using Google Colab and took several hours.
3
UMAP Dimensionality ReductionThe 12,400 × 3,072 embedding matrix was reduced to 2D coordinates using UMAP (cosine metric, n_neighbors=15, min_dist=0.1). Two separate projections were computed: a global UMAP (all chromosomes together) and a per-chromosome local UMAP (for fine-grained exploration).
4
Gene & Pathway ProjectionsRefSeq gene coordinates (UCSC hg38) were overlapped with the bins. Each gene's embedding was computed as the mean of all overlapping bins. Pathway embeddings (KEGG + Reactome, via MyGene.info) were then computed as the mean of their member genes. These were projected into 2D with UMAP independently.
5
Annotation BundlingGene functional summaries, KEGG/Reactome pathway memberships, and all UMAP coordinates were serialized into static JSON files bundled with this web page. Everything runs in your browser — no server or internet connection needed.
Data Sources
ALPHAGenomeDNA foundation model — provides the 3,072-dim sequence embeddings for each genomic bin
UCSC RefSeq (hg38)Gene coordinate annotations overlapping each genomic bin
MyGene.infoGene functional summaries, KEGG pathway memberships, and Reactome pathway memberships
UMAP (umap-learn)Dimensionality reduction algorithm used to produce the 2D coordinates from the high-dimensional embeddings
Plotly.jsInteractive scatter plot rendering in the browser
Loading ALPHAGenome UMAP Coordinates...
Physical Genome KaryotypeZoom UMAP to highlight corresponding physical regions