ALPHAGenome UMAP Explorer

What You Are Looking At

Dataset Configuration

Genomic Bins Non-overlapping 131 KB windows covering the full human genome (hg38/GRCh38)

3,072 Embedding Dimensions Each bin is described by a 3,072-number vector produced by ALPHAGenome

~131 KB Bin Resolution The native input window size of the ALPHAGenome model (131,072 bp)

cosine UMAP Distance Metric Measures the angle between embedding vectors — captures functional direction, not magnitude

15 UMAP n_neighbors How many local neighbors each point considers when building the map — balances local vs. global structure

0.1 UMAP min_dist Minimum separation between points in 2D — controls cluster tightness

Three Levels of Biological Resolution

Use the UMAP Mapping Level selector in the sidebar to switch between three views:

Level	Points	What Each Dot Represents
Genomic Bins	~12,400	A raw 131 KB stretch of DNA. Colored by chromosome or by physical position along the chromosome.
Individual Genes	~25,000	One dot per RefSeq gene. Computed by averaging the embeddings of all bins that overlap that gene.
Biological Pathways	~1,200	One dot per KEGG or Reactome pathway. Computed by averaging the gene-level embeddings for all member genes.

How This Data Was Obtained

1

Genome Tiling The human reference genome (hg38/GRCh38) was divided into ~12,400 non-overlapping windows of 131,072 base pairs each — the native input size of ALPHAGenome. This covers all 24 chromosomes (autosomes + X, Y).

2

Embedding with ALPHAGenome Each 131 KB bin was passed through ALPHAGenome — a DNA foundation model trained on vast amounts of genomic sequence data. The model's final-layer representation (a 3,072-dimensional vector) was extracted and saved for each bin. This was computed on GPU using Google Colab and took several hours.

3

UMAP Dimensionality Reduction The 12,400 × 3,072 embedding matrix was reduced to 2D coordinates using UMAP (cosine metric, n_neighbors=15, min_dist=0.1). Two separate projections were computed: a global UMAP (all chromosomes together) and a per-chromosome local UMAP (for fine-grained exploration).

4

Gene & Pathway Projections RefSeq gene coordinates (UCSC hg38) were overlapped with the bins. Each gene's embedding was computed as the mean of all overlapping bins. Pathway embeddings (KEGG + Reactome, via MyGene.info) were then computed as the mean of their member genes. These were projected into 2D with UMAP independently.

5

Annotation Bundling Gene functional summaries, KEGG/Reactome pathway memberships, and all UMAP coordinates were serialized into static JSON files bundled with this web page. Everything runs in your browser — no server or internet connection needed.

Data Sources

ALPHAGenome DNA foundation model — provides the 3,072-dim sequence embeddings for each genomic bin

UCSC RefSeq (hg38) Gene coordinate annotations overlapping each genomic bin

MyGene.info Gene functional summaries, KEGG pathway memberships, and Reactome pathway memberships

UMAP (umap-learn) Dimensionality reduction algorithm used to produce the 2D coordinates from the high-dimensional embeddings

Plotly.js Interactive scatter plot rendering in the browser

Understanding This Visualization

What You Are Looking At

Dataset Configuration

Three Levels of Biological Resolution

How This Data Was Obtained

Data Sources