MLPerf Training Benchmark Data Download

Available Downloads

DLRM v2 Benchmark

(click to expand)

Criteo 4TB multi-hot dataset

Criteo Click Logs dataset preprocessed to create a synthetic multi-hot dataset (~4.0TB)

bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/dlrmv2-preprocessed-criteo-click-logs.uri

FLUX.1 Benchmark

(click to expand)

CC12M preprocessed dataset

Conceptual 12M preprocessed dataset (~2.4TB)

bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/flux-1-cc12m-preprocessed.uri

COCO preprocessed dataset

Common Objects in Context preprocessed dataset (~65GB)

bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/flux-1-coco-preprocessed.uri

CC12M dataset

Conceptual 12M dataset (~163GB)

bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/flux-1-cc12m-disk.uri

COCO dataset

Common Objects in Context dataset (~3.6GB)

bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/flux-1-coco.uri

FLUX.1 empty encodings

Pre-computed empty encodings for the FLUX.1 benchmark (~2.1MB)

bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/flux-1-empty-encodings.uri

All FLUX.1 dataset files

CC12M and COCO, both processed and unprocessed, as well as the COCO 2014 30K validation set (~2.7TB)

bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d flux_1_datasets https://training.mlcommons-storage.org/metadata/flux-1-datasets.uri

COCO 2014 30K validation set

Subset of 30,000 image-caption pairs from the COCO 2014 validation data (~2.0MB)

bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d flux_1_datasets https://training.mlcommons-storage.org/metadata/flux-1-coco-2014-val-30k.uri

Graph Neural Network Benchmark

(click to expand)

IGBH dataset

IGB Heterogenous dataset for the Graph Neural Network benchmark (~2.4TB)

bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/gnn-igbh-dataset-full.uri

Llama 3.1 8B Benchmark

(click to expand)

Llama 3.1 8B preprocessed C4 dataset

Llama 3.1 8B C4 dataset preprocessed with the Llama 3.1 8B tokenizer (~86GB)

bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d llama3_1_8b_preprocessed_c4_dataset https://training.mlcommons-storage.org/metadata/llama-3-1-8b-preprocessed-c4-dataset.uri

Llama 3.1 8B tokenizer

Llama 3.1 8B tokenizer (~33GB)

bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d llama3_1_8b_tokenizer https://training.mlcommons-storage.org/metadata/llama-3-1-8b-tokenizer.uri

Llama 3.1 405B Benchmark

(click to expand)

C4 dataset preprocessed with Mixtral tokenizer

C4 dataset for Llama 3.1 405B Benchmark preprocessed with Mixtral tokenizer (~389GB)

bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d c4/mixtral_8x22b_preprocessed https://training.mlcommons-storage.org/metadata/mixtral-8x22b-preprocessed-c4-dataset.uri

Mixtral 8x22b tokenizer

Mixtral 8x22b tokenizer used to preprocess the C4 dataset (~2.8MB)

bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d mixtral_8x22b_tokenizer https://training.mlcommons-storage.org/metadata/mixtral-8x22b-tokenizer.uri

C4 full dataset unzipped

Full C4 fileset unzipped, including all raw train and validation files. (~842GB)

bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d ./ https://training.mlcommons-storage.org/metadata/c4-full-dataset-unzipped.uri

C4 validation dataset zipped

C4 customized validation dataset zipped (~82MB)

bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/c4-validation-dataset-zipped.uri

C4 train and eval datasets

C4 train and eval datasets (~842GB)

bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d c4/original https://training.mlcommons-storage.org/metadata/c4-train-and-eval-datasets.uri

Mixtral 8x22b Benchmark (Retired)

(click to expand)

Mixtral 8x22b v0.1 FSDP checkpoint

Mixtral 8x22b v0.1 FSDP checkpoint for tensor_parallelism=1 (~282GB)

bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/mixtral-8x22b-v0-1-fsdp-checkpoint.uri

Mixtral 8x22b v0.1 2D FSDP TP checkpoint

Mixtral 8x22b v0.1 2D FSDP TP checkpoint for tensor_parallelism>1 (~282GB)

bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/mixtral-8x22b-v0-1-2d-fsdp-tp-checkpoint.uri

Mixtral 8x22b docker images

Mixtral 8x22b docker images for Mixtral 8x22b benchmark environment setup (~24GB)

bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/mixtral-8x22b-docker-images.uri