MLPerf Training Benchmark Data Download
Available Downloads
DLRM v2 Benchmark
(click to expand)
Criteo 4TB multi-hot dataset
Criteo Click Logs dataset preprocessed to create a synthetic multi-hot dataset (~4.0TB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/dlrmv2-preprocessed-criteo-click-logs.uri
FLUX.1 Benchmark
(click to expand)
CC12M preprocessed dataset
Conceptual 12M preprocessed dataset (~2.4TB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/flux-1-cc12m-preprocessed.uri
COCO preprocessed dataset
Common Objects in Context preprocessed dataset (~65GB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/flux-1-coco-preprocessed.uri
CC12M dataset
Conceptual 12M dataset (~163GB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/flux-1-cc12m-disk.uri
COCO dataset
Common Objects in Context dataset (~3.6GB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/flux-1-coco.uri
FLUX.1 empty encodings
Pre-computed empty encodings for the FLUX.1 benchmark (~2.1MB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/flux-1-empty-encodings.uri
All FLUX.1 dataset files
CC12M and COCO, both processed and unprocessed, as well as the COCO 2014 30K validation set (~2.7TB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d flux_1_datasets https://training.mlcommons-storage.org/metadata/flux-1-datasets.uri
COCO 2014 30K validation set
Subset of 30,000 image-caption pairs from the COCO 2014 validation data (~2.0MB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d flux_1_datasets https://training.mlcommons-storage.org/metadata/flux-1-coco-2014-val-30k.uri
Graph Neural Network Benchmark
(click to expand)
IGBH dataset
IGB Heterogenous dataset for the Graph Neural Network benchmark (~2.4TB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/gnn-igbh-dataset-full.uri
Llama 3.1 8B Benchmark
(click to expand)
Llama 3.1 8B preprocessed C4 dataset
Llama 3.1 8B C4 dataset preprocessed with the Llama 3.1 8B tokenizer (~86GB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d llama3_1_8b_preprocessed_c4_dataset https://training.mlcommons-storage.org/metadata/llama-3-1-8b-preprocessed-c4-dataset.uri
Llama 3.1 8B tokenizer
Llama 3.1 8B tokenizer (~33GB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d llama3_1_8b_tokenizer https://training.mlcommons-storage.org/metadata/llama-3-1-8b-tokenizer.uri
Llama 3.1 405B Benchmark
(click to expand)
C4 dataset preprocessed with Mixtral tokenizer
C4 dataset for Llama 3.1 405B Benchmark preprocessed with Mixtral tokenizer (~389GB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d c4/mixtral_8x22b_preprocessed https://training.mlcommons-storage.org/metadata/mixtral-8x22b-preprocessed-c4-dataset.uri
Mixtral 8x22b tokenizer
Mixtral 8x22b tokenizer used to preprocess the C4 dataset (~2.8MB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d mixtral_8x22b_tokenizer https://training.mlcommons-storage.org/metadata/mixtral-8x22b-tokenizer.uri
C4 full dataset unzipped
Full C4 fileset unzipped, including all raw train and validation files. (~842GB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d ./ https://training.mlcommons-storage.org/metadata/c4-full-dataset-unzipped.uri
C4 validation dataset zipped
C4 customized validation dataset zipped (~82MB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/c4-validation-dataset-zipped.uri
C4 train and eval datasets
C4 train and eval datasets (~842GB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d c4/original https://training.mlcommons-storage.org/metadata/c4-train-and-eval-datasets.uri
Mixtral 8x22b Benchmark (Retired)
(click to expand)
Mixtral 8x22b v0.1 FSDP checkpoint
Mixtral 8x22b v0.1 FSDP checkpoint for tensor_parallelism=1 (~282GB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/mixtral-8x22b-v0-1-fsdp-checkpoint.uri
Mixtral 8x22b v0.1 2D FSDP TP checkpoint
Mixtral 8x22b v0.1 2D FSDP TP checkpoint for tensor_parallelism>1 (~282GB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/mixtral-8x22b-v0-1-2d-fsdp-tp-checkpoint.uri
Mixtral 8x22b docker images
Mixtral 8x22b docker images for Mixtral 8x22b benchmark environment setup (~24GB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/mixtral-8x22b-docker-images.uri