MLPerf Training Benchmark Data Download
Available Downloads
DLRM v2 Benchmark
(click to expand)
Criteo 4TB multi-hot dataset (reference format)
Criteo Click Logs dataset preprocessed to create a synthetic multi-hot dataset in reference format (~4.0TB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/dlrmv2-preprocessed-criteo-click-logs-reference.uri
Criteo 4TB multi-hot dataset (HugeCTR format)
Criteo Click Logs dataset preprocessed to create a synthetic multi-hot dataset in HugeCTR format (~4.0TB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/dlrmv2-preprocessed-criteo-click-logs.uri
FLUX.1 Benchmark
(click to expand)
CC12M preprocessed dataset
Conceptual 12M preprocessed dataset (~2.4TB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/flux-1-cc12m-preprocessed.uri
COCO preprocessed dataset
Common Objects in Context preprocessed dataset (~65GB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/flux-1-coco-preprocessed.uri
CC12M dataset
Conceptual 12M dataset (~163GB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/flux-1-cc12m-disk.uri
COCO dataset
Common Objects in Context dataset (~3.6GB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/flux-1-coco.uri
FLUX.1 empty encodings
Pre-computed empty encodings for the FLUX.1 benchmark (~2.1MB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/flux-1-empty-encodings.uri
All FLUX.1 dataset files
CC12M and COCO, both processed and unprocessed, as well as the COCO 2014 30K validation set (~2.7TB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d flux_1_datasets https://training.mlcommons-storage.org/metadata/flux-1-datasets.uri
COCO 2014 30K validation set
Subset of 30,000 image-caption pairs from the COCO 2014 validation data (~2.0MB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d flux_1_datasets https://training.mlcommons-storage.org/metadata/flux-1-coco-2014-val-30k.uri
Graph Neural Network Benchmark
(click to expand)
IGBH dataset
IGB Heterogenous dataset for the Graph Neural Network benchmark (~2.4TB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/gnn-igbh-dataset-full.uri
Llama 3.1 8B Benchmark
(click to expand)
Llama 3.1 8B preprocessed C4 dataset
Llama 3.1 8B C4 dataset preprocessed with the Llama 3.1 8B tokenizer (~86GB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d llama3_1_8b_preprocessed_c4_dataset https://training.mlcommons-storage.org/metadata/llama-3-1-8b-preprocessed-c4-dataset.uri
Llama 3.1 8B tokenizer
Llama 3.1 8B tokenizer (~33GB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d llama3_1_8b_tokenizer https://training.mlcommons-storage.org/metadata/llama-3-1-8b-tokenizer.uri
Llama 3.1 405B Benchmark
(click to expand)
C4 dataset preprocessed with Mixtral tokenizer
C4 dataset for Llama 3.1 405B Benchmark preprocessed with Mixtral tokenizer (~389GB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d c4/mixtral_8x22b_preprocessed https://training.mlcommons-storage.org/metadata/mixtral-8x22b-preprocessed-c4-dataset.uri
Mixtral 8x22b tokenizer
Mixtral 8x22b tokenizer used to preprocess the C4 dataset (~2.8MB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d mixtral_8x22b_tokenizer https://training.mlcommons-storage.org/metadata/mixtral-8x22b-tokenizer.uri
C4 full dataset unzipped
Full C4 fileset unzipped, including all raw train and validation files. (~842GB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d ./ https://training.mlcommons-storage.org/metadata/c4-full-dataset-unzipped.uri
C4 validation dataset zipped
C4 customized validation dataset zipped (~82MB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/c4-validation-dataset-zipped.uri
C4 train and eval datasets
C4 train and eval datasets (~842GB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d c4/original https://training.mlcommons-storage.org/metadata/c4-train-and-eval-datasets.uri
BERT Benchmark (Retired)
(click to expand)
BERT input files
Checkpoint, config, and other input files for the BERT benchmark (~24GB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/bert-input-files.uri
BERT preprocessed Wikipedia dataset
Wikipedia dataset preprocessed for the BERT benchmark (~9.7GB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/bert-preprocessed-wikipedia-dataset.uri
GPT-3 Megatron Benchmark (Retired)
(click to expand)
GPT-3 Megatron preprocessed dataset
C4 dataset preprocessed for the GPT-3 Megatron benchmark (~89GB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/gpt-3-megatron-preprocessed-dataset.uri
GPT-3 Megatron FP32 checkpoint
FP32 checkpoint for the GPT-3 Megatron benchmark (~2.1TB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/gpt-3-megatron-fp32-checkpoint.uri
GPT-3 Megatron BF16 checkpoint
BF16 checkpoint for the GPT-3 Megatron benchmark (~2.5TB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/gpt-3-megatron-bf16-checkpoint.uri
Mixtral 8x22b Benchmark (Retired)
(click to expand)
Mixtral 8x22b v0.1 FSDP checkpoint
Mixtral 8x22b v0.1 FSDP checkpoint for tensor_parallelism=1 (~282GB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/mixtral-8x22b-v0-1-fsdp-checkpoint.uri
Mixtral 8x22b v0.1 2D FSDP TP checkpoint
Mixtral 8x22b v0.1 2D FSDP TP checkpoint for tensor_parallelism>1 (~282GB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/mixtral-8x22b-v0-1-2d-fsdp-tp-checkpoint.uri
Mixtral 8x22b docker images
Mixtral 8x22b docker images for Mixtral 8x22b benchmark environment setup (~24GB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/mixtral-8x22b-docker-images.uri
Stable Diffusion Benchmark (Retired)
(click to expand)
Stable Diffusion LAION 400M filtered moments dataset
LAION 400M preprocessed moments for the Stable Diffusion benchmark (~897GB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d datasets/laion-400m/webdataset-moments-filtered https://training.mlcommons-storage.org/metadata/stable-diffusion-laion-400m-filtered-moments-dataset.uri
Stable Diffusion LAION 400M filtered images dataset
LAION 400M raw images for the Stable Diffusion benchmark (~300GB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d datasets/laion-400m/webdataset-filtered https://training.mlcommons-storage.org/metadata/stable-diffusion-laion-400m-filtered-images-dataset.uri
Stable Diffusion COCO 2014 validation prompts dataset
COCO 2014 validation prompts datasetfor the Stable Diffusion benchmark (~2.1MB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d datasets/coco2014 https://training.mlcommons-storage.org/metadata/stable-diffusion-coco2014-validation-prompts-dataset.uri
Stable Diffusion COCO 2014 validation stats dataset
COCO 2014 validation stats dataset for the Stable Diffusion benchmark (~33MB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d datasets/coco2014 https://training.mlcommons-storage.org/metadata/stable-diffusion-coco2014-validation-stats-dataset.uri
Stable Diffusion SD checkpoint
StabilityAI's 512-base-ema.ckpt checkpoint for the Stable Diffusion benchmark (~5.3GB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d checkpoints/sd https://training.mlcommons-storage.org/metadata/stable-diffusion-sd-checkpoint.uri
Stable Diffusion Inception checkpoint
Inception checkpoint for the Stable Diffusion benchmark (~96MB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d checkpoints/inception https://training.mlcommons-storage.org/metadata/stable-diffusion-inception-checkpoint.uri
Stable Diffusion CLIP checkpoint
CLIP checkpoint for the Stable Diffusion benchmark (~3.9GB)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d checkpoints/clip https://training.mlcommons-storage.org/metadata/stable-diffusion-clip-checkpoint.uri