overview
What is GPIC?
GPIC is a large-scale image-text dataset developed by Stanford University that enables researchers and developers in visual generative modeling to advance their work. It comprises 100 million permissively-licensed, VLM-captioned image-text pairs for training and benchmarking. Officially known as "A Giant Permissive Image Corpus for Visual Generation," GPIC was introduced by Stanford's vision lab with its publication appearing on arXiv around May 29, 2026. This dataset provides approximately 28 trillion pixels across 100 million training, 200,000 validation, and 1 million test examples. Its primary purpose is to offer a stable, accessible, and permissively licensed resource for training and benchmarking visual generative models, supporting open and reproducible research.