Introducing CLIP: A Vision-Language Model

We’re introducing a neural network called CLIP which efficiently learns visual concepts from natural language supervision. CLIP can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, similar to the “zero-shot” capabilities of GPT-2 and GPT-3. Although deep learning has revolutionized computer vision, current approaches have several major problems: typical vision datasets are labor-intensive and costly to create while teaching only a narrow set of visual concepts; standard vision models are good at one task and one task only, and require significant effort to adapt to a new task; and models that perform well on benchmarks have disappointingly poor performance on stress tests, casting doubt on the entire deep learning approach to computer vision.

We present a neural network that aims to address these problems: it is trained on a wide variety of images with a wide variety of natural language supervision that’s abundantly available on the internet. By design, the network can be instructed in natural language to perform a great variety of classification benchmarks, without directly optimizing for the benchmark’s performance, similar to the “zero-shot” capabilities of GPT-2 and GPT-3. This is a key change: by not directly optimizing for the benchmark, we show that it becomes much more representative: our system closes this “robustness gap” by up to 75% while matching the performance of the original ResNet-50 on ImageNet zero-shot without using any of the original 1.28M labeled examples.

CLIP was designed to mitigate a number of major problems in the standard deep learning approach to computer vision: costly datasets, narrow task performance, and poor real-world performance. CLIP learns from text–image pairs that are already publicly available on the internet, reducing the need for expensive large labeled datasets. CLIP can be adapted to perform a wide variety of visual classification tasks without needing additional training examples. CLIP also performs better in real-world scenarios compared to traditional models, as it cannot "cheat" by overfitting to benchmark datasets.

Key takeaways:

CLIP is highly efficient: It learns from unfiltered, highly varied, and highly noisy data, and is intended to be used in a zero-shot manner.
CLIP is flexible and general: It can zero-shot perform many different tasks, validated on over 30 different datasets.

Limitations: CLIP struggles with more abstract tasks like counting objects and very fine-grained classification. It also has poor generalization to images not covered in its pre-training dataset.

Broader impacts: CLIP allows people to design their own classifiers and removes the need for task-specific training data, but this can influence model performance and biases. CLIP also raises privacy or surveillance-related risks.

Conclusion: With CLIP, we’ve tested whether task-agnostic pre-training on internet-scale natural language can improve the performance of deep learning for other fields. CLIP learns a wide variety of tasks during pre-training, demonstrated via zero-shot transfer. We are encouraged by our findings that suggest zero-shot evaluation is a more representative measure of a model’s capability.

The original article: https://openai.com/research/clip