Computer vision has continued to grow with unprecedented progress in recent years. When comparing this to Natural Language Processing (NLP), one thing seems clear: the size of neural networks will keep growing, as will the capabilities of these models will too. Nonetheless, as the size and complexity of these models continue to expand, adapting them to novel tasks and domains presents significant challenges – that differ from those faced in the NLP community.

The goal of this workshop is to explore and discuss ways of dealing with the new reality of ever larger models in computer vision. The sheer parameter and training dataset sizes mean that these models often cannot be trained by academia and some models might not even fit on large GPUs for inference. These developments not only bring new challenges for computer vision researchers and practitioners but also many novel opportunities. In this workshop, we aim to bring together researchers from academia and industry to talk and discuss topics that are increasingly of importance for the vision community.

The workshop BigMAC: Big Model Adaptation for Computer Vision, will cover topics related to how large pretrained models can be effectively used:

  • Prompting methods and techniques for vision models
  • New methodologies for fine-tuning pretrained models
  • Leveraging multi-modal weak-supervision techniques
  • Scaling and fine-tuning with self-supervision
  • Finetuning large general-pretrained models robustness
  • Quantization and efficiency

This is the first iteration of the BigMAC event and does NOT have a call for papers, which might be added in future iterations. The workshop is organized as a half-day event with oral talks from the invited speakers.


The workshop is a half-day event. It will consist of a series of invited talks on this topic from the leading experts in academia and industry.


The workshop will take place on October 2nd, 9am-1pm, in the Paris Convention Center, Room S03. Times are Paris time.

Recording available here

9:00-9:15 Welcome (Yuki M. Asano)

Scale has arrived in vision, stand-alone and via Visual Language Models. What to do?

9:15-9:45 Neil Houlsby

LLMs are increasingly being augmented with visual capabilities. This progress is, in part, driven by improvements in scalable and general visual pre-training. In this talk, I will present some recent advances in visual pre-training with a focus on Vision Transformers, and the results of applying such models in VLMs. These advances tackle scaling ViTs to unprecedented sizes, endowing them with the ability to flexibly handle any images, and unlocking the efficacy of sparsity.

9:45-10:15 Maria Attarian

The goal of the field of robotics is to create physical systems that can operate dexterously in dynamic real-world environments. This requires being able to handle ambiguous task descriptions, and creating action plans which take into account the physical constraints of the embodiment and environment. Large pretrained multimodal models contain a wealth of world knowledge, which makes them a promising backbone for such robotic systems. However despite their strengths, these models don’t understand physical dynamics, and do not natively produce robotic control outputs. In this talk we describe these strengths and weaknesses in detail, and discuss how to make these models more suitable for robotic planning and control.

10:15-10:45 Samir Gadre (for Ludwig Schmidt)


10:45-11:00 Break
11:00-11:30 Ishan Misra

In this talk, I’ll show how self-supervised learning can be used to improve foundational multimodal models to scale to more modalities, learn better representations, and be more efficient. A big challenge when training foundational multimodal models is the scarcity of paired data. While there is copious amounts of (image, text) data, other modalities such as depth or IMU have limited dataset sizes overall. Our first work in this direction called ImageBind shows that images can be used as a universal signal to “bind” multiple different modalities. We show that naturally co-occurring image pairings such as (image, IMU), (image, depth) can be automatically used to learn a share embedding space where unseen pairs of modalities are aligned. ImageBind enables emergent zero-shot recognition, cross-modal retrieval and generation. Our second work shows that using self-supervised learning as a “pre” pre-training stage improves multimodal (image, text) representations across a wide range of model sizes and data sizes. Pre pre-training improves the performance of foundational models with billions of parameters trained on billions of images while also speeding up their convergence. The resulting models show state-of-the-art performance for full finetuning, linear probing, zero-shot recognition tasks in image and video domains.

11:30-12:00 Aditi Raghunathan

Pretrained models provide strong feature representations which can be adapted to downstream tasks via fine-tuning. However, in this talk, we show that standard fine-tuning procedures can cause feature distortion: correspondences between in-distribution and out-of-distribution features are weakened. As a result, fine-tuned models are not “maximally” robust to distribution shifts. How do we devise better fine-tuning procedures that minimize such distortion? Via theoretical constructions, we provide simple, scalable and effective modifications to the fine-tuning process that vastly improve the accuracy and robustness of fine-tuned models. Overall, this talk highlights the importance of various factors often overlooked in the fine-tuning process in effectively preserving pretrained knowledge.

12:00-12:30 Sayak Paul

Large-scale text-to-image diffusion models like DALL-E 2, Imagen, and Stable Diffusion have been quite successful at the task of text-to-image generation. However, their text-only conditional control offers limited flexibility and controllability to the end users. To elevate that degree of freedom, we need ways to condition the generation process better. This talk will discuss some of the most promising and effective approaches to controlling text-to-image diffusion models. The approaches will be a mix of both training-time and inference-time techniques.

12:30-1:00 Carl Vondrick

Computer vision algorithms need to combine many skills — spatial, physical, mathematical, geometrical, and cognitive — in order to accurately analyze the visual world. In this talk, I will show how code synthesis equips neural networks with these skills, thereby providing versatile representations for answering questions and recognizing objects. Through a series of experimental results, I will moreover show how this approach naturally provides inherent explainability of the decision making process, while also achieving state-of-the-art zero-shot performance across different tasks and benchmarks.

1:00 Closing remarks