Summary

Computer vision has continued to grow with unprecedented progress in recent years. When comparing this to Natural Language Processing (NLP), one thing seems clear: the size of neural networks will keep growing, as will the capabilities of these models will too. Nonetheless, as the size and complexity of these models continue to expand, adapting them to novel tasks and domains presents significant challenges – that differ from those faced in the NLP community.

The goal of this workshop is to explore and discuss ways of dealing with the new reality of ever larger models in computer vision. The sheer parameter and training dataset sizes mean that these models often cannot be trained by academia and some models might not even fit on large GPUs for inference. These developments not only bring new challenges for computer vision researchers and practitioners but also many novel opportunities. In this workshop, we aim to bring together researchers from academia and industry to talk and discuss topics that are increasingly of importance for the vision community.

The workshop BigMAC: Big Model Adaptation for Computer Vision, will cover topics related to how large pretrained models can be effectively used:

  • Prompting methods and techniques for vision models
  • New methodologies for fine-tuning pretrained models
  • Leveraging multi-modal weak-supervision techniques
  • Scaling and fine-tuning with self-supervision
  • Finetuning large general-pretrained models robustness
  • Quantization and efficiency

This is the first iteration of the BigMAC event and does NOT have a call for papers, which might be added in future iterations. The workshop is organized as a half-day event with oral talks from the invited speakers.

Format

The workshop is a half-day event. It will consist of a series of invited talks on this topic from the leading experts in academia and industry.

Schedule

The workshop will take place on October 2nd, 9am-1pm, in the Paris Convention Center, Room X. Times are Paris time.

9:00-9:15 Welcome

TBD

9:15-9:45 Neil Houlsby

LLMs are increasingly being augmented with visual capabilities. This progress is, in part, driven by improvements in scalable and general visual pre-training. In this talk, I will present some recent advances in visual pre-training with a focus on Vision Transformers, and the results of applying such models in VLMs. These advances tackle scaling ViTs to unprecedented sizes, endowing them with the ability to flexibly handle any images, and unlocking the efficacy of sparsity.

9:45-10:15 Maria Attarian

TBD

10:15-10:45 Ludwig Schmidt

TBD

10:45-11:00 Break
11:00-11:30 Ishan Misra

TBD

11:30-12:00 Aditi Raghunathan

Pretrained models provide strong feature representations which can be adapted to downstream tasks via fine-tuning. However, in this talk, we show that standard fine-tuning procedures can cause feature distortion: correspondences between in-distribution and out-of-distribution features are weakened. As a result, fine-tuned models are not “maximally” robust to distribution shifts. How do we devise better fine-tuning procedures that minimize such distortion? Via theoretical constructions, we provide simple, scalable and effective modifications to the fine-tuning process that vastly improve the accuracy and robustness of fine-tuned models. Overall, this talk highlights the importance of various factors often overlooked in the fine-tuning process in effectively preserving pretrained knowledge.

12:00-12:30 Sayak Paul

Large-scale text-to-image diffusion models like DALL-E 2, Imagen, and Stable Diffusion have been quite successful at the task of text-to-image generation. However, their text-only conditional control offers limited flexibility and controllability to the end users. To elevate that degree of freedom, we need ways to condition the generation process better. This talk will discuss some of the most promising and effective approaches to controlling text-to-image diffusion models. The approaches will be a mix of both training-time and inference-time techniques.

12:30-1:00 Carl Vondrick

Computer vision algorithms need to combine many skills — spatial, physical, mathematical, geometrical, and cognitive — in order to accurately analyze the visual world. In this talk, I will show how code synthesis equips neural networks with these skills, thereby providing versatile representations for answering questions and recognizing objects. Through a series of experimental results, I will moreover show how this approach naturally provides inherent explainability of the decision making process, while also achieving state-of-the-art zero-shot performance across different tasks and benchmarks.

1:00 Closing remarks

Speakers

Organizers