This paper explains how to utilize large-scale pre-trained vision language models (CLIP) for long-term action detection in videos.
This paper shows that jointly optimizing Vision Transformers for the primary task and a Self-Supervised Auxiliary Task is surprisingly beneficial when the amount of training data is limited.
Dominick Reilly is awarded The Chateaubriand Fellowship and will be interning at Inria Sophia Antipolis, France.
🎃 Happy Halloween! We hosted a Trick or Research event at CharMLab. A spooky afternoon of research and fun, and with treats! 🎃