Seeing the Pose in the Pixels: Learning Pose-Aware Representations in Vision Transformers

Two approaches for training ViTs to learn Pose-aware representations for ADL videos, enabling fine-grained and viewpoint-agnostic visual perception.