By finetuning the pre-trained CLIP model, we achieve state-of-the-art performance on the video highlight detection task, demonstrating the power of leveraging large-scale multimodal knowledge for specialized video understanding.
The core message of this paper is that action detection can be effectively tackled by formulating it as a three-image generation problem, where the starting point, ending point, and action-class predictions are generated as images via a diffusion-based framework.