The author introduces T2AV-BENCH, a benchmark for text-to-audio generation aligned with videos, and proposes the T2AV model that integrates visual-aligned text embeddings for improved audio synthesis.
RFWave introduces a multi-band Rectified Flow approach for high-fidelity audio waveform reconstruction, emphasizing efficiency and quality.