VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning
machinelearning.apple.com/research/vssflowVSSFlow is a unified flow-matching framework for video-conditioned sound and speech generation, integrating video-to-sound (V2S) and visual text-to-speech (VisualTTS) tasks. It uses a novel condition aggregation mechanism and leverages the inductive biases of cross-attention and self-attention layers to handle different input signals. VSSFlow outperforms state-of-the-art baselines on both V2S and VisualTTS benchmarks.