Lyrics-to-Song (L2S) generation models promise end-to-end music synthesis from text, but their vulnerability to copyright leakage remains underexplored. To mitigate this risk, commercial systems typically block prompts containing copyrighted lyrics. In this work, we introduce Adversarial PhoneTic Prompting (APT), an attack that replaces iconic phrases with homophonic alternatives—e.g., "mom's spaghetti" becomes "Bob's confetti"—preserving the acoustic form while bypassing copyright filters.
We reveal that models can be prompted to regurgitate memorized songs using phonetically similar but semantically unrelated lyrics. Despite the semantic drift, black-box models like SUNO and open-source models like YuE generate outputs that are strikingly similar to the original songs—melodically, rhythmically, and vocally—achieving high scores on CLAP, AudioJudge, and CoverID. These effects persist across genres and languages.
More surprisingly, we find that phonetic prompts alone can trigger visual memorization in text-to-video models: when given altered lyrics from Lose Yourself, Veo~3 generates scenes that mirror the original music video—complete with a hooded rapper and dim urban settings—despite no explicit visual cues in the prompt.
Key Findings: Through systematic testing with phoneme modifications (like "mom's spaghetti" → "Bob's confetti"), we demonstrate that AI music models exhibit significant memorization, raising important questions about copyright safety in generative music systems.
Demonstration showing how phonetic modifications can trigger visual memorization in text-to-video models.
Hip-hop tracks with phonetic modifications. Key transformations preserve rhythm while changing lyrics.
Pop songs with phonetic modifications across different languages.
Jingle Bells transformed through phonetic modifications.
Classic English songs regenerated with genre modifications.
Chinese language songs demonstrating cross-linguistic memorization.
@article{roh2025bob,
title={Bob's Confetti: Phonetic Memorization Attacks in Music and Video Generation},
author={Roh, Jaechul and Novack, Zachary and Peng, Yuefeng and Mireshghallah, Niloofar and Berg-Kirkpatrick, Taylor and Houmansadr, Amir},
journal={arXiv preprint arXiv:2507.17937},
year={2025}
}