A team of engineers at Google has unveiled a new music generation AI system called MusicLM. The model produces high-quality music based on text descriptions such as “calming violin melody supported by a distorted guitar riff.” It works in a similar way to DALL-E, generating images from text.
MusicLM uses AudioLM’s multilevel autoregressive modeling as its generative component and extends it to text processing. To address the key problem of lack of paired data, the scientists applied MuLan, a joint music-text model trained to project music and its textual descriptions onto representations that are close to each other in the embedding space.
While training MusicLM on a large unlabeled music dataset, the model treats the conditional music generation process as a hierarchical sequence modeling task and generates music at 24 kHz, which remains constant for several minutes. To address the lack of evaluation data, the developers released MusicCaps, a new high-quality music captioning dataset containing 5,500 examples of music text pairs prepared by professional musicians.
Experiments show that MusicLM outperforms previous systems in terms of sound quality and compliance with text descriptions. Additionally, the MusicLM model can condition on both text and melody. The model can generate music according to the style described in the text description and can transform the melody even if the song is whistled or hummed.
Check out the model demo on our website.
The AI system learned how to create music by training on a dataset containing 5 million audio clips representing 280,000 hours of songs performed by singers. MusicLM can create songs of various lengths. For example, you can create a quick riff or an entire song. And it can go even further than that, as is often the case in symphonies, where alternating compositions of songs create the feeling of a story. The system can also handle specific requests, such as requests for specific instruments or specific genres. It can also create a similar feel to vocals.
The creation of the MusicLM model is part of a deep learning AI application designed to replicate human mental abilities such as speaking, writing a paper, drawing a picture, taking a test, or writing a proof of a mathematical theorem.
Now, the developers have announced that Google will not be releasing the system publicly. Test results showed that about 1% of the music generated by the model was a direct copy of music from real performers. Therefore, we are wary of content theft and lawsuits.