Machine Learning

A bit of popular science

Turning audio into sheet music: the future of MIR

Over the past ten years, research has dramatically boosted computers’ ability to solve complex problems, especially in sequence processing. The emergence of Transformer neural networks marks a watershed moment: these architectures rethink the handling of syntactic units called tokens and are revolutionizing Music Information Retrieval.

What Ivory does

At Ivory, we design proprietary models capable of turning an audio signal into a symbolic representation of music. We work on two distinct levels:

Signal analysis

Convert the audio signal into a spectrogram
Extract notes (onset, offset, pitch, velocity)

Symbolic analysis

Use the extracted notes to determine rhythm, tempo, separate melody from harmony, identify time signatures, chords, and all other elements of a traditional score

A hard problem, a fascinating research topic

Unsurprisingly, AI-assisted automatic transcription raises several major challenges we are striving to overcome:

Wide semantics of expressiveness: While the human ear excels at processing periodic signals and remains invariant to tempo changes, a machine operates on an absolute unit of time: seconds, which makes it difficult to adapt one piece to another. It must also handle live performance recordings where performers play durations that sound natural to the ear but remain ambiguous from an informational standpoint. Context is therefore crucial, and the variability of input data along with the lack of public datasets makes generalization difficult for neural networks.
Harmony vs. melody: In natural language these concepts are clearly distinct, yet in music the boundary between the two can be blurry, they respond to and complement each other.
Polyphony: Recent models process semantic units called tokens one after another and are not designed to analyze concurrent tokens in time. Therefore, finding the best technique to provide the network with the most relevant temporal representation remains an open research question. Several strategies exist to address these difficulties, but the research topic remains unresolved.
Human editorial choices: Some indications are stylistic or notational (free tempo, ambiguous time signatures, editor’s markings) and go beyond a simple succession of notes.
Latent ambiguities: Notes alone do not always resolve the inherent uncertainty of musical notation, generating entropy that neural networks still struggle to model.

Full steam ahead for continuous improvement

The landscape of automatic music transcription is evolving rapidly, and Ivory’s latest models are leading the charge. We already deliver outstanding results on standard repertoires, and our recent upgrades are aggressively conquering the complexities of highly polyphonic passages and live recordings. Our focus is absolute: achieving pinpoint note detection, rock-solid rhythm stabilization, and rich harmonic analysis. With these core pillars rapidly strengthening, our next major frontier is professional-grade engraving ensuring that our accurate transcriptions are translated into beautifully formatted, publication-ready sheet music.

We ship rapid, regular updates fueled by a powerful feedback loop. Your trials and insights are the catalyst for our progress they directly shape our roadmap and accelerate breakthroughs. Join our community, experience our latest advancements, and help us build the ultimate audio-to-score tool for every musician.

Marius from Ivory