Turning audio into sheet music: the future of MIR
Over the past ten years, research has dramatically boosted computers’ ability to solve complex problems, especially in sequence processing. The emergence of Transformer neural networks marks a watershed moment: these architectures rethink the handling of syntactic units called tokens and are revolutionizing Music Information Retrieval.
What Ivory does
At Ivory, we design proprietary models capable of turning an audio signal into a symbolic representation of music. We work on two distinct levels:
Signal analysis
- Convert the audio signal into a spectrogram
- Extract notes (onset, offset, pitch, velocity)
Symbolic analysis
- Use the extracted notes to determine rhythm, tempo, separate melody from harmony, identify time signatures, chords, and all other elements of a traditional score
A hard problem, a fascinating research topic
Unsurprisingly, AI-assisted automatic transcription raises several major challenges we are striving to overcome:
- Wide semantics of expressiveness: while the human ear excels at processing periodic signals and remains invariant to tempo changes, a machine operates on an absolute unit of time—seconds—which makes it difficult to adapt one piece to another. It must also handle live performance recordings where performers play durations that sound natural to the ear but remain ambiguous from an informational standpoint. Context is therefore crucial, and the variability of input data along with the lack of public datasets makes generalization difficult for neural networks.
- Harmony vs. melody: in natural language these concepts are clearly distinct, yet in music the boundary between the two can be blurry—they respond to and complement each other.
- Polyphony: recent models process semantic units called tokens one after another and are not designed to analyze concurrent tokens in time. Therefore, finding the best technique to provide the network with the most relevant temporal representation remains an open research question. Several strategies exist to address these difficulties, but the research topic remains unresolved.
- Human editorial choices: some indications are stylistic or notational (free tempo, ambiguous time signatures, editor’s markings) and go beyond a simple succession of notes.
- Latent ambiguities: notes alone do not always resolve the inherent uncertainty of musical notation, generating entropy that neural networks still struggle to model.
Full steam ahead for continuous improvement
Automatic music transcription is advancing, but no model is universal yet. At Ivory, we achieve strong results on simple repertoires; highly polyphonic passages or live recordings remain areas of active research. Our priority is clear: refine note detection, stabilize rhythm, and enrich harmonic analysis to reduce residual errors.
We publish updates regularly and have set up a feedback loop. Your trials and comments are invaluable: they shape our roadmap and accelerate fixes. Join the community, stay updated on our progress, and help us make audio-to-score transcription a reliable tool for every musician.