Machine Learning

A bit of popular science

Turning audio into sheet music: the future of MIR

Over the past ten years, research has dramatically boosted computers’ ability to solve complex problems, especially in sequence processing. The emergence of Transformer neural networks marks a watershed moment: these architectures rethink the handling of syntactic units called tokens and are revolutionizing Music Information Retrieval (MIR).

At the heart of Ivory is a custom-built Sequence-to-Sequence (Seq2Seq) Transformer. Unlike standard text-based AI, our model doesn't just read words; it ingests a rich, multidimensional stream of musical features—pitch, velocity, timing intervals, and melodic contours. By learning the deep contextual relationships between these elements, our Transformer translates raw audio events into a highly structured musical grammar.

What Ivory does

At Ivory, we design proprietary models capable of turning an audio signal into a symbolic representation of music. We work on two distinct levels:

1. Signal analysis

  • Convert the audio signal into a spectrogram
  • Extract raw acoustic events (onset, offset, pitch, velocity)

2. Symbolic analysis (The Transformer)

This is where our AI acts like an expert transcriber. It processes the extracted notes using a Dual-Encoder architecture: one "brain" focuses exclusively on the rhythm, while the other analyzes the melodic and harmonic content.

  • Multi-Task Learning: While predicting notes, the network is simultaneously trained on "auxiliary tasks"—forcing it to identify downbeats, measure boundaries, and exactly where a note sits on the beat.
  • Rich Vocabulary: The model outputs a complete musical package, determining not just the pitch, but left/right hand staff assignments, exact measure positions, and even expressive modifiers like grace notes and strums.

A hard problem, a fascinating research topic

Unsurprisingly, AI-assisted automatic transcription raises several major challenges we are striving to overcome:

  • Wide semantics of expressiveness: A machine operates on an absolute unit of time (seconds), while humans play with feel, accelerating and decelerating naturally. To teach our model to handle live, unquantized performances, we use advanced data augmentation. During training, we warp the timing of the music using mathematical "S-curves" (simulating realistic rubato) and dynamically shift tempos and dynamics so the AI learns to feel the beat rather than just counting milliseconds.
  • Harmony vs. melody: In natural language these concepts are clearly distinct, yet in music the boundary between the two can be blurry. They respond to and complement each other.
  • Polyphony & Temporal Clumping: Transformers process units one after another, which makes concurrent notes (chords) tricky. Without careful guidance, models tend to "clump" time steps together. We developed custom training mechanisms—specifically weighting time-changes heavier than chord-blocks—to force the network to respect the independent rhythms of complex polyphony.
  • Latent ambiguities & Robustness: Notes alone do not always resolve the inherent uncertainty of musical notation. To prevent our model from cascading into errors when it encounters something ambiguous, we actively inject "structural noise" during training. By occasionally giving the model slightly corrupted data, we teach it to recover from its own mistakes, making it highly robust during real-world transcription.
  • Context and Memory: Music is highly contextual; the key or time signature of a measure depends on what happened pages ago. Our inference engine uses a sliding "context window," seamlessly stitching together chunks of generated music by feeding the tail-end of the last phrase as the prompt for the next one, ensuring long-term continuity.

Full steam ahead for continuous improvement

Automatic music transcription is advancing rapidly, but no model is universal yet. At Ivory, our custom Transformer achieves remarkable results on structured repertoires; however, heavily improvised polyphony or extreme live rubatos remain areas of active research. Our priority is clear: continue refining our dual-encoder's sense of rhythm, expand our data augmentations, and enrich harmonic analysis to reduce residual errors.

We publish updates regularly and have set up a feedback loop. Your trials and comments are invaluable: they shape our roadmap and accelerate our model's learning. Join the community, stay updated on our progress, and help us make audio-to-score transcription a flawless, reliable tool for every musician.