Machine Learning of Musical Gestures

This tutorial/primer is adapted from the paper “Machine Learning of Musical Gestures” by B. Caramiaux and A. Tanaka presented at NIME 2013 (New Interfaces for Musical Expression).


Machine Learning becomes a buzz term that can be see on a lot of tech-oriented journals or magazines. In Music Technology, machine learning has also gained a lot of interests by researchers and practitioners. In this tutorial, we aim at giving elements to understand what is Machine Learning for non-specialists and how it has been used in music, especially to understand musical gestures. It is part of a broader research on the use of Machine Learning in Interaction Design.

Please feel free to send me your comments.

Machine Learning

Machine Learning (ML) is a body of statistical analysis methods that achieve tasks by learning from examples. The field is intricately linked to domains such as Data Mining, techniques that discover unknown structures from data, and Pattern Recognition, techniques that identify patterns within given datasets based on a likelihood matching with preexisting patterns. Machine learning methods are distinguished by comprising a learning component that allows inference and generalization. They are particularly useful in contexts where an application is too complex to be described by analytical formulations or manual brute force design, and when an application is dependent to the environment in which it is deployed.

In order to “act by learning” rather than being explicitly programmed, ML is divided into two phases: training which learns the data’s internal structure from given samples (training data); and testing which takes new samples (testing data) and acts, or infers decisions based on the previously learned structure.

Types of Learning

Learning can be performed in several ways and depends on the data available as well as the imposed design. Classical types of learning are (non-exhaustive list): supervised, unsupervised, semisupervised.


Illustration of supervised learning

In supervised learning, training data consist of pairs of input with corresponding desired output (the goal is known);

The learner takes pairs of input and output and learns their relationship. Each learner has a set of features and constraints that makes them suitable or not for a given application or to achieve a certain task. Once the learning performed, for a new incoming input, the tester returns an output that is the most accurate according to the learned relationship.

Regarding musical gestures, a supervised mode allows for the personalization of a system based on a given idiosyncratic gesture vocabulary. It also allows for the automatic association between input gestures and desired output sound.



In unsupervised learning, training data comprise only of inputs (the goal is unknown and must be learned from the data);

In this mode, the learner only takes inputs and discovers the internal structure of them based on the number of example and how these examples are representative of an existing structure. Once the learning performed, an new input enters the tester and an output is returned based on the learned structure.

Regarding musical gestures, an unsupervised mode allows for the automatic discovery of gestural structure that could be used in human-machine improvisational setting. This has not been explored in the literature.



In semi-supervised learning, methods consider examples of pairs of inputs and desired outputs as well as examples comprising only inputs (the goal is partially known).

The learner takes first examples of input and output that allows for learning a prototype of their relationship. However, it is often hard to have labelled data that span the input-output space that can be used for training. Instead, based on the learned prototype, the learner takes only inputs to refine the learned relationship. Once the learning performed, the tester returns the most accurate output based on a new input.


ML techniques are configured to achieve specific tasks: Regression, Classification, Clustering, Segmentation, Forecasting.



Regression is the task that consists in representing a set of continuous variables with a another set of continuous variables. As an example, the transformation can be performed through a function learned from examples.

Supervised regression

(Erratum: only supervised regression is mentioned in “Machine Learning of Musical Gestures”, both supervised and unsupervised regression must be considered)

Supervised regression techniques are used for control by learning the mapping between gesture and sound (as a regression problem) beforehand and then driving the sound synthesis from new gestural inputs. A exemplar is a performer that execute gestures while listening to sounds or music. The system automatically learns the relationship between gesture representation and sound description. A new input gesture gives rise to a new sound output but in accordance to the learned relationship. While the situation seems very related to the cross-modal analysis, the techniques are not the same since analysis methods are often not suited for prediction but offers interpretable outcomes.

Unsupervised regression

Unsupervised regression techniques are used for analysis.


Streamed sensor data are often multidimensional, redundant or noisy (as shown in Figure 1), and it is worth trying to find a new representation of gesture with less dimensions. Hence the goal is to find a function that takes multi-dimensional gesture data as inputs and outputs data with lower dimension. Regression techniques can achieve such goal. In addition, if the new representation is unknown in advance (which is often the case), the regression is unsupervised. Finding new representation can be used for analysis ([Rasamimanana et al., 2006; Young et al., 2008; MacRitchie et al., 2009; Toiviainen et al., 2010]).

Unsupervised regression can also be used in a cross-modal configuration, namely for gesture-sound cross-modal analysis, for instance by inspecting how accompanying gestures are associated to sounds in a listening situation. Such analysis is motivated by recent advances in cognitive sciences and more precisely by the fruitful application of embodied cognition theory in music (aka embodied music cognition). If Embodied Cognition states that the physical elements of our body figure in our thought meaning that thought is not confined to the brain but also distributed over the body parts. Embodied music cognition states that music perception and cognition is mediated by the body and consequently it could be possible to assess gestural elements incorporated in the acoustic flux via the study of gestures performed while listening to such flux. Consequently, ML techniques seem to be well fitted for this task: inspecting the intrinsic relationships between acoustic description (e.g. loudness, pitch, brightness, …) and gesture representation (e.g. speed, acceleration, energy, positions, …).


Classification is the task that consists in deciding in which category, a dataset belongs to. For example, considering incoming gesture data, the task is to decide which gesture it is among a given vocabulary.

Applications include multiparametric control of sound. In this case, classification techniques are used as a way to perform sound control based on a multidimensional input (e.g. video). In addition, each dimension is not necessarily interpretable and suitable for control. Classification allows for not considering the raw description of the input but rather use a human friendly description that consists in the recognized gesture.

Gestures are time-based processes. Differences in model gesture execution can reflect expressive nuance. To be able to assess temporal evolution of, and variations across, gestures leads to a greater potential for expressive interaction. One solution is to extend classification create a model of the gesture temporal evolution, that eventually are statistical in order to take into account noise in the data or missing data. This leads to two typical applications: 1/ gesture following; 2/ variation tracking. Gesture following allows for the assessment of the time progression in a template while performing a gesture. This could be used to synchronize musical event to the instrumentalist’s performance for instance. Variation tracking allows for the assessment of how the live gesture varies from the template in terms of parameters such as speed, scale, orientation and so on. This could be used to multiparametric expressive control of sound.



Clustering is the task that finds groups among the data that are as most consistent as possible for the data belonging to the same group and as distinct as possible to data from other groups.

Clustering is used for unsupervised classification. Indeed, the classes are learned with the input data and then used to classify new inputs during the testing phase.



Segmentation is the task of defining boundaries between segments in a continuous multidimensional time series (e.g. gesture).


Prediction is the task that analyzes historical data to forecast future events

Glossary of the techniques

ANN — Artificial Neural Networks

Regression and classification; Find non-linear relationship between input and output; Non-probabilistic and probabilistic versions of the method.

CCA — Canonical Component Analysis

DTW — Dynamic Time Warping

GMM — Gaussian Mixture Model

HMM — Hidden Markov Models

Classification, prediction; Generative model; Probabilistic

  • Hybrid HMM-DTW
  • Segmental
  • Hierarchical
  • Multimodal

kNN — k-Nearest Neighbors

LDA — Linear Discriminant Analysis

Non-probabilistic, linear, discriminant

PCA — Principal Component Analysis

Regression; Non-probabilistic (existing probabilistic version called Probabilistic PCA); linear

PF — Particle Filtering


Resources gather the available softwares and further reading.


Wekinator is a toolkit for interactive machine learning based on Weka, the well known suite of ML software written in Java. Wekinator implements the following supervised methods: multilayer perceptron neural networks, k-nearest neighbors, decision tree, adaboost, support vector machines.

Gesture Variation Follower a library for realtime gesture recognition and gesture variations estimation, written in C++ and released as an openFrameworks add-on compatible with PureData and Max/MSP.

Probabilistic models part of the MuBu collection of objects from Ircam set of objects for Max/MSP implementing models for mapping–by–demonstration.

The SARC EyesWeb Catalog (SEC) proposes a suite of classification, clustering and regression methods for the EyesWeb cross-platform graphical programming environment.

The Gesture Recognition Toolkit is a C++ Library gathering methods from the SEC but for multi-platform applications

IRCAM’s MnM toolbox (part of the FTM&Co) is a library of FTM-based objects and abstractions running in Max/MSP that facil- itates handling matrices for gesture–sound mapping tasks. Even if not a exclusively a ML toolkit, it contains regression methods such as PCA, CCA and classification methods such as HMM and GMM.

OpenCV (Open Source Computer Vision) contains ML tools that can also be used in musical performance.