Main

Zebra finches acquire complex, stereotyped vocalizations through a months-long process of sensory–motor learning3,19,20,21. During development, syllable order—that is, syntax—and the spectral structure of syllables evolve3. These two aspects of vocal learning may be mediated by largely independent mechanisms with distinct anatomical substrates21, 22. Here we focus on characterizing the development of spectral structure. We began our studies by obtaining dense audio recordings of five male zebra finches between 35 and 123 days post-hatch (dph; mean ± standard deviation 73.4 ± 18.6 consecutive days of recording). Birds were isolated from other males after birth and, on average, live-tutored from around 46 to 63 dph (Extended Data Fig. 1a). Band-passed (0.35–8 kHz) audio recordings were segmented into individual vocal renditions, and represented as song spectrogram segments (Fig. 1a; 563,124–1,203,647 renditions per bird). We excluded noise and isolated calls from the analyses.

Fig. 1: Fast and slow change in developing zebra finch vocalizations.
figure 1

a, Vocalizations at three developmental stages. Dotted lines indicate syllable onsets. Crystalized song syllables (middle and bottom) fall into discrete categories (syllables i, a, b, c) and form a stereotyped ‘motif’, typically resembling the tutor song. b, Time course of one acoustic feature, entropy variance, for syllable b. c, Magnification of the region outlined in b, showing a period of within-day span (early to late, day k) and overnight shift (late day k to early day k + 1). The consolidation index (CI) is approximately −0.75. d, Histograms of consolidation indices over pairs of consecutive days, syllables and birds, for 32 acoustic features (top) and 32 random spectral projections (bottom). eg, Three scenarios of slow developmental change (grey arrows) and fast within-day change in vocalizations. Each point represents the distribution of vocalizations from a given time and day. A larger distance between points indicates more dissimilar distributions. h, Linear projections of the points in g onto two example song features (dotted lines in g) for the misaligned, strong-consolidation scenario. Consolidation strength varies across directions. i, Consolidation indices over 10,000 random projections simulated from the three scenarios (1, 2 and 3 in eg).

Behavioural change in single features

Vocal development is often characterized by considering changes in acoustic features such as pitch, frequency modulation3 or entropy variance2,14 (Fig. 1b). Such characterizations readily reveal multiple timescales of behavioural change: individual features can vary consistently within a day, display overnight discontinuities, and show drift over the duration of weeks or months (Fig. 1b, c).

We summarize the relation between change at these different timescales through a consolidation index (Fig. 1c), which measures whether within-day change in a feature (‘span’, Fig. 1c) is maintained or lost overnight (‘shift’, Fig. 1c). Weak consolidation2,14 corresponds to a consolidation index of close to −1 (no consolidation: the shift is equal but opposite to span); strong consolidation4,15 corresponds to an index of close to 0 (perfect consolidation: the shift is 0 days); and offline learning4,23,24 to an index of larger than 0. Across 32 commonly used acoustic features, the consolidation indices in our data are mostly negative, indicating weak consolidation (Fig. 1d, top; median −0.67). This finding holds even for random spectral features (Fig. 1d, bottom; median −0.64) and is consistent with past accounts of song development in zebra finches2,14.

Individual features, however, may provide an incomplete account of change in a complex behaviour such as song vocalizations. To illustrate this point, we consider three simple scenarios. In the first two (Fig. 1e, f), the change in behaviour that occurs within any given day largely mirrors, on a faster timescale, the slow change that occurs over the course of many days or weeks. In the third scenario (Fig. 1g), within-day change is partly ‘misaligned’ with slow change: that is, it involves behavioural features that do not consistently change on slower timescales. Within-day change could reflect metabolic, neural or other changes that are not necessarily congruent with longer-term learning or development; the slow change reflects long-term modifications in behaviour that are typically equated with learning and development. We abstractly refer to these slow components as the direction of slow change (DiSC).

Notably, simulations of these scenarios show that negative consolidation indices for single features can result from very different time courses of development (Fig. 1h, i). Negative indices occur both when within-day and slow changes are closely aligned but daily gains along the DiSC are mostly lost overnight (weak consolidation, Fig. 1f), and when diurnal gains along the DiSC are perfectly consolidated but within-day change is substantially misaligned with slow change (Fig. 1g). The broad distributions of indices observed during song development (Fig. 1d, top), which also include strongly positive indices, seem more consistent with the misaligned scenario (Fig. 1i, histogram 3).

Nearest-neighbour measures of change

We developed a general characterization of change in high-dimensional behavioural data, based on nearest-neighbour statistics12,13, that can distinguish between the scenarios in Fig. 1e–g. We initially analyse song-spectrogram segments of fixed duration aligned to syllable onset (Fig. 1a), but later extend our analysis to alternative parameterizations of the vocalization behaviour. Vocal renditions are represented as real-valued vectors \({x}_{i}\in {{\mathbb{R}}}^{d}\) (where i indexes renditions, and d denotes dimension), each associated with a production time, \({t}_{i}\in {\mathbb{R}}\) (for example, the bird’s age when singing xi). The K-neighbourhood of rendition xi is given by those K renditions (among the set of all renditions) that are closest to xi on the basis of some metric (for example, Euclidean distance). For small-enough values of K, different syllable types do not mix within a neighbourhood (Extended Data Fig. 1e) and neighbourhood statistics are largely independent of cluster boundaries, obviating the need for clustering renditions into syllables.

We visualize all vocalizations produced by a bird throughout development with Barnes–Hut t-distributed stochastic neighbour embedding (t-SNE)11 (which predominantly preserves local neighbourhoods11). Each point in the embedding corresponds to a spectrogram segment, xi (Fig. 1a). Different locations correspond to different vocalization types (Fig. 2b and Extended Data Fig. 2a). The embedding suggests that vocalizations change from undifferentiated subsong3,20 (Fig. 2a, middle) to clearly differentiated syllables that fall into at least four categories (Fig. 2a, syllables a, b, c and introductory note i, as in Fig. 1a). The emergence of clustered syllables from unclustered subsong can be confirmed by standard clustering approaches (Fig. 2g and Extended Data Fig. 1c, d). Notably, the embedding does not preserve all local structure in the data, as nearest neighbours in the embedding space are not necessarily nearest neighbours in the high-dimensional data space (Fig. 2a; black crosses represent high-dimensional neighbours). We therefore quantify behavioural change directly in the high-dimensional data by analysing the composition of high-dimensional neighbourhoods12,13 (Extended Data Fig. 2e–g).

Fig. 2: Neighbourhood mixing and repertoire dating.
figure 2

a, t-SNE of all vocalizations from the bird in Fig. 1a. Each point is a syllable rendition. Clusters (syllables i, a, b, c) emerge during development. Arrows indicate renditions of syllable b from Fig. 1a. Crosses show the 600 nearest neighbours of the rendition from day 58. Inset, histogram of production times (neighbourhood times) over the 600 nearest neighbours. b, Average spectrograms for different locations in the t-SNE visualization from a. c, Pooled neighbourhood times for day 70. Percentiles (vertical lines) quantify the extent of the behavioural repertoire on day 70. d, Percentiles (5th, 50th and 95th) of neighbourhood times for individual renditions from day 70 (each row represents a rendition). Rows are sorted by the 50th percentile—the repertoire time (rT, red dots). Left and right black dots mark the 5th and 95th percentiles. A small random horizontal shift was added to each dot for visualization. e, Mixing matrix for all data points depicted in a. Each column of the matrix represents a histogram of production times, pooled over all neighbourhoods of points within a day (x axis), normalized by a shuffling null hypothesis (LMR, base-2 logarithm of the mixing ratio). The black arrow marks the first day of tutoring. f, Average mixing matrix for five birds (days 60–75). g, Single-day t-SNE, for three days (for the same bird as in Fig. 1a), illustrating the gradual emergence of clusters. h, Behavioural trajectory based on f, computed with ten-dimensional multidimensional scaling (MDS). Each point corresponds to a day. The two dimensions that capture the most variance in the trajectory are shown.

For each data point, we refer to the production times of all data points in its K-neighbourhood as ‘neighbourhood production times’ (or ‘neighbourhood times’; Fig. 2a, histogram). We summarize the neighbourhood times of many data points (Fig. 2d) through ‘pooled neighbourhood times’ (Fig. 2c) and the ‘neighbourhood mixing matrix’ (Fig. 2e and Extended Data Figs. 2g, 3d). Each value in the neighbourhood mixing matrix represents the similarity between behaviours from two production periods. Deviations from zero indicate that behaviours from the corresponding production periods are more similar (for values greater than 0), in terms of mixing at the level of K-neighbourhoods, or less similar (for values smaller than 0) than expected from a shuffling null hypothesis.

We use multidimensional scaling25 on the mixing matrix to represent the similarity between behaviours from different production times as a ‘behavioural trajectory’ (Fig. 2h). Each point on the trajectory represents the distribution of all vocalizations produced on a given day. Pairwise distances between points represent the dissimilarity between distributions (Extended Data Fig. 2e–g). Here we focus on a 16-day phase of gradual change midway through development (Fig. 2f). During this phase, the behavioural trajectory is structured differently on fast and slow timescales (Extended Data Fig. 3f–h). The two-dimensional projection of the trajectory that explains the maximal variance mainly reflects the direction of slow change (Fig. 1e–g, 2h).

The behavioural trajectory summarizes the progressive differentiation of vocalizations into distinct syllables, as well as simultaneous, continuous change in many spectral features of individual syllables. Notably, change is characterized through the behavioural trajectory by comparing the bird’s song to itself across time, rather than to a tutor song. Thus the behavioural trajectory may also reflect innate song priors that can result in crystallized song deviating from the tutor song26 and additional change due to other developmental processes27.

Repertoire extent and consolidation

Additional t-SNE visualizations of the data suggest that renditions from nearby days overlap considerably, whereby changes occurring within a day partly mimic the slow change across days (Extended Data Fig. 2b, c). We quantify this apparent spread along the DiSC—reflecting different degrees of behavioural ‘maturity’—through neighbourhood times (Fig. 2d). We refer to behavioural renditions that predominantly have neighbours produced in the future as ‘anticipations’, and to renditions that predominantly have neighbours that were produced in the past as ‘regressions’ (Extended Data Fig. 3b). By contrast, renditions that are ‘typical’ for a given developmental stage mostly have neighbours produced on the same or nearby days. We denote the median neighbourhood time as the ‘repertoire time’ of a rendition. The repertoire time effectively places each rendition along the DiSC (Fig. 2d, x axis): that is, it dates it with respect to the progression of vocal development (‘repertoire dating’). A broad distribution of repertoire times across all renditions in a day (Fig. 2d) suggests considerable behavioural variability along the DiSC; the most extreme regressions are backdated more than ten days into the past, and the most extreme anticipations are post-dated more than ten days into the future.

To quantify behavioural change on the timescale of hours, we subdivide each day into ten consecutive periods, and compute pooled neighbourhood times separately for each period. The percentiles of the pooled neighbourhood times chart the evolution of behaviour within and across days throughout development (Fig. 3a). Each repertoire-dating percentile is akin to a learning curve for a part of the behavioural repertoire (for example, typical renditions are described by the 50th percentile, and extreme anticipations by the 95th). The evolution of each percentile captures the progress along the DiSC (Fig. 3a, y axis) over time (Fig. 3a, x axis). We validated this characterization of behavioural change on simulated behaviour that mimicked vocal development (Extended Data Fig. 4a–d).

Fig. 3: Multiple components of behavioural change during sensory–motor learning.
figure 3

a, Average repertoire dating percentiles (for five birds) describing within and across-day changes along the DiSC. For each production day and period, five percentiles of the pooled neighbourhood times (Fig. 2c) are arranged vertically (lines). b, Average of data from a across days 60–70, expressed relative to the average 50th percentile. c, Within-bout changes. As for a, but based on production day and period in a singing bout. d, As for b, but averaged across data from c. e, Span and shift for the 5th, 50th and 95th percentiles (blue arrows in b, analogous to Fig. 1c) averaged over days 50–80, separately for syllables (points) and birds (colours). Black lines indicate medians and 95% bootstrapped confidence intervals over all points. f, Simulated stratified mixing matrices (right) for three models (left) of the alignment of within-day and across-day change with the DiSC. g, Average measured stratified mixing matrices (five birds, days 60–70). hj, Stratified behavioural trajectory based on g. Different two-dimensional projections reveal the DiSC (h), as well as within-day (i) and across-day (j) change not aligned with the DiSC (labels 1–5 represent different strata). The full ten-dimensional trajectories faithfully reproduce the structure of the stratified mixing matrices (MDS stress = 0.016); the depicted four-dimensional subspace captures 81% of the ten-dimensional variance. k, Separate projections for each stratum onto the local DiSC (black arrows in upper diagrams; points represent strata from h).

The repertoire-dating percentiles reveal that typical renditions move gradually along the DiSC throughout the day, and that changes along the DiSC acquired during the day are, on average, fully consolidated overnight (Fig. 3a, b, red). Anticipations undergo a similar or smaller degree of within-day change (Fig. 3a, b, 75th and 95th percentiles), whereas regressions move by a larger distance within each day, but this change is only weakly consolidated overnight (Fig. 3a, b, 5th and 25th percentiles; Fig. 3e). The most ‘immature’ renditions thus improve markedly throughout a day—more than typical renditions or anticipations—but these improvements are mostly lost overnight. This pattern of change seems to be characteristic of development, as it is absent in adults (Extended Data Figs. 5, 6).

Movement along the DiSC also occurs on timescales that are faster than hours, namely within bouts of singing—that is, groups of vocalizations that are preceded and followed by a pause (average bout duration 3.81 ± 0.83 s across birds). We subdivide each bout into ten consecutive periods, compute pooled neighbourhood times for each period (over all bouts in a day), and track change through the corresponding percentiles (Fig. 3c, d). Within bouts, large changes along the DiSC occur at the regressive tail of the behavioural repertoire: vocalizations are most regressive at the onset and offset of bouts (Fig. 3c, d, 5th percentile). Similar, albeit weaker changes occur for typical renditions (Fig. 3c, d, red). The same apparent changes in song maturity are observed when short and long bouts (durations 2.30 ± 0.54 s versus 6.28 ± 1.73 s) are considered separately. Song maturity thus decreases at the end of a bout, not after a fixed time into the bout (Extended Data Fig. 5a–c).

Misaligned behavioural components

The repertoire time reveals within-day and within-bout changes that mirror, on a faster timescale, changes that also occur over many days (see Supplementary Methods). As above (Fig. 1), we refer to such components of change as being aligned with the DiSC, and to components that are not reflected in the repertoire time as being misaligned.

We identify both aligned and misaligned components of change through the ‘stratified mixing matrix’, which combines a neighbourhood-mixing matrix (for example, Fig. 2f) with repertoire dating. Each day’s behavioural repertoire is binned into five consecutive production periods. Within each period, the behavioural repertoire is subdivided into five strata on the basis of repertoire time (Fig. 2d, quintiles). All renditions from a day thus fall into 5 × 5 = 25 bins. The stratified mixing matrix measures similarity between 50 bins that combine the data from two adjacent days (Fig. 3g). We compare the measured stratified mixing matrix with simulations that differ with respect to how within-day change and change across adjacent days align with the DiSC (Fig. 3f and Extended Data Fig. 4e–j). In model 1, development is one-dimensional and therefore aligned with the DiSC (Fig. 3f, top; similar to Fig. 1e). In model 2, within-day change involves a component that is not aligned with the DiSC (Fig. 3f, middle; similar to Fig. 1g). In model 3, adjacent days are separated not only along the DiSC, but also along a direction orthogonal to both the DiSC and the direction of within-day change (Fig. 3f, bottom, across-day change). Prominent ‘stripes’ along every other diagonal in the measured mixing matrix (Fig. 3g) indicate a larger similarity between renditions from the same day than between renditions from adjacent days, as predicted by model 3, suggesting that several misaligned components contribute to change at fast timescales.

From the stratified mixing matrix, we infer stratified behavioural trajectories. The two-dimensional projection that captures most of the variance due to strata (Fig. 3h) resembles Fig. 2h and reflects the DiSC. Consistent with repertoire dating, behavioural change along the DiSC between adjacent days (Fig. 3h, blue versus red for each stratum) is small compared with the spread of the behaviour for one day along the DiSC (for example, blue points, strata 1–5). For each stratum, however, much of the change that occurs within a day is misaligned with the DiSC (Fig. 3i, k; early versus late separated along the orthogonal dimension of within-day change). Yet another misaligned component is necessary to appropriately capture change across adjacent days (Fig. 3j). These properties of aligned and misaligned components are replicated by a linear analysis based on spectral features that are chosen to capture change at specific timescales (Extended Data Figs. 7, 8), and are robust to how song is parameterized and segmented, and to how nearest neighbours are defined (Extended Data Figs. 9, 10).

Discussion

Our analysis of high-dimensional vocalizations reveals that vocal learning and development do not reflect an underlying one-dimensional process. Single behavioural features in isolation therefore provide an incomplete account of behavioural change during development and learning. The weak consolidation observed here (Fig. 1d) and elsewhere2,14 at the level of single features appears to reflect prominent misaligned components of within-day change, rather than weak consolidation along the DiSC (Fig. 1h, i). Strong overnight consolidation along the DiSC across much of the behavioural repertoire (Fig. 3a, b) seems consistent with consolidation patterns observed for skilled motor learning in humans23,24,28 and of motor adaptation in humans1,18 and birds4.

Our characterization of behaviour on the basis of nearest-neighbour statistics can be applied when no accurate parametric model of the behaviour is known, as is the case at present for most natural, complex behaviours. The approach is largely complementary to methods that rely on clustering behaviour into distinct categories2,3,10,29. Forgoing an explicit clustering of the data can be advantageous, because assuming the existence of clusters can be an unwarranted approximation30 and may impede the characterization of behaviour that appears not to be clustered (such as juvenile zebra finch song; Extended Data Fig. 1); moreover, determining correct cluster boundaries is in general an ill-defined problem30. Notably, our analyses require only an indicator function that selects nearest neighbours (based here on a ‘locally meaningful’ distance metric)—a much weaker requirement than a globally valid distance metric or the existence of a low-dimensional feature space that globally maps behavioural space11. These properties make repertoire dating applicable to almost any behaviour and other high-dimensional datasets, including data that are characterized by ‘labels’ other than production time. Repertoire dating may thus provide a general account of learning and change that is amenable to comparisons between different behaviours and model systems, including different species17 and artificial systems5.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.