Update: The code for this notebook is available on Github.
Aligning video recordings with Julia
xxxxxxxxxxmd"# Aligning video recordings with Julia"Due to the restrictions imposed by COVID-19, dance teachers around the world are taking their classes online. As a dance student, it would be helpful to watch yourself against a recording of your teacher.
Something like this:
xxxxxxxxxxmd"Due to the restrictions imposed by COVID-19, dance teachers around the world are taking their classes online. As a dance student, it would be helpful to watch yourself against a recording of your teacher.Something like this:"xxxxxxxxxxhtml"<iframe width=\"680\" height=\"383\" src=\"https://www.youtube-nocookie.com/embed/I8kuhv0FcTE\" frameborder=\"0\" allow=\"accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen></iframe>"It is very unlikely the song in both videos starts at exactly the same. Manually syncing them is not easy.
On the other hand, computers can compare the sound waves of each video file and quickly determine how long to wait before playing one video or the other so they are both aligned.
xxxxxxxxxxmd"It is very unlikely the song in both videos starts at exactly the same. Manually syncing them is not easy.On the other hand, computers can compare the sound waves of each video file and quickly determine how long to wait before playing one video or the other so they are both aligned."Sound waves
xxxxxxxxxxmd"## Sound waves"Sound waves are represented in digital audio by very rapidly sampling the distortions of the medium (e.g. air) through a microphone and storing the resulting data in a vector.
xxxxxxxxxxmd"[Sound waves](https://pudding.cool/2018/02/waveforms/) are represented in digital audio by very rapidly sampling the distortions of the medium (e.g. air) through a microphone and storing the resulting data in a vector."Jazmine, the teacher, sent Julia a recording of a jazz song (cc0), sampled 16,000 times per second:
xxxxxxxxxxmd"Jazmine, the teacher, sent Julia a [recording of a jazz song](https://freesound.org/s/37750/) ([cc0](https://creativecommons.org/publicdomain/zero/1.0/)), sampled 16,000 times per second:"xxxxxxxxxxbegin using FileIO: load import LibSndFile using Plots using SampledSignals using FFTWendxxxxxxxxxxx = load("../data/jazz.ogg")7.00875xxxxxxxxxxx_duration_seconds = length(x) / x.samplerate(SampleBuf{Float32,2}, 16000.0, 112140)xxxxxxxxxxtypeof(x), x.samplerate, length(x)Each sample in
xxxxxxxxxxmd"Each sample in $$x$$ corresponds to a microphone reading. The first 5 samples at $$1s$$ are:"-0.140871
-0.125916
-0.0961665
-0.0542877
-0.0306375
-0.0344809
xxxxxxxxxxx[16_000:16_005, 1]Plotting the audio wave:
xxxxxxxxxxmd"Plotting the audio wave:"x
plot(domain(x), x[:, 1], xlabel="time (s)", ylabel="amplitude", title="Jazz song", linealpha=0.85, legend=false, linecolor=palette(:default)[2], fmt = :png, dpi=300)xxxxxxxxxxmd"Julia, then danced against the same song and recorded herself near [the sea](https://freesound.org/s/9332/) ([cc-by-nc](https://creativecommons.org/licenses/by-nc/3.0/)):"xxxxxxxxxxx2 = load("../data/jazz-waves.ogg")9.52xxxxxxxxxxx2_duration_seconds = length(x2) / x2.samplerate(SampleBuf{Float32,2}, 16000.0, 152320)xxxxxxxxxxtypeof(x2), x2.samplerate, length(x2)Let's look at Julia's audio wave:
xxxxxxxxxxmd"Let's look at Julia's audio wave:"xxxxxxxxxxplot(domain(x2), x2[:, 1], xlabel="time (s)", ylabel="amplitude", title="Jazz song at the Sea", linealpha=0.85, legend=false, fmt = :png, dpi=300)We would like to know how much lag there is between the 2 different recordings of the same song:
xxxxxxxxxxmd"We would like to know how much lag there is between the 2 different recordings of the same song: $$x$$ and $$x2$$."Using Cross Correlation to align audio signals
xxxxxxxxxxmd"## Using Cross Correlation to align audio signals"Visually, we could superimpose both signals and slide each other over the time axis until they match as closely as possible.
For example, take Jazmine's recording (
xxxxxxxxxxmd"Visually, we could superimpose both signals and slide each other over the time axis until they match as closely as possible. For example, take Jazmine's recording ($$x$$) and place it next to Julia's ($$x2$$). They are clearly not aligned."xxxxxxxxxxbegin x_right_padding_1 = zeros(Float32, length(x2) - length(x), 1) x_padded_1 = cat(x, x_right_padding_1, dims=1) mixed_1 = cat(x2, x_padded_1, dims=2) plot(domain(mixed_1), mixed_1, xlabel="time (s)", ylabel="amplitude", title="At the sea + Jazz song", linealpha=0.85, legend=false, fmt = :png, dpi=300)endThen slide Julia's recording (
xxxxxxxxxxmd"Then slide Julia's recoring ($$x2$$) $$0.1$$ seconds to the right, then again $$0.1$$ seconds, … $$1$$ second later:"xxxxxxxxxxbegin x_left_padding_2 = zeros(Float32, convert(Int, 1 * x.samplerate), 1) x_right_padding_2 = zeros(Float32, length(x2) - length(x) - length(x_left_padding_2), 1) x_padded_2 = cat(x_left_padding_2, x, x_right_padding_2, dims=1) mixed_2 = cat(x2, x_padded_2, dims=2) plot(domain(mixed_2), mixed_2, xlabel="time (s)", ylabel="amplitude", title="At the sea + Jazz song. Offset 1 second", linealpha=0.85, legend=false, fmt = :png, dpi=300)endxxxxxxxxxxmd" $$2$$ seconds later:"xxxxxxxxxxbegin x_left_padding_3 = zeros(Float32, convert(Int, 2 * x.samplerate), 1) x_right_padding_3 = zeros(Float32, length(x2) - length(x) - length(x_left_padding_3), 1) x_padded_3 = cat(x_left_padding_3, x, x_right_padding_3, dims=1) mixed_3 = cat(x2, x_padded_3, dims=2) plot(domain(mixed_2), mixed_3, xlabel="time (s)", ylabel="amplitude", title="At the sea + Jazz song. Offset 2 seconds", linealpha=0.85, legend=false, fmt = :png, dpi=300)endxxxxxxxxxxmd" $$2.5$$ seconds later both $$x$$ and $$x2$$ are perfectly aligned."xxxxxxxxxxbegin x_left_padding_4 = zeros(Float32, convert(Int, 2.5 * x.samplerate), 1) x_right_padding_4 = zeros(Float32, length(x2) - length(x) - length(x_left_padding_4), 1) x_padded_4 = cat(x_left_padding_4, x, x_right_padding_4, dims=1) mixed_4 = cat(x2, x_padded_4, dims=2) plot(domain(mixed_2), mixed_4, xlabel="time (s)", ylabel="amplitude", title="Ocean Wave + Jazz. Offset 2.5 seconds", linealpha=0.85, legend=false, fmt = :png, dpi=300)endThis sliding process is essentially what a cross-correlation measures. It tells us how similar 2 signals are when displaced relative to each other.
xxxxxxxxxxmd"This sliding process is essentially what a cross-correlation measures. It tells us how similar 2 signals are when displaced relative to each other."Mathematically, the cross-correlation of two temporal series
xxxxxxxxxxmd"Mathematically, the [cross-correlation](https://en.wikipedia.org/wiki/Cross-correlation) of two temporal series $$u$$ and $$v$$ is defined as:"xxxxxxxxxxmd"$$(u \star v)(t) \triangleq\ \int_{-\infty}^\infty \overline{u(\tau)} v(t + \tau) \, d\tau$$"Where
xxxxxxxxxxmd"Where $$\overline{u(\tau)}$$ is the [complex conjugate](https://en.wikipedia.org/wiki/Complex_conjugate) of $$u(\tau)$$. And $$t$$ is the lag."The cross-correlation (
Moreover, the Fourier transform (
As such we can implement the cross-correlation with the following steps:
xxxxxxxxxxmd"The cross-correlation ($$\star$$) of two functions is equivalent to the convolution ($$*$$) of one of the signals reversed in the time axis: $$(u \star v)(t) = u(−t) ∗ g (t)$$Moreover, the Fourier transform ($$\mathcal{F}$$) of a convolution [is equivalent](https://en.wikipedia.org/wiki/Convolution_theorem) to multiplying the Fourier tranforms of the functions:$$\mathcal{F}\{u * v\} = \mathcal{F}\{u\} \cdot \mathcal{F}\{v\}$$$$(u \star v)(t) = \mathcal{F}^{-1}\big\{\mathcal{F}\{u(-t)\}\cdot\mathcal{F}\{v(t)\}\big\}$$As such we can [implement the cross-correlation](https://dsp.stackexchange.com/questions/736/how-do-i-implement-cross-correlation-to-prove-two-audio-files-are-similar) with the following steps:"zero padding the signals and reversing one of them:
xxxxxxxxxxmd"1. zero padding the signals and reversing one of them:"xxxxxxxxxxbegin x_max_size = abs(length(x2) - length(x)) x_pad_size = length(x) + 2 * length(x_right_padding_1) x_pad = SampledSignals.SampleBuf(zeros(Float32, x_pad_size, 1), x.samplerate) x_padded = vcat(x, x_pad) x_padded_reversed = reverse(x_padded, dims=1) x2_pad_size = length(x2) x2_pad = SampledSignals.SampleBuf(zeros(Float32, x2_pad_size, 1), x2.samplerate) x2_padded = vcat(x2, x2_pad) # Listen to this audio at 12 seconds. It is the same Jazz song (x) but reversed and with empty audio at the start x_padded_reversedendPerforming a convolution between both signals with their Fourier transforms:
xxxxxxxxxxmd"2. Performing a convolution between both signals with their Fourier transforms: "304640×1 Array{Float32,2}:
717.08154
692.6897
659.4013
618.29803
570.74023
518.3476
460.08493
⋮
595.0318
646.0196
686.0295
714.1685
729.12396
729.91205xxxxxxxxxxconvolution = irfft(rfft(x2_padded) .* rfft(x_padded_reversed), length(x_padded))xxxxxxxxxxplot(domain(x2_padded), convolution[:, 1], xlabel="time (s)", ylabel="convolution(t)", legend=false, fmt = :png, dpi=300)The point where the signals are most correlated corresponds to the peak of the convolution. This is the lag between
xxxxxxxxxxmd"The point where the signals are most correlated corresponds to the peak of the convolution. This is the lag between $$x$$ and $$x2$$:"(40000, 6877.022f0)xxxxxxxxxxlag_index, peak_value = argmax(convolution[:, 1]), maximum(convolution[:, 1])The lag is exactly 2.5 seconds. We can now use this value to align the audio/video files for Jazmine and Julia.
xxxxxxxxxxmd"The lag is exactly 2.5 seconds. We can now use this value to align the audio/video files for Jazmine and Julia."2.5xxxxxxxxxxlag_seconds = x_duration_seconds * lag_index / length(x)