Update: The code for this notebook is available on Github.
Aligning video recordings with Julia
xxxxxxxxxx
md"# Aligning video recordings with Julia"
Due to the restrictions imposed by COVID-19, dance teachers around the world are taking their classes online. As a dance student, it would be helpful to watch yourself against a recording of your teacher.
Something like this:
xxxxxxxxxx
md"Due to the restrictions imposed by COVID-19, dance teachers around the world are taking their classes online. As a dance student, it would be helpful to watch yourself against a recording of your teacher.
Something like this:"
xxxxxxxxxx
html"<iframe width=\"680\" height=\"383\" src=\"https://www.youtube-nocookie.com/embed/I8kuhv0FcTE\" frameborder=\"0\" allow=\"accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen></iframe>"
It is very unlikely the song in both videos starts at exactly the same. Manually syncing them is not easy.
On the other hand, computers can compare the sound waves of each video file and quickly determine how long to wait before playing one video or the other so they are both aligned.
xxxxxxxxxx
md"It is very unlikely the song in both videos starts at exactly the same. Manually syncing them is not easy.
On the other hand, computers can compare the sound waves of each video file and quickly determine how long to wait before playing one video or the other so they are both aligned."
Sound waves
xxxxxxxxxx
md"## Sound waves"
Sound waves are represented in digital audio by very rapidly sampling the distortions of the medium (e.g. air) through a microphone and storing the resulting data in a vector.
xxxxxxxxxx
md"[Sound waves](https://pudding.cool/2018/02/waveforms/) are represented in digital audio by very rapidly sampling the distortions of the medium (e.g. air) through a microphone and storing the resulting data in a vector."
Jazmine, the teacher, sent Julia a recording of a jazz song (cc0), sampled 16,000 times per second:
xxxxxxxxxx
md"Jazmine, the teacher, sent Julia a [recording of a jazz song](https://freesound.org/s/37750/) ([cc0](https://creativecommons.org/publicdomain/zero/1.0/)), sampled 16,000 times per second:"
xxxxxxxxxx
begin
using FileIO: load
import LibSndFile
using Plots
using SampledSignals
using FFTW
end
xxxxxxxxxx
x = load("../data/jazz.ogg")
7.00875
xxxxxxxxxx
x_duration_seconds = length(x) / x.samplerate
(SampleBuf{Float32,2}, 16000.0, 112140)
xxxxxxxxxx
typeof(x), x.samplerate, length(x)
Each sample in
xxxxxxxxxx
md"Each sample in $$x$$ corresponds to a microphone reading. The first 5 samples at $$1s$$ are:"
-0.140871
-0.125916
-0.0961665
-0.0542877
-0.0306375
-0.0344809
xxxxxxxxxx
x[16_000:16_005, 1]
Plotting the audio wave:
xxxxxxxxxx
md"Plotting the audio wave:"
x
plot(domain(x), x[:, 1], xlabel="time (s)", ylabel="amplitude", title="Jazz song", linealpha=0.85, legend=false, linecolor=palette(:default)[2], fmt = :png, dpi=300)
xxxxxxxxxx
md"Julia, then danced against the same song and recorded herself near [the sea](https://freesound.org/s/9332/) ([cc-by-nc](https://creativecommons.org/licenses/by-nc/3.0/)):"
xxxxxxxxxx
x2 = load("../data/jazz-waves.ogg")
9.52
xxxxxxxxxx
x2_duration_seconds = length(x2) / x2.samplerate
(SampleBuf{Float32,2}, 16000.0, 152320)
xxxxxxxxxx
typeof(x2), x2.samplerate, length(x2)
Let's look at Julia's audio wave:
xxxxxxxxxx
md"Let's look at Julia's audio wave:"
xxxxxxxxxx
plot(domain(x2), x2[:, 1], xlabel="time (s)", ylabel="amplitude", title="Jazz song at the Sea", linealpha=0.85, legend=false, fmt = :png, dpi=300)
We would like to know how much lag there is between the 2 different recordings of the same song:
xxxxxxxxxx
md"We would like to know how much lag there is between the 2 different recordings of the same song: $$x$$ and $$x2$$."
Using Cross Correlation to align audio signals
xxxxxxxxxx
md"## Using Cross Correlation to align audio signals"
Visually, we could superimpose both signals and slide each other over the time axis until they match as closely as possible.
For example, take Jazmine's recording (
xxxxxxxxxx
md"Visually, we could superimpose both signals and slide each other over the time axis until they match as closely as possible.
For example, take Jazmine's recording ($$x$$) and place it next to Julia's ($$x2$$). They are clearly not aligned."
xxxxxxxxxx
begin
x_right_padding_1 = zeros(Float32, length(x2) - length(x), 1)
x_padded_1 = cat(x, x_right_padding_1, dims=1)
mixed_1 = cat(x2, x_padded_1, dims=2)
plot(domain(mixed_1), mixed_1, xlabel="time (s)", ylabel="amplitude", title="At the sea + Jazz song", linealpha=0.85, legend=false, fmt = :png, dpi=300)
end
Then slide Julia's recording (
xxxxxxxxxx
md"Then slide Julia's recoring ($$x2$$) $$0.1$$ seconds to the right, then again $$0.1$$ seconds, …
$$1$$ second later:"
xxxxxxxxxx
begin
x_left_padding_2 = zeros(Float32, convert(Int, 1 * x.samplerate), 1)
x_right_padding_2 = zeros(Float32, length(x2) - length(x) - length(x_left_padding_2), 1)
x_padded_2 = cat(x_left_padding_2, x, x_right_padding_2, dims=1)
mixed_2 = cat(x2, x_padded_2, dims=2)
plot(domain(mixed_2), mixed_2, xlabel="time (s)", ylabel="amplitude", title="At the sea + Jazz song. Offset 1 second", linealpha=0.85, legend=false, fmt = :png, dpi=300)
end
xxxxxxxxxx
md" $$2$$ seconds later:"
xxxxxxxxxx
begin
x_left_padding_3 = zeros(Float32, convert(Int, 2 * x.samplerate), 1)
x_right_padding_3 = zeros(Float32, length(x2) - length(x) - length(x_left_padding_3), 1)
x_padded_3 = cat(x_left_padding_3, x, x_right_padding_3, dims=1)
mixed_3 = cat(x2, x_padded_3, dims=2)
plot(domain(mixed_2), mixed_3, xlabel="time (s)", ylabel="amplitude", title="At the sea + Jazz song. Offset 2 seconds", linealpha=0.85, legend=false, fmt = :png, dpi=300)
end
xxxxxxxxxx
md" $$2.5$$ seconds later both $$x$$ and $$x2$$ are perfectly aligned."
xxxxxxxxxx
begin
x_left_padding_4 = zeros(Float32, convert(Int, 2.5 * x.samplerate), 1)
x_right_padding_4 = zeros(Float32, length(x2) - length(x) - length(x_left_padding_4), 1)
x_padded_4 = cat(x_left_padding_4, x, x_right_padding_4, dims=1)
mixed_4 = cat(x2, x_padded_4, dims=2)
plot(domain(mixed_2), mixed_4, xlabel="time (s)", ylabel="amplitude", title="Ocean Wave + Jazz. Offset 2.5 seconds", linealpha=0.85, legend=false, fmt = :png, dpi=300)
end
This sliding process is essentially what a cross-correlation measures. It tells us how similar 2 signals are when displaced relative to each other.
xxxxxxxxxx
md"This sliding process is essentially what a cross-correlation measures. It tells us how similar 2 signals are when displaced relative to each other."
Mathematically, the cross-correlation of two temporal series
xxxxxxxxxx
md"Mathematically, the [cross-correlation](https://en.wikipedia.org/wiki/Cross-correlation) of two temporal series $$u$$ and $$v$$ is defined as:"
xxxxxxxxxx
md"$$(u \star v)(t) \triangleq\ \int_{-\infty}^\infty \overline{u(\tau)} v(t + \tau) \, d\tau$$"
Where
xxxxxxxxxx
md"Where $$\overline{u(\tau)}$$ is the [complex conjugate](https://en.wikipedia.org/wiki/Complex_conjugate) of $$u(\tau)$$. And $$t$$ is the lag."
The cross-correlation (
Moreover, the Fourier transform (
As such we can implement the cross-correlation with the following steps:
xxxxxxxxxx
md"The cross-correlation ($$\star$$) of two functions is equivalent to the convolution ($$*$$) of one of the signals reversed in the time axis:
$$(u \star v)(t) = u(−t) ∗ g (t)$$
Moreover, the Fourier transform ($$\mathcal{F}$$) of a convolution [is equivalent](https://en.wikipedia.org/wiki/Convolution_theorem) to multiplying the Fourier tranforms of the functions:
$$\mathcal{F}\{u * v\} = \mathcal{F}\{u\} \cdot \mathcal{F}\{v\}$$
$$(u \star v)(t) = \mathcal{F}^{-1}\big\{\mathcal{F}\{u(-t)\}\cdot\mathcal{F}\{v(t)\}\big\}$$
As such we can [implement the cross-correlation](https://dsp.stackexchange.com/questions/736/how-do-i-implement-cross-correlation-to-prove-two-audio-files-are-similar) with the following steps:"
zero padding the signals and reversing one of them:
xxxxxxxxxx
md"1. zero padding the signals and reversing one of them:"
xxxxxxxxxx
begin
x_max_size = abs(length(x2) - length(x))
x_pad_size = length(x) + 2 * length(x_right_padding_1)
x_pad = SampledSignals.SampleBuf(zeros(Float32, x_pad_size, 1), x.samplerate)
x_padded = vcat(x, x_pad)
x_padded_reversed = reverse(x_padded, dims=1)
x2_pad_size = length(x2)
x2_pad = SampledSignals.SampleBuf(zeros(Float32, x2_pad_size, 1), x2.samplerate)
x2_padded = vcat(x2, x2_pad)
# Listen to this audio at 12 seconds. It is the same Jazz song (x) but reversed and with empty audio at the start
x_padded_reversed
end
Performing a convolution between both signals with their Fourier transforms:
xxxxxxxxxx
md"2. Performing a convolution between both signals with their Fourier transforms: "
304640×1 Array{Float32,2}:
717.08154
692.6897
659.4013
618.29803
570.74023
518.3476
460.08493
⋮
595.0318
646.0196
686.0295
714.1685
729.12396
729.91205
xxxxxxxxxx
convolution = irfft(rfft(x2_padded) .* rfft(x_padded_reversed), length(x_padded))
xxxxxxxxxx
plot(domain(x2_padded), convolution[:, 1], xlabel="time (s)", ylabel="convolution(t)", legend=false, fmt = :png, dpi=300)
The point where the signals are most correlated corresponds to the peak of the convolution. This is the lag between
xxxxxxxxxx
md"The point where the signals are most correlated corresponds to the peak of the convolution. This is the lag between $$x$$ and $$x2$$:"
(40000, 6877.022f0)
xxxxxxxxxx
lag_index, peak_value = argmax(convolution[:, 1]), maximum(convolution[:, 1])
The lag is exactly 2.5 seconds. We can now use this value to align the audio/video files for Jazmine and Julia.
xxxxxxxxxx
md"The lag is exactly 2.5 seconds. We can now use this value to align the audio/video files for Jazmine and Julia."
2.5
xxxxxxxxxx
lag_seconds = x_duration_seconds * lag_index / length(x)