Update: The code for this notebook is available on Github.

Aligning video recordings with Julia

xxxxxxxxxx
 
md"# Aligning video recordings with Julia"

9 μs

Due to the restrictions imposed by COVID-19, dance teachers around the world are taking their classes online. As a dance student, it would be helpful to watch yourself against a recording of your teacher.

Something like this:

xxxxxxxxxx
 
md"Due to the restrictions imposed by COVID-19, dance teachers around the world are taking their classes online. As a dance student, it would be helpful to watch yourself against a recording of your teacher.
​
Something like this:"

16.5 μs

xxxxxxxxxx
 
html"<iframe width=\"680\" height=\"383\" src=\"https://www.youtube-nocookie.com/embed/I8kuhv0FcTE\" frameborder=\"0\" allow=\"accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen></iframe>"

170 μs

It is very unlikely the song in both videos starts at exactly the same. Manually syncing them is not easy.

On the other hand, computers can compare the sound waves of each video file and quickly determine how long to wait before playing one video or the other so they are both aligned.

xxxxxxxxxx
 
md"It is very unlikely the song in both videos starts at exactly the same. Manually syncing them is not easy.
​
On the other hand, computers can compare the sound waves of each video file and quickly determine how long to wait before playing one video or the other so they are both aligned."

18 μs

Sound waves

xxxxxxxxxx
 
md"## Sound waves"

12.5 μs

Sound waves are represented in digital audio by very rapidly sampling the distortions of the medium (e.g. air) through a microphone and storing the resulting data in a vector.

xxxxxxxxxx
 
md"[Sound waves](https://pudding.cool/2018/02/waveforms/) are represented in digital audio by very rapidly sampling the distortions of the medium (e.g. air) through a microphone and storing the resulting data in a vector."

10.1 μs

Jazmine, the teacher, sent Julia a recording of a jazz song (cc0), sampled 16,000 times per second:

xxxxxxxxxx
 
md"Jazmine, the teacher, sent Julia a [recording of a jazz song](https://freesound.org/s/37750/) ([cc0](https://creativecommons.org/publicdomain/zero/1.0/)), sampled 16,000 times per second:"

10.9 μs

xxxxxxxxxx
 
begin
    using FileIO: load
    import LibSndFile
    using Plots
    using SampledSignals
    using FFTW
end

32.4 s

xxxxxxxxxx
 
x = load("../data/jazz.ogg")

13.1 ms

x_duration_seconds

7.00875

xxxxxxxxxx
 
x_duration_seconds = length(x) / x.samplerate

19.5 μs

(SampleBuf{Float32,2}, 16000.0, 112140)

xxxxxxxxxx
 
typeof(x), x.samplerate, length(x)

17 μs

Each sample in $x$ corresponds to a microphone reading. The first 5 samples at $1 s$ are:

xxxxxxxxxx
 
md"Each sample in $$x$$ corresponds to a microphone reading. The first 5 samples at $$1s$$ are:"

29.1 μs

Float321

-0.140871

-0.125916

-0.0961665

-0.0542877

-0.0306375

-0.0344809

xxxxxxxxxx
 
x[16_000:16_005, 1]

13.8 μs

Plotting the audio wave:

xxxxxxxxxx
 
md"Plotting the audio wave:"

9.9 μs

x
 
plot(domain(x), x[:, 1], xlabel="time (s)", ylabel="amplitude", title="Jazz song", linealpha=0.85, legend=false, linecolor=palette(:default)[2], fmt = :png, dpi=300)

351 ms

Julia, then danced against the same song and recorded herself near the sea (cc-by-nc):

xxxxxxxxxx
 
md"Julia, then danced against the same song and recorded herself near [the sea](https://freesound.org/s/9332/) ([cc-by-nc](https://creativecommons.org/licenses/by-nc/3.0/)):"

40 μs

xxxxxxxxxx
 
x2 = load("../data/jazz-waves.ogg")

8 ms

x2_duration_seconds

9.52

xxxxxxxxxx
 
x2_duration_seconds = length(x2) / x2.samplerate

14.5 μs

(SampleBuf{Float32,2}, 16000.0, 152320)

xxxxxxxxxx
 
typeof(x2), x2.samplerate, length(x2)

15.2 μs

Let's look at Julia's audio wave:

xxxxxxxxxx
 
md"Let's look at Julia's audio wave:"

8.1 μs

xxxxxxxxxx
 
plot(domain(x2), x2[:, 1], xlabel="time (s)", ylabel="amplitude", title="Jazz song at the Sea", linealpha=0.85, legend=false, fmt = :png, dpi=300)

375 ms

We would like to know how much lag there is between the 2 different recordings of the same song: $x$ and $x 2$ .

xxxxxxxxxx
 
md"We would like to know how much lag there is between the 2 different recordings of the same song: $$x$$ and $$x2$$."

27.2 μs

Using Cross Correlation to align audio signals

xxxxxxxxxx
 
md"## Using Cross Correlation to align audio signals"

4.1 μs

Visually, we could superimpose both signals and slide each other over the time axis until they match as closely as possible.

For example, take Jazmine's recording ( $x$ ) and place it next to Julia's ( $x 2$ ). They are clearly not aligned.

xxxxxxxxxx
 
md"Visually, we could superimpose both signals and slide each other over the time axis until they match as closely as possible. 
​
For example, take Jazmine's recording ($$x$$) and place it next to Julia's ($$x2$$). They are clearly not aligned."

20.3 μs

xxxxxxxxxx
 
begin
    x_right_padding_1 = zeros(Float32, length(x2) - length(x), 1)
    x_padded_1 = cat(x, x_right_padding_1, dims=1)
    mixed_1 = cat(x2, x_padded_1, dims=2)
    plot(domain(mixed_1), mixed_1, xlabel="time (s)", ylabel="amplitude", title="At the sea + Jazz song", linealpha=0.85, legend=false, fmt = :png, dpi=300)
end

84.3 ms

Then slide Julia's recording ( $x 2$ ) $0.1$ seconds to the right, then again $0.1$ seconds, …

$1$ second later:

xxxxxxxxxx
 
md"Then slide Julia's recoring ($$x2$$) $$0.1$$ seconds to the right, then again $$0.1$$ seconds, …
​
 $$1$$ second later:"

11.5 μs

xxxxxxxxxx
 
begin
    x_left_padding_2 = zeros(Float32, convert(Int, 1 * x.samplerate), 1)
    x_right_padding_2 = zeros(Float32, length(x2) - length(x) - length(x_left_padding_2), 1)
    x_padded_2 = cat(x_left_padding_2, x, x_right_padding_2, dims=1)
    mixed_2 = cat(x2, x_padded_2, dims=2)
    plot(domain(mixed_2), mixed_2, xlabel="time (s)", ylabel="amplitude", title="At the sea + Jazz song. Offset 1 second", linealpha=0.85, legend=false, fmt = :png, dpi=300)
end

28.8 ms

$2$ seconds later:

xxxxxxxxxx
 
md" $$2$$ seconds later:"

10 μs

xxxxxxxxxx
 
begin
    x_left_padding_3 = zeros(Float32, convert(Int, 2 * x.samplerate), 1)
    x_right_padding_3 = zeros(Float32, length(x2) - length(x) - length(x_left_padding_3), 1)
    x_padded_3 = cat(x_left_padding_3, x, x_right_padding_3, dims=1)
    mixed_3 = cat(x2, x_padded_3, dims=2)
    plot(domain(mixed_2), mixed_3, xlabel="time (s)", ylabel="amplitude", title="At the sea + Jazz song. Offset 2 seconds", linealpha=0.85, legend=false, fmt = :png, dpi=300)
end

25.3 ms

$2.5$ seconds later both $x$ and $x 2$ are perfectly aligned.

xxxxxxxxxx
 
md" $$2.5$$ seconds later both $$x$$ and $$x2$$ are perfectly aligned."

306 μs

xxxxxxxxxx
 
begin
    x_left_padding_4 = zeros(Float32, convert(Int, 2.5 * x.samplerate), 1)
    x_right_padding_4 = zeros(Float32, length(x2) - length(x) - length(x_left_padding_4), 1)
    x_padded_4 = cat(x_left_padding_4, x, x_right_padding_4, dims=1)
    mixed_4 = cat(x2, x_padded_4, dims=2)
    plot(domain(mixed_2), mixed_4, xlabel="time (s)", ylabel="amplitude", title="Ocean Wave + Jazz. Offset 2.5 seconds", linealpha=0.85, legend=false, fmt = :png, dpi=300)
end

40.5 ms

This sliding process is essentially what a cross-correlation measures. It tells us how similar 2 signals are when displaced relative to each other.

xxxxxxxxxx
 
md"This sliding process is essentially what a cross-correlation measures. It tells us how similar 2 signals are when displaced relative to each other."

274 μs

Mathematically, the cross-correlation of two temporal series $u$ and $v$ is defined as:

xxxxxxxxxx
 
md"Mathematically, the [cross-correlation](https://en.wikipedia.org/wiki/Cross-correlation) of two temporal series $$u$$ and $$v$$ is defined as:"

10 μs

$(u ⋆ v) (t) ≜ \int_{- \infty}^{\infty} \bar{u (τ)} v (t + τ) d τ$

xxxxxxxxxx
 
md"$$(u \star v)(t) \triangleq\ \int_{-\infty}^\infty \overline{u(\tau)} v(t + \tau) \, d\tau$$"

5 μs

Where $\bar{u (τ)}$ is the complex conjugate of $u (τ)$ . And $t$ is the lag.

xxxxxxxxxx
 
md"Where $$\overline{u(\tau)}$$ is the [complex conjugate](https://en.wikipedia.org/wiki/Complex_conjugate) of $$u(\tau)$$. And $$t$$ is the lag."

10.9 μs

The cross-correlation ( $⋆$ ) of two functions is equivalent to the convolution ( $*$ ) of one of the signals reversed in the time axis:

$(u ⋆ v) (t) = u (- t) * g (t)$

Moreover, the Fourier transform ( $F$ ) of a convolution is equivalent to multiplying the Fourier tranforms of the functions:

$F {u * v} = F {u} \cdot F {v}$

$(u ⋆ v) (t) = F^{- 1} {F {u (- t)} \cdot F {v (t)}}$

As such we can implement the cross-correlation with the following steps:

xxxxxxxxxx
 
md"The cross-correlation ($$\star$$) of two functions is equivalent to the convolution ($$*$$) of one of the signals reversed in the time axis: 
​
​
$$(u \star v)(t) =  u(−t) ∗ g (t)$$
​
Moreover, the Fourier transform ($$\mathcal{F}$$) of a convolution [is equivalent](https://en.wikipedia.org/wiki/Convolution_theorem) to multiplying the Fourier tranforms of the functions:
​
$$\mathcal{F}\{u * v\} = \mathcal{F}\{u\} \cdot \mathcal{F}\{v\}$$
​
$$(u \star v)(t) = \mathcal{F}^{-1}\big\{\mathcal{F}\{u(-t)\}\cdot\mathcal{F}\{v(t)\}\big\}$$
​
As such we can [implement the cross-correlation](https://dsp.stackexchange.com/questions/736/how-do-i-implement-cross-correlation-to-prove-two-audio-files-are-similar) with the following steps:"

13.6 μs

zero padding the signals and reversing one of them:

xxxxxxxxxx
 
md"1. zero padding the signals and reversing one of them:"

7.5 μs

xxxxxxxxxx
 
begin
    x_max_size = abs(length(x2) - length(x))
    
    x_pad_size = length(x) + 2 * length(x_right_padding_1)
    x_pad = SampledSignals.SampleBuf(zeros(Float32, x_pad_size, 1), x.samplerate)
    x_padded = vcat(x, x_pad) 
    x_padded_reversed = reverse(x_padded, dims=1)
    
    x2_pad_size = length(x2)
    x2_pad = SampledSignals.SampleBuf(zeros(Float32, x2_pad_size, 1), x2.samplerate)
    x2_padded = vcat(x2, x2_pad)
    
    # Listen to this audio at 12 seconds. It is the same Jazz song (x) but reversed and with empty audio at the start
    x_padded_reversed
end

5.6 ms

Performing a convolution between both signals with their Fourier transforms:

xxxxxxxxxx
 
md"2. Performing a convolution between both signals with their Fourier transforms: "

4.7 μs

convolution

304640×1 Array{Float32,2}:
 717.08154
 692.6897
 659.4013
 618.29803
 570.74023
 518.3476
 460.08493
   ⋮
 595.0318
 646.0196
 686.0295
 714.1685
 729.12396
 729.91205

xxxxxxxxxx
 
convolution = irfft(rfft(x2_padded) .* rfft(x_padded_reversed), length(x_padded))

95.6 ms

xxxxxxxxxx
 
plot(domain(x2_padded), convolution[:, 1], xlabel="time (s)", ylabel="convolution(t)", legend=false, fmt = :png, dpi=300)

260 ms

The point where the signals are most correlated corresponds to the peak of the convolution. This is the lag between $x$ and $x 2$ :

xxxxxxxxxx
 
md"The point where the signals are most correlated corresponds to the peak of the convolution. This is the lag between $$x$$ and $$x2$$:"

12.9 μs

(40000, 6877.022f0)

xxxxxxxxxx
 
lag_index, peak_value = argmax(convolution[:, 1]), maximum(convolution[:, 1])

6.4 ms

The lag is exactly 2.5 seconds. We can now use this value to align the audio/video files for Jazmine and Julia.

xxxxxxxxxx
 
md"The lag is exactly 2.5 seconds. We can now use this value to align the audio/video files for Jazmine and Julia."

391 μs

lag_seconds

2.5

xxxxxxxxxx
 
lag_seconds = x_duration_seconds * lag_index / length(x)

18 μs

Rodrigo Castro

Table of Contents

Aligning video recordings with Julia

Aligning video recordings with Julia

Sound waves

Using Cross Correlation to align audio signals