Update: The code for this notebook is available on Github.

Aligning video recordings with Julia

9 μs

Due to the restrictions imposed by COVID-19, dance teachers around the world are taking their classes online. As a dance student, it would be helpful to watch yourself against a recording of your teacher.

Something like this:

16.5 μs
170 μs

It is very unlikely the song in both videos starts at exactly the same. Manually syncing them is not easy.

On the other hand, computers can compare the sound waves of each video file and quickly determine how long to wait before playing one video or the other so they are both aligned.

18 μs

Sound waves

12.5 μs

Sound waves are represented in digital audio by very rapidly sampling the distortions of the medium (e.g. air) through a microphone and storing the resulting data in a vector.

10.1 μs

Jazmine, the teacher, sent Julia a recording of a jazz song (cc0), sampled 16,000 times per second:

10.9 μs
32.4 s
x
13.1 ms
x_duration_seconds
7.00875
19.5 μs
(SampleBuf{Float32,2}, 16000.0, 112140)
17 μs

Each sample in x corresponds to a microphone reading. The first 5 samples at 1s are:

29.1 μs
13.8 μs

Plotting the audio wave:

9.9 μs
351 ms

Julia, then danced against the same song and recorded herself near the sea (cc-by-nc):

40 μs
x2
8 ms
x2_duration_seconds
9.52
14.5 μs
(SampleBuf{Float32,2}, 16000.0, 152320)
15.2 μs

Let's look at Julia's audio wave:

8.1 μs
375 ms

We would like to know how much lag there is between the 2 different recordings of the same song: x and x2.

27.2 μs

Using Cross Correlation to align audio signals

4.1 μs

Visually, we could superimpose both signals and slide each other over the time axis until they match as closely as possible.

For example, take Jazmine's recording (x) and place it next to Julia's (x2). They are clearly not aligned.

20.3 μs
84.3 ms

Then slide Julia's recording (x2) 0.1 seconds to the right, then again 0.1 seconds, …

1 second later:

11.5 μs
28.8 ms

2 seconds later:

10 μs
25.3 ms

2.5 seconds later both x and x2 are perfectly aligned.

306 μs
40.5 ms

This sliding process is essentially what a cross-correlation measures. It tells us how similar 2 signals are when displaced relative to each other.

274 μs

Mathematically, the cross-correlation of two temporal series u and v is defined as:

10 μs

(uv)(t) u(τ)¯v(t+τ)dτ

5 μs

Where u(τ)¯ is the complex conjugate of u(τ). And t is the lag.

10.9 μs

The cross-correlation () of two functions is equivalent to the convolution () of one of the signals reversed in the time axis:

(uv)(t)=u(t)g(t)

Moreover, the Fourier transform (F) of a convolution is equivalent to multiplying the Fourier tranforms of the functions:

F{uv}=F{u}F{v}

(uv)(t)=F1{F{u(t)}F{v(t)}}

As such we can implement the cross-correlation with the following steps:

13.6 μs
  1. zero padding the signals and reversing one of them:

7.5 μs
5.6 ms
  1. Performing a convolution between both signals with their Fourier transforms:

4.7 μs
convolution
304640×1 Array{Float32,2}:
 717.08154
 692.6897
 659.4013
 618.29803
 570.74023
 518.3476
 460.08493
   ⋮
 595.0318
 646.0196
 686.0295
 714.1685
 729.12396
 729.91205
95.6 ms
260 ms

The point where the signals are most correlated corresponds to the peak of the convolution. This is the lag between x and x2:

12.9 μs
(40000, 6877.022f0)
6.4 ms

The lag is exactly 2.5 seconds. We can now use this value to align the audio/video files for Jazmine and Julia.

391 μs
lag_seconds
2.5
18 μs