Automatically aligning multiple video/audio clips in kdenlive

I’ve recently begun trying to produce some youtube videos on programming topics that I find interesting. In that interest, I’ve come to use the kdenlive linux video editor 1. Kdenlive is pretty nice and will serve my needs just fine.

One thing I’ve aspired to do is some multi-camera videos. My wife practices a couple kinds of martial arts and I’d love to record some of her test katas from several angles.

Aligning a handful of clips really isn’t that hard to do manually, but since I’m a nerd, I decided to program a utility to do this. It’s a problem that I wanted to implement for years and I’ve finally gotten to it.

Similar to the product PluralEyes, my utility looks at the audio portions of your clips and aligns them based on that. In a nutshell, I compute the cross correlation of all pairs of clips using FFTs and then write a timeline based on the strongest matches.

The end of this post will have more implementation details.

Here is a quick demo


The steps to use it are pretty simple

Step 1 – create an unaligned kdenlive project

First, create a kdenlive project that contains all of the desired clips in the desired tracks as below (click for larger image). Note that the clock in the multiple views show different times. This demo project has three different cameras (an S5, HTCM8, and motorola something) in addition to a cheap Sony voice recorder and a Sony PCM-M10

Here’s the fancy rig I used. It’s for creating some 360 style VR videos. I have some scripting for merging stuff together using Hugin, but I haven’t been satisfied with the results yet.

Step 2 – run the utility

A binary of the utility can be found on my github. You can also find the source code for it.

To run it is easy. In the example below, my project is called easydump, because it’s just an easy dump of all the clips:

./sync_kdenlive video_and_voice_recorders/easydump.kdenlive

It will create the file easydump_aligned.kdenlive. It also creates easydump_aligned_aligned0.wav which is a wav file with as many channels as clips.

The demo is about 4.5 minutes and the utility runs in ~10seconds on my Skylake with 16GB. 2 Run time will scale linearly in the length of your longest clip and quadratic in the number of clips. So far, I’ve made little/no attempt to optimize it beyond using FFTs.

This utility is for my own benefit/enjoyment/curiosity. If you have any interest in it, please comment. If you tried to run it and it didn’t work, please comment. If you tried it and it did work, please comment. If I get no comments, I’ll assume no one cares and will make no effort to make it better.

Of course, the easydump part isn’t hardcoded. It’s just the file name that I chose for this experiment.

Step 3 – load the updated project file into kdenlive

Step 2 generated a easydump_aligned.kdenlive. Load it into kdenlive. In the picture below, note that the clocks show the same time.

Here’s the result. From listening to the audio of it, I’d say it worked pretty well:


So how’s it work?

The core idea is something that I learned many moons ago, when I took 6.003 – signals and systems. A playlist of the 2011 lectures can be found here.

In particular, this lecture about convolution:

Convolution or the variant of it cross-correlation that I use in the utility is basically that act of sliding one signal along another, multiply the two, and taking the area of the result. A large area means lots of similarity. Smaller area means less.

If you were to just compute the cross correlation directly, you’d have a lot of computation to do. Thankfully, there was a guy named Fourier who found a way to compute this in the frequency domain. This coupled with the Fast Fourier Transform enables me to to cross correlations on WAV files pretty quickly.

Here are the required steps:

  1. compute the discrete fourier transform of each of your wav files. The length of each result should be equal to the max of the lengths of the two wavs.
  2. piecewise,  multiply the first by the complex conjugate of the second
  3. compute the inverse transform of the result of those multiplies.

Here is what you get with you take two sine waves, shifted relative to each other. The two two waveforms are the two input sines. The bottom two are the real and imaginary components of the cross correlation. The source code for this experiment can be found here on my github.

Note that this technique works for any kind of data. Spatial/visual works too. I’ve had thoughts of implementing a jigsaw puzzle solver this way.

Try it with your own waves

I have a compiled (for ubuntu linux) binary of a utility that takes two waves and writes a wave showing the cross correlation. It can be found here. To invoke it:

./sync_wavs file2.wav file1.wav true

If you omit the “true” argument, you’ll get just the aligned channels. In this case, I added true to tell sync_wavs to include the cross-correlation as the third and fourth channels. Note the peak location corresponds to the beginning of the second channel.

Attribution of libraries I used

Very few real programming projects are implemented entirely from scratch and this one is no different. The external libraries I used are:

FFTW This library implements the fourier transform parts. In addition to its excellent tutorial, I have a couple simple programs that I used in the lead up to this project.

TinyXML2 This library implements the xml read/write functions. I have a lot of experience using LibXML2’s perl interface. LibXML2 is excellent but there are some areas where it makes me do more of the work than I care for. TinyXML2 is easier to use in most ways. The output xml is nicely formatted by default. Adding text and new elements is a bit quicker to implement. The main thing missing TinyXML2 is XPATH support. Using the visitor methods sort of makes up for this.

BOOST. It’s hard for me not to use boost these days. In particular, I use:

ffmpeg I don’t use this in the code directly. The code does call ffmpeg to extract wav files of the same samplerate from each of the clips.

Some additional implementation details

I have found that downsampling the audio of each clip to 5k samples/sec works fine. This is important because the longer the fourier transform, the longer the runtime.

When taking the ffts and iffts, you need to double the length of the input data but padding with zeros (ie, the first half is your wav, the second is silence). This is important in knowing the relative order and offset of any pair of clips. The way the cross-correlation works, if you have a negative offset, the peaks will show up in the second half of the transform data. At the same time, what happens if one clip begins near the end (past halfway) of the other? To sidestep this, I simply doubled the data lengths. 3

In the utility, I compute the cross correlation for all pairs of audios. Some of those pairs will correlate better than other and some don’t correlate at all (since they don’t overlap). Here’s the method I’ve found to work for me:

  • compute the mean/average and standard deviation of the cross-correlation result.
  • find the peak value
  • order the peaks of all the pairs by ordering by the number of standard deviations above the mean/average. For pairs that should correlate (they overlap), the peaks tend to be at least 15 standard deviations above the mean.

  1. I actually have a version of Movie Studio Platinum, which I paid for some years ago. The problem is that I use windows less and less

  2. I imagine that anyone doing video editing will need/want/have a machine that it also pretty beefy. I’m using the graphics stuff built into the Skylake.

  3. I imagine there’s a way to do this without doubling but I don’t know the math well enough. If you do know, please let me know

Using a phony C struct as a function selector

Generic programming, as used by the std and boost packages, depends heavily on template tricks to extract data from specific data structures in a generic way. When I first tried using some of the boost libraries, I felt pretty clueless in getting them to do what I wanted. Hence this post.

Say you have a C struct like this one:

struct Foo {
  int a;
  int b;

Generic programming accesses the a and b members using a global get function and a property map:

template <typename T, typename PMAP>
int get(T &t, PMAP pmap);

What’s in the property map? In this case nothing. The magic happens in the templated specialization of get:

struct get_a {};

int get(Foo &foo, get_a)
  return foo.a;

Which you call like this:

Foo f1;
f1.a = 10;
std::cout << "a is " << get(f1, get_a()) << std::endl;

How’s it work and what’s the compiled result?

The get_a struct is empty; it contains no data members. When you call get, the compiler selects the get_a version since it’s the most specific specialization of get. From there, things get interesting. Constructing get_a() is a noop. Pushing it as a function argument is a noop. After inlining, get simply becomes f1.a.


How to quickly compute the Euclidean MST of a large number of points (and why you might want to)

When I was in college, like most CS type majors, 1 I took the algorithms class. Sorting, binary trees, O(n) notation… all of that.

One of the things I didn’t understand is why such a big deal was made about some algorithms like the graph algorithms. Why would I want a minimum spanning tree of a million nodes?

A bunch of years ago, I did need to compute the MST of a couple hundred thousand nodes for my job. Later on, I needed to do it three dimensionally. Something made me think of it recently, and I have a hankering to write about it.

I’ve also recorded a series of videos to explain the programming aspects:

What problem was I trying to solve?

At the time, I was working on code to generate a clock buffer tree for large seas of logic gates. Say you have a million gates in your logic design and that 10% of those are clocked elements.2. So we have 100k logic gates that need a clock signal.

So, clock buffers can’t drive the capacitance of a 100k gates, at least not if you want the design to run at more than 1 cycle per second. The solution is a clock tree. The input clock drives 5 buffers, which each drive 5 more buffers,… until everyone has their clock.

How do you decide which buffer drives which cell? That’s where the MST comes in. Once I had computed the MST of all of the clocked elements, I picked a random end (of the MST) node 3 and collapsed it into it’s adjacent node. When a collapsed node passes a capacitance threshold (also need to include wiring capacitance), that’s where I add a buffer. Keep doing this until you have your first level of buffers, then recursively do the same up the tree.

But some cells are more timing critical than others.

In a design of 1M gates and 100k clocked elements, most of those gates will easily meet timing, no matter how badly you design things. Most. If memory serves me, we’re talking 90% (99%?) of the design meets timing, easy peasy.

So in those 100k gates, some are timing critical, many are not. An easy solution is to have two subtrees. One for the easy stuff and the other for the hard. The problem there is that you can have one little cell surrounded by critical stuff. Do I really need to segregate it? How can I allow the user to control the thresholds for such things?

The solution I wanted to implement, but never did (though I’m convinced this will work) is to turn the problem into a 3D MST problem. The X and Y axes are the normal X/Y of the design. The third axis represents timing. I simply need to devise a timing to distance coefficient.

MST basics

First off, the MST algorithm is pretty easy. To quote

Initially, T contains an arbitrary vertex. In each step, T is augmented with a least-weight edge (x,y) such that x is in T and y is not yet in T. By the Cut property, all edges added to T are in the MST. Its run-time is either O(m log n) or O(m + n log n), depending on the data-structures used.

Sounds pretty easy. The problem is you need a graph to operate on. I only had a set of points. The easy solution is to create a complete graph. All nodes are connected to all other nodes. O(n^2) in the number of nodes. 100k*100k can take up some space.

Triangulation to the rescue

Again, I refer to

every edge not in a Delaunay triangulation is also not in any EMST

I don’t really understand the math behind it but first compute a triangulation of your points and run MST on that.

The CGAL library will give you the Delaunay triangulation.

The BOOST Graph Library will give you the MST.

The problem is that these libraries were difficult for me to really understand. Now that I’ve spent some time with them, they’re actually pretty simple. I’ve created some videos that perhaps will help you also see them as simple.

Visualization with paraview

First, it’s important to be able to see that data. I’ve found that paraview works well and is easy to use (once you know how to use it)

CGAL how to

Boost Graph library how to

  1. My major was basically computer science except it was in the math department. Instead of the normal 150 students per year, there were 18 of us doing the Math with CS variant.

  2. The design size was actually smaller. 100k-500k. The percentage of clocked elements was higher. I’m going by memory, but I think the final number of nodes is in the same ballpark.

  3. I think I actually picked the bottom right one, but it doesn’t really make a difference. The main benefit to the bottom right is that you get the same or similar results from one run to the next.