On overflowing stacks

I recently set out to implement a few basic data structures in C for the hell of it (and to reassure myself that I can still code C), and ran into an interesting compiler wart…

I was trying to instantiate a static array of 10 million integers (who doesn’t?), in order to test insertions and deletions in my tree. However, as you can astutely deduce from the title of this post, this was too much for the stack of my poor program and ended up in a segfault – a textbook stack overflow.

I did not think of that at first though, and tried to isolate the offending piece of code by inserting a return 0; in the main() after a piece of code I knew to be working, and working my way down to pinpoint the issue.

Much to my dismay, this didn’t really work out. Why? Check the following code:

Do you think it works with that last line uncommented? You’d be wrong!

[15:20:35]florent@Air:~/Experiments/minefield$ gcc boom.c
[15:20:40]florent@Air:~/Experiments/minefield$ ./a.out
Segmentation fault

GCC (4.2.1) wants to instantiate the array even though it’s declared after the function returns!

Interestingly enough, when you tell GCC to optimise the code, it realises the array will never get reached and prunes it away.

[15:26:06]florent@Air:~/Experiments/minefield$ gcc -O2 boom.c
[15:26:16]florent@Air:~/Experiments/minefield$ ./a.out
Hello world!

Clang (1.7) exhibits exactly the same behaviour.

Lessons learnt? return is no way of debugging a program.

Optimising a video editor plugin

During the past few weeks, I have been writing a C++ plugin to grade C41 digital intermediates in Cinelerra, an open-source Linux video editor. C41 is the most common chemical process for negatives, resulting in films that look like this — you probably know it if you’ve ever shot film cameras.

Of course, after scanning those negatives, you have to process (“grade”) them to turn them back to positive. And it’s not as simple as merely inverting the values of each channel for each pixel; C41 has a very pronounced orange shift that you have to take into account.

The algorithm

The core algorithm for this plugin was lifted from a script written by JaZ99wro for still photographs, based on ImageMagick, which does two things:
– Compute “magic” values for the image
– Apply a transformation to each channel (R, G, B) based on those magic values

The problem with film is that due to tiny changes between the images, the magic values were all over the place from one frame to the other. Merely applying JaZ’s script on a series of frames gave a sort of “flickering” effect, with colours varying from one frame to the other, which is unacceptable effect for video editing.

The plugin computes those magic values for each frame of the scene, but lets you pick and fix specific values for the duration of the scene. The values are therefore not “optimal” for each frame, but the end result is visually very good.

However, doing things this way is slow: less than 1 image/second for 1624*1234 frames.

Optimising: do less

The first idea was to make optional the computing of the magic values: after all, when you’re batch processing a scene with fixed magic values, you don’t need to compute them again for each frame.

It was a bit faster, but not by much. A tad more than an image/second maybe.

Optimising: measure

The next step —which should have been the first!— was to actually benchmark the plugin, and see where the time was spent. Using clock_gettime() for maximum precision, the results were:

~0.3 seconds to compute magic values (0.2s to apply a box blur to smooth out noise, and 0.1s to actually compute the values)

~0.9 seconds to apply the transformation

Optional computing of the magic values was indeed a step in the right direction, but the core of the algorithm was definitely the more computationally expensive. Here’s what’s to be computed for each pixel:

row[0] = (magic1 / row[0]) - magic4;
row[1] = pow((magic2 / row[1]),1/magic5) - magic4;
row[2] = pow((magic3 / row[2]),1/magic6) - magic4;

With row[0] being the red channel, row[1] the green channel, and row[2] the blue channel.

The most expensive call here is pow(), part of math.h. We don’t need to be extremely precise for each pixel value, so maybe we can trade some accuracy for raw speed?

Optimising: do better

Our faithful friend Google, tasked with searching for a fast float pow() implementation, gives back Ian Stephenson’s implementation, a short and clear (and more importantly, working) version of pow().

But we can’t just throw that in without analysing how it affects the resulting frame. The next thing to do was to add a button that would switch between the “exact” version and the approximation: the results were visually identical.

Just to be sure, I measured the difference between the two methods, and it shows an average 0.2% difference, going as high as 5% for the worst case, which are acceptable values.

And the good news is that the plugin now only takes 0.15 to 0.20 second to treat each image, i.e. between 5 and 6 images/second — an 8-fold gain since the first version. Mission accomplished!