Bartosz Ciechanowski

Drawing Bézier Curves

A few months ago I’ve released my latest iPad app – Revolved. The core idea of Revolved is super simple – you draw curves on the right hand side of the screen and they get revolved around the axis to create a 3D model.

In this article I’m going to walk through the process of displaying Bézier curves on the iPad screen as it happens in the app. Since the line drawing engine powering Revolved is OpenGL based, presenting curves on the display is not as easy as calling some Core Graphics functions here and there.

To make the story more interesting I made a few interactive demos that should make understanding the presented concepts much easier. Here’s the small part of Revolved in your browser. This is what we’re going to build:

Note: the demos are made in HTML5 using the canvas API which (unfortunately) is not a match for the full-fledged OpenGL based rendering. While some small glitches and artifacts may occur in the browser, Revolved doesn’t suffer from those problems.

The OpenGL code powering the line drawing in Revolved is available on my GitHub. As an added bonus I’ve also published the canvas/JS version that runs the demos in this article.

Bézier Curves

I’m absolutely fascinated with Bézier curves. Not only are they ubiquitous in computer graphics, but also they’re just so much fun to play with. Dragging the control points and watching the curve wiggle is an experience on its own.

The most popular Bézier curves out there are cubic Bézier curves. They’re defined in terms of four control points. While the first and last control points specify locations of the endpoints of the drawn curve, the second and third control point influence the tangency of the curve.

If you’re interested in mastering Bézier curves, I highly recommend diving into A Primer on Bézier Curves, which is a definite resource on the matter. However, this article doesn’t require any meaningful background apart from intuitive understanding of how curves work.

Having decided to utilize curves in Revolved, I’ve faced the challenge of rendering smooth cubic Bézier curves on the GPU, using the OpenGL primitives.

Subdivisions

One of the easiest way to approximate any curvy shape is to split it into line segments. We’re going to divide a Bézier curve into a set of connected linear components, with the intention that large enough number of segments will look sufficiently smooth for the user. The most important problem we have to solve is finding the connection points for those segments.

Fortunately, Bézier curves are parametric that is their x and y components are defined in terms of a parameter t, which varies from 0 to 1. For the value of t equal to 0, we land on the first end point of the curve and for the value of t equal to 1, we land on the last end point of the curve. For in-between values of t we get some points that lie on the curve in-between those end points.

Bézier curve parametrization

To approximate a shape of a Bézier curve using 2 line segments, calculate the position of 3 connection points for t equal to 0, 0.5 and 1. Using 10 segments requires 11 points for t values of 0, 0.1, 0.2, …, 0.9 and 1.0 respectively.

Subdivision in Action

The little simulator below depicts the method. Try dragging both end points and control points. You can adjust the number of segments with a slider. The faint gray line is the exact (hopefully) Bézier curve drawn with HTML5 canvas primitives.

A very nice feature of Bézier curves is that the subdivision points accumulate near areas of large curvature. Try making a V-shaped curve and notice, how little red knots tend to stick together near the tip, they don’t divide the curve to equal-length segments. Moreover, even for small number of linear segments the drawn shape is not far off from the exact curve.

Drawing Lines

Even though OpenGL does support line drawing, I didn’t want to use this feature for two reasons. First of all, the line width can only be set once per draw call. Animating curves’ thicknesses would require separating the drawing commands into different batches thus potentially requiring multiple draw calls to render all the segments. This is unnecessarily complex.

The most important factor is, the lines rendered by OpenGL are plain wrong.

This is actual slanted line drawn by OpenGL

The “algorithm” seems to clone pixels in either horizontal or vertical direction, making lines look like parallelograms instead of rotated rectangles. Needless to say, this doesn’t fulfill my quality requirements. While simple line drawing would work for very thin lines, I wanted the curves in Revolved to be bold and prominently presented on the screen.

Adding Width

Having nailed approximating the shape of Bézier curves with line segments, we should focus on making them wider. Rejecting OpenGL line drawing method forces us to jump into the usual GPU suspect – a triangle. A single line segment will consist of two triangles that will form a quadrilateral. We just have to calculate 4 corner points for a given segment and connect them with edges:

  1. Start with a linear segment
  2. Extend it in a perpendicular direction
  3. Create vertices at new endpoints
  4. Connect them to create two triangles

While this high-level overview sounds relatively simple, the devil, as usual, is in the details.

The Perpendicular

Given a vector, finding a vector perpendicular to it is one of the easiest trick in vector math. Just exchange the coordinates and negate one of them. That’s it:

Perpendicular vector has its coordinates flipped and one of them is negated

Having the perpendicular direction makes it easy to calculate the positions of vertices that will get further used for triangles rendering.

The Bad Tangent

We’re just one step short of rendering the Bézier curve with some width. Unfortunately, making the straight line segments wider by applying the above-mentioned perpendicularity wouldn’t work well. Try making V-shape again and notice the discontinuities:

The problem is we’re treating the linear segments as independent entities, forgetting that they are part of the curve.

Two rectangular segments create cracks

While we could somehow merge the vertices in the midpoint, this wouldn’t be perfect. Since Bézier curves are defined in pristine mathematical terms we can get the vertex position much more precisely.

The Good Tangent

What’s wrong in the above approach is that we’re ignoring the fact that a curve is, well, a curve. As mentioned before, to get direction perpendicular to the direction of the curve we have to calculate the actual local direction of the curve itself. In other words, if we were to draw an arrow at the given point of the curve showing its local direction, which way should the arrow point? What’s the tangent at this point?

A tangent vector

Derivatives come to the rescue. Since a Bézier curve is parametrized in terms of t in a form of x(t) and y(t), we can simple derive the equations with respect to t. I won’t bother you with the equations, you can look up the exact solution directly in the source code.

With the tangent vector in hand it’s just the matter of generating the correct positions of vertices:

Correct segments have their vertices connected

Having derived the exact direction of a curve at a given point, we can generate the exact perpendicular direction finally receiving the exact position of the vertex. Awesome!

Notice that for small number of segments the tight curves look really weird. This is actually easy to explain – for sufficiently long segments, the tangent vector may vary greatly, thus the surplus width may be added in completely different directions. However, when the segments are small enough it’s really difficult for the tangent vector to mess things up.

Auto-Segmentation

At this point we’ve almost recreated the line drawing technique used in Revolved. For the final step we’re not going to add anything new to our drawing routine. Instead, we’re going to reduce the interaction complexity. We’re going to get rid of the slider.

The purpose of the slider is to define how many segments should the curve be split into. We don’t want to stick in a single large value and call it a day. Rendering a few dozen triangles for even the shortest segment is surely a waste. Instead, we’re going to figure out a way to automatically calculate the number of required segments.

Estimating Length

If you look closely at the Bézier curve you’ll notice that its shape always resembles the shape of line connecting all four control points.

Curve’s shape resembles the line connecting control points

The length of the curve is never longer than the length of the connecting polyline. Calculating the total length of AB, BC, and CD segments we derive an upper bound for curve’s length:

1
2
3
float estimatedLength = (GLKVector2Length(GLKVector2Subtract(b, a)) +
                         GLKVector2Length(GLKVector2Subtract(c, b)) +
                         GLKVector2Length(GLKVector2Subtract(d, c)));

As you might have noticed, I’m using Apple’s GLKit for the vector operations. While most parts of GLKit quickly become too primitive for any serious OpenGL programming, the math part of GLKit is wonderful. The only real drawback is its verbosity, but this is something most Cocoa devs should be used to.

Estimating number of segments

Having calculated the estimated length it’s just a matter of getting the number of segments required to display a curve. While we could simply divide the estimated length and take the ceil of the value, it would mean that for very short curves we could end up with only one or two segments. Instead, we should opt for slightly biased function that will boost smaller input values. We’re going to pick hyperbola – it’s just so awesome:

Hyperbola

For small input values hyperbola bumps the output value, for large input values the function is asymptotically similar to plain old linear function. A few minutes spend on parameters tweaking crafts the final solution. Notice how the number of knots changes when the curve gets longer or shorter:

Block Based Subdivider

While writing Revolved I’ve found yet another awesome application for ObjC blocks. Instead of passing the model object through the rendering system, I’ve opted for using a subdivider block. Given a value of t (the same one that paremetrizes the curve), it returns the SegmentSubdivision which is a struct that defines both position and perpendicular vector at a given point of a curve. Naturally, the block is a typedef. I don’t seem to be the only one who has problems remembering block syntax.

1
2
3
4
5
6
typedef struct SegmentSubdivision {
    GLKVector2 p;
    GLKVector2 n;
} SegmentSubdivision;

typedef SegmentSubdivision (^SegmentDivider)(float t);

The line mesh generator can then utilize the passed block to generate the vertex positions:

1
2
3
4
5
6
7
for (int seg = 0; seg <= segmentsCount; seg++) {
  
  SegmentSubdivision subdivision = subdivider((double)seg/(double)segmentsCount);
  v[0] = GLKVector2Add(subdivision.p, GLKVector2MultiplyScalar(subdivision.n, +_lineSize /2.0f));
  v[1] = GLKVector2Add(subdivision.p, GLKVector2MultiplyScalar(subdivision.n, -_lineSize /2.0f));
  ...  
}

Why not Core Graphics?

Having read all that you might be wondering: why bother with OpenGL drawing at all? The obvious choice for any seasoned iOS developer would be enrolling Core Graphics and drawing the curves using its functions. Let’s discuss the options I could have used.

For the stage setting purpose I should note that Revolved is currently limited to 60 curves and the drawing area of the app has dimensions of 448x768 points (896x1536 pixels on a retina device).

A Single Layer for all the Curves

The first Quartz-based solution is quite obvious – a single view with custom - drawRect: implementation. Whenever user drags anything, the entire view would have to be redrawn. Segment deletion? Redraw the entire thing. A single animation? Redraw the entire thing, 60 times per second.

Good luck pulling off redrawing that many segments that fast and that often. Did I mention that the rasterization of Core Graphics calls happens on the CPU, not on the GPU (Apple Dev account required)?

One Layer per Curve

A single view is definitely not a good idea. Alternatively I could dedicate a single view per segment.

First of all, let’s consider memory requirements. Each pixel has four 8-bits components (RGBA). Storing the data drawn in - drawRect: for each curve would take a total of 60*896*1536*4 bytes, that is 315 MB! Even though modern iPads are capable of handling those huge RAM demands, we still have to draw a 3D model. Adding space for OpenGL buffers doesn’t really help.

Secondly, the overdraw is huge. Since layers have to be transparent, every pixel on the right side of the screen is overdrawn up to 60 times. In total, the GPU has to push over 26 full-screens worth of pixels, at 60 frames per second. This is not something you want in your app.

Optimizations

I tested these approaches on the latest & greatest iPad Air and the stuttering was absolutely abysmal.

The performance of both methods could be improved by implementing various optimizations, mostly in terms of keeping the redrawn areas small. This would require huge amounts of code that would dynamically resize the views, clip the Bézier paths and do other crazy tricks, just with an expectation of making things better. All those efforts would still become worthless for curves covering the entirety of the drawing area.

Drawing dozens of animated curves, as it exists in Revolved, is a case in which abstraction starts to leak, forcing the programmer to fight against the clean, high-level API to do the work that could done on a lower level much more efficiently.

Throwing Stuff at GPU

Revolved features 3D models, so it was obvious that some parts of the app would rely on OpenGL. Generating models’ meshes required converting the abstract, mathematical notion of Bézier curves into concrete set of vertices and faces. This got me thinking – if I already have to go through the trouble of transforming the Béziers into the vertex representation of 3D models, why not to enroll the same process for 2D lines? This was an important reason for going the GPU way.

The crucial factor for choosing OpenGL is that it made the animation and interaction possibilities endless. Some of the line effects I’ve implemented and/or had in mind would simply be next to impossible to craft using Core Graphics.

Final Words

Cushioned with easy to use, high level API one often ignores the underlying concepts and nitty-gritty details of how stuff works. Consciously depriving myself of Core Graphics conveniences forced me to think about drawing Bézier curves in terms of “how?”. Understanding the mathematical foundations guiding the drawn shapes is something worth going through, even just for the sake of curiosity.

The presented method is not state of the art. There are actually more complex and robust methods of drawing the curves, but the solution implemented in Revolved fits the purpose completely (the fact that I found the linked resource just a few days ago was certainly an influencing factor).

As for the rendering implementation, OpenGL was the right choice. While some less important elements of Revolved like buttons, tutorial and credits panels are UIKit based, the essential part of the application runs entirely in OpenGL. This is still not a full Brichter, but writing an app this way was an absolute joy.

If you get bored dissecting the Bézier curves I encourage you to try playing with them, this time in 3D, on your iPad.

Exploring GPGPU on iOS

I’ve always wanted to try some GPGPU programming on iOS. The idea of harnessing the highly parallelized power just tickled my inner geek. Using GPU to run image processing is already a solved problem. Unfortunately, fragment shader calculations do not support the notion of more general computation. With advent of A7 chip and OpenGL ES 3.0, I’ve decided to tackle this problem and I’ve managed to get some exceptionally good results. Granted, the method I rolled in has some limitations and performs well only in fairly computationally heavy situations, but when it works, it works like magic.

As a companion to this article, I created a GitHub repo that contains the source code for many discussed concepts, as well as benchmarks that were run to measure the effectiveness of GPU computation.

Why?

Why on earth would one want to use GPU for calculations? The answer is obvious – blazingly fast speed. The following paragraphs consider three different use cases, each with increasing complexity. While the presented examples are somehow naive, the intention is to highlight the typical patterns of applications that might benefit from GPU-powered calculations. Each example contains a CPU vs GPU comparison chart – a result of direct computation of a presented problem. The timings were measured on the iPad Air (5th generation iPad) with iOS 7.1 beta 2.

The Disappointing Linear Function

Let’s start with the most basic function there is – a linear function. There are countless of applications for linear functions, so let’s pick one and see how well it performs:

1
f(x) = 3.4 * x + 10.2;

How should we compare the performance of GPU to CPU? While using time is the obvious choice, we shouldn’t focus on raw milliseconds. Comparing run-time for small and large input sets, would render the former one useless, as they would be merely visible on the plot, pitifully overlapping horizontal axis. Instead, we should consider the ratio of CPU to GPU time. Making CPU performance normalized at 1.0, makes the GPU value easy to interpret – if the GPU performs twice as fast as the CPU, then the GPU’s value is 2.0. Here’s the comparison chart for GPU vs CPU for the computed linear function:

GPU vs CPU comparison

The horizontal axis of a chart represents the number of input float values and is in logarithmic scale.

As you can see, GPU sucks in this case. Even for the very large input set (224 = over 16 million values), the GPU runs 0.31 times as “fast” as CPU i.e. it runs over 3 times slower. The only promising result is that the performance increases with input size. For sufficiently large data set, the GPU would overcome the CPU. Notice, however, that representing 224 float values already takes 64 MB of memory. Surely would we hit memory limits before reaching any beneficial performance level.

The Promising Complication

Since linear function was so disappointing let’s reach into some more complicated functions from our math repository. The exponential function transpires in many natural phenomena e.g. exponential decay. Paired with basic trigonometry functions it forms a solution of many differential equations, like the ones that rule the damping. Let’s consider this kind of complicated mixture of exponent, sine and a cosine with some random coefficients:

1
f(x) = 2.0 * exp(-3.0 * x) * (sin(0.4 * x) + cos(-1.7 * x)) + 3.7;

Here’s the comparison chart for GPU vs CPU execution time:

GPU vs CPU comparison

First success! As soon as the input size crosses 216 (little over 65k values), the GPU is faster than the CPU. For very large data sets, the GPU is over 5 times faster.

The Amazing Iteration

As exciting as the results of the previous example were, we can do even better. Let’s consider an iterative computation. Imagine a simulation of huge number of bodies in a central gravitational field. Let’s make a few assumptions:

  • each body has the same mass
  • each body has the same initial position, but different initial velocity
  • we can ignore body-to-body interaction
  • the problem is two dimensional

The goal of the calculation is to answer the question: how far have the bodies flown in 10 seconds? In other words, what is the position of every single body after 10 seconds?

Reaching back to fundamentals of physics this simple equation comes to mind:

1
p(t) = p0 + v0 * t + 0.5 * a0 * t * t;

Unfortunately we can’t merely plug the values in and get the result in one step, because the acceleration of body is not constant – it depends on the distance from field’s center. A very simple solution is to run the calculation in an iterative fashion, just like most of the physics engines do.

For a given position in space, we can get the body’s acceleration by measuring the the value of gravitational force which is governed by Newton’s law of universal gravitation. Since this is merely an example, we can ignore all the constant coefficients and just notice that the force is proportional to the distance squared and points at the center of the field.

Let’s focus on the gist of the iteration itself. Each step will evaluate the current acceleration value, and then update values of both position and velocity using a forward Euler method. The body of the iterative loop goes as follow:

1
2
3
4
5
6
7
8
9
float l = 1.0f/sqrtf(x*x + y*y);
float ax = -x*l*l*l;
float ay = -y*l*l*l;

x += vx*dt;
y += vy*dt;

vx += ax*dt;
vy += ay*dt;

Let’s settle for a time step of 0.1 seconds. Running the loop 100 times will result in good approximation of body’s position at t = 10s. Without further ado, here’s the GPU vs CPU comparison for the computation:

GPU vs CPU comparison

That’s right, the GPU was over 64 times faster. Sixty four. Times. Faster. While this example is somehow contrived (2D over 3D space, ignoring body to body collisions), the result is astonishing. In terms of absolute time, the CPU calculation took over 21 seconds, while the GPU took a third of a second to do the work. Even for as little as 4096 float input values (2048 two dimensional vectors), the GPU is almost twice as fast as the CPU.

How?

From developer’s perspective the typical GPU programming works as follow:

  • submit some vertex geometry
  • process it in vertex shader
  • let the GPU figure out the triangles
  • process each fragment in fragment shader
  • present result on the screen

While both vertex and fragment shader are customizable, so far only the fragment shader’s output was accessible to the developer, either in form of generated framebuffer.

Long story short, using a new feature of OpenGL ES 3.0 called Transform Feedback, we do have an access to the output of the vertex shader.

Transform Feedback

Here’s the excerpt from the definition of Transform Feedback:

Transform Feedback is the process of capturing Primitives generated by the Vertex Processing step(s), recording data from those primitives into Buffer Objects. This allows one to preserve the post-transform rendering state of an object and resubmit this data multiple times.

The crucial part is “preserve the post-transform rendering state”. Basically, we let the GPU do the work on the vertex data and then we can access the computed output. Actually, it doesn’t even have to be vertex data per se, the GPU will happily make operations on whatever float values we provide. This is the essence of my method.

Setting up the Transform Feedback in OpenGL takes a dozen lines of code and I’ve done this with help of this short article and this much longer tutorial. The setup can be split into two different steps.

First of all, we should inform OpenGL which shader variables should be used to write output data into the memory buffer. This stage must happen before shader gets linked. The following two lines of code present the setup, the crucial call being glTransformFeedbackVaryings:

1
2
const char * varyings[] = {"nameOfOutput1", "nameOfOutput2"};
glTransformFeedbackVaryings(program, sizeof(varyings)/sizeof(*varyings), varyings, GL_INTERLEAVED_ATTRIBS);

The second stage is very similar to a regular draw call setup, although enriched with some additional API calls:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
glUseProgram(program);

glEnable(GL_RASTERIZER_DISCARD); // discard primitives before rasterization stage

glBindBuffer(GL_ARRAY_BUFFER, readBuffer);

/* bind VAO or configure vertex attributes */

glBindBufferBase(GL_TRANSFORM_FEEDBACK_BUFFER, 0, writeBuffer); // the output should be written to writeBuffer

glBeginTransformFeedback(GL_POINTS);
glDrawArrays(GL_POINTS, 0, count);
glEndTransformFeedback();

glDisable(GL_RASTERIZER_DISCARD);

glFinish(); // force the calculations to happen NOW

Note, that there are some differences between desktop and mobile OpenGL implementations. For instance, I couldn’t compile my vertex shader on its own, I had to provide the dummy implementation of fragment shader as well (which apparently is not a requirement on a desktop OpenGL).

As a final step, remember to include OpenGLES/ES3/gl.h header.

Writing and Reading to Buffers

Unlike traditional PCs, all iOS devices have Unified Memory Architecture. It doesn’t matter whether you read or write from “regular” memory or OpenGL allocated buffer – it’s all the same pool of on-device transistors.

I’ve done some crude tests and in fact, reading from “OpenGL” and “regular” memory takes the same amount of time and this is true for writing to memory as well.

On the desktop PCs, the RAM→VRAM memory transfer usually takes nontrivial amount of time. On the iOS however, not only can you generate your data directly into OpenGL accessible memory, but also you can read back from it as well, without any apparent penalty! This is fantastic feature of iOS as mobile GPGPU platform.

Accessing data pointer for reading is super easy:

1
2
3
4
5
6
glBindBuffer(GL_ARRAY_BUFFER, sourceGPUBuffer);
float *memoryBuffer = glMapBufferRange(GL_ARRAY_BUFFER, 0, bufferSize, GL_MAP_READ_BIT);

//read from memoryBuffer

glUnmapBuffer(GL_ARRAY_BUFFER);

And so is accessing pointer for writing:

1
2
3
4
5
6
glBindBuffer(GL_ARRAY_BUFFER, targetGPUBuffer);
float *memoryBuffer = glMapBufferRange(GL_ARRAY_BUFFER, 0, bufferSize, GL_MAP_WRITE_BIT);

//write to memoryBuffer

glUnmapBuffer(GL_ARRAY_BUFFER);

Wrapping up data reading/writing logic in 3 additional lines of OpenGL boilerplate is surely worth the added benefits.

Shader Calculations

OpenGL ES shaders are written in GLSL ES which is very much C-like with some sweet vector and matrix additions.

Simple Vector Operations

GLSL defines vec4 type that is a four component vector (with x, y, z and w components). While the language specification is the definite guide on the matter, there are few simple operations that are worth glancing over.

Let a and b be GLSL variables of vec4 type. One can easily multiply all vector components by scalar:

1
2
float c = 2.0;
vec4 r = a * c; // r.x = a.x * c, r.y = a.y * c...

One can also multiply two vectors component-wise:

1
vec4 r = a * b; // r.x = a.x * b.x, r.y = a.y * b.y...

As well as feed entire vector into plethora of built-in functions:

1
vec4 r = exp(a); // r.x = exp(a.x), r.y = exp(a.y)...

I highly recommend OpenGL ES 3.0 API Reference Card which is a short primer on OpenGL API as well as GLSL cheat sheet. It contains a list of all built-in functions and it is generally recommended to use them instead of manually recoding even the trivial functionality. Built-ins may have direct hardware implementation and thus should be faster than their hand-written counterparts.

Example Shader

Here’s the simplest shader written in GLSL ES 3.0 that doubles the input vector:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#version 300 es

in  vec4 inVec;
out vec4 outVec;

vec4 f(vec4 x)
{
    return 2.0 * x;
}

void main()
{
    outVec = f(inVec);
}

As you can see, it’s very readable. We return a transformed input vector (marked as in) as an output vector (marked as out).

Processing Power

Why do we bother doing four calculations at once in a vec4 operations, instead of calculating single float values separately? Because crunching entire vectors is way faster. Some quick benchmarks showed using one vec4 operation instead of four float ones is 3.5 times faster.

What’s even more surprising, we don’t even have to return just one vec4 in a single shader calculation. We can use up to 16 in vectors to feed data to the shader. This seemingly arbitrary number is a platform specific value of MAX_VERTEX_ATTRIBS constant and is set in stone for all A7 devices. For the output vectors we’re limited by the GL_MAX_TRANSFORM_FEEDBACK_INTERLEAVED_COMPONENTS constant which for A7 is equal to 64. This means that we can define at most 16 out variables of vec4 type (since each vec4 has 4 components).

With that knowledge we can extend our shader to take in more input and generate more output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#version 300 es

in vec4 inVec1;
in vec4 inVec2;
in vec4 inVec3;
in vec4 inVec4;

out vec4 outVec1;
out vec4 outVec2;
out vec4 outVec3;
out vec4 outVec4;

vec4 f(vec4 x)
{
    return 2.0 * x;
}

void main()
{
    outVec1 = f(inVec1);
    outVec2 = f(inVec2);
    outVec3 = f(inVec3);
    outVec4 = f(inVec4);
}

By utilizing this method one achieves much faster execution. For instance going from 1 in/out vector pairs to 16 in/out vector pairs results in over 4 times performance increase (The Disappointing Raise example). In some cases however, the sweet spot is at 8 (The Promising Plot) or even 2 vector pairs (The Amazing Simulation), so this must be tuned manually on case by case basis.

Obviously, writing that much code by hand and changing pieces of function calls all over the place is tedious, so my GitHub implementation generates the shader for a given number of vectors. The only responsibility of programmer is to provide a source for vec4 f(vec4 x) function.

Multi-argument Functions

Throughout previous examples I’ve used a simple function that takes one argument and returns one value. However, it’s not that hard to write more complicated functions that take arbitrary (with some moderation) number of arguments.

Let’s consider a multi-argument function y = f(a, b, c) = a*b + c. In terms of math definitions this function is R³ → R. We don’t want to calculate just a single float value (it’s slow), so instead, we will calculate four values at once (an entire vec4 of values). Therefore, a simple shader would have three in vectors and one out vector. Assuming inVec1 contains a₀a₁a₂… values, inVec2 contains b₀b₁b₂… values and so on, the shader code is trivial

1
2
3
4
vec4 f(vec4 a, vec4 b, vec4 c)
{
    return a * b + c; // adding vectors here!
}

While another option is for the data is to be fetched from a single continuous buffer with a₀b₀c₀a₁b₁c₁a₂… ordering, notice that the shader would have to pick separate components from vector and perform calculations one by one, which is hardly optimal.

Single Argument Multivalued Functions

Let’s consider a multi-argument function (a, b) = f(x) = (cos(x), sin (x)), which means f is R → R². This function calculates location of point on a unit circle for a given angle x.

We can’t use a regular vec4 f(vec4 x) since we can’t output two vectors from on function. Instead, we could make use of a vec4 f(vec2 x) function, that would transform two input values into four output values. Clearly, the number of out shader vectors would have to be twice as large as number of in vectors.

Multi-argument Multivalued Functions

Finally, let’s consider a function (r, s) = f(a, b). This function takes in a 2D vector and returns 2D vector (R² → R²). Considering previous cases one might argue this decays into calculating two values of regular y = f(x) function, however this is hardly the case. For instance we can define function that return perpendicular vector f(x, y) = (-y, x). We couldn’t have done that in the plain old f(x) = y. On a side note, since we’re doing calculations on vec4 technically we could lookup the next component, but this technique is fairly limited, not to mention the extreme abuse of blindly reducing R² → R² to R → R.

As usual, we want to return as much data as possible, so we’re going to calculate two values of a function at once, packing them into one vec4:

1
2
3
4
vec4 f(vec4 a)
{
    return vec4(-a.y, a.x, -a.w, a.z);
}

Speaking of crazy R² → R² to R → R reductions, consider function f(x, y) = (3.0 * x, 2.0 * y). In this case the calculation does decay into regular f(x) = y calculation:

1
2
3
4
vec4 f(vec4 a)
{
    return vec4(3.0, 2.0, 3.0, 2.0) * a;
}

Notice, that this is merely a technical twist since a shader doesn’t care whether the function is R² → R² or R → R, it’s just patiently crunching the numbers.

Build Settings and Caveats

For my CPU vs GPU comparison I’ve compiled all programs with following settings.

  • Optimization Level – Fastest, Aggressive Optimizations [-Ofast]
  • Vectorize Loops – Yes
  • Relax IEEE Compliance – Yes

The application was build using Xcode 5.1 (5B45j). I run the tests on the iOS 7.1 beta2. Due to a bug (Apple Developer account required) in iOS 7.0.x, one can’t apply user defined functions inside shaders used for Transform Feedback. Only main() and built-ins work.

Unsurprisingly, manual “inlining” the body of f function into main did work. If you want to use Transform Feedback on iOS 7.0.x you can’t use function calls, so you’ll have to write a lot of code to make use of all 16 in vectors.

The time measurements were taken by running each test 8 times and then taking the average.

When

Now that we’ve seen how much performance the GPU offers and how to use it, let’s discuss the factors that can influence the application of the method.

Data Size and Computation Complexity

Basically, there are two requirements for the GPU powered computation to triumph: large data sets and non trivial calculations.

The simple polynomial evaluation inside The Disappointing Linear Function case is the best example of low-overhead calculations. This calculation is an example of fused multiply-add which usually can be executed with just a single instruction. Moreover, Clang is smart enough to generate vectorized code – the A7 computes four fused multiply-adds at once. It’s an additional performance boost explaining why the CPU is running circles around the GPU in this case.

The more complex evaluation of The Promising Complication case shows the benefits of parallelization. For large enough inputs, the simultaneous calculation performed by GPU starts winning.

Finally, The Amazing Iteration allows GPU to show its potential. While CPU has to calculate each case one by one, the GPU crunches multiple cases at once, leaving CPU in the dust.

The conclusion here is obvious. There is a huge overhead of firing up the OpenGL machinery, so once started it’s better to keep it rolling. There is one more amazing benefit of running the calculations on the GPU. After firing the calculations, the CPU is free to do whatever it wants.

Precision

It must be noted that values returned by GPU calculations might not be equal to the values returned by CPU. Are the GPU vs CPU errors large?

Comparisons

Bruce Dawson wrote an amazing article series on floating point format. In part 4 of thereof, he presented a genuinely cool way to compare floating point values. Basically, due the internal representation of floating point format, interpreting bits of float as int, incrementing that integer, and reinterpreting those new bits as float again, creates the next representable float number. To quote “Dawson’s obvious-in-hindsight theorem”:

If the integer representations of two same-sign floats are subtracted then the absolute value of the result is equal to one plus the number of representable floats between them.

For instance, if the integer difference between two float values is 4, it means that one could fit 3 other float values in between. Quoting Dawson just once more:

The difference between the integer representations tells us how many Units in the Last Place the numbers differ by. This is usually shortened to ULP, as in “these two floats differ by two ULPs.”

Just for the reference, I’ve also calculated the value of relative float error, defined as:

fabsf((cpu_value - gpu_value)/cpu_value);

For the evaluated examples, I will refer to the integer difference as ULP, and to the relative float value error as FLT.

Unfortunately, both methods have some issues for comparing near-zero values, but in general they provide a good overview of what’s going on.

The Disappointing Linear Function

ULP: avg: 0.0620527
     max: 1
     min: 0
FLT: avg: 6.24647e-09
     max: 1.19209e-07
     min: 0

The values calculated by GPU are as little as 1 ULP away from the CPU’s result and the average being so low, shows that less than 7% of values were not exactly equal to the CPU computation.

The Promising Complication

ULP: avg: 0.527707
     max: 3
     min: 0

FLT: avg: 3.34232e-08
     max: 2.49631e-07
     min: 0

The maximum ULP error jumped to 3 and the average ULP error increased as well. The relative error still provides a good indication that the value of error is insignificant.

The Amazing Iteration
ULP: avg: 46.6319
     max: 100
     min: 0

FLT: avg: 3.49051e-06 
     max: 8.61719e-06
     min: 0

The maximum value of 100 ULPs is a hint that each iteration accumulated some error. The relative error is still low at 3.49×10-4%.

Reasons

By default the precision of float (and, as direct result, of vec2, vec3 and vec4 as well) inside vertex shader is set to highp (GLSL ES 3.0 spec 4.5.4) which essentially is regular IEEE 754 float. Where do the differences in comparison to CPU results come from?

My familiarity with secrets of floating point values is not that great (despite starting to read What Every Computer Scientist Should Know About Floating-Point Arithmetic a couple of times). I have some hypotheses, but you should take them with a grain of salt.

One of the reasons for this effects could be slightly different implementations of exp, pow and trigonometry functions. However, The Disappointing Linear Function doesn’t use any built-ins and the small errors are still there. I suspect the most important reason is that the rounding mode inside GLSL computation is not defined by the specification. This could explain why doing 100 iterations of an algorithm inside GPU results in at most 100 ULPs difference.

Does it Matter?

Obviously, if precision is mission critical you wouldn’t use float anyway, jumping to double instead, thus skipping the GPU calculations entirely. Just for the record (and for the sake of linking to an interesting rant), you shouldn’t rely on the math library to perform exact calculations anyway.

For the more mundane tasks, the precision doesn’t matter, unless you do thousands of iterations inside a shader. For games and simple simulations you shouldn’t care at all.

Final Words

I’m extremely pleased with the results. It really shows that GPUs inside iOS devices are extremely powerful. I expected perhaps a 3-5x speed improvement, but the extremity of 64x is a shocker. In case of highly iterative calculations, you can reduce the wait time for the user from about a minute to little under one second. Even though the presented examples were somehow artificial, the benefits of huge performance gains and offloading the CPU are of utmost importance. Without a doubt, using Transform Feedback limits the effectiveness of the method to the brand new devices powered by A7 chip, but as time goes, this will become less of an issue.

As for the further work, a lot can still be done. Now that iOS7 enables reading textures in vertex shader (although some claim it was possible in previous releases as well), one could treat a float texture as a source of truly random (as in random access, not randomness) memory resulting in much more advanced vertex shader computation techniques. However, this still doesn’t solve the random write problem.

On the other hand, one might also try going the fragment shader way, using float textures as input and half float textures as output, but it still does not provide full featured GPGPU capabilities.

All those effort will become vein, when Apple provides access to the now private OpenCL framework. Until then, OpenGL abuse is the primary toy for us to play.