## Saturday, February 6, 2016

### ray-box intersection and fmin

In working on the code for the next mini-book, I have followed the philosophy of using the simplest code and trusting that it is good enough.   For bounding volume hierarchies I tried making a simpler ray-box test that people usually use.   I was especially impressed with this post by Tavian Barnes.

This is the simplest ray-box test I was able to write:

The overall code seemed pretty slow so I tried some old more complicated code that stores the sign of the ray propagation directions as well as the inverses of the ray direction components:

This was a lot faster in the overall ray tracer so I started digging into it a bit.   I wrote an isolation test that sends lots of rays from near the corner of a box in random directions (so a little less than 1/8 should hit).    It occurred to me that the fmax and fmin might be slower due to their robust NaN handling and that does appear to be the case.    Here is the test I used on my old macbook pro compiled with -O.   Perhaps it gives a little too much amortization advantage to the long code because 100 trials for each ray, but maybe not (think big BVH trees).

The runtimes on my machine are:

using mymin/mymax: 5.4s
using fmin/fmax: 6.9s

As Tavian B points out, the properties of fmax and NaN can be used to programmer advantage, and vectorization might make system max functions a winner.   But the gaps between all three functions were bigger than I expected.

Here is my code.   If you try it on other compilers/machines please let us know what you see.

Ian Mallett said...

I actually looked at this exact problem in conjunction with ray-box intersections. This is what my code looks like (simplified):

bool intersects(Ray const& ray,AABB const& aabb, float* min_possible_dist_contents) {
Vec3 ts0 = (aabb. low - ray.orig) * ray.dir_inv;
Vec3 ts1 = (aabb.high - ray.orig) * ray.dir_inv;

float tmin = vec_min(ts0,ts1).get_max();
float tmax = vec_max(ts0,ts1).get_min();

//Early termination, surprisingly, seems to be worth the branches.
if (tmin>tmax) return false; //Doesn't hit box at all
if (tmin>EPS_NEAREST_HIT) { //Box is ahead
*min_possible_dist_contents = tmin;
return true;
} else if (tmax>EPS_NEAREST_HIT) { //Ray starts inside box
*min_possible_dist_contents = 0.0f;
return true;
}
return false; //Box is behind
}

N.B. "vec_min" returns a "Vec3", and then ".get_max()" is the maximum over its elements. What's excellent about this scheme is that:
1: You can do the first min/max as a vector operation, (in this case, "_mm_min_ps"), (my "Vec3"s are of course padded to 4 floats). So you get parallelism.
2: "_mm_min_ps"/"_mm_max_ps" are much faster than their std-library counterparts. They're three cycles each (the double precision versions are also!).
3: The horizontal .get_max()/.get_min() can be parallelized too, with a bit of work--although I just use scalar operations.
4: This is a lot less branchy. As the comment attests, I actually had a fully branchless version, but it's actually slower.

But the main thing, if you look at the disassembly, is that the normal std-library min/max are doing lots of branchy checking for NaN and overflows and things. That shouldn't ever happen, so don't use that. Use "_mm_min_ps"/"_mm_max_ps", even if you're using scalar code.

Thiago said...

The link to your code is going to the tavianator blog: is that intended?

Did you really mean to compile with -O? That's going to produce terribly slow code which is only useful for debugging. For comparing algorithms and low-level optimizations, it's almost worthless. I recommend making -O3 a default setting.

std::min/max are implementation dependent. On one particular x64 linux machine, all reasonable compilers (clang, gcc, and icc) will convert that to the minss/maxss scalar SSE instructions which is great (proof: http://goo.gl/RaFbDg). But try and implement min/max yourself and icc 13 gets dumb (see same link but choose icc in compiler list). On the other hand, the order of operands on some systems, such as the example I provided above, is backwards, so the nan pass-through behavior will also be backwards. This prevents writing portable robust BVHs with std::min/max. I don't have Windows, but it wouldn't surprise me that the MSVC compiler probably does the absolute worst thing possible and probably explains Ian's results.

If you want the fastest possible robust code over a variety of OS and compiler combinations, you have to write it using scalar SSE intrinsics (_mm_min_ss) which defeats your goal of simple code, not to mention at this point you might as well switch from scalar to the more complex SIMD. I'd advocate implementing your own min function like in my example above and hope that most compilers will be smart enough to optimize this to the minss assembly instruction for you.

If you don't care about handling nans, which honestly is probably fine most of the time, especially if you're writing an intro book, then just use std::min instead of fminf and call it a day. Some users might get worse performance, but I'm sure it'll still be faster than fminf.

As for testing bounding box intersection performance, you have to be careful of synthetic tests which might put too much or too little emphasis on early outs. I'm just guessing here, but I'd assume that a 1/8 hit probability is way too low for a quality BVH, making early outs seem better than they really are.

Ian Mallett said...

Thiago, just to clarify, by std-library min, I meant the C std library fmin and friends, not C++ std::min. These are required by the standard to handle NaNs and suchlike. Sorry for any confusion. Also, these results came from GCC 5.3 on Linux (and variously, Intel 16.0), not MSVC on Windows.

---

To follow up on my first comment yesterday, I did some work trying to parallelize the horizontal operation (for vec4, not vec3). It turns out to be not worth it. The simple method is two movss and three minss/maxss. The parallel verion has two minss/maxss. The problem is that pipelining issues make it slower anyway.

Peter Shirley said...

Thanks -- link is fixed. http://www.cs.utah.edu/~shirley/aabb/

I actually tried -Ofast and -O and -O was slightly faster. I'll do some more tests.

Peter Shirley said...

Tried all the versions on clang with -O, -Ofast, and -O3 and interestingly -O was the fastest on two of them. Weird stuff.

Thiago said...

Oops, I thought -O meant -O0 and not -O1, still not as great as -O3, but not too bad. In this simple test, -O1 and -O3 actually produce identical code for me. In more complex code, there could be a difference. I don't recommend -Ofast if you're trying to write numerically robust code since it will enable fast math optimizations.

I tried the benchmark you provided and it does a great job as a drand48 benchmark -- about 70% of the time is spent creating random directions. Here's a simple modification that lets you better measure things (compile with -std=c++11):

#include "aabb.h"
#include

int main() {
int hit = 0;
int n = 1000000;
int trials = 100;

vec3 origin(2,2,2);
aabb box(vec3(0.1, 0.1, 0.1), vec3(1.99, 1.99, 1.99));

vec3 *dirs = new vec3[n * trials];
for (int i=0; i < n * trials; ++i)
dirs[i] = vec3(drand48()-0.5,drand48()-0.5,drand48()-0.5);

auto start = std::chrono::high_resolution_clock::now();

for (int i = 0; i < n; i++) {
for (int t =0; t < trials; t++) {
ray r(origin, dirs[i*trials + t]);
if (box.hit1(r, 0.0, 10.0))
hit++;
}
}

auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration diff = end-start;
std::cout << diff.count() <<"s\n";

std::cerr << float(hit)/float(n*trials) << "\n";
}

This is a much better benchmark at showing the relative time differences of each intersection method, but it's still not great at evaluating how these methods will perform in a BVH. One major problem remaining is that since the drand48s were inside the trials loop, each trial had a different direction so you aren't measuring the amortization of the long bbox intersection variant. The trials loop in the original code didn't really add anything that the upper for loop couldn't do with more iterations. My version hasn't fixed that.

If you do use the same direction for each trial, there's also the problem that the compiler might optimize the inner loop out since the result is the same for each run. In fact, when I tried this, it looked like the compiler was in fact doing this. Compilers can be frustratingly amazing sometimes.

After removing the drand48 overhead, the hit3 version ends up being about 3x slower than hit1 for me even though they do almost the same thing. This happens for two reasons:
1) hit3 is doing the same divide over and over instead of multiplying by the inverse.
2) ray constructor is creating all the precomputed data needed for hit1 but not needed for hit3 (or hit2)

If I fix these, then hit3, as expected, is about the same speed as hit1 in the benchmark.

Moral of the story, in the book you should use the simple to understand hit3 but with the division precomputed at the start of the for loop and you should double check all synthetic benchmarks with actual tests in the ray tracer.

Eric Haines said...

Frank Gennari said...

I've experimented with the performance of this exact same function myself. I don't have the code for reference right now, but I took a different approach. I started with an early reject test in the case where both points on the line were on the same (outside) side of any plane of the box. Then I ended up creating an 8-way case split with template code for the various combinations of signs in the ray direction. That reduces the code to a single greater/less operation per dimension, rather than using min() or max().

This approach seemed to outperform other code that I tried. However, it may be optimized for the data layout of my point and line structs. I also tried using SSE2 instructions but it was slower, probably because I had to transpose some of the floating-point numbers to have all of the X component data adjacent (and same with Y and Z).

I would think that a better performance test would be to choose both the starting and ending points of the rays randomly, so that rays with directions of both signs are present. This will help ensure that there are examples where the ray will intersect each face of the box. Maybe have the n loop select the starting point and trials loop select the ending point from two precomputed sets of points.

Frank Gennari said...

As promised, I found my ray-AABB intersection code. It looks like I removed the early rejection test. There are two parts to the code. The first is the intersection template function where the signs of the three axes are template parameters to simplify the math and produce a ~20% speedup:

template bool get_line_clip(point const &p1, vector3d const &dinv, float const d[3][2]) {

float tmin(0.0), tmax(1.0);
float const t1((d[0][xneg] - p1.x)*dinv.x), t2((d[0][!xneg] - p1.x)*dinv.x);
if (t2 < tmax) {tmax = t2;} if (t1 > tmin) {tmin = t1;}

if (tmin < tmax) {
float const t1((d[1][yneg] - p1.y)*dinv.y), t2((d[1][!yneg] - p1.y)*dinv.y);
if (t2 < tmax) {tmax = t2;} if (t1 > tmin) {tmin = t1;}

if (tmin < tmax) {
float const t1((d[2][zneg] - p1.z)*dinv.z), t2((d[2][!zneg] - p1.z)*dinv.z);
if (t2 < tmax) {tmax = t2;} if (t1 > tmin) {tmin = t1;}
if (tmin < tmax) {return 1;}
}
}
return 0;
}

Here, "p1" is the first point on the line segment, "dinv" is the inverse of the line segment's delta = 1.0/(p2 - p1), and "d" is the AABB stored as {x1,x2},{y1,y2},{z1,z2}.

The code to select a function pointer based on line slope is here:
if (dinv.x < 0.0) {
if (dinv.y < 0.0) {
if (dinv.z < 0.0) {get_line_clip_func = get_line_clip<1,1,1>;}
else {get_line_clip_func = get_line_clip<1,1,0>;}} else {
if (dinv.z < 0.0) {get_line_clip_func = get_line_clip<1,0,1>;}
else {get_line_clip_func = get_line_clip<1,0,0>;}}} else {
if (dinv.y < 0.0) {
if (dinv.z < 0.0) {get_line_clip_func = get_line_clip<0,1,1>;}
else {get_line_clip_func = get_line_clip<0,1,0>;}} else {
if (dinv.z < 0.0) {get_line_clip_func = get_line_clip<0,0,1>;}
else {get_line_clip_func = get_line_clip<0,0,0>;}}}

I experimented with dozens of versions of this function and this one turned out to be the best for multithreaded performance (8 threads on a quad core CPU with hyperthreading) on Windows using MSVC++ 2010.

Frank Gennari said...

Ah, looks like the template parameters got messed up in the XML formatting, and I don't know how to edit that post. They should be:
bool xneg, bool yneg, bool zneg