BuschnicK: Writing a Voxel engine from scratch in Vulkan

This is a series of posts that I originally published to my Facebook feed in late 2017. I repost them here for posterity because I still hope to find the time to continue the project.

November 13, 2017

Why are build systems and dependency management for C and C++ open source software still unsolved problems on Windows in 2017? I wanted to render some text in my toy Vulkan 3D engine so I figured I'd do the right thing and use FreeType, Harfbuzz and Pango. FreeType for rendering individual glyphs, Harfbuzz for shaping lines of text and Pango for laying out entire paragraphs. Making FreeType work was a breeze. Harfbuzz was painful, with lots of broken links, missing documentation and a more involved build process. Pango I completely gave up on. It's ridiculously convoluted and pulls in the entire world in dependencies. No wonder most people give up on the endeavour entirely and resort to hacking bitmap fonts manually. What a waste of time! I only ever get a few minutes here and there while Leonie is sleeping...

November 16, 2017

I'm reading up on modern approaches to global illumination in 3d engines. The process of tracing individual rays of light through complex 3d scenes, following them around for multiple bounces to create realistic reflections, refractions and shadows. It's mind boggling how much we can get away with brute force methods on contemporary graphics hardware. Millions of rays through millions of voxels 60 times per second. The supercomputer under everyone's desk today is truly marvelous. And yet I agree with Halvar Flake that there's no such thing as 'enough compute resources'. Exciting times watching where this will lead...

November 21, 2017

A few thousand lines of code later and I can render a first simple scene. The screenshot shows a MagicaVoxel (https://ephtracy.github.io/) scene and a little soldier dude in the background who's just there to make sure alpha transparency works. Voxels are flat shaded and colored from a palette. The vertex shader looks up a color value in a 1D palette image, performs the lighting computation and forwards the result to the fragment shader which is just a pass-through.

November 22, 2017

Leonie spent her first hour in the day nursery all alone today (and rocked it!). So I had time to bang out 200 more lines of code and implement Multisample Anti-Aliasing (MSAA). This is another one of these crazy brute force methods where the power of modern (graphics) hardware is just amazing. What we're trying to do is make sharp edges appear less jaggy/pixelated. So instead of drawing a red line with only red pixels we interpolate sub-pixel values and fill them with colors like "half-red, half-transparent". The effect is quite dramatic as you can see in the attached two images comparing MSAA ON vs OFF. Now the way to achieve this is to render the entire scene at a multiple of the screen resolution and then scale it down again. So basically the graphics hardware is doing the work to fill a screen 8 times the size and then throws away most of that information again. There is some smarts in modern hardware to avoid being quite so dumb, but this is an accurate description of how the first implementors approached it.

Here's a great explanation of the details:
https://mynameismjp.wordpress.com/2012/10/24/msaa-overview/

November 23, 2017

Leonie is spending more time in the day nursery so I have more time for coding ;-) Today was a major milestone. The image may not look like much and just another model but it is significant:
I store my voxels in cubic chunks of 32^3. So far I have only ever rendered a single chunk and didn't worry about crossing chunk boundaries. The model you see below is significantly bigger than a single chunk though, so I needed to tackle a whole slew of new problems.

First, since I plan on having very many chunks, more than the graphics card can store at once, I needed a way to only keep the set of chunks currently around the camera in graphics memory. To this end I have implemented a Least Recently Used (LRU) Cache. This keeps track of the chunks rendered recently and if memory gets tight, drops off the ones that haven't been touched in awhile. The LRU cache is implemented as a combination of a hash_map and a list to make all operations O(1).

Second, I needed a way to actually manage memory on the graphics card. My vertex buffers differ significantly in size, depending on the exact content of the chunk. Empty space in the model requires no triangles to render while complex stairs and such require very many. So I implemented a First Fit memory allocator that tracks allocated blocks. The LRU cache decides which blocks to keep and the First Fit allocator decides where to put them, hopefully maintaining low fragmentation.

Third, I needed a way to efficiently query for neighboring chunks and chunks around the camera. To this end I'm using an R*Tree (https://en.wikipedia.org/wiki/R-tree) as a spatial subdivision structure that allows for efficient intersection and coverage queries. So far I'm only using it for neighbor queries, but the next step will be querying for the camera frustum.

November 25, 2017

Added a heightmap reader (what you see is a section of Tamriel from Elder Scrolls - for some reason that turned up as the first hit for my heightmap Google search). Then wasted a day trying to figure out why my frustum math was broken. In the end it turns out that one of my assumptions on how GLM works was incorrect and the matrix is stored transposed relative to what I expected. Oh well. Now I can query my R*Tree for chunks intersecting the camera frustum, so that's good.

I've been looking for more heightmaps and was slightly disappointed:
- most algorithmically generated data sucks/is boring. Once you've seen a few of these you've seen them all. With a bit of experience it's often possible to directly discern the noise function they've used for generating the terrain.
- real digital elevation data is often freely available from agencies like the US geological survey. Even up to 1m resolution which would be perfect for my use case. However, they often provide it in custom data formats only supported by expensive Geo information systems. I don't want to waste time writing an importer for that. Sucks.

In an ideal world I'd have a synthesizer system based on something like TensorFlow. A neural net trained on real world elevation data that then synthesizes new textures from that input. Ideally with the option of accepting coarse additional human input as a guide. E.g.: "I want a mountain here, but feel free to fill in all the rest." Should be possible to build.

November 27, 2017

I was trying to load a huge heightmap to test my caching strategy. Turns out Visual Studio's default setting of building x86 binaries tripped me. It's tricky handling a multi gigabyte dataset in a 32 bit address space which only affords you 3GB of RAM if you're lucky (no fragmentation). So I had to change my project's configuration to build a 64 bit x64 binary. That change by itself was trivial - except for a few warnings about the now larger size_t type my code built just fine. The issue was rebuilding all dependencies (SDL, SDL image, boost, Google protobuffers, gunit, harffbuzz, FreeType). This took a while. Luckily Vulkan is a prebuilt library and GLM is a header only library.

After that was fixed I had to debug an issue with my caching strategy. I had to update it to be able to deal with zero byte chunks. This may seem a bit counterintuitive, but a chunk may completely disappear when producing a mesh from it, even if it is actually non-empty and contains voxels. Specifically this is true for fully solid interior chunks. A chunk completely surrounded by other solid chunks will not get any polygons created for it and result in a null mesh.

So now I can cruise over a giant heightmap with chunks happily streaming in and out of the GPU.

November 27, 2017

Lots of algorithms on voxels are of the "embarrassingly parallel" nature. So far I have been holding off parallelizing anything. I thought I'd make it correct first and fast later. However, testing with the large heightmap took about 8 seconds for loading and generating meshes from it. This became old pretty fast. So I looked at intel's Thread Building Blocks (TBB) library for writing parallel code. It seems very well designed and integrating it only took a few minutes. The parallelized loading code got a nice near linear speedup from 8 seconds to about 1.3 seconds utilizing all my 8 cores. Quick win!

There are even more optimizations to be had by utilizing the SIMD instruction set and processing entire rows of voxels at once instead of one by one. This is far more involved though and makes the code much more rigid. Since I haven't finalized the data format yet and in fact don't even know what properties I'll eventually store per voxel, I'm postponing this for now. But it does look like a fun challenge! In any case, storing the data in a column oriented and SIMD friendly manner (i.e. structs of arrays instead of the more familiar arrays of structs OOP design) should have cache benefits even if I don't use the enhanced instruction set right away.

November 29, 2017

Leonie has been sick the past few days so we had some bad nights and little time for hobby projects. I still managed to implement a few things:

1) Improved MagicaVoxel file format support. See attached screenshot.

2) Greedy meshing. This is the biggest and trickiest new feature. Took me quite a while to get right and the resulting code is 7 nested loops and a goto (!). It's pretty cool though. When naively meshing a voxel grid you create quads (pairs of triangles) per cube face. That means 12 triangles for a single voxel. This adds up quickly! One way to simplify this is to perform what amounts to two dimensional run length encoding. You merge neighboring voxels with the same surface properties into larger rectangles and create a single big quad to cover all of them. The downside is that the mesh topology gets worse (because you introduce t-junctions), but the amount of geometry you have to send to the GPU is reduced dramatically (for my models a factor of 4x to 6x).

More background on meshing voxels:
https://0fps.net/2012/06/30/meshing-in-a-minecraft-game/

And before you ask about the nested for loops. You go through the chunk volume in axis aligned slices and greedily construct rectangles:
for axis in [0..3]
for side in [0..1]
for dimension one in [0..chunksize]
for dimension two in [0..chunksize]
for dimension three in [0..chunksize]
while horizontal run continues
while vertical run continues
for length of horizontal run

The goto is just an early out of one of the inner loops in case the surface properties no longer warrant merging voxels.

3) More parallelization work. Since the new meshing strategy is about twice as expensive computationally as the naive approach I needed to speed it up. I took advantage of Vulkan's ability to record command buffers in multiple threads and use secondary command buffers. If we need to mesh multiple chunks in a frame the work is branched out to multiple threads on a TBB (intel thread building blocks) thread group. Compared to the serial version this gives me a 3x to 4x speedup. The remaining performance bottlenecks are the boost RTree traversal and memory allocations for the meshes. The former is amenable to parallelization as well, but would require me to hack the boost library which I'd like to avoid for the time being. I could play around with different allocators for addressing the latter, but that will probably not improve things dramatically. Better to find a way of avoiding allocations altogether

December 3, 2017

"compiler is out of heap space in pass 2"

;-( Don't you just love it when your tools fail you?

December 8, 2017

My first tree! This may seem unremarkable, but required a lot of work. The other complex models I posted were all hand made with the external modelling package MagicaVoxel and required painstakingly placing each individual voxel. This tree on the other hand is procedurally grown. If I change just a handful of easy parameters or the random seed I can grow a new tree or an entire forest.

The intuitive approach for growing a tree would be to start at the root and grow out branches one by one. And in fact, many algorithms do just that and use L-Systems (https://en.wikipedia.org/wiki/L-system). However this has a few downsides: One, the resulting plants tend to look very fractal and with a discernable repeating structure. Two, it is extremely hard to avoid self intersection with this approach. Branches will grow straight through other branches or the stem of the tree. Even if you manage to solve this for a single tree, the problem gets compounded if you want to grow a forest where branches should not intersect neighboring trees. Three, finding good parameters and rulesets for these algorithms is hard.

So instead of using L-Systems I chose a Space Colonization Algorithm (see paper below). The crucial insight of this algorithm is an observation of what a tree is trying to achieve: grow its branches into the light. Thus you start from the goal instead of from the bottom. Seed a volume of arbitrary shape (I'm currently using a simple sphere) with randomly distributed attraction points. These will determine the shape of the crown. Then you grow branches towards these attraction points following a few simple rules:
1) each attraction point only influences the closest branch segment
2) an attraction point that is closer than a given threshold to a branch segment gets killed off
3) every iteration of the algorithm you add new branch segments to the tree, growing in the average direction of all attraction points influencing the new segment
4) keep going until no attraction points are left or none can be reached within a radius of influence

It's a beautifully simple idea with remarkably lifelike results. And as a bonus growing an entire forest (or roots for that matter) is no different than growing a single tree - you distribute attraction points everywhere and start with more than one root and voila - a forest. You need to perform lots of nearest neighbor queries to thousands of attraction points every iteration. This is expensive, so to make it efficient I'm resorting to using an R*Tree (the data structure ;-)) as a space partitioning mechanism. I'm a little proud that despite the complexity of the algorithm and a significant amount of code my very first execution of the program led to the result you see. Usually such things require a few attempts to get right ;-)

Now the next task is to give the tree, especially the trunk, volume. Leonardo da Vinci observed that trees are essentially a network of pipes transporting water from their roots to the branches. As such, the rule for the width of a tree is to keep the pipe capacity constant. That is, starting from the terminal branches progress towards the trunk and whenever branches merge preserve the surface area of the intersection disks. This is easy enough to compute - however drawing "thick" lines is surprisingly hard. Even just the single voxel width line drawing algorithm isn't entirely trivial. An exercise for another day.

Lastly, the trees as a network of pipes model is only correct for trees in moderate climates where transporting water is the primary concern. Trees in desert climates are more interested in storing water and you end up with funky shapes like baobab trees. Essentially barrels with a few sticks on their heads. Should be an interesting task to adapt the algorithm to such specialized trees. Coniferous trees and weeping willows should pose cool challenges too.

"Modeling Trees with a Space Colonization Algorithm"
http://algorithmicbotany.org/papers/colonization.egwnp2007.large.pdf
Aside: Department of Algorithmic Botany?! How cool is that?

December 9, 2017

The voxel editor I use, MagicaVoxel, encodes the color of a voxel in a single byte. This implies every model has a palette of at most 256 unique colors. Typically every model uses its own palette, tailored for that specific model. A tree would use mostly greens and browns while an 80s style arcade might be all bright neon. This presents a problem if you want to create a scene containing more than one model, e.g. an arcade underneath a tree. We need to squeeze the colors of both models into a new, joint palette somehow. Preferably while preserving as much visual fidelity as possible.

The way I solved this dilemma is by creating a master palette that covers a broad range of colors. When loading a model into my scene I now need to remap the model's colors to the best matching color in the master palette. Images are typically stored in RGB (red, green, blue) color space. This is fine for sending to a monitor, but it is terrible for computing distances (=differences) between colors. RGB space doesn't match the human visual system's perception of what constitutes close colors very well at all. Humans have evolved to be much more sensitive to differences in green than blue for example. The color space should reflect that.

Luckily a bunch of people created such a color space back in the 70s. It's called Lab (https://en.wikipedia.org/wiki/Lab_color_space). You take your RGB values and transform them through a bunch of magic numbers and square roots. The resulting space reflects human perception much better and a special formula for color differences, CIELAB delta E (https://en.wikipedia.org/wiki/Color_difference) has been devised. I think they ran actual experiments where humans were shown colors side by side and should press a button if they thought they were different. Whatever the details, today this deltaE color difference is used for measuring color reproduction quality in print, textiles, photography and many other domains.

Anyway, back to my tree over the arcade: I load both models and remap their private palettes to the master palette by picking the best matching color according to deltaE. If the master palette contains anything even close to the original colors the models come out nicely at the end and I can have a merry collection of things in a single scene. Win!

While researching some of this I found forum entries where people painstakingly did this remapping manually in MagicaVoxel. Maybe I should release a tool and ease their pain ;-)

December 10, 2017

I fell down a completely unexpected rabbit hole today. When I generated my first tree I was just drawing it as a stick figure, thinking that extending those thin limbs to branches with volume would be easy. Boy was I wrong. Now the Leonardo pipe model of a tree does give me correct radii for every branch. This just required careful traversal of the tree data structure. But actually rendering thick lines proved to be a hard problem indeed!

A line with volume is a cylinder in 3D. To draw a cylinder into a voxel grid you need to scan convert it. That is, you need to find all the voxels partially covered by the cylinder. The cylinder is given in parametric form, i.e. as its two endpoints and a radius. The task is to somehow iterate this and compute integer grid coordinates without visiting a single grid cell more than once or, worse, missing one.

A plethora of efficient scan conversion algorithms exist for the 2D case. We have been projecting any kind of geometry down to a 2D grid of pixels (the screen!) since the beginning of computing. Projecting them into a 3D voxel grid is something else entirely though and the literature is depressingly thin on the topic. True 3D displays are rare (although the Zürich Christmas market would be perfect with its lines of LEDs dangling above the street), so few people have tackled the problem.

I found some ancient papers but no ready made solution. I'm interested in your ideas!

What I did in the meantime is write a brute force solution:
Compute the smallest axis aligned bounding box containing the cylinder. This is not trivial by itself and warrants another post sometime. It involves some cool tricks to make it efficient. We know all voxels touching the cylinder must be contained in this bounding box. Now we iterate the bounding box and ask for every single voxel: "am I in the cylinder?". This can be done relatively efficiently by computing a few dot products to get the distance of the query point to the endcaps of the cylinder and the axis. If the query point passes the tests we draw it. Voila, a thick line or cylinder.

This works well enough for cylinders that are roughly parallel to an axis and not too long. It gets miserably inefficient if you have a long thin diagnoal cylinder. The volume of the bounding box relative to the volume of the cylinder will be huge, making my simple algorithm do way too much work.

The other issue, even if I had an efficient cylinder rendering algorithm, is how to make multiple cylinders meet up without gaps. If you have two cylinders meet up at an angle you'll have a wedge missing between them, making my trees look as if someone started cutting them down. I'm thinking rendering capsules instead of cylinders will mitigate this problem. But maybe an entirely different approach is needed and I should use some sort of spline or so.

Anyway, attached is the current state of the art. The white floating voxels are the attractors that led to the current tree shape.

December 11, 2017

Today I grew a bunch of trees simultaneously, letting them compete for the same set of attraction points. I also spread the attraction points in a more interesting pattern than just a simple sphere. With a little bit of imagination you can see the forest ;-)

I also experimented with different exponents for the pipe capacity model of the trees. The "correct" solution is to use 0.5, i.e. the square root for calculating the radius given the area of a disc (2 * pi * r^2). I tried using things like 0.3 and 0.7 to have the trees thin out faster or slower. But the results didn't look convincing, so I stayed with the original model (which I think produces relatively slim trees? What do you think?).

Then I implemented an algorithm for scan converting a solid sphere or ball. Because of the inherent symmetry this is much easier than discs, cylinders or any other three dimensional structure. There's an efficient algorithm for circles which a certain Mr Bresenham came up with a long time ago. It also takes advantage of the symmetrie and basically mirrors the octants of a circle around the center point. It uses only simple integer operations and is thus pretty efficient.

To convert the Bresenham circle algorithm to three dimensions we observe that we can build up the ball as a stack of vertical slices. So we apply Bresenham twice in a row: once to compute the radius for a given vertical slice and again to actually render that slice. Works like a charm.

Now why did I want balls? (haha). Because stacking thick cylinders at angles cut wedges into my tree trunks (see images). To fix this I'm now rendering a ball as a joint between any two cylinders. Problem solved.

December 18, 2017

I can generate lots of trees now. Question is: where should they grow? Spawning them randomly across the map leads to a very chaotic and busy look. It also doesn't necessarily feel realistic. The other extreme, growing them in strict rectangular patterns is boring and only ever happens in plantations anyway. Halfway between the two extremes one could assign each tree to a grid cell and then wiggle it (randomize) the position only within that cell. Better, but also not the look I was hoping for.

What we really want is blue noise. Did you know noise has color? Most people are familiar with the term white noise, but there are also pink, brown, blue and others [1]. Anyway, there's an efficient algorithm for implementing Poisson Disk sampling [2] in arbitrary dimensions. The idea is to generate a bunch of points that are at least a certain distance d and at most 2*d apart from one another. In two dimensions you start by picking a single point on the plane randomly and put it on an "open" list. You create a number k of random points in the annulus surrounding that point. Check for each of the newly generated points if the minimum and maximum distance conditions hold (see below). If so, put them on the open list. Iterate until the list is empty by popping a new reference point from the open list.

Ordinarily to make sure the distance criteria are being obeyed you'd have to check every point against every other point. This would be terribly inefficient for a large set of points. So we use a grid of cell size min_distance / sqrt(num_dimensions). Every point we keep is marked in this grid. Now to check whether a newly generated point is a valid candidate we just have to traverse the grid in its neighborhood. If all cells are empty, we can keep the point. If a cell is occupied, we compute the exact distance (since the grid only gives us an approximate answer), checking the condition again. The grid is constructed such that we need only check one cell in the diagonal directions and two cells in horizontal and vertical directions.

Now the last tricky bit to get right is how to uniformly sample points from an annulus, or ring. If you naively choose a point by picking a random angle and distance in [min_distance, max_distance] you'll end up with points clustering closer to the inside of the ring than the outside. The reason of course is that there's more surface area on the outside of the ring for the same number of points. To compensate for this we observe that the surface grows quadratically and we thus should choose a random distance according to:
distance = sqrt(random * (max_distance^2 - min_distance^2) + min_distance^2)
Voila, now the points are uniformly distributed on the ring.

The result is now a bit too regular for my liking, but I'll explore two ways of fixing this next time ;-)

[1] https://en.wikipedia.org/wiki/Colors_of_noise#Blue_noise
[2] https://www.cct.lsu.edu/~fharhad/ganbatte/siggraph2007/CD2/content/sketches/0250.pdf