Efficient Octree Storage and Traversal

Introduction

I had a problem. The Galaxy Generator worked fine when I was moving around inside the Galaxy, but it seems important that I should be able to move outside it, turn the brightness up, and see the whole thing at once. When I tried that, the frame rate would drop to nearly zero and the machine would then run out of memory. Here was the sort of image that I was looking for:

Galaxy generator!

The problem turned out to be caused by generating way too many nodes in the octree that I use to generate stars, and the size of each node being too large. This post describes the process of resolving those issues by making the octree nodes much smaller, and generating them selectively.

Introduction
Contents
Octrees and traversal
Storage for reuse
Storing the tree more efficiently
Using even less memory
Conclusion

Octrees and traversal

An Octree is a data structure for recursively dividing space where each internal node has at most (or exactly, according to which definition is used, I use the ‘at most’ version) 8 sub-nodes. Each node represents a cubic area in space, and the volume of each node will be reduced by 8 for every level in the tree. This shows how a two level octree divides space, with the blue cube being an example of a level 2 octant:

Octants!

The Galaxy Generator procedurally generates a galaxy containing a large number of stars, and can’t generate them all at once, so uses an octree and a fitness function.

The fitness function is a function \(F(e_x,e_y,e_z,x,y,z,l)\) where \(e_x,e_y,e_z\) are the coordinates of the eye or viewpoint, \(x,y,z\) are the coordinates of some specified corner of the octant (I pick minimum x,y,z of the octant), and \(l\) is the level of the octant in the tree. For \(F\), a lower value means a better fit, so for instance “distance squared to the nearest point in the octant” (squared to avoid the expense of the square root in the formula \(d=\sqrt{x^2+y^2+z^2}\)) could be used as such a function:

int64_t minimum_distance2_eye_octant(int32_t ex, int32_t ey, int32_t ez,
			int32_t basex, int32_t basey, int32_t basez, uint32_t level) {
	int32_t octant_size=1LL<<(32-level);
	return min_distance2_dim1(basex,ex,basex+octant_size)
		+min_distance2_dim1(basey,ey,basey+octant_size)
		+min_distance2_dim1(basez,ez,basez+octant_size);
}

where

inline int64_t min_distance2_dim1(int32_t base, int32_t eye, int32_t top) {
	uint64_t d;
	if(eye>=base) {
		if(eye<=top) {
			return 0;
		} else {
			d=eye-top;
		}
	} else {
		d=base-eye;
	}
	return d*d;
}

One important property of \(F\) is that a more interesting octant can not be contained in a less interesting one. More formally:

If the octant \(x_1,y_1,z_1,l_1\) contains the octant \(x_2,y_2,z_2,l_2\) then \(\forall e_x,e_y,e_z, F(e_x,e_y,e_z,x_1,y_1,z_1,l_1)\leq F(e_x,e_y,e_z,x_2,y_2,z_2,l_2)\).

Now it’s easy to do a depth first partial traversal, given \(F_{max}\) is the maximum value of interest. In pseudocode:

depth_first_traversal(e_x,e_y,e_z,x,y,z,level) {
	if F(e_x,e_y,e_z,x,y,z,level)>f_max or level>level_max then return
	do any processing associated with the octant at x,y,z,level
	let s=width_associated_with(level)/2
	depth_first_traversal(e_x,e_y,e_z,x  ,y  ,z  ,level-1)
	depth_first_traversal(e_x,e_y,e_z,x  ,y  ,z+s,level-1)
	depth_first_traversal(e_x,e_y,e_z,x  ,y+s,z  ,level-1)
	depth_first_traversal(e_x,e_y,e_z,x  ,y+s,z+s,level-1)
	depth_first_traversal(e_x,e_y,e_z,x+s,y  ,z  ,level-1)
	depth_first_traversal(e_x,e_y,e_z,x+s,y  ,z+s,level-1)
	depth_first_traversal(e_x,e_y,e_z,x+s,y+s,z  ,level-1)
	depth_first_traversal(e_x,e_y,e_z,x+s,y+s,z+s,level-1)
}

For the example \(F\) above, this will traverse any octants contacting a sphere of radius \(\sqrt{F_{max}}\).

Storage for reuse

Doing the processing each time \(F\) changes (which will be every time the viewpoint changes, which might be each frame) can be expensive, so it makes sense to store all or part of the tree. I need to store:

Pointers to the sub-octants,
The list of generated stars (I use CompactStar as the class to contain an individual star),
A pointer to the part of the density function that is used for generation.

It ends up looking something line this:

struct Octant {
	array<Octant*,8> sub_octants;
	vector<CompactStar> stars;
	DensifyEval* density_eval;
};

The total size of this is 96 bytes (vector is 8 bytes pointer to data, 8 bytes pointer to end of data and 8 bytes pointer to end of capacity), plus whatever the overhead is required by the memory allocator for allocating the Octants and the CompactStars. It turns out that sometimes the display will benefit from having 10 million octants or more, which means that I might end up with over 1 GB of storage required - and I consider 1 GB to be my total memory budget.

In some circumstances it gets even worse - because the location and size of the octant is encoded by its place in the tree, to do any sort of traversal other than depth first would involve storing some additional state. State that needs to be stored includes x,y,z, each with a minimum number of bits corresponding to the tree level (the trees I’m using are more than 16 level, so I go up to the next size of 32 bits), and the level. This will be an additional 13 bytes:

int32_t xhi, yhi, zhi;
uint8_t level;

bringing the total to 107 bytes (plus allocation overhead) for say a breadth first traversal. This gets padded to a multiple of 8 for a total of 112 bytes.

One way of bring the size down is to not use individual new (or malloc for C programmers) calls for each Octant and for each array of CompactStar, but to have that effectively memory managed by some underlying structure (this could be an array if the maximums size is known in advance, or a vector if not, or some other suitable structure). Assuming that there aren’t more than \(2^{32}\) Octants, they can now be referred to by index rather than by pointer, and the same can be done with the density evaluation function and the CompactStar list, although in the last case, some care might need to be taken to avoid too much fragmentation. Lastly, I make an arbitrary decision that there are never more than \(2^{16}-1\) stars associated with an octant. The result is much smaller:

typedef uint32_t OctantIndex;
typedef uint32_t CompactStarIndex;
typedef uint32_t DensityEvalIndex;

struct Octant {
	array<OctantIndex,8> sub_octant_indices;
	DensityEvalIndex density_eval_index;
	CompactStarIndex compact_star_index;
	uint16_t compact_star_count;
};

The total size of this is 42 bytes (padded to 44), or 55 bytes (padded to 56) if the extra non-depth first state is needed. Originally, I was intending that there would never be more than \(2^{16}\) octants stored at once, which would have allowed OctantIndex to be a uint16_t, however that turned out to have an unfeasibly large number of stars per octant.

Storing the tree more efficiently

Note that the lion’s share of that last structure (32 out of 42/55 bytes) is the indices to the sub-octants. Can I get rid of these? It turns out that I can under the following conditions:

Octants don’t need to know directly where their children are or interact with them (so this isn’t going to work for a ray tracer or collision detector)
Traversals are ‘mostly’ fitness first, in that octants get evaluated in approximately best fitness order.

This is very suitable for what I’m doing - stars are treated as being independent of each other, and fitness first traversals have advantages over depth first traversals in cases where there are resource limitations (time, octant count or star count in my case). Here is the data structure, with explanation of fields below:

struct Octant {
	int32_t xhi, yhi, zhi;
	OctantIndex parent;
	DensityEvalIndex density_eval_index;
	ComactStarIndex compact_star_index;
	uint16_t compact_star_count;
	uint8_t sub_octant_mask;
	uint8_t level;
};

The total size for this is 28 bytes.

density_eval_index, compact_star_index, compact_star_count are as before,
xhi, yhi, zhi, level from the non-depth first traversal case are now required,
each octant has a reference to its parent,
Each octant has a boolean indicator for each of its potential children for whether they exist or not. I really wanted to use a std::bitmask for this, but it turns out that the size for this is a multiple of 8 bytes, so instead I use an 8-bit sub_octant_mask and do the bit wrangling myself.

How the parent and sub_octant_mask are used:

The top octant is a special case that doesn’t have a parent - the easiest way of dealing with this is to just to create it early, then never delete it until just before program close, otherwise the special case has to be checked often.
Every time a (non top) octant is created, the parent is known at creation time, so put in the parent field, and the bit in the parent’s sub_octant_mask corresponding to the new octant is set,
Just before every time a (non top) octant is deleted, the bit in the parent’s sub_octant_mask corresponding to the octant to be deleted is cleared.

This way, each octant knows which of its children exist, but has no quick way of accessing them (it would take a linear search of the underlying structure).

Traversal

One thing that is going to be needed is a version of \(F\) (call it \(f\) that has a smallish number of possible values, as octants are going to be arranged in buckets. The easiest way of doing that is compose \(F\) with an another function that flattens the result, so:

\(f(e_x,e_y,e_z,x,y,z,l)=\mbox{flatten}(F(e_x,e_y,e_z,x,y,z,l))\)

A good flatten function is some sort of logarithm (distance changes matter less as distances get larger), but logarithms are expensive to compute. Fortunately, for positive values, the IEEE-754 representation of a floating point number is an approximate logarithm, so all that needs to be done is to cast the result of \(F\) to a float, convert that sequence of bits to an integer (don’t cast), then shift right by some amount and subtract a constant. The following function does this, so each power of 2 in the input results in an increase of \(32=2^{\mbox{LOG2_BUCKET_SIZE}}\) in the output.

const static int LOG2_BUCKET_SIZE=5;

int32_t approx_log(float x) {
	uint32_t i=*reinterpret_cast<uint32_t*>(&x);
	return std::max(0,(int32_t)(i>>(23-LOG2_BUCKET_SIZE))
			-(0x3F800000>>(23-LOG2_BUCKET_SIZE)));
}

Approx Log!

The flattened function \(f\) will allow octants to be placed into buckets (about 3000 of them), so that the octants can be traversed in order from 0 working upwards, which is roughly a fitness first search. Before describing this in more detail I need one more concept - the idea of evaluated and unevaluated octants.

An unevaluated octant has the fields xhi, yhi, zhi, level, parent, density_eval_index, sub_octant_mask filled in - everything except the ‘payload’ of compact_star_index, compact_star_count. It can be thought of as a placeholder for an octant of interest to evaluate later. It typically doesn’t have any sub-octants.
An evaluated octant has all the fields filled in. In my case, that means that the stars have been generated and made ready to send to the the graphics card.

Here is the algorithm for performing a traversal. First the setup:

Create an array of buckets, where each bucket contains a empty list. For the moment I’m going to say that the buckets are of type vector<OctantIndex>, with the size of the array set to \(\mbox{max}(f)+1\) (called F_MAX+1, so the structure is:

array<vector<OctantIndex>,F_MAX+1> buckets;

Next post, I will replace this with something more efficient.

Set a level called the cutoff below which no octants will be added. This could be set to F_MAX to evaluate all octants.
Go through the underlying store, evaluating each octant with the function \(f\), and appending a reference to that octant to the vector buckets[f(...)]. If this is the first time through, make sure that the top octant in generated first.

Then the algorithm:

For each i in the range 0..F_MAX:
	For each octant associated with an entry in the vector `buckets[i]`
		Evaluate the octant
		For each bit in sub_octant_mask that is 0:
			Evaluate t=f(...) for the associated sub-octant
			If t < cutoff:
				Set the bit in `sub_octant_mask`
				Create the sub-octant (unevaluated)
				Add the sub-octant reference to the vector `buckets[t]`

Note that t might be the same as i so the bucket might grow during this loop and the new entries must be included.

Showing this in diagrams, with only 11 buckets and a small number of octants for simplicity, and a cutoff of 9 (so cutting off any octants with \(f \geq 9\)) and the following key for evaluated/unevaluated:

Traverse Key!

Here is what the buckets might look like after step 3:

Traverse Start!

After the first octant in buckets[0] has the new sub-octants (shown in red) added:

Traverse Fill!

After all the octants in buckets[0] have been run through the loop (note that the newly generated level 0 octant is included):

Traverse Fill2!

The result of this will be an approximately fitness first traversal - if this is used for rendering stars and viewed as it was running, it would show a sphere of stars approximately filling out from the center.

Dealing with resource constraints

It is sometimes necessary to impose resource constraints on this process to not have the user experience being too laggy, or not run out of memory. The three constraints that the Galaxy Generator uses are time, octant count or star count.

Maximum time

If there is a maximum runtime for this (say I want to try to achieve a particular frame rate), it is easy to just stop the traversal when time runs out. One risk of doing this if the time limit is too small is thrashing, where the same octants are tested over and over again. There are two techniques to mitigate this if \(F\) has not changed between executions, either or both of which can be used:

start the next execution where the last one left off.
ensure that some minimum amount of progress is made - say complete the next bucket before stopping.

Maximum total octant count

In order to run in bounded memory, the Galaxy Generator has a maximum total octant count. In order to generate new octants, less fit octants need to be deleted.

When there is an octant with t=f(...) to be created that would take the total over the maximum, do the following before adding it:

While t < cutoff:
	If buckets[cutoff] is not empty:
		Let x = the octant referenced by the last element of buckets[cutoff]
		clear the bit corresponding to x in the sub_octant_mask of the parent
		remove x
		remove the last element of buckets[cutoff]
		success, exit while loop
	else
		cutoff--;

If t < cutoff at the end of the loop, a less fit octant must have been removed, so the new one can be added, otherwise t == cutoff and the new octant is now over the cutoff and doesn’t get added. Also see below

Traverse Delete!

There are two reasons to remove from the back of vector representing the bucket rather than the front:

If an octant and a child of that octant have the same f, it is essential that the child is removed before the parent, otherwise the accuracy of the sub_octant_mask will get corrupted.
If buckets[cutoff] is a vector or something similar, it is much more efficient to remove from the back than the front.

Maximum total star count

This case is broadly similar to the maximum total octant count, except that more than one octant may have to be removed to free up enough stars for the new octant:

While t < cutoff:
	If buckets[cutoff] is not empty:
		Let x = the octant referenced by the last element of buckets[cutoff]
		clear the bit corresponding to x in the sub_octant_mask of the parent
		remove x
		remove the last element of buckets[cutoff]
		If now enough space for the new stars:
			success, exit while loop
	else
		cutoff--;

Ordering concerns when constructing the buckets

If octants are going to be removed, as in the case of maximum octant or star count, it is important that for a given bucket that a child is never earlier than the parent, or there is a risk of a parent being removed before the child, which would corrupt the accuracy of sub_octant_mask, resulting in subtle and hard to diagnose bugs.

This is no problem when adding new octants during traversal, as the parent octant will already exist, but what about when constructing the buckets from the underlying store in step 3 of the preparation for traversal? - there are no order guarantees there.

There are three solutions to this that I can think of:

Arrange the fitness function \(F\) and the flattening function so that \(f(\mbox{child}) > f(\mbox{parent})\), and a child is never in the same bucket as the parent - this is the technique that I use,
Sort each of the bucket vectors by increasing level after step 3 - there is possibly a speed penalty doing this,
Instead of constructing the buckets from an underlying store, construct it by going through the previous array of buckets in order - this is fast but takes up a little more memory.

Using even less memory

For my purposes, the Octant struct listed above was good enough, but what if I wanted to save more space?

It is possible to reduce the size of the Octant structure even further if maximum sizes of some of the values are known. For instance, if there are no more than 21 levels, xhi, yhi, zhi only need 21 bits of storage each, and can be packed into a 64 bit integer:

struct Octant {
	int64_t xhi:21, yhi:21 ,zhi:21;
	OctantIndex parent;
	DensityEvalIndex density_eval_index;
	ComactStarIndex compact_star_index;
	uint16_t compact_star_count;
	uint8_t sub_octant_mask;
	uint8_t level;
};

The total size is 24 bytes. (note that int32_t xhi:21, yhi:21 ,zhi:21 doesn’t work, as the packed values are not allowed cross an aligned boundary unless __attribute__ ((packed)) is used).

If other size limits are known, it may be possible to pack the structure even smaller. Something like the following might work (if all the values fit inside the allowed number of bits), but it’s going to take __attribute__ ((packed)) to fit it in a C++ structure, as values are going to cross 32 bit boundaries. Here is an example, that has a size of 20 bytes:

struct __attribute__ ((packed)) Octant {
	int32_t xhi:21;
	int32_t yhi_hi:21;
	int32_t zhi:21;
	uint32_t compact_star_index:24;
	uint32_t compact_star_count:16;
	uint32_t parent:24;
	uint32_t density_eval_index:19;
	uint32_t level:5;
	uint32_t sub_octant_mask:8;
};

Conclusion

With the techniques here I reduced a 120 byte structure to 48 bytes. I didn’t quite get the 112 byte structure to 28 bytes because I have some fields I didn’t discuss here (in order to try and keep the post under ~~3000~~3500 words), and I still have the star_list as a pointer which I will clean up sometime.

This, combined with the technique I will discuss in the next post, allowed the Galaxy Generator to max out at about 450 MB of memory and running reasonably smoothly in most cases - there is popping moving through dense star fields at high speed with the brightness turned up, but that’s going to be simply unavoidable on low-end hardware.

This is a good demonstration of why paying attention to data structure size is important, and thinking of ways of accessing it that can result in a smaller structure.

The next post is tentatively titled “Lists of Lists of Tiny Things”.

I welcome any comments, corrections, observations, or improvement suggestions. Please feel free to contact me

Introduction#

Contents#

Octrees and traversal#

Storage for reuse#

Storing the tree more efficiently#

Traversal#

Dealing with resource constraints#

Maximum time#

Maximum total octant count#

Maximum total star count#

Ordering concerns when constructing the buckets#

Using even less memory#

Conclusion#