Introduction
While building my Galaxy Generator, one of the performance bottlenecks was in the generation of Poisson random variates to determine how many stars should be placed inside a voxel. This post looks at how to write a fast crossplatform Poisson random variate generator.
First I motivate this by writing about why use Poisson random variates at all, then I explore the floating point case (using double
) which gives some good results, but may not reliably generate the same results crossplatform. Lastly I look at integer only versions of the Poisson random variate algorithms that can safely used crossplatform.
Using integer only code might also offer advantages in systems (say embedded) where floating point support is not available.
At the time of writing the vectorization part of this, I hadn’t found Wolfgang Hörmann’s paper, but the results from that have now been incorporated.
The final versions of this code is available at GitHub:

poisson_random_variate_integer.c
contains the best integer version, 
poisson_random_variate_double.c
contains the best double version.
Contents
 Introduction
 Contents
 Requirements
 Why use Poisson random variates?
 Floating point (double) implementations
 Integer implementations
 Conclusion
Requirements
The Galaxy Generator is written using is written using C++ and Vulkan, and primarily targeted at Android phones, although there are Linux and Windows versions too. In particular:

The code needs to compile without warnings with Wall and Wextra on g++ (Linux), clang++ (Android and Linux) and with most of the warnings turned on for Visual Studio (Windows).

It needs to generate exactly the same results on all platforms, so I can leave the door open to users switching platforms, or one day having multiple people exploring the same galaxy. This makes use of floating point operations problematic, and for safety, everything needs to be done with integers.

I want to support any 64bit ARM (ARMv8A or later).

I want to support any x8664 bit CPU that supports SSE4.1 and BMI2 (Haswell, Excavator or later). I may later remove the BMI2 requirement to support even older processors.

It needs to be fast.
Why use Poisson random variates?
Say I want to generate a certain number of stars in a particular volume, and have it be reproducible. For an example, I will say that I want to generate approximately 256 stars in a 16x16 area. The easy way to do it is to just to pick a seed and generate all the stars with pseudorandom x and y coordinates. I get something like this:
In the full problem I want to solve (a 400 billion star galaxy), this is not feasible  even the most heavily optimized generation would take many hours and several terabytes of storage would be required. This is not going to happen on a mobile phone, so the galaxy needs to be lazily generated in pieces, focusing on generating stars that may be visible. It is also important that the order of generation doesn’t matter  I don’t want to find different star fields in a volume depending on the route I use to arrive there.
Translating to the 16x16 area example, I want to be able to pick some of the 1x1 areas (which I will call boxes) in any order, and get the same overall layout no matter which order I evaluate them in. The easy way of doing that would be to have a way of generating a seed for the pseudorandom generator for each box, and generate an average number of stars (in this case one star) using that seed. The results look a lot less random (as if the stars were repelling each other a little):
Here is the same layout with the box boundaries superimposed, showing one star per box:
The problem here is the one star per box.
I observe that if stars exist with a density \(\rho\), then (assuming that the placement of stars is completely random, which is not quite physically true due to gravity effects and two stars not being able to occupy the same location, but these effects are tiny) the number of stars in a particular volume (or area) will follow something called the Poisson Distribution. The formula for the Poisson distribution is
\[P(k;\lambda)=\frac{\lambda^k e^{\lambda}}{k!}\mbox{ where }\lambda=\rho\times\mbox{volume}\mbox{ (or }\lambda=\rho\times\mbox{area}\mbox{)}\]The number of stars in each box now needs to be a random number that fits on this distribution  a Poisson random variate. When I do this, I get the following (first with the box boundaries superimposed, then without):
This looks very similar to the first image, but now the boxes can be generated in any order and the same distribution of stars will be produced.
Note that something gets lost in this process  with each box having a random number of stars, the total may not be 256 any more, and this will be hard to correct. Anyone who counted the stars will have already seen that there are 252 in this example. One of the properties of the Poisson distribution is that
\[\mbox{variance}(X)=\mbox{expected}(X)=\lambda\]and for large \(\lambda\) the Poisson distribution can be approximated by the normal distribution, so for the 256 star example, the standard deviation \(\sigma=\sqrt{256}=16\). For 400 billion stars, the standard deviation becomes relatively tiny: \(\sigma=\sqrt{4\times 10^{11}}\approx600000\). To put it more simply, the variations average out.
The rest of this discussion is concerned with generating the Poisson random variates.
Why performance is important
When the naive implementation was used (described as ‘Knuth int64’ below), a substantial chunk of the CPU time (I don’t have records, but I seem to recall that in excess of 10% of the executed instructions according to Valgrind) was spent in the Poisson random variate function. Why is this so high, given that even in the simplest implementations, the time taken is \(O(\lambda)\) and the loop should be fast? One of the steps of star generation (that is otherwise outside the scope of this discussion) is that first the number of potential stars in a volume is calculated (a Poisson random variate), the potential stars are placed, then a density function is applied as a probability so that the distribution fits the shape of the galaxy. This is very generic in that any shaped distribution of stars can be generated, but many potential stars are generated then thrown out, so the Poisson random variate generation, star placement and filtering need to be as efficient as possible.
Floating point (double) implementations
A good starting place is to look at the currently available implementations. All the ones that I have found work on floating point numbers (double
), and therefore can’t be guaranteed to return the same results on different compiler versions, compiler flags or platforms.
Internal C++ implementation
The C++ standard library has supported Poisson random variates since C++11. Note that using it for this is a little offbrand: the implementation seems to be best suited to setting up a generator with particular \(\lambda\) then generating many variates, whereas procedural generation will mean the seed will have to be set every time to ensure repeatability, and because lambda will vary, I need to create a std::poisson_distribution
every time.
// set this up somewhere in advance
std::default_random_engine generator;
uint32_t poisson_random_variable_internal(uint64_t seed, double lambda) {
// Generate a Poisson random variate
generator.seed(seed);
std::poisson_distribution<int> distribution(lambda);
return distribution(generator);
}
Knuth version
Donald Knuth presents an algorithm for generating Poisson random variates in The Art of Computer Programming. Vol. 2  Seminumerical Algorithms. Here is a version based on a slightly modified version of that algorithm that prevents underflow. The time taken for this is \(O(\lambda)\).
uint32_t poisson_random_variable_knuth_double(uint64_t* seed, double lambda) {
if(lambda<=0) {
return 0;
}
uint32_t ret=0;
while(lambda>0) {
double l=std::min(lambda,500.0);
lambda=l;
double L=exp(l);
double p=fast_rand_double(seed);
while(p>L) {
ret++;
p*=fast_rand_double(seed);
}
}
return ret;
}
where fast_rand_double
returns a number in the range [0,1) and is defined as (fast_rand64
will be defined below):
double fast_rand_double(uint64_t* seed) {
// yes, the compiler does the right thing and turns it into a multiplication
return fast_rand64(seed)/18446744073709551616.0;
}
Hörmann version
Wolfgang Hörmann describes methods for generating fast Poisson random variates in The transformed rejection method for generating Poisson random variables in Insurance: Mathematics and Economics, Volume 12, Issue 1, February 1993, Pages 3945. A non paywalled version is here. He describes two algorithms, PTRD (Poisson Transformed Rejection with Decomposition) and PTRS (Poisson Transformed Rejection with Squeeze). I implemented PTRD.
This algorithm assumes that uniform random variates U and V are in the range (0,1). Because I generate uniform random variates in the range [0,1), I put in an extra test that us!=0
so that a division by zero doesn’t happen in the following step.
Note that this is only accurate for \(10 \leq \lambda \leq 10^8\), which is plenty  if I needed \(\lambda\) above \(10^8\) (I don’t), the Gaussian would be a very good approximation. \(\lambda\) is given the value u
in the following code.
uint64_t poisson_random_variable_hormann_double(uint64_t* seed, double u) {
while(true) {
double smu=std::sqrt(u);
double b=0.931+2.53*smu;
double a=0.059+0.02483*b;
double vr=0.92773.6224/(b2.0);
double V=fast_rand_double(seed);
if(V<0.86*vr) {
double U=V/vr0.43;
double us=0.5abs(U);
return std::floor((2.0*a/us+b)*U+u+0.445);
}
double t=fast_rand_double(seed);
double U;
if(V>=vr) {
U=t0.5;
} else {
U=V/vr0.93;
U=((U<0)?0.5:0.5)U;
V=t*vr;
}
double us=0.5abs(U);
if(us==0  (us<0.013 && V>us)) { // add test to deal with us==0 case
continue;
}
double k=std::floor((2.0*a/us+b)*U+u+0.445);
double inv_alpha=1.1239+1.1328/(b3.4);
V=V*inv_alpha/(a/(us*us)+b);
if(k>=10.0) {
if(std::log(V*smu)<=(k+0.5)*std::log(u/k)u
std::log(std::sqrt(2*M_PI))
+k(1.0/12.01.0/(360*k*k))/k) {
return k;
}
} else if(0<=k && std::log(V)<k*std::log(u)ulog_fact_table[k]) {
return k;
}
}
}
Floating point results
I next ran these on my machine, an Intel(R) Core(TM) i54340M CPU @ 2.90GHz
(Haswell), running at about 3.5 GHz under load, with \(\lambda\) ranging from 1 to 200. I ran it for 10,000,000 iterations each, and the results are the average number of nanoseconds taken to run. Here are the results for selected \(\lambda\):
\(\lambda\)  1  10  25  50  100  200 

Internal  31  127  220  211  210  208 
Knuth double  22  75  151  277  529  1033 
Hörmann double    56  44  41  38  37 
As expected, Knuth’s algorithm having an execution time of \(O(\lambda)\) performs quickly on small \(\lambda\) and slowly on large \(\lambda\).
What might be a little more surprising is how much faster Hörmann’s algorithm is than the one provided by the C++ standard library (for higher \(\lambda\), about 37 ns vs. 208 ns  over 5 times as fast). I was unable to track down a reference to the algorithm used by the standard library (I found the source code, but I want a solution that is unencumbered by the GNU Public License, which rules out reverse engineering this), but it seems unlikely that an algorithm was selected that wasn’t the fastest available. Hörmann’s PTRD algorithm is used by Boost.
Possible reasons that my code is faster:

The standard library is written for a very generic case  there are multiple options for random number generators and multiple distributions available. My code is specialized for one case, and doing that frequently makes code faster, commonly by a factor of two or more. Because my code is in one place, there are minimal function calls and no virtual method calls. This avoids call overhead and also may allow the compiler to optimize better than would otherwise be possible,

It looks like the standard library is optimized for creating a generator with a particular \(\lambda\) and then reusing that generator. This includes precalculating and storing some \(\lambda\) dependent values. My use case might have a different \(\lambda\) each time, so all that precalculation is repeated, and I also set the seed every time. My implementation doesn’t store anything except the static table for
log_fact_table
so avoids that overhead. 
I may be using a worse but faster random number generator. High quality random numbers seem to not be needed for this.
In any case, although I now have a very fast Poisson random variate generator (use Knuth for \(\lambda<10\), Hörmann for \(\lambda \geq 10\)), it does not solve the problem I set out to solve, as any nontrivial code using floating point may return different results on different platforms. Part of the requirements are that exactly the same results are returned on all platforms, so I am going to have to turn to the integers.
Integer implementations
Infrastructure
There is some infrastructure setup required before being able to produce integer code for these algorithms.
Count leading zeros and popcount
The gcc/clang and Visual Studio compilers provide differently named built in functions to access the bsr
(lzcnt
on later processors) and popcnt
, and I am going to need these:
#ifdef _MSC_VER // windows
#define clz64 (uint32_t)__lzcnt64
#define clz32 (uint32_t)__lzcnt
#define popcount __popcnt
#elif defined(__GNUC__)  defined(__clang__) // gcc/clang
#define clz64 __builtin_clzll
#define clz32 __builtin_clz
#define popcount __builtin_popcount
#endif
128 bit multiplication
gcc and clang support 128 bit integers, whereas Visual Studio does not.
As a frequent user of 64 bit fixed point, it’s not ideal to have to write a compatibility layer to support Visual Studio, but here we are  the suggestion in that support thread to just use intrinsics adds complexity, particularly when building cross platform. Luckily, I’m only going to need a few versions of unsigned multiplication:

multu64hilo
to do a 64 bit x 64 bit unsigned multiplication, returning the 128 bit result in two 64 bit chunks, 
multu64hi
to do a 64 bit x 64 bit unsigned multiplication, returning only the high 64 bits. 
mults64hi
to do a 64 bit x 64 bit signed multiplication, returning only the high 64 bits. 
fixed_mult64u
to do a 64 bit x 64 bit unsigned multiplication, returning only bits 32..95. 
fixed_mult64s
to do a 64 bit x 64 bit multiplication, returning only bits 32..95.
The last two of these are multiplying 32.32 fixed point values together and return a 32.32 fixed point result.
Here are the definitions that I need (both gcc/clang and Visual Studio generate the same instructions under the hood):
#ifdef _MSC_VER // Windows
static inline uint64_t multu64hi(uint64_t x, uint64_t y) {
unsigned __int64 ret;
_umul128(x, y, &ret);
return ret;
}
static inline int64_t mults64hi(int64_t x, int64_t y) {
__int64 ret;
_mul128(x, y, &ret);
return ret;
}
static inline void multu64hilo(uint64_t x, uint64_t y, uint64_t* rhi, uint64_t* rlo) {
*rlo=_umul128(x, y, rhi);
}
static inline int64_t fixed_mult64s(int64_t x, int64_t y) {
int64_t rhi;
int64_t rlo = _mul128(x, y, &rhi);
return (int64_t)__shiftleft128(rlo, rhi, 32);
}
static inline int64_t fixed_mult64u(uint64_t x, uint64_t y) {
uint64_t rhi;
uint64_t rlo = _umul128(x, y, &rhi);
return (uint64_t)__shiftleft128(rlo, rhi, 32);
}
#elif defined(__GNUC__)  defined(__clang__) // gcc/clang
static inline uint64_t multu64hi(uint64_t x,uint64_t y) {
return (uint64_t)((((unsigned __int128)x)*y)>>64);
}
static inline int64_t mults64hi(int64_t x,int64_t y) {
return (int64_t)((((__int128)x)*y)>>64);
}
static inline void multu64hilo(uint64_t x,uint64_t y,uint64_t* rhi,uint64_t* rlo) {
unsigned __int128 ret=((unsigned __int128)x)*y;
*rhi=ret>>64;
*rlo=ret;
}
static inline int64_t fixed_mult64s(int64_t x,int64_t y) {
return (int64_t)((((__int128)x)*y)>>32);
}
static inline int64_t fixed_mult64u(uint64_t x,uint64_t y) {
return (uint64_t)((((unsigned __int128)x)*y)>>32);
}
#endif
A fast random number generator
This doesn’t need to be all that good. Here’s the one that I’m using (using the multu64hilo
defined above).
static inline uint64_t fast_rand64(uint64_t* seed) {
*seed += 0x60bee2bee120fc15ULL;
uint64_t hi,lo;
multu64hilo(*seed,0xa3b195354a39b70dULL,&hi,&lo);
uint64_t m1 = hi^lo;
multu64hilo(m1,0x1b03738712fad5c9ULL,&hi,&lo);
uint64_t m2 = hi^lo;
return m2;
}
Integer versions of square root and transcendental functions
Completely integer based fixed point versions of sqrt()
, exp()
and log()
will be needed by some of following code. These are not provided by the standard library, so I have provided these myself, the sqrt()
using Newton’s Method, the others using the Remez Algorithm and the tool lolremez to create the coefficients.
static inline uint32_t multu32hi(uint32_t x,uint32_t y) {
return (((uint64_t)x)*y)>>32;
}
static inline uint32_t mults32hi(int32_t x,int32_t y) {
return (((int64_t)x)*y)>>32;
}
// inputs and outputs are both 32.32 fixed point integers
uint64_t fixed_sqrt_32_32(uint64_t x) {
uint32_t lead=clz64(x)>>1;
x<<=(lead<<1);
if((1ULL<<62)==x) {
return x>>(lead+15);
}
// x is a 2.62 unsigned fixed in the range (1,4)
// result is 1.63 unsigned fixed in the range (0.5,1)
// start with a linear interpolation correct at the endpoints
// 7/6  1/6 x, so 1>1, 4>0.5
uint64_t y=3074457345618258602ULLmultu64hi(x,12297829382473034410ULL);
// now do some NewtonRaphson
// y=y*(3/21/2*x*y*y)
// Maximum error for #iterations:
// 0 ~0.2
// 1 ~0.06
// 2 ~0.005
// 3 ~3e5
// 5 ~1e9 (about 30 bits  limit of the algorithm)
y=multu64hi(y,0xC000000000000000ULL((multu64hi(multu64hi(y,y),x))))<<1;
y=multu64hi(y,0xC000000000000000ULL((multu64hi(multu64hi(y,y),x))))<<1;
y=multu64hi(y,0xC000000000000000ULL((multu64hi(multu64hi(y,y),x))))<<1;
// dont shift left on the last one
y=multu64hi(y,0xC000000000000000ULL((multu64hi(multu64hi(y,y),x))));
return multu64hi(y,x)>>(lead+14);
}
// input is 0.32 fixed representing a range [0,1)
// operation is 2^x
// result is a 2.30 fixed representing a range [1,2)
// static inline uint32_t p_exp2_32_internal(uint32_t x) {
uint32_t u=58831021U;
u=multu32hi(u,x)+222008398U;
u=multu32hi(u,x)+1037829222U;
u=multu32hi(u,x)+(2976266834U+0x7C4F);
return (multu32hi(u,x)>>2)+0x40000000U;
}
// inputs and outputs are both 32.32 fixed point integers
int64_t log_64_fixed(uint64_t lx) {
int32_t lead=clz64(lx);
int32_t x=((lx<<lead)>>32)0x80000000ULL;
// x is a 1.63 unsigned fixed in the range [0,1)
// calculate ln2(x+1)
// result is 1.63 unsigned fixed in the range [0,1)
int32_t u= 19518282;
u=mults32hi(u<<1,x) +109810370;
u=mults32hi(u<<1,x) 291900857;
u=mults32hi(u<<1,x) +516277066;
u=mults32hi(u<<1,x) 744207376;
u=mults32hi(u<<1,x) +1027494097;
u=mults32hi(u<<1,x) 1548619616;
u=mults32hi(u,x)+ (1549074032+93);
uint64_t d=mults32hi(u,x)<<3;
return mults64hi((d+((31LLlead)<<32)),6393154322601327829LL)<<1;
}
Knuth, with integers
Now I have the tools to write a Poisson random variate generator using entirely integer operations. It is largely similar to the floating point version, except instead of accumulating a number that gets smaller and smaller, I use clz64
and shift operations to normalize the result to [0.5,1) every time (treating it as a 0.64 fixed point).
const uint64_t P_LN2_INV_2_POW_63=13306513097844322492ULL;
// lambda is fixed 32.32
uint32_t poisson_random_variable_knuth_int64(uint64_t* seed, uint64_t lambda) {
if(lambda<=0) {
return 0;
}
uint64_t num_digits=multu64hi(lambda,P_LN2_INV_2_POW_63)<<1;
int32_t int_digits=num_digits>>32;
uint64_t start=((uint64_t)p_exp2_32_internal(num_digits&0xFFFFFFFF))<<33;
uint32_t ret=1;
while(int_digits>=0) {
uint64_t x=(fast_rand64(seed)1);
start=multu64hi(start,x);
uint32_t z=clz64(start);
int_digits=z;
start<<=z;
ret++;
}
return ret;
}
This performs well compared to the version using doubles.
\(\lambda\)  1  10  25  50  100  200 

Knuth double  22  75  151  277  529  1033 
Knuth int64  17  43  82  148  279  542 
A question arises  do I really need to use 64 bits? How about 32? 16? 8? How about vectorizing? (The answers are no, yes, yes, sort of, and yes).
It turns out that 32 bit and a 16 bit versions work just fine. Here’s the 16 bit version. I tried an alternate version where a fresh random number was generated every loop instead of splitting a 64 bit random number into 4, but performance was about the same. This might be improved by using a specifically 16 or 32 bit random number generator.
// lambda is fixed 32.32
uint32_t poisson_random_variable4_knuth_int16(uint64_t* seed, uint64_t lambda) {
if(lambda<=0) {
return 0;
}
uint64_t num_digits=multu64hi(lambda,P_LN2_INV_2_POW_63)<<1;
int32_t int_digits=num_digits>>32;
uint16_t start=p_exp2_32_internal(num_digits&=0xFFFFFFFF)>>15;
int32_t ret=1;
uint32_t i=4;
uint64_t r=0;
while(int_digits>=0) {
if(i==4) {
r=fast_rand64(seed);
i=0;
}
start=(((uint32_t)start)*(((uint32_t)r)&0xFFFF)>>16)1U;
uint32_t z=clz32(start)16;
int_digits=z;
start<<=z;
ret++;
i++;
r>>=16;
}
return ret;
}
There is a problem with 8 bits  the results are lower than expected, and even if this is adjusted for (by scaling \(\lambda\)), the distribution is too narrow. This can be sort of fixed by adjusting the scaled \(\lambda\) up by some integer amount, and the result down, but it’s not generating particularly high quality results (although it’s probably plenty good for something like procedurally generated stars or other artifacts if speed was really needed).
What I do is add a small integer increment (approximately \(0.0242\lambda\) at the start and store the same value to subtract from the result, then multiply \(\lambda\) by 1.000121. These values were found by trial and error to give something approximately right. I am not recommending this  I just wanted to see how close I could get it, as an accurate 8 bit version would really help when vectorizing is done.
//lambda is fixed 32.32
uint32_t poisson_random_variable_knuth_int8(uint64_t* seed, uint64_t lambda) {
if(lambda<=0) {
return 0;
}
lambda=multu64hi(9280556943483275418ULL,(lambda<<1));
uint64_t lambda_diff=(((lambda)>>32)*104000000+0x80000000U)>>32;
lambda+=lambda_diff<<32;
... // largely the same in here
return ret(uint32_t)lambda_diff;
}
\(\lambda\)  1  10  25  50  100  200 

Knuth double  22  75  151  277  529  1033 
Knuth int64  17  43  82  148  279  542 
Knuth int32  16  45  89  160  306  597 
Knuth int16  17  55  111  204  392  769 
Knuth int8 (imprecise)  20  58  118  210  400  784 
Knuth, with integers and vectorization
As the integer width reduces, the performance reduces. However, the narrower the integer, the more operations that can be done at once with vectorization, which should more than make up for this performance reduction.
Here is the vectorization of the 16 bit version using SSE. There is a lot of work to do at the ends:

At the start, the ’leftover’ part (the noninteger part of \(\lambda log_2e\)) is placed in one of the lanes, the other lanes are set to 0xFFFF (as close as possible to 1).

One iteration is performed outsize the loop to get everything set up (including a lot of previous state),

The inner loop is set to overrun, and enough previous state (
old_old_start_flag
,old_old_int_digits
,old_old_rand
) is kept to back up one iteration, multiply all the lanes (horizonal_mult8_16_corr
) together, then step through without vectorization to the right place.
Why not use AVX? It would double the width of the vector for little extra effort, therefore improving the speed on x8664. I chose not to do this, as Android is the primary target for this application, and ARM NEON does not support more than 128 bit vectors.
union variant16 {
#if __x86_64
__m128i v;
#elif __aarch64__
uint16x8_t v;
#endif
uint16_t s16[8];
uint64_t s64[2];
};
#if __x86_64
typedef __m128i uint16x8_t;
#elif __aarch64__
#else
typedef variant16 uint16x8_t;
#endif
// helper fuction to multiply all the lanes together
static inline uint64_t horizonal_mult8_16_corr(uint16x8_t x) {
#if __x86_64  _M_X64
union variant16 u;
u.v=x;
uint64_t startx0=u.s16[0];
uint64_t startx1=u.s16[4];
for(uint32_t i=1;i<4;i++) {
startx0*=u.s16[i];
startx1*=u.s16[i+4];
}
#elif __aarch64__
uint32x4_t t1=vmull_u16(vget_low_u16(x),vget_high_u16(x));
uint64x2_t t2=vmull_u32(vget_low_u32(t1),vget_high_u32(t1));
uint64_t startx0=vdupd_laneq_u64(t2,0);
uint64_t startx1=vdupd_laneq_u64(t2,1);
#else
uint64_t startx0=x.s16[0];
uint64_t startx1=x.s16[4];
for(uint32_t i=1;i<4;i++) {
startx0*=x.s16[i];
startx1*=x.s16[i+4];
}
#endif
return multu64hi(startx0,startx1);
}
// lambda is fixed 32.32
uint32_t poisson_random_variable_knuth_int16_sse(uint64_t* seed, uint64_t lambda) {
if(lambda<=0) {
return 0;
}
uint64_t num_digits=multu64hi(lambda,P_LN2_INV_2_POW_63)<<1;
int32_t int_digits=num_digits>>32;
num_digits&=0xFFFFFFFFULL;
// e^(ln(2)*x) = (e^ln(2))^x = 2^x
uint32_t r7=p_exp2_32_internal((uint32_t)num_digits)>>15;
int32_t ret=1;
uint16_t old_start_flag=r7;
int32_t old_int_digits=int_digits;
#if __x86_64  _M_X64
__m128i zero=_mm_setzero_si128();
__m128i const_1=_mm_set1_epi16(1);
__m128i const_FF00=_mm_set1_epi16(0xFF00U);
__m128i const_F000=_mm_set1_epi16(0xF000U);
__m128i const_C000=_mm_set1_epi16(0xC000U);
__m128i const_8000=_mm_set1_epi16(0x8000U);
__m128i old_start=_mm_insert_epi16(_mm_set1_epi16(0xFFFFU),r7,0);
uint64_t a=fast_rand64(seed);
uint64_t b=fast_rand64(seed);
__m128i old_rand=_mm_set_epi64x(b,a);
__m128i startx=_mm_insert_epi16(old_rand,
(((uint32_t)_mm_extract_epi16(old_rand,0))*r7)>>16,0);
startx=_mm_or_si128(startx,const_1);
__m128i clz_select8=_mm_cmpeq_epi16(_mm_and_si128(startx,const_FF00),zero);
startx=_mm_blendv_epi8(startx,_mm_slli_epi16(startx,8),clz_select8);
__m128i clz_select4=_mm_cmpeq_epi16(_mm_and_si128(startx,const_F000),zero);
startx=_mm_blendv_epi8(startx,_mm_slli_epi16(startx,4),clz_select4);
__m128i clz_select2=_mm_cmpeq_epi16(_mm_and_si128(startx,const_C000),zero);
startx=_mm_blendv_epi8(startx,_mm_slli_epi16(startx,2),clz_select2);
__m128i clz_select1=_mm_cmpeq_epi16(_mm_and_si128(startx,const_8000),zero);
startx=_mm_blendv_epi8(startx,_mm_slli_epi16(startx,1),clz_select1);
uint32_t t=popcount(_mm_movemask_epi8(clz_select8));
t=popcount(_mm_movemask_epi8(clz_select4))+t+t;
t=popcount(_mm_movemask_epi8(clz_select2))+t+t;
t=popcount(_mm_movemask_epi8(clz_select1))+t+t;
int_digits=(t>>1);
#elif __aarch64__
uint16x8_t const_1=vdupq_n_u16(1);
uint16x8_t old_start=vsetq_lane_u16(r7,vdupq_n_u16(0xFFFF),0);
uint64_t a=fast_rand64(seed);
uint64_t b=fast_rand64(seed);
uint16x8_t old_rand=vcombine_u16(vcreate_u16(a),vcreate_u16(b));
uint16x8_t startx=vsetq_lane_u16((((uint32_t)vgetq_lane_u16(old_rand,0))*r7)>>16,
old_rand,0);
startx=vorrq_u16(startx,const_1);
uint16x8_t zz=vclzq_u16(startx);
startx=vshlq_u16(startx,vreinterpretq_s16_u16(zz));
int_digits=vaddvq_u16(zz);
#else // don't know what the processor is
variant16 startx,old_start,old_rand;
old_rand.s64[0]=(fast_rand64(seed));
old_rand.s64[1]=(fast_rand64(seed));
startx=old_rand;
startx.s16[0]=((((uint32_t)old_rand.s16[0])*r7)>>16)1U;
for(uint32_t i=0;i<8;i++) {
old_start.s16[i]=0xFFFF;
uint16_t x=startx.s16[i](uint16_t)1;
int32_t z=clz32(x)16;
int_digits=z;
x<<=z;
startx.s16[i]=x;
}
old_start.s16[0]=r7;
#endif
ret += 8;
uint16x8_t old_old_start=old_start;
uint16_t old_old_start_flag=old_start_flag;
int32_t old_old_int_digits=old_int_digits;
uint16x8_t old_old_rand=old_rand;
while (int_digits >= 0) {
old_old_start=old_start;
old_old_start_flag=old_start_flag;
old_old_int_digits=old_int_digits;
old_old_rand=old_rand;
old_start=startx;
old_start_flag=0;
old_int_digits=int_digits;
#if __x86_64  _M_X64
uint64_t a=fast_rand64(seed);
uint64_t b=fast_rand64(seed);
old_rand=_mm_set_epi64x(b,a);
startx=_mm_mulhi_epu16(startx,old_rand);
startx=_mm_or_si128(startx,const_1);
__m128i clz_select8=_mm_cmpeq_epi16(_mm_and_si128(startx,const_FF00),zero);
startx=_mm_blendv_epi8(startx,_mm_slli_epi16(startx,8),clz_select8);
__m128i clz_select4=_mm_cmpeq_epi16(_mm_and_si128(startx,const_F000),zero);
startx=_mm_blendv_epi8(startx,_mm_slli_epi16(startx,4),clz_select4);
__m128i clz_select2=_mm_cmpeq_epi16(_mm_and_si128(startx,const_C000),zero);
startx=_mm_blendv_epi8(startx,_mm_slli_epi16(startx,2),clz_select2);
__m128i clz_select1=_mm_cmpeq_epi16(_mm_and_si128(startx,const_8000),zero);
startx=_mm_blendv_epi8(startx,_mm_slli_epi16(startx,1),clz_select1);
uint32_t t=popcount(_mm_movemask_epi8(clz_select8));
t=popcount(_mm_movemask_epi8(clz_select4))+t+t;
t=(popcount(_mm_movemask_epi8(clz_select1))>>1)
+popcount(_mm_movemask_epi8(clz_select2))+t+t;
int_digits=t;
#elif __aarch64__
uint64_t a=fast_rand64(seed);
uint64_t b=fast_rand64(seed);
old_rand=vcombine_u16(vcreate_u16(a),vcreate_u16(b));
uint32x4_t mul_lo=vmull_u16(vget_low_u16(startx),vget_low_u16(old_rand));
uint32x4_t mul_hi=vmull_high_u16(startx,old_rand);
startx=vcombine_u16(vshrn_n_u32(mul_lo,16),vshrn_n_u32(mul_hi,16));
uint16x8_t mult=vorrq_u16(startx,const_1);
uint16x8_t z=vclzq_u16(mult);
startx=vshlq_u16(mult,vreinterpretq_s16_u16(z));
int_digits=vaddvq_u16(z);
#else // don't know what the processor is
#warning noopt
old_rand.s64[0]=(fast_rand64(seed));
old_rand.s64[1]=(fast_rand64(seed));
for(uint32_t i=0;i<8;i++) {
uint32_t x=startx.s16[i];
x=((x*old_rand.s16[i])>>16)1U;
int32_t z=clz32(x)16;
int_digits=z;
x<<=z;
startx.s16[i]=x;
}
#endif
ret+=8;
}
//__asm__("int3");
union variant16 urand;
ret=8;
uint64_t start64=horizonal_mult8_16_corr(old_start);
int32_t z=clz64(start64);
if(old_start_flag==0 && old_int_digits<z) {
ret=8;
int_digits=old_old_int_digits;
#if __x86_64  _M_X64  __aarch64__
urand.v=old_old_rand;
#else
urand=old_old_rand;
#endif
old_start_flag=old_old_start_flag;
start64=horizonal_mult8_16_corr(old_old_start);
z=clz64(start64);
} else {
int_digits=old_int_digits;
#if __x86_64  _M_X64  __aarch64__
urand.v=old_rand;
#else
urand=old_rand;
#endif
}
uint16_t start;
if(old_start_flag==0) {
int_digits=z;
start=(uint16_t)(start64>>(48z));
} else {
start=old_start_flag;
}
uint32_t i=0;
while(int_digits>=0 && i<8) {
start=((((uint32_t)start)*urand.s16[i++])>>16)1U;
int32_t z=clz32(start)16;
int_digits=z;
start<<=z;
ret++;
}
return ret;
}
It is illustrative to look at the inner loop to see how the NEON and SSE instructions compare with each other.
In general, the NEON instructions are reasonably orthogonal, and there aren’t very many surprises  if something exists for one size/signedness, it likely exists for the others. It is also well typed, with the an expressive list of types (float64x1_t
and float64x2_t
are missing from this list), which makes coding a little more verbose but will catch many errors at compile time. Coding using ARM NEON feels like eating at a well stocked salad bar, where the ingredients are mostly simple, but there are lots of them, they go well together, and you can mix and match to taste.
The SSE and AVX instructions on the other hand are way less orthogonal, certainly for integer operations. There are many operations that only exist for a range of the sizes or signedness  it looks like operations have been added as they seem useful from the designers’ point of view. There are however some more powerful instructions, such as the movemask for 8, 32 and 64 bit values (not for 16) that can be useful. The typing is also much weaker  there is __m128
,__m128
for single precision floats, __m128d
,__m256d
for double precision floats, and __m128i
,__m256i
for all the different types of integer. Coding for SSE or AVX feels like eating at a Smorgasbord, where there are some simple options and also some fancier dishes that you couldn’t create yourself, but you have to spend much more time looking at the options, and while there might be the prebuilt thing that you need, there are also gaps where it is harder to create what you want. The lack of good typing also makes it much easier to make a culinary surprise (something inedible).
The following breaks the inner loop into sections, comparing ‘Agnostic’ (plain C), ARM intrinsics (using NEON), and x8664 intrinsics (using SSE up to 4.1 and popcount). If there would be different choices required because 8 bit and 16 bit components have different instruction support, there may be a x8664 (8 bit case) section.
All CPUs:
while (int_digits >= 0) {
// These are sse registers
old_old_start=old_start;
old_old_rand=old_rand;
old_start=startx;
// Thse are regular registers
old_old_int_digits=old_int_digits;
old_int_digits=int_digits;
old_old_start_flag=old_start_flag;
old_start_flag=0;
Loading old_rand with a new random number.
Agnostic:
old_rand.s64[0]=(fast_rand64(seed));
old_rand.s64[1]=(fast_rand64(seed));
ARM:
vcombine_u8
has the low word first.
uint64_t a=fast_rand64(seed);
uint64_t b=fast_rand64(seed);
old_rand=vcombine_u16(vcreate_u16(a),vcreate_u16(b));
x8664:
_mm_set_epi64x
has the high word first. It’s easy to miss this in the documentation, and I was getting different results than on ARM until I spotted it.
uint64_t a=fast_rand64(seed);
uint64_t b=fast_rand64(seed);
old_rand=_mm_set_epi64x(b,a);
Multiply the random number by start, taking the high byte of the result.
Agnostic:
for(uint32_t i=0;i<8;i++) {
uint32_t x=startx.s16[i];
x=((x*old_rand.s16[i])>>16)1U;
ARM:
ARM NEON does not support multiplying two vectors of 8 bit integers and returning the high part as a vector of 8 bit integer (Helium does with [__arm_]vmulhq[_u16]
). Instead, the lower 4 values and upper 4 values are dealt with separately and recombined.
// 32 bit results of multiplying the low part of the vector
uint32x4_t mul_lo=vmull_u16(vget_low_u16(startx),vget_low_u16(old_rand));
// 32 bit results of multiplying the low part of the vector
uint32x4_t mul_hi=vmull_high_u16(startx,old_rand);
// recombine
startx=vcombine_u16(vshrn_n_u32(mul_lo,16),vshrn_n_u32(mul_hi,16));
uint16x8_t mult=vorrq_u16(startx,const_1);
x8664:
startx=_mm_mulhi_epu16(startx,old_rand);
startx=_mm_or_si128(startx,const_1);
x8664 (8 bit case)
SSE does not support multiplying two vectors of 8 bit integers and returning the high part as a vector of 8 bit integer. One of the nonorthogonalities of SSE/AVX is that there are frequently operations that exist on one integer size, but not others In this case there is a _mm_mulhi_epu16
that works on 16 bit integers, but not a _mm_mulhi_epu8
that works on 8 bits. Instead, the lower 8 values and upper 8 values are dealt with separately and recombined (similar to ARM NEON).
__m128i start_lo=_mm_unpacklo_epi8(startx,zero);
__m128i start_hi=_mm_unpackhi_epi8(startx,zero);
__m128i x_lo=_mm_unpacklo_epi8(old_rand,zero);
__m128i x_hi=_mm_unpackhi_epi8(old_rand,zero);
// 16 bit results of multiplying the low part of the vector
start_lo=_mm_srli_epi16(_mm_add_epi16(_mm_mullo_epi16(start_lo,x_lo),const_A7),8);
// 16 bit results of multiplying the low part of the vector
start_hi=_mm_srli_epi16(_mm_add_epi16(_mm_mullo_epi16(start_hi,x_hi),const_A7),8);
// recombine
startx=_mm_or_si128(_mm_packus_epi16(start_lo,start_hi),const_1);
Count leading zeros, store the count and shift left so there are no leading zeros.
Agnostic:
int32_t z=clz32(x)24;
x<<=z;
startx.s8[i]=x;
ARM:
The componentwise count leading zeros and shift left make this very easy.
uint8x16_t z=vclzq_u8(mult);
startx=vshlq_u8(mult,vreinterpretq_s8_u8(z));
x8664:
There are no SSE or AVX instructions to do what is needed directly. AVX512 has _mm_mask_sllv_epi16
which would only do part of this. I used a method based on Hacker’s Delight, that generates both the leading zero counts and the shifted values.
The C code for this algorithm for a single value would be:
uint16_t v; // input, known not to be zero
int count = 0;
if(v&0xFF00) {
count += 8;
v <<= 8;
}
if(v&0xF000) {
count += 4;
v <<= 4;
}
if(v&0xC000) {
count += 2;
v <<= 2;
}
if(v&0x8000) {
count += 1;
v <<= 1;
}
Here is the translation into SSE. It could be made somewhat faster if an equivalent of the NEON vaddvq_u16
(add up all the 16 bit components) instruction existed (_mm_hadd_epi16
adds pairs of components), but instead the quickest way seems to be counting the clz_selectX results using popcount(_mm_movemask_epi8(...))
There is a wrinkle here  there is no _mm_movemask_epi16
instruction, so _mm_movemask_epi8
has to be used, which means that each leading zero is counted twice. This is corrected for at the end where a multiply by 2 is left out and popcount(_mm_movemask_epi8(clz_select1))
is divided by 2.
__m128i clz_select8=_mm_cmpeq_epi16(_mm_and_si128(startx,const_FF00),zero);
startx=_mm_blendv_epi8(startx,_mm_slli_epi16(startx,8),clz_select8);
__m128i clz_select4=_mm_cmpeq_epi16(_mm_and_si128(startx,const_F000),zero);
startx=_mm_blendv_epi8(startx,_mm_slli_epi16(startx,4),clz_select4);
__m128i clz_select2=_mm_cmpeq_epi16(_mm_and_si128(startx,const_C000),zero);
startx=_mm_blendv_epi8(startx,_mm_slli_epi16(startx,2),clz_select2);
__m128i clz_select1=_mm_cmpeq_epi16(_mm_and_si128(startx,const_8000),zero);
startx=_mm_blendv_epi8(startx,_mm_slli_epi16(startx,1),clz_select1);
uint32_t t=popcount(_mm_movemask_epi8(clz_select8));
t=popcount(_mm_movemask_epi8(clz_select4))+t+t;
t=popcount(_mm_movemask_epi8(clz_select2))+t+t;
t=(popcount(_mm_movemask_epi8(clz_select1))>>1)+t;
x8664 (8 bit case):
This is similar to the 16 bit case, but with a couple of changes:

there is no
_mm_slli_epi8
instruction, so_mm_slli_epi16
has to be replaced by shifting the whole 128 bit register, and an extra_mm_and_si128
instruction needs to be used to stop unwanted bits spilling over from neighboring 8 bit values. 
_mm_movemask_epi8
returns accurate results without the doubling.
The extra and
step makes the code even bulkier and requires more constants (const_FC
and const_FE
):
__m128i clz_select4=_mm_cmpeq_epi8(_mm_and_si128(startx,const_F0),zero);
startx=_mm_blendv_epi8(startx,_mm_and_si128(_mm_slli_epi64(startx,4),const_F0),
clz_select4);
__m128i clz_select2=_mm_cmpeq_epi8(_mm_and_si128(startx,const_C0),zero);
startx=_mm_blendv_epi8(startx,_mm_and_si128(_mm_slli_epi64(startx,2),const_FC),
clz_select2);
__m128i clz_select1=_mm_cmpeq_epi8(_mm_and_si128(startx,const_80),zero);
startx=_mm_blendv_epi8(startx,_mm_and_si128(_mm_slli_epi64(startx,1),const_FE),
clz_select1);
uint32_t t=popcount(_mm_movemask_epi8(clz_select4));
t=popcount(_mm_movemask_epi8(clz_select2))+t+t;
t=popcount(_mm_movemask_epi8(clz_select1))+t+t;
Subtract the number of leading zeros found from the total count
Agnostic:
int_digits=z;
}
ARM: Horizontal add instruction to add all the leading zero counts together.
int_digits=vaddvq_u16(z);
x8664:
The zero counts were already added together in the last step and stored in t
.
int_digits=t;
Finishing up
All CPUs:
ret+=8;
}
Software pipelining
It is possible to use software pipelining to do more than one calculation at once (as described in the Mandelbrot example), effectively widening the vector. It adds a lot of complexity to the code, and improves performance only a little. At the time of writing, I still have an edge case that is not working properly, and the performance improvement is not worth the effort, particularly in light of what I look at in the next section where I write an integer version of Hörmann. Here is an idea of the likely timing:
\(\lambda\)  1  10  25  50  100  200 

Knuth int16 sse  29  53  73  101  155  258 
Knuth int16 sse (pipeline)  33  56  72  92  137  221 
Hörmann, with integers
This is a faithful as possible conversion of the floating point version to fixed point, mainly using 32.32 integers.
Methodology:

I went through floating point version of the code, calculating the range of most of the variables and some of the subexpressions, particularly where they feed into a division. This is because division is the most problematic part of converting floating point to fixed point. This analysis is inexact enough that, except for the special case of zero, I didn’t bother distinguishing between inclusive ‘[…]’ and exclusive ‘(…)’ ranges.

I then wrote equivalent fixed point statements in parallel to the floating point ones, using the same variable names with an ‘i’ prefix. I didn’t comment out the floating point statements at this stage.

Addition and subtraction and comparison were straightforward conversion, multiplication I used
fixed_mult64s
andfixed_mult64u
as described above. 
Division is tricky. x8664 supports a ‘128 bit / 64 bit > 64’ bit division (although clang/gcc don’t seem to offer a way of getting at it without using inline assembly, Visual Studio does have intrinsics for it), but ARM does not, so the best native division that can be relied on is ‘64 bit / 64 bit > 64 bit’. It would be possible to use software division with more precision, but that would be very slow, which defeats the point of the exercise here. What I did case by case was shift values (where necessary) so that the dividend uses most of 64 bits, the divisor uses around 32 bits, so that the quotient will be around 32 bits. This offers a little under 32 bits of accuracy, which is the best that can be done with that technique in the general case. Division by integer values was done asis.

sqrt
andlog
called the functions defined above.


I then debugged the code using the floating point and fixed point code in parallel. I could (where necessary) check the fixed point values that I was getting against known good floating point ones.

Lastly, I commented out all the floating point code, and ran some more tests.
This is a slow, tedious and errorprone process, and it seems that some of it at least could be automated using variable range analysis, but a quick search did not find any suitable tools.
Some particular notes:

us==0
gets replaced byius<65536
, otherwisefixed_mult64u(ius,ius)
which is used as a divisor returns zero. 
For
log(V*smu)<=(k+0.5)*log(u/k)ulog(sqrt(2*M_PI))+k(1.0/12.01.0/(360*k*k))/k
, I allowed the possibility to exit early before the last part ((1.0/12.01.0/(360*k*k))/k
) is calculated, as that can save a calculation with two divisions.
The result is something that is a little degraded from the double
case because of reduced accuracy, but the returned distribution still seems to be good.
// iu is fixed 32.32
uint64_t poisson_random_variable_hormann_int(uint64_t* seed, uint64_t iu) {
while(true) {
// double smu=std::sqrt(u); // [3.1623,10000]
uint64_t ismu=fixed_sqrt_32_32(iu);
// double b=0.931+2.53*smu; // [8.9316,25301]
uint64_t ib=3998614553ULL+(multu64hi(ismu,11667565626621291397ULL)<<2);
// double a=0.059+0.02483*b; // [0.16277,629]
uint64_t ia=multu64hi(ib,458032655350208166ULL)253403070ULL;
// double vr=0.92773.6224/(b2.0); // [0.4051,0.9277]
uint64_t ivr=3984441160ULL((16705371433151369943ULL/(ib8589934592ULL))<<2);
uint64_t iV=fast_rand64(seed)>>32
// double V=iV/4294967296.0; // [0,1]
//if(V<0.86*vr) { // V/vr:[0,0.86]
if(iV<multu64hi(15864199903390214389ULL,ivr)) {
// double U=V/vr0.43; // [0.43,0.43]
int64_t iU=(iV<<32)/ivr1846835937ULL;
// double us=0.5abs(U); // [0.07,0.5]
uint64_t ius=2147483648ULLabs(iU);
// uint64_t k=std::floor((2.0*a/us+b)*U+u+0.445);
// 2.0*a range:[0.32554,1258]
uint64_t i2a_div_us=((ia<<21)/ius)<<12;
uint64_t ik=(fixed_mult64s(i2a_div_us+ib,iU)+iu+1911260447ULL)>>32;
return (uint32_t)ik;
}
uint64_t it=fast_rand64(seed)>>32;
//double t=it/4294967296.0; // [0,1]
int64_t iU;
if(iV>=ivr) {
//U=t0.5; // [0.5,0.5]
iU=it2147483648ULL;
} else {
// U=V/vr0.93; // [0.93,0.07]
iU=(iV<<32)/ivr3994319585ULL;
// U=((U<0)?0.5:0.5)U; // [0.5,0.5]
iU=((iU<0)?2147483648LL:2147483648LL)iU;
// V=t*vr; // [0,0.9277]
iV=fixed_mult64u(it,ivr);
}
// U range:[0.5,0.5]
// V range:[0,1]
//double us=0.5abs(U); // [0,0.5]
uint64_t ius=2147483648ULLabs(iU);
// if(us==0  (us<0.013 && V>us))
if(ius<65536  (ius<55834575ULL && iV>ius)) {
continue;
}
// double k=std::floor((2.0*a/us+b)*U+u+0.445); // anything
// 2.0*a range:[0.32554,1258]
// us range:(0,0.5]
uint64_t i2a_div_us=((ia<<21)/ius)<<12; //udiv64fixed(ia<<1,ius);
int64_t ik=(fixed_mult64s(i2a_div_us+ib,iU)+(int64_t)iu+1911260447LL)>>32;
// b3.4 range: [5.5316,25298]
// double inv_alpha=1.1239+1.1328/(b3.4); // >=1.1239, <1.3287
uint64_t iinv_alpha=4827113744ULL+((10448235843349090035ULL/
(ib14602888806ULL))<<1);
// us*us range:(0,0.25]
// V=V*inv_alpha/(a/(us*us)+b);
// don't need ranges from here on, as all further divisions are by integers
iV=((fixed_mult64u(iinv_alpha,iV)<<31)/
((((ia<<20)/fixed_mult64u(ius,ius))<<12)+ib)<<1);
if(ik>=10.0) {
// if(std::log(V*smu)<=(k+0.5)*log(u/k)ulog(sqrt(2*M_PI))+k
(1.0/12.01.0/(360*k*k))/k) {
int64_t lhs=log_64_fixed(fixed_mult64u(iV,ismu));
int64_t rhs=((ik<<1)+1)*((log_64_fixed(iu/ik))>>1)iu3946810947LL
+(ik<<32);
if(lhs>rhs) {
continue;
}
rhs=(357913941ULL(4294967296ULL/(360*ik*ik)))/ik;
if(lhs>rhs) {
continue;
}
return (uint32_t)ik;
} else if(0<=ik && log_64_fixed(iV)<((int64_t)ik)*log_64_fixed(iu)
(int64_t)iulog_fact_table_fixed[ik]) {
return (uint32_t)ik;
}
}
}
Timings:
\(\lambda\)  1  10  25  50  100  200 

Hörmann double    56  44  41  38  37 
Hörmann int64    102  93  85  79  76 
The integer version takes over twice as long. Why is this? This is some extra shifting going on, but the major culprit is likely to be division  there are 10 of them in the code, although at most 8 of them are used in any particular execution. Looking at a portion of Agner Fog’s Instruction tables: Lists of instruction latencies, throughputs and microoperation breakdowns for Intel, AMD and VIA CPUs, for the Haswell processor:
Instruction  dividend  divisor/ quotient  Ports  Latency  Reciprocal Throughput 

div  unsigned __int128  uint64_t  p0 p1 p5 p6  3296  2174 
idiv (unused)  __int128  int64_t  p0 p1 p5 p6  39103  2481 
divsd  double  double  p0  1020  814 
The floating point uses divsd
and the integer version uses div
which is much slower. idiv
would be even slower, but all the divisions in this example can done unsigned. This is an example of CPU resources being prioritized towards floating point performance rather than integer performance (the 64 bits vs 53 bits doesn’t help either). It would be quicker (and more accurate) to convert the integers to doubles, divide, then convert back. However this would risk breaking cross platform consistency.
Higher accuracy
The weakest point in the above is the divisions. This could be improved by shifting the dividend to be 64 bits and the divisor to be 32 bits, then shifting the result back, using something like the following code:
uint64_t udiv64_fixed(uint64_t a, uint64_t b) {
uint32_t leada=clz64(a);
int32_t leadb=clz64(b)32;
a<<=leada;
if(leadb>0) {
b<<=leadb;
} else {
b>>=leadb;
}
uint64_t d=a/b;
int32_t new_lead=leadaleadb32;
if(new_lead>0) {
d>>=new_lead;
} else {
d<<=new_lead;
}
return d;
}
Doing this makes the fixed point results very close to the floating point ones, but at the expense of speed because of the extra tests and shifts  it made the execution time a little over 50% longer.
Picking the best integer code
This is the timing for the integer implementations:
\(\lambda\)  1  10  25  50  100  200 

Knuth int64  17  43  82  148  279  542 
Knuth int32  16  45  89  160  306  597 
Knuth int16  17  55  111  204  392  769 
Knuth int16 sse  29  53  73  101  155  258 
Knuth int8 sse (imprecise)  33  77  91  101  134  196 
Hörmann int64    102  93  85  79  76 
Here’s a graph of the fastest four:
First note that the imprecise ‘Knuth int8 sse’ doesn’t even get a look in, because by the time the vectorization makes up for the setup overhead, ‘Hörmann int64’ has taken the lead. Even if I was tempted to put up with a slightly distorted distribution in the interests of speed (I wasn’t), it doesn’t help.
On my x8664 CPU, up to about \(\lambda=18\), ‘Knuth int64’ is the fastest, then up to \(\lambda=38\) ‘Knuth int16 sse’ is fastest, then ‘Hörmann int64’ is the clear winner for large values. The crossover points will be different on ARM, but it’s quite a lot of effort to measure timing (the only ARM system I have access to is my phone), so I’m going to assume it’s not too much different than x8664. Note that I want to pick the same crossover points across all processors, otherwise I won’t get identical results. The final algorithm is:
if (lambda<18)
use Knuth int64
else if(lambda<38)
use Knuth int16 sse
else
use Hormann int64
The band for using ‘Knuth int16 sse’ might widen a little if the software pipelined version was working, but having something that averages less than 90 ns for all \(\lambda\) is still good. Even a simplified version that just chooses between the ‘Knuth int64’ and ‘Hörmann int64’ (crossover at about \(\lambda=28\)) would still be more than adequate.
Conclusion
Here are the collected results:
\(\lambda\)  1  10  25  50  100  200 

Internal  31  127  220  211  210  208 
Knuth double  22  75  151  277  529  1033 
Knuth int64  17  43  82  148  279  542 
Knuth int32  16  45  89  160  306  597 
Knuth int16  17  55  111  204  392  769 
Knuth int8 (imprecise)  20  58  118  210  400  784 
Knuth int16 sse  29  53  73  101  155  258 
Knuth int8 sse (imprecise)  33  77  91  101  134  196 
Hörmann double    56  44  41  38  37 
Hörmann int64    102  93  85  79  76 
If a solution is allowed to use floating point, a fast solution is given (an average of below 50 ns for most \(\lambda\)) using:

‘Knuth double` for \(\lambda<10\),

‘Hörmann double’ for \(\lambda \geq 10\).
If we want reliable cross platform results, an integer solution is needed, so a fast solution (an average of below 90 ns for \(\lambda\)) is:

‘Knuth int64’ for \(\lambda<18\),

‘Knuth int16 sse’ for \(18 \leq \lambda<38\),

‘Hörmann int64’ for \(\lambda \geq 38\).
Either case is likely to be fast enough for most purposes. In my project, Poisson random variate code less than 4% of the executed instructions according to Valgrind. This might not be quite an apples to apples comparison with the 10+% mentioned earlier as there have also been many other changes, but the new code is definitely an improvement.
This is also a good demonstration that even heavily parallelized \(O(\lambda\)) code will eventually be beaten by \(O(1)\) code.
I welcome any comments, corrections, observations, or improvement suggestions. Please feel free to contact me