So, I was doing all sorts of tests lately... Including calculating md5 sums of function results in most of the float range (takes forever on Raspberry Pi 3)
And... and... Carmack's method was actually slower on RPi and is easily overcome by honest 1/sqrt() on SSE if you calculate 4 at once!
1. Most of the time goes into md5 calculation
2. Array size is 8k (2048 floats) to fit in L1 cache.
3. sin() is currently not included, but is horrifically slow (15x slower that multiply)
CPU Phenom II X6 1090T 3.2GHz (linpack reports 10 GFLOPS for a single core)
Checking CPU/compiler combo for floating point determinism...
..checking quick reverse square root using Carmack's method..
..ok, in 37 (pure 12.7) seconds (
0.168 GFLOPS)
..md5 checksum = E1F449F174DB646E7D3075870D37BD4F
..checking SSE SIMD4 1/sqrt(x)..
..ok, in 29 (pure 5.02) seconds (
0.425 GFLOPS)
..md5 checksum = 7BA70F1439D5E2955151CC565477E924
..checking 1/sqrt(x)..
..ok, in 42 (pure 17.5) seconds (0.122 GFLOPS)
..md5 checksum = 7BA70F1439D5E2955151CC565477E924
..checking SSE SIMD4 RSQRTPS (packed quick reverse square root)..
..ok, in 25 (pure 1.44) seconds (
1.48 GFLOPS)
..md5 checksum = EF9B294032F7BA3051A1025B06EA3C96
..checking rsqrt (whatever implementation Chentra uses, a single-float SSE RSQRTSS on x86)..
..ok, in 35 (pure 10.7) seconds (0.199 GFLOPS)
..md5 checksum = EF9B294032F7BA3051A1025B06EA3C96
..checking x * Pi (typed const)..
..ok, in 57 (pure 9.88) seconds (0.425 GFLOPS)
..md5 checksum = 9CA6E7B818FA046C3DAE722C35196729
..checking x * 3.141592653589793 (inline const)..
..ok, in 57 (pure 9.93) seconds (0.422 GFLOPS)
..md5 checksum = 0FC3738303DEA3CFC8C6F7AFBF585BE6
..checking x * float(3.141592653589793) (inline const with type-cast)..
..ok, in 55 (pure 7.74) seconds (0.542 GFLOPS)
..md5 checksum = 9CA6E7B818FA046C3DAE722C35196729
CPU Core i5-2450M 2.50GHz (linpack reports 19 GFLOPS for a single core)
Checking CPU/compiler combo for floating point determinism...
..checking quick reverse square root using Carmack's method..
..ok, in 40 (pure 14.6) seconds (
0.146 GFLOPS)
..md5 checksum = E1F449F174DB646E7D3075870D37BD4F
..checking SSE SIMD4 1/sqrt(x)..
..ok, in 31 (pure 5.32) seconds (
0.4 GFLOPS)
..md5 checksum = 7BA70F1439D5E2955151CC565477E924
..checking 1/sqrt(x)..
..ok, in 48 (pure 21.6) seconds (0.0987 GFLOPS)
..md5 checksum = 7BA70F1439D5E2955151CC565477E924
..checking SSE SIMD4 RSQRTPS (packed quick reverse square root)..
..ok, in 27 (pure 1.22) seconds (
1.74 GFLOPS)
..md5 checksum = F881C03FB2C6F5BBDFF57AE5532CFFFD
..checking rsqrt (whatever implementation Chentra uses, a single-float SSE RSQRTSS on x86)..
..ok, in 32 (pure 6.88) seconds (0.309 GFLOPS)
..md5 checksum = F881C03FB2C6F5BBDFF57AE5532CFFFD
..checking x * Pi (typed const)..
..ok, in 57 (pure 6.3) seconds (0.665 GFLOPS)
..md5 checksum = 9CA6E7B818FA046C3DAE722C35196729
..checking x * 3.141592653589793 (inline const)..
..ok, in 57 (pure 6.79) seconds (0.618 GFLOPS)
..md5 checksum = 0FC3738303DEA3CFC8C6F7AFBF585BE6
..checking x * float(3.141592653589793) (inline const with type-cast)..
..ok, in 56 (pure 5.9) seconds (0.711 GFLOPS)
..md5 checksum = 9CA6E7B818FA046C3DAE722C35196729
Raspberry Pi 3:
CPU ARMv7 rev 4 (v7l)
x4 logical cores
Checking CPU/compiler combo for floating point determinism...
..checking quick reverse square root using Carmack's method..
..ok, in 890 (pure 271) seconds (
0.00787 GFLOPS)
..md5 checksum = E1F449F174DB646E7D3075870D37BD4F
..checking 1/sqrt(x)..
..ok, in 722 (pure 103) seconds (
0.0206 GFLOPS)
..md5 checksum = 7BA70F1439D5E2955151CC565477E924
..checking rsqrt (whatever implementation Chentra uses, a single-float SSE RSQRTSS on x86)..
..ok, in 890 (pure 271) seconds (0.00787 GFLOPS)
..md5 checksum = E1F449F174DB646E7D3075870D37BD4F
..checking x * Pi (typed const)..
..ok, in 1294 (pure 75.6) seconds (0.0555 GFLOPS)
..md5 checksum = 9CA6E7B818FA046C3DAE722C35196729
..checking x * 3.141592653589793 (inline const)..
..ok, in 1331 (pure 112) seconds (0.0373 GFLOPS)
..md5 checksum = 0FC3738303DEA3CFC8C6F7AFBF585BE6
..checking x * float(3.141592653589793) (inline const with type-cast)..
..ok, in 1282 (pure 63.3) seconds (0.0662 GFLOPS)
..md5 checksum = 9CA6E7B818FA046C3DAE722C35196729
My translation of fast inverse square root from Wikipedia:
// https://en.wikipedia.org/wiki/Fast_inverse_square_root
function FastInverseSquareRoot(a: float): float; inline;
var
i: longint;// absolute Result; //code generator FAILS to marry SSE2 and general-purpose registers
begin
//Result:= a;
i:= longint(pointer(@a)^);
i:= $5f3759df - (i shr 1);
Result:= float(pointer(@i)^);
Result*= 1.5 - (a * 0.5 * Result * Result);
Result*= 1.5 - (a * 0.5 * Result * Result);
end;
And the faster but NOT Tdeterministic alternative I use @ x86:
function FastInverseSquareRoot(a: float): float; inline; assembler;
asm
RSQRTSS xmm7, a
MOVSS [Result], xmm7
end['xmm7'];
P.S. I haven't seen a x86 CPU without SSE2 since forever. You need Pentium 3 or a really ancient AMD Sempron for this code to fail.