Pages: [1]
  Print  
Author Topic: Is Carmack's rsqrt dead? (at least in Pascal)  (Read 13462 times)
cheb
Lesser Nub


Cakes 3
Posts: 127



WWW
« on: February 06, 2018, 05:22:02 AM »

So, I was doing all sorts of tests lately... Including calculating md5 sums of function results in most of the float range (takes forever on Raspberry Pi 3)
And... and... Carmack's method was actually slower on RPi and is easily overcome by honest 1/sqrt() on SSE if you calculate 4 at once!  Shocked

1. Most of the time goes into md5 calculation
2. Array size is 8k (2048 floats) to fit in L1 cache.
3. sin() is currently not included, but is horrifically slow (15x slower that multiply)


CPU Phenom II X6 1090T 3.2GHz (linpack reports 10 GFLOPS for a single core)
  Checking CPU/compiler combo for floating point determinism...
    ..checking quick reverse square root using Carmack's method..
      ..ok, in 37 (pure 12.7) seconds (0.168 GFLOPS)
      ..md5 checksum = E1F449F174DB646E7D3075870D37BD4F
    ..checking SSE SIMD4 1/sqrt(x)..
      ..ok, in 29 (pure 5.02) seconds (0.425 GFLOPS)
      ..md5 checksum = 7BA70F1439D5E2955151CC565477E924
    ..checking 1/sqrt(x)..
      ..ok, in 42 (pure 17.5) seconds (0.122 GFLOPS)
      ..md5 checksum = 7BA70F1439D5E2955151CC565477E924
    ..checking SSE SIMD4 RSQRTPS (packed quick reverse square root)..
      ..ok, in 25 (pure 1.44) seconds (1.48 GFLOPS)
      ..md5 checksum = EF9B294032F7BA3051A1025B06EA3C96
    ..checking rsqrt (whatever implementation Chentra uses, a single-float SSE RSQRTSS on x86)..
      ..ok, in 35 (pure 10.7) seconds (0.199 GFLOPS)
      ..md5 checksum = EF9B294032F7BA3051A1025B06EA3C96
    ..checking x * Pi (typed const)..
      ..ok, in 57 (pure 9.88) seconds (0.425 GFLOPS)
      ..md5 checksum = 9CA6E7B818FA046C3DAE722C35196729
    ..checking x * 3.141592653589793 (inline const)..
      ..ok, in 57 (pure 9.93) seconds (0.422 GFLOPS)
      ..md5 checksum = 0FC3738303DEA3CFC8C6F7AFBF585BE6
    ..checking x * float(3.141592653589793) (inline const with type-cast)..
      ..ok, in 55 (pure 7.74) seconds (0.542 GFLOPS)
      ..md5 checksum = 9CA6E7B818FA046C3DAE722C35196729


    
CPU Core i5-2450M 2.50GHz (linpack reports 19 GFLOPS for a single core)
  Checking CPU/compiler combo for floating point determinism...
    ..checking quick reverse square root using Carmack's method..
      ..ok, in 40 (pure 14.6) seconds (0.146 GFLOPS)
      ..md5 checksum = E1F449F174DB646E7D3075870D37BD4F
    ..checking SSE SIMD4 1/sqrt(x)..
      ..ok, in 31 (pure 5.32) seconds (0.4 GFLOPS)
      ..md5 checksum = 7BA70F1439D5E2955151CC565477E924
    ..checking 1/sqrt(x)..
      ..ok, in 48 (pure 21.6) seconds (0.0987 GFLOPS)
      ..md5 checksum = 7BA70F1439D5E2955151CC565477E924
    ..checking SSE SIMD4 RSQRTPS (packed quick reverse square root)..
      ..ok, in 27 (pure 1.22) seconds (1.74 GFLOPS)
      ..md5 checksum = F881C03FB2C6F5BBDFF57AE5532CFFFD
    ..checking rsqrt (whatever implementation Chentra uses, a single-float SSE RSQRTSS on x86)..
      ..ok, in 32 (pure 6.88) seconds (0.309 GFLOPS)
      ..md5 checksum = F881C03FB2C6F5BBDFF57AE5532CFFFD
    ..checking x * Pi (typed const)..
      ..ok, in 57 (pure 6.3) seconds (0.665 GFLOPS)
      ..md5 checksum = 9CA6E7B818FA046C3DAE722C35196729
    ..checking x * 3.141592653589793 (inline const)..
      ..ok, in 57 (pure 6.79) seconds (0.618 GFLOPS)
      ..md5 checksum = 0FC3738303DEA3CFC8C6F7AFBF585BE6
    ..checking x * float(3.141592653589793) (inline const with type-cast)..
      ..ok, in 56 (pure 5.9) seconds (0.711 GFLOPS)
      ..md5 checksum = 9CA6E7B818FA046C3DAE722C35196729

    
    
    
Raspberry Pi 3:    
CPU ARMv7 rev 4 (v7l)
  x4 logical cores
  Checking CPU/compiler combo for floating point determinism...
    ..checking quick reverse square root using Carmack's method..
      ..ok, in 890 (pure 271) seconds (0.00787 GFLOPS)
      ..md5 checksum = E1F449F174DB646E7D3075870D37BD4F
    ..checking 1/sqrt(x)..
      ..ok, in 722 (pure 103) seconds (0.0206 GFLOPS)
      ..md5 checksum = 7BA70F1439D5E2955151CC565477E924
    ..checking rsqrt (whatever implementation Chentra uses, a single-float SSE RSQRTSS on x86)..
      ..ok, in 890 (pure 271) seconds (0.00787 GFLOPS)
      ..md5 checksum = E1F449F174DB646E7D3075870D37BD4F
    ..checking x * Pi (typed const)..
      ..ok, in 1294 (pure 75.6) seconds (0.0555 GFLOPS)
      ..md5 checksum = 9CA6E7B818FA046C3DAE722C35196729
    ..checking x * 3.141592653589793 (inline const)..
      ..ok, in 1331 (pure 112) seconds (0.0373 GFLOPS)
      ..md5 checksum = 0FC3738303DEA3CFC8C6F7AFBF585BE6
    ..checking x * float(3.141592653589793) (inline const with type-cast)..
      ..ok, in 1282 (pure 63.3) seconds (0.0662 GFLOPS)
      ..md5 checksum = 9CA6E7B818FA046C3DAE722C35196729



My translation of fast inverse square root from Wikipedia:
Code:
    // https://en.wikipedia.org/wiki/Fast_inverse_square_root
    function FastInverseSquareRoot(a: float): float; inline;
    var
      i: longint;// absolute Result; //code generator FAILS to marry SSE2 and general-purpose registers
    begin
      //Result:= a;
      i:= longint(pointer(@a)^);
      i:= $5f3759df - (i shr 1);
      Result:= float(pointer(@i)^);
      Result*= 1.5 - (a * 0.5 * Result * Result);
      Result*= 1.5 - (a * 0.5 * Result * Result);
    end;

And the faster but NOT Tdeterministic alternative I use @ x86:
Code:
    function FastInverseSquareRoot(a: float): float; inline; assembler;
    asm
      RSQRTSS xmm7, a
      MOVSS [Result], xmm7
    end['xmm7']; 

P.S. I haven't seen a x86 CPU without SSE2 since forever. You need Pentium 3 or a really ancient AMD Sempron for this code to fail.
Logged

Imma lazy dreamer. I achieved nothing.
Neon_Knight
In the year 3000
***

Cakes 49
Posts: 3775


Trickster God.


« Reply #1 on: February 25, 2018, 06:41:17 AM »

Sago once explained the problems that rsqrt has. Apparently it's system-specific and unportable.
Logged


"Detailed" is nice, but if it gets in the way of clarity, it ceases being a nice addition and becomes a problem. - TVT
Want to contribute? Read this.
sago007
Posts a lot
*

Cakes 62
Posts: 1664


Open Arena Developer


WWW
« Reply #2 on: February 25, 2018, 09:24:56 AM »

Well, I did not explain it.
I read a description on it somewhere where they discussed the cast between int and float would hurt performance on all processors with dedicated floating point registers.
The code is likely copied from a math library targeted an early RISC based processor.

One thing to remember when comparing the function is that the function is deterministic. Your hardware accelerated functions might not be.
« Last Edit: February 25, 2018, 09:28:06 AM by sago007 » Logged

There are nothing offending in my posts.
cheb
Lesser Nub


Cakes 3
Posts: 127



WWW
« Reply #3 on: February 28, 2018, 01:15:49 AM »

Quote
Your hardware accelerated functions might not be.
That's a really interesting topic (and that's what the md5 sums are for). Interestingly enough, I found that most hardware-accelerated calculations in Pascal are consistently deterministic between x86 and whichever arm Raspberry Pi 3 has. Including sin() (horrifically slow that one is), sqrt() and everything. Even using SSE2.

What is NOT deterministic are the quick-and-dirty SSE functions that calculate 1/x or rsqrt(x). These are lightning-fast but checksums do not match between different CPUs

Quote
would hurt performance on all processors with dedicated floating point registers.
Very true. When I switched Pascal to generate assembly output I found that to pass value from a xmm register to a general-purpose register it writes it to memory. Back and forth, it does that twice. Definitely a case of "mah eyes are bleeding!"/"make me unsee it!"
« Last Edit: February 28, 2018, 01:18:48 AM by cheb » Logged

Imma lazy dreamer. I achieved nothing.
Pages: [1]
  Print  
 
Jump to: