Age | Commit message (Collapse) | Author | Files | Lines |
|
results, show results for multiple backends on all test systems
|
|
|
|
- switch row_eor() from macro to static inline function
- compress rho rotate values into from 15 128-bit registers to two to
reduce register pressure (still spilling, though)
- remove PERMUTE macro
- switch from unrolled loop with macro in body of permute_n_neon() to
regular loop
- add documentation for register/lane layout and for compressed rho
rotations
with these changes the neon backend is still uses ~50% more cycles than
the scalar backend, so i will probably leave it disabled for the initial
release.
scalar (pi5):
> ./bench 2000
info: cpucycles: version=20240318 implementation=arm64-vct persecond=2400000000
info: backend=scalar num_trials=2000 src_lens=64,256,1024,4096,16384 dst_lens=32
function,dst_len,64,256,1024,4096,16384
sha3_224,28,20.2,10.3,10.3,9.3,9.2
sha3_256,32,20.2,10.3,10.3,9.9,9.7
sha3_384,48,20.9,15.3,12.8,12.7,12.5
sha3_512,64,20.2,20.2,18.9,25.3,17.9
shake128,32,20.2,10.1,9.0,8.1,7.9
shake256,32,20.2,10.3,10.3,9.9,9.7
neon backend bench results (pi5):
> ./bench 2000
info: cpucycles: version=20240318 implementation=arm64-vct persecond=2400000000
info: backend=neon num_trials=2000 src_lens=64,256,1024,4096,16384 dst_lens=32
function,dst_len,64,256,1024,4096,16384
sha3_224,28,32.7,16.3,16.4,14.9,14.6
sha3_256,32,32.0,16.2,16.4,15.9,15.5
sha3_384,48,32.7,24.2,20.4,20.2,20.0
sha3_512,64,32.0,32.2,30.1,28.6,28.5
shake128,32,32.7,16.2,14.2,12.8,12.5
shake256,32,32.7,16.2,16.3,15.7,15.4
|
|
|
|
|
|
|
|
|
|
cyles, see commit message)
made the following changes:
- row_t contents are now 3 uint64x2_t instead of uin64x2x3_t (so they
are stored as registers instead of memory)
- fetch round constants 2 at a time
- round loop unrolled once
- drop convoluted ext/trn store (hard to read, doesn't help)
bench results
-------------
scalar backend:
> make clean all SHA3_BACKEND=1
...
> ./bench 10000
info: cpucycles: version=20240318 implementation=arm64-vct persecond=2400000000
info: backend=scalar num_trials=10000 src_lens=64,256,1024,4096,16384 dst_lens=32
function,dst_len,64,256,1024,4096,16384
sha3_224,28,20.2,10.3,10.3,9.3,9.2
sha3_256,32,20.2,10.3,10.3,9.9,9.7
sha3_384,48,20.9,15.3,12.8,12.7,12.7
sha3_512,64,20.2,20.2,18.9,17.9,18.1
shake128,32,20.2,10.3,9.0,8.1,7.9
shake256,32,20.2,10.1,10.3,9.9,9.7
neon backend:
> make clean all SHA3_BACKEND=3
...
> ./bench 10000
info: cpucycles: version=20240318 implementation=arm64-vct persecond=2400000000
info: backend=neon num_trials=10000 src_lens=64,256,1024,4096,16384 dst_lens=32
function,dst_len,64,256,1024,4096,16384
sha3_224,28,9.7,5.0,5.0,4.6,4.5
sha3_256,32,9.7,5.0,5.0,4.9,4.8
sha3_384,48,9.7,7.3,6.2,6.2,6.1
sha3_512,64,9.7,9.7,9.1,8.7,8.7
shake128,32,9.7,5.0,4.5,4.0,4.0
shake256,32,9.7,5.0,5.1,4.9,4.8
|
|
|
|
scalar bench results:
info: cpucycles: version=20240318 implementation=arm64-vct persecond=2400000000
info: backend=scalar num_trials=50000 src_lens=64,256,1024,4096,16384 dst_lens=32
function,dst_len,64,256,1024,4096,16384
sha3_224,28,20.2,10.3,10.3,9.3,9.2
sha3_256,32,20.2,10.1,10.3,9.9,9.7
sha3_384,48,20.9,15.1,12.8,12.7,12.5
sha3_512,64,20.2,20.2,18.9,18.0,18.0
shake128,32,20.2,10.1,9.0,8.1,7.9
shake256,32,20.2,10.1,10.3,9.9,9.7
neon bench results:
info: cpucycles: version=20240318 implementation=arm64-vct persecond=2400000000
info: backend=neon num_trials=50000 src_lens=64,256,1024,4096,16384 dst_lens=32
function,dst_len,64,256,1024,4096,16384
sha3_224,28,32.7,16.2,16.3,14.8,14.5
sha3_256,32,32.7,16.2,16.3,15.8,15.4
sha3_384,48,32.7,24.2,20.3,20.3,20.0
sha3_512,64,32.0,32.3,30.2,28.7,29.3
shake128,32,34.8,16.9,14.9,13.3,13.4
shake256,32,35.5,18.1,17.4,17.2,16.4
diet-neon bench results:
info: cpucycles: version=20240318 implementation=arm64-vct persecond=2400000000
info: backend=diet-neon num_trials=50000 src_lens=64,256,1024,4096,16384 dst_lens=32
function,dst_len,64,256,1024,4096,16384
sha3_224,28,33.4,16.5,16.6,15.1,15.0
sha3_256,32,33.4,16.5,16.6,16.1,15.9
sha3_384,48,33.4,25.0,21.0,20.7,21.4
sha3_512,64,33.4,34.9,33.5,31.1,32.0
shake128,32,36.8,18.4,16.3,14.3,14.0
shake256,32,34.1,17.7,18.2,17.6,17.3
|
|
command-line argument
|
|
|
|
|
|
commit message)
scalar (odroid n2l):
pabs@pizza:~/git/sha3/tests/bench> ./bench 1000
info: cpucycles: version=20240318 implementation=arm64-vct persecond=1800000000
info: backend=scalar num_trials=1000 src_lens=64,256,1024,4096,16384 dst_lens=32
function,dst_len,64,256,1024,4096,16384
sha3_224,28,34.0,16.4,15.5,14.0,13.7
sha3_256,32,34.0,16.1,15.4,14.8,14.4
sha3_384,48,34.0,23.4,19.0,18.8,18.6
sha3_512,64,34.0,30.8,28.1,26.5,26.5
shake128,32,34.0,16.1,13.6,12.1,11.8
shake256,32,34.0,16.1,15.5,14.8,14.4
neon (odroid n2l):
pabs@pizza:~/git/sha3/tests/bench> ./bench 1000
info: cpucycles: version=20240318 implementation=arm64-vct persecond=1800000000
info: backend=neon num_trials=1000 src_lens=64,256,1024,4096,16384 dst_lens=32
function,dst_len,64,256,1024,4096,16384
sha3_224,28,65.6,32.5,31.8,28.7,28.2
sha3_256,32,65.6,32.5,31.9,30.8,30.0
sha3_384,48,65.6,48.0,39.7,39.5,39.0
sha3_512,64,68.0,63.9,59.1,56.0,55.9
shake128,32,65.6,32.5,28.4,25.4,24.8
shake256,32,65.6,32.5,31.6,30.5,29.7
|
|
|
|
|
|
|
|
|
|
i verified that (gcc, at least) does constant propagation and inlines
permute_n_<backend> and that this change does not affect performance.
bench results, before:
pabs@flex:~/git/sha3/tests/bench> ./bench
info: cpucycles: version=20240318 implementation=amd64-pmc persecond=4800000000
info: backend=avx512 num_trials=100000 src_lens=64,256,1024,4096,16384 dst_lens=32
function,dst_len,64,256,1024,4096,16384
sha3_224,28,15.4,7.8,7.8,7.1,7.0
sha3_256,32,15.5,7.8,7.8,7.6,7.4
sha3_384,48,15.5,11.7,9.8,9.8,9.7
sha3_512,64,15.6,15.5,14.6,13.9,13.9
shake128,32,15.5,7.8,6.9,6.2,6.1
shake256,32,15.5,7.8,7.9,7.6,7.4
bench results, after change:
pabs@flex:~/git/sha3/tests/bench> ./bench
info: cpucycles: version=20240318 implementation=amd64-pmc persecond=4800000000
info: backend=avx512 num_trials=100000 src_lens=64,256,1024,4096,16384 dst_lens=32
function,dst_len,64,256,1024,4096,16384
sha3_224,28,15.4,7.8,7.8,7.1,7.0
sha3_256,32,15.6,7.8,7.8,7.6,7.4
sha3_384,48,15.6,11.7,9.8,9.8,9.7
sha3_512,64,15.6,15.5,14.6,13.8,13.8
shake128,32,15.6,7.9,6.9,6.2,6.1
shake256,32,15.7,7.9,7.9,7.6,7.4
|
|
|
|
|
|
permute{,12}_{scalar,avx512}() to use them
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
cleanups
|
|
from output, use defines for src/dst lengths
|
|
|
|
|
|
|
|
|
|
|
|
|
|
permute_{scalar,axv512}(), hard-code num_rounds to 24 in permute_{scalar,avx512}(), add permute12_{scalar,avx512}(), absorb12(), and xof12_{init,absorb,raw,absorb,squeeze_raw,squeeze,once}(), update turboshake to use xof12_*(), move permute tests to PERMUTE_TESTS static array, rename test_permute() to test_permute_scalar(), add test_permute_avx512(), add PERMUTE12_TESTS and test_permute12_{scalar,avx512}()
|
|
|
|
compiler flags
|
|
|
|
improve comments
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|