When I see software that provides assembly instructions my architecture, my gut reaction is to enable them. They must be faster! Why else would someone have taken the time to descend to the lowest levels of coding hell?
It’s not a magic bullet though. Assembly is still written by a programmer, and it still suffers from bit rot just like any other code. Its platform specificity can magnify the effect; you may be running the code on a platform that has a compatible instruction set but which needs totally different instructions to properly optimize the same operation.
While running OpenSSL benchmarks, I observed some curious behavior. OpenSSL on newer Xeon processors is as much as 50% slower with asm than without (for certain ciphers).
For reasons addressed elsewhere, I’m mainly interested in just three algorithms:
It turns out that this problem specifically only impacts RC4 and AES, which isn’t so lucky.
I used openssl-1.0.0d for all of my testing, although I found similar results with openssl-0.9.8r. openssl speed is the source of the benchmark results, running on just a single core in all cases.
Is it the compiler?
At first I wondered if it was just a matter of a newer compiler toolchain producing better code than the pseudo-hand written asm that OpenSSL originally acquired years ago. Since I’m beholden to RHEL at work, I ran some tests1 with versions of OpenSSL built on RHEL 4 and 52, which have very different versions of the common toolchain.
asm: aes-128-cbc 85,513.56k no-asm: aes-128-cbc 146,107.05k
asm: aes-128-cbc 85,284.18k no-asm: aes-128-cbc 167,597.40k
A 15% improvement isn’t bad, but since I would normally build openssl with asm enabled I would never see that gain for AES. asm performance remains the same between the two, and is far lower than the C version. Something else is at issue here.
x86_64 processors keep the same instruction set (and occasionally add new ones like AESNI), but that doesn’t mean they keep the same performance characteristics forever. Optimized code for an Opteron 270 will probably still run on a Xeon X5560 five years later, but it isn’t necessarily optimized anymore. It can be a bit of an arms race to keep up with the latest hardware quirks.
I ran openssl speed3 across three hosts with different Xeon processors: Harpertown (5400 series), Nehalem (5500 series), Westmere (5600 series). This revealed that some change in the CPU architecture from Nehalem on caused OpenSSL’s assembly code to become less efficient than the GCC compiled C code, but only for AES and RC4 (which coincidentally are the ciphers I care about).
Every other cipher continues to perform at least as well with asm than without, and many of them (such as RSA) are substantially faster.
openssl-1.0.0d with asm on Westmere
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes md5 25733.55k 85951.91k 234419.20k 410698.07k 522136.23k hmac(md5) 40607.96k 126173.42k 297999.27k 451871.74k 531436.89k sha1 24997.02k 77700.48k 184074.50k 283976.43k 335233.02k rmd160 20760.18k 58180.05k 124640.68k 173745.83k 196400.47k rc4 283818.19k 308370.67k 261646.51k 263007.91k 262647.50k des cbc 44095.51k 45725.21k 46270.38k 46344.53k 46388.57k des ede3 17276.04k 17488.11k 17485.82k 17500.50k 17484.46k seed cbc 45346.65k 45829.25k 45666.74k 45812.39k 45809.66k blowfish cbc 89570.52k 93845.12k 95046.06k 95244.63k 94516.57k cast cbc 77486.21k 80917.33k 81672.87k 81765.03k 81862.66k aes-128 cbc 78122.06k 83396.93k 84696.03k 84403.54k 85325.14k aes-192 cbc 66005.65k 69767.59k 70664.70k 70910.29k 71065.60k aes-256 cbc 57154.74k 60040.38k 60761.86k 60856.32k 60959.40k camellia-128 cbc 76557.16k 106521.11k 120180.99k 124169.90k 125476.86k camellia-192 cbc 63078.86k 83880.51k 91553.11k 93668.01k 94090.97k camellia-256 cbc 63201.99k 83897.11k 91556.78k 93672.79k 94401.88k sha256 32503.04k 69011.39k 114596.78k 136929.28k 145926.83k sha512 24876.78k 99736.85k 145548.37k 200672.83k 226355.88k whirlpool 19835.40k 41370.62k 69102.45k 80712.29k 75937.11k sign verify sign/s verify/s rsa 512 bits 0.000114s 0.000009s 8804.9 105697.7 rsa 1024 bits 0.000572s 0.000029s 1749.3 34239.9 rsa 2048 bits 0.003495s 0.000101s 286.1 9882.1 rsa 4096 bits 0.024595s 0.000386s 40.7 2590.2
openssl-1.0.0d without asm on Westmere
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes md5 18629.85k 69272.04k 192093.38k 350936.41k 459634.22k hmac(md5) 36383.35k 113622.20k 266558.29k 401562.97k 468322.99k sha1 19238.83k 61410.09k 152286.46k 242361.34k 293481.13k rmd160 17462.79k 51741.31k 116493.57k 170108.25k 196741.80k rc4 324468.11k 349412.22k 357622.95k 360358.19k 366261.59k des cbc 44055.78k 46050.52k 46879.49k 46962.35k 47125.85k des ede3 17255.43k 17489.71k 17535.40k 17550.34k 17502.58k seed cbc 45675.97k 45870.89k 45715.14k 45889.54k 45886.12k blowfish cbc 89574.70k 93861.93k 95049.98k 94363.13k 94524.76k cast cbc 78034.80k 80866.50k 81670.31k 81606.31k 81920.00k aes-128 cbc 158038.23k 165169.24k 167165.95k 167066.62k 167766.70k aes-192 cbc 137546.14k 142435.09k 143960.83k 143626.92k 144209.24k aes-256 cbc 122027.35k 125287.85k 126534.74k 126574.93k 126872.23k camellia-128 cbc 93205.31k 95191.42k 95708.93k 95844.69k 95690.75k camellia-192 cbc 72242.07k 73267.86k 73551.62k 73626.97k 73385.08k camellia-256 cbc 72224.74k 73108.46k 73555.54k 73559.04k 73594.20k sha256 19389.61k 45023.25k 79790.42k 98912.94k 106834.60k sha512 14506.04k 57937.00k 98487.38k 144708.55k 169426.94k whirlpool 13411.79k 27530.37k 44540.43k 52997.12k 56243.54k sign verify sign/s verify/s rsa 512 bits 0.000271s 0.000020s 3688.9 48818.2 rsa 1024 bits 0.001408s 0.000064s 710.1 15543.2 rsa 2048 bits 0.008497s 0.000223s 117.7 4476.0 rsa 4096 bits 0.054863s 0.000788s 18.2 1269.7
AESNI is a partial solution, since it supersedes the AES asm code on supported CPUs (Westmere and newer). But that doesn’t help out poor Nehalem, who arrived before that instruction set. AESNI is also a huge performance gain that makes AES easily the preferred SSL cipher if you have hardware that supports the instruction set.
I didn’t test as thoroughly with a 32-bit openssl, but asm vs no-asm on a Westmere produced:
RC4: 32% slower AES-128: 5.6% slower
32-bit OpenSSL is much slower for most operations than 64-bit OpenSSL, so if at all possible you should be using a 64-bit OS and application.
Since just wholesale disabling asm hurts us badly4, the ideal solution is to either fine tune OpenSSL’s asm for these processors or to selectively disable it when the C solution produces better results (such as by consulting the cpuid level).
The other three problem cases (RC4 on 586, AES on x86_64 and AES on 586) have no solutions at present. The best option right now is to disable asm for these ciphers using this patch to Configure. This has a relatively modest negative performance impact on pre-Nehalem Xeons, but significantly improves it on Nehalem and later Xeons.
You can also disable asm entirely by configuring a build with:
$ ./config no-asm
although as mentioned, this is a bit overkill since there are other algorithms which still benefit from asm optimizations.
You can tell how openssl was built with:
$ openssl version -f -p
platform: linux-x86_64 compiler: gcc -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -m64 -DL_ENDIAN -DTERMIO -O3 -Wall -DMD32_REG_T=int -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DWHIRLPOOL_ASM
This output suggests that asm was enabled when this build was made.
$ openssl speed -evp aes-128-cbc ↩
gcc 3.4.6 and 4.1.2 respectively ↩
$ openssl speed ↩
RSA performance is 58% slower in my tests without asm ↩
If you don’t apply this patch, you should also remove ‘rc4-x86_64.o’ from line 130 in Configure in addition to the other changes in openssl-1.0.0d-noasm_aes_rc4.patch ↩