March 24, 2011
OpenSSL: Outmoded Assembly

When I see software that provides assembly instructions my architecture, my gut reaction is to enable them. They must be faster! Why else would someone have taken the time to descend to the lowest levels of coding hell?

It’s not a magic bullet though. Assembly is still written by a programmer, and it still suffers from bit rot just like any other code. Its platform specificity can magnify the effect; you may be running the code on a platform that has a compatible instruction set but which needs totally different instructions to properly optimize the same operation.

The problem

While running OpenSSL benchmarks, I observed some curious behavior. OpenSSL on newer Xeon processors is as much as 50% slower with asm than without (for certain ciphers).

For reasons addressed elsewhere, I’m mainly interested in just three algorithms:

  • aes-128-cbc
  • rc4
  • rsa-1024

It turns out that this problem specifically only impacts RC4 and AES, which isn’t so lucky.

OpenSSL asm v. C

I used openssl-1.0.0d for all of my testing, although I found similar results with openssl-0.9.8r. openssl speed is the source of the benchmark results, running on just a single core in all cases.

Is it the compiler?

At first I wondered if it was just a matter of a newer compiler toolchain producing better code than the pseudo-hand written asm that OpenSSL originally acquired years ago. Since I’m beholden to RHEL at work, I ran some tests1 with versions of OpenSSL built on RHEL 4 and 52, which have very different versions of the common toolchain.

RHEL 4 v. 5

RHEL 4
asm:    aes-128-cbc 85,513.56k
no-asm: aes-128-cbc 146,107.05k
RHEL 5
asm:    aes-128-cbc 85,284.18k
no-asm: aes-128-cbc 167,597.40k

A 15% improvement isn’t bad, but since I would normally build openssl with asm enabled I would never see that gain for AES. asm performance remains the same between the two, and is far lower than the C version. Something else is at issue here.

Architectural changes

x86_64 processors keep the same instruction set (and occasionally add new ones like AESNI), but that doesn’t mean they keep the same performance characteristics forever. Optimized code for an Opteron 270 will probably still run on a Xeon X5560 five years later, but it isn’t necessarily optimized anymore. It can be a bit of an arms race to keep up with the latest hardware quirks.

I ran openssl speed3 across three hosts with different Xeon processors: Harpertown (5400 series), Nehalem (5500 series), Westmere (5600 series). This revealed that some change in the CPU architecture from Nehalem on caused OpenSSL’s assembly code to become less efficient than the GCC compiled C code, but only for AES and RC4 (which coincidentally are the ciphers I care about).

Every other cipher continues to perform at least as well with asm than without, and many of them (such as RSA) are substantially faster.

Results

openssl-1.0.0d with asm on Westmere
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
md5              25733.55k    85951.91k   234419.20k   410698.07k   522136.23k
hmac(md5)        40607.96k   126173.42k   297999.27k   451871.74k   531436.89k
sha1             24997.02k    77700.48k   184074.50k   283976.43k   335233.02k
rmd160           20760.18k    58180.05k   124640.68k   173745.83k   196400.47k
rc4             283818.19k   308370.67k   261646.51k   263007.91k   262647.50k
des cbc          44095.51k    45725.21k    46270.38k    46344.53k    46388.57k
des ede3         17276.04k    17488.11k    17485.82k    17500.50k    17484.46k
seed cbc         45346.65k    45829.25k    45666.74k    45812.39k    45809.66k
blowfish cbc     89570.52k    93845.12k    95046.06k    95244.63k    94516.57k
cast cbc         77486.21k    80917.33k    81672.87k    81765.03k    81862.66k
aes-128 cbc      78122.06k    83396.93k    84696.03k    84403.54k    85325.14k
aes-192 cbc      66005.65k    69767.59k    70664.70k    70910.29k    71065.60k
aes-256 cbc      57154.74k    60040.38k    60761.86k    60856.32k    60959.40k
camellia-128 cbc    76557.16k   106521.11k   120180.99k   124169.90k   125476.86k
camellia-192 cbc    63078.86k    83880.51k    91553.11k    93668.01k    94090.97k
camellia-256 cbc    63201.99k    83897.11k    91556.78k    93672.79k    94401.88k
sha256           32503.04k    69011.39k   114596.78k   136929.28k   145926.83k
sha512           24876.78k    99736.85k   145548.37k   200672.83k   226355.88k
whirlpool        19835.40k    41370.62k    69102.45k    80712.29k    75937.11k
                  sign    verify    sign/s verify/s
rsa  512 bits 0.000114s 0.000009s   8804.9 105697.7
rsa 1024 bits 0.000572s 0.000029s   1749.3  34239.9
rsa 2048 bits 0.003495s 0.000101s    286.1   9882.1
rsa 4096 bits 0.024595s 0.000386s     40.7   2590.2
openssl-1.0.0d without asm on Westmere
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
md5              18629.85k    69272.04k   192093.38k   350936.41k   459634.22k
hmac(md5)        36383.35k   113622.20k   266558.29k   401562.97k   468322.99k
sha1             19238.83k    61410.09k   152286.46k   242361.34k   293481.13k
rmd160           17462.79k    51741.31k   116493.57k   170108.25k   196741.80k
rc4             324468.11k   349412.22k   357622.95k   360358.19k   366261.59k
des cbc          44055.78k    46050.52k    46879.49k    46962.35k    47125.85k
des ede3         17255.43k    17489.71k    17535.40k    17550.34k    17502.58k
seed cbc         45675.97k    45870.89k    45715.14k    45889.54k    45886.12k
blowfish cbc     89574.70k    93861.93k    95049.98k    94363.13k    94524.76k
cast cbc         78034.80k    80866.50k    81670.31k    81606.31k    81920.00k
aes-128 cbc     158038.23k   165169.24k   167165.95k   167066.62k   167766.70k
aes-192 cbc     137546.14k   142435.09k   143960.83k   143626.92k   144209.24k
aes-256 cbc     122027.35k   125287.85k   126534.74k   126574.93k   126872.23k
camellia-128 cbc    93205.31k    95191.42k    95708.93k    95844.69k    95690.75k
camellia-192 cbc    72242.07k    73267.86k    73551.62k    73626.97k    73385.08k
camellia-256 cbc    72224.74k    73108.46k    73555.54k    73559.04k    73594.20k
sha256           19389.61k    45023.25k    79790.42k    98912.94k   106834.60k
sha512           14506.04k    57937.00k    98487.38k   144708.55k   169426.94k
whirlpool        13411.79k    27530.37k    44540.43k    52997.12k    56243.54k
                  sign    verify    sign/s verify/s
rsa  512 bits 0.000271s 0.000020s   3688.9  48818.2
rsa 1024 bits 0.001408s 0.000064s    710.1  15543.2
rsa 2048 bits 0.008497s 0.000223s    117.7   4476.0
rsa 4096 bits 0.054863s 0.000788s     18.2   1269.7

AESNI

AESNI is a partial solution, since it supersedes the AES asm code on supported CPUs (Westmere and newer). But that doesn’t help out poor Nehalem, who arrived before that instruction set. AESNI is also a huge performance gain that makes AES easily the preferred SSL cipher if you have hardware that supports the instruction set.

32-bit

I didn’t test as thoroughly with a 32-bit openssl, but asm vs no-asm on a Westmere produced:

RC4: 32% slower
AES-128: 5.6% slower

32-bit OpenSSL is much slower for most operations than 64-bit OpenSSL, so if at all possible you should be using a 64-bit OS and application.

Conclusion

Since just wholesale disabling asm hurts us badly4, the ideal solution is to either fine tune OpenSSL’s asm for these processors or to selectively disable it when the C solution produces better results (such as by consulting the cpuid level).

One of the problem cases (RC4 on x86_64) has been solved in OpenSSL already and a patch is available5.

The other three problem cases (RC4 on 586, AES on x86_64 and AES on 586) have no solutions at present. The best option right now is to disable asm for these ciphers using this patch to Configure. This has a relatively modest negative performance impact on pre-Nehalem Xeons, but significantly improves it on Nehalem and later Xeons.

You can also disable asm entirely by configuring a build with:

$ ./config no-asm

although as mentioned, this is a bit overkill since there are other algorithms which still benefit from asm optimizations.

You can tell how openssl was built with:

$ openssl version -f -p

platform: linux-x86_64
compiler: gcc -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H
-m64 -DL_ENDIAN -DTERMIO -O3 -Wall -DMD32_REG_T=int -DOPENSSL_IA32_SSE2
-DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM
-DWHIRLPOOL_ASM

This output suggests that asm was enabled when this build was made.


  1. $ openssl speed -evp aes-128-cbc 

  2. gcc 3.4.6 and 4.1.2 respectively 

  3. $ openssl speed 

  4. RSA performance is 58% slower in my tests without asm 

  5. If you don’t apply this patch, you should also remove ‘rc4-x86_64.o’ from line 130 in Configure in addition to the other changes in openssl-1.0.0d-noasm_aes_rc4.patch 

  1. aesthma posted this
Blog comments powered by Disqus