Red Hat Bugzilla – Bug 857035
PRD32 - cache ssl handshakes with the hosts to reduce cpu load
Last modified: 2016-02-10 14:43:50 EST
rhevm's is intalled on a bare-metal with AMD Opteron 4170 HE processor and 32GB RAM.
I installed 224 hosts, 200 of them are fake hosts and the rest are bare-metal hosts, 1 of them runs 200 vms which serve as fake hosts. This scale is totally reasonable even in rhevm 3.0.
rhevm's java process takes a lot of cpu - ~35-75% of the total cpus, the java threads dump shows that most of the heavy threads use the sun.security.rsa.RSACore.crtCrypt method which is used for ssl handshake with the hosts.
rhevm and webadmin are so slow that it's not possible to work on them!!!
Example of a thread that does the ssl handshake:
"pool-4-thread-2837" prio=10 tid=0x00007f4bc8004000 nid=0x39c runnable [0x00007f4b7c6c4000]
- locked <0x00000000ec66ee18> (a sun.security.ssl.SSLSocketImpl)
- locked <0x00000000ec66efe8> (a java.lang.Object)
- locked <0x00000000ec66efa8> (a java.lang.Object)
- locked <0x00000000eef8ca68> (a sun.security.ssl.AppOutputStream)
- locked <0x00000000eef91f48> (a java.io.BufferedOutputStream)
at $Proxy32.list(Unknown Source)
at sun.reflect.GeneratedMethodAccessor91.invoke(Unknown Source)
After trying several things the only "simple" way I find to improve this is to start using OpenSSL in a way that allows session caching. This is the change I am proposing to do that in VDSM:
This has to be combined with another change in the engine to make sure that we use one SSL context per host. I am working on that.
After studying how the engine manages the SSL connections with the hosts I found that it creates only one SSL context that is shared by all the hosts. However this context (the SSLContext class, part of the JDK) manages a cache of sessions indexed by host name and port number. I verified that with the VDSM change proposed in comment 8 the engine automatically reuses the handshakes with the hosts, so no additional changes are required in the engine.
Rami did tests with and without SSL, and the results clearly show that there is an important, very important, change in the use of CPU. I took a sample (using sysstat) of the first hour after the installation, and it shows that the system uses 150% of CPU with SSL disabled and 300% of CPU with SSL enabled. We all know that this is expected, but having half of the CPU dedicated to SSL doesn't look right to me. See the attached graph.
The stack traces that Rami collected again show many threads doing RSA related computations, so I think it is worth trying to reduce the impact of SSL in the use of CPU. Using a session cache as proposed in the patch is an step in that direction. I don't have hard data to compare the situation with/without the cache. Will try to do some simple benchmarks.
Rami, what is your "feeling" about the environment with SSL disabled? Is it more responsive?
Created attachment 624914 [details]
Use of CPU with and without SSL
Juan, thank for the results analysis!
My "feeling" - hard to say as I had problem to connect the storage to the 200 fake hosts so I haven't done anything usefull with the env, but by judging the environment removal, I think it was faster and smoother, the change was also noticable while changing tabs in webadmin.
Created attachment 624961 [details]
Java threads dumps + timestamps
Tested on 3.1 si20.
I've tested the proposed ssl cache patch which achieves imporessive results - it reduces the cpu usage by ~40% in comparison to the current ssl implementation. I'll attach the charts.
Note that the environment (200 fake hosts) with this patch uses twice the kernel space cpu time than the current ssl implementation, but the patch is very economic in the user space cpu time.
There are also new (afaik) errors arouse while using the patch, Juan - could you please tell whether they might be related to the patch?
Thread-66272::ERROR::2012-11-02 20:10:36,815::SecureXMLRPCServer::210::root::(handle_error) client ('10.35.160.93', 38849)
Traceback (most recent call last):
File "/usr/lib64/python2.6/SocketServer.py", line 560, in process_request_thread
File "/usr/lib64/python2.6/site-packages/vdsm/SecureXMLRPCServer.py", line 207, in finish_request
File "/usr/lib64/python2.6/SocketServer.py", line 322, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/usr/lib64/python2.6/SocketServer.py", line 617, in __init__
File "/usr/lib64/python2.6/BaseHTTPServer.py", line 329, in handle
File "/usr/lib64/python2.6/BaseHTTPServer.py", line 323, in handle_one_request
File "/usr/lib64/python2.6/SimpleXMLRPCServer.py", line 498, in do_POST
File "/usr/lib64/python2.6/site-packages/vdsm/SecureXMLRPCServer.py", line 176, in shutdown
File "<string>", line 1, in shutdown
error: [Errno 107] Transport endpoint is not connected
libvirtEventLoop::ERROR::2012-11-02 17:43:18,423::guestIF::188::vm.Vm::(desktopLock) vmId=`d60b0a9d-7032-468e-8942-0c8e609b2c49`::desktopLock failed
Traceback (most recent call last):
File "/usr/share/vdsm/guestIF.py", line 186, in desktopLock
File "/usr/share/vdsm/guestIF.py", line 93, in _forward
error: [Errno 107] Transport endpoint is not connected
Created attachment 640591 [details]
CPU usage charts of ssl vs cached_ssl + java threads dump
Thanks for the analysis Rami.
The first error can be related to the patch as it happens in one of the lines that it adds. Do you know what operation is triggering it? Can you provide a complete log file containing it?
The second error I think is unrelated.
Created attachment 643974 [details]
In the log provided in comment #18 I can't find any reason for the errors described in comment #14.
However I have been asked to replace pyOpenSSL with M2Crypto in the implementation, which is a major change and I think it invalidates previous tests. Can we repeat the performance test with the new version of the patch using M2Crypto?
The change proposed in comment #8 has been merged upstream:
I've reproduced the described scenario on si24.1, we have found that the cause to the slowness is the ~99% io on the rhevm machine due to db r/w in rhevm machine's local disk, there was also ~60% cpu usage but it's not the cause to the slowness.
I've attached an external iscsi disk and moved the postgresql's data dir into that external disk and now the system behaves well, webadmin is responsive and I managed to attach 2 iscsi storage domains into a data center which contains 250 fake hosts.
Closing this bug.
Any idea of what was causing that 99% disk activity in the database? Was the system idle or performing some test?
It was during storage domain creation/attachement/activation to the 250 hosts data center, I think there was 99% disk utilization also after the storage was up.
moved to backend to track the overall feature.
juan - what is blocking pushing this patch upstream for both vdsm and engine?
The VDSM patch is already merged upstream (see comment #24), and no change is required in the engine.
I'm setting docs_scoped- on this bug as I do not believe this change impacts the formal documentation. My interpretation is that we are re-arranging the behind the scenes implementation to resolve performance issues and that this change is best reflected in the Technical Notes for the release (we will come to actually writing the Technical Notes when it comes time to approve the errata).
To verify that the SSL cache is working on the VDSM side we can use the openssl s_client command:
1. From the engine machine run the following command to connect to the VDSM service (replace h0.example.com with your host name) and save the session information to a file (note the -sess_out argument):
openssl s_client -connect h0.example.com:54321 -CAfile /etc/pki/ovirt-engine/ca.pem -cert /etc/pki/ovirt-engine/certs/engine.cer -key /etc/pki/ovirt-engine/keys/engine_id_rsa -sess_out session.txt
That should establish a SSL, save the session parameters to the session.txt file and similar to this:
depth=1 C = US, O = Example Inc., CN = CA-r6.example.com.72202
depth=0 O = Example Inc., CN = 192.168.200.100
0 s:/O=Example Inc./CN=192.168.200.100
1 s:/C=US/O=Example Inc./CN=CA-r6.example.com.72202
No client certificate CA names sent
SSL handshake has read 3163 bytes and written 2793 bytes
New, TLSv1/SSLv3, Cipher is AES256-SHA
Server public key is 2048 bit
Secure Renegotiation IS supported
Compression: zlib compression
Expansion: zlib compression
Protocol : TLSv1
Cipher : AES256-SHA
Key-Arg : None
Krb5 Principal: None
PSK identity: None
PSK identity hint: None
TLS session ticket:
0000 - 04 a0 a3 47 e4 28 d3 18-6b 11 74 e8 4e 42 a9 01 ...G.(..k.t.NB..
04b0 - cc b5 ac 70 6f 28 98 d6-29 c3 3f 86 ae 6b dc 4f ...po(..).?..k.O
Compression: 1 (zlib compression)
Start Time: 1369044609
Timeout : 300 (sec)
Verify return code: 0 (ok)
The relevant information in this case is the Session-ID, take note of it.
2. Use the openssl s_client command again providing the session parameters saved to the session.txt file (note the --sess_in parameter):
openssl s_client -connect h0.example.com:54321 -CAfile /etc/pki/ovirt-engine/ca.pem -cert /etc/pki/ovirt-engine/certs/engine.cer -key /etc/pki/ovirt-engine/keys/engine_id_rsa -sess_in session.txt
This should vie an output similar to the previous one, and the *Session-ID must be exactly the same*. If it is different then session caching is not working.
moving to assigned, since the #BZ fix was not back ported from upstream yet
changing back to assigned, since bug was reverted.
Works on sf18, Tested as described in Comment 40
openssl s_client -connect dell-r210ii-05.rhev.lab.eng.brq.redhat.com:54321 -CAfile /etc/pki/ovirt-engine/ca.pem -cert /etc/pki/ovirt-engine/certs/engine.cer -key /etc/pki/ovirt-engine/keys/engine_id_rsa -sess_out session.txt
openssl s_client -connect dell-r210ii-05.rhev.lab.eng.brq.redhat.com:54321 -CAfile /etc/pki/ovirt-engine/ca.pem -cert /etc/pki/ovirt-engine/certs/engine.cer -key /etc/pki/ovirt-engine/keys/engine_id_rsa -sess_in session.txt
session ids match
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.