Bug 857035 - (ssl_handshake_cache) PRD32 - cache ssl handshakes with the hosts to reduce cpu load
PRD32 - cache ssl handshakes with the hosts to reduce cpu load
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm (Show other bugs)
Unspecified Unspecified
urgent Severity urgent
: ---
: 3.2.1
Assigned To: Juan Hernández
Martin Pavlik
: Regression, Reopened, TestBlocker, ZStream
Depends On:
Blocks: 613180 915537
  Show dependency treegraph
Reported: 2012-09-13 08:25 EDT by Rami Vaknin
Modified: 2016-02-10 14:43 EST (History)
19 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
SSL handshakes are now cached with the hosts in order to reduce CPU load on the hosts.
Story Points: ---
Clone Of:
Last Closed: 2013-07-16 09:29:05 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: Infra
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
ykaul: needinfo+

Attachments (Terms of Use)
Use of CPU with and without SSL (21.18 KB, image/jpeg)
2012-10-10 10:26 EDT, Juan Hernández
no flags Details
Java threads dumps + timestamps (39.57 KB, application/x-tar)
2012-10-10 11:29 EDT, Rami Vaknin
no flags Details
CPU usage charts of ssl vs cached_ssl + java threads dump (108.57 KB, application/x-tar)
2012-11-08 01:10 EST, Rami Vaknin
no flags Details
vdsm log (1012.66 KB, application/x-xz)
2012-11-13 02:05 EST, Rami Vaknin
no flags Details

  None (edit)
Description Rami Vaknin 2012-09-13 08:25:38 EDT

rhevm's is intalled on a bare-metal with AMD Opteron 4170 HE processor and 32GB RAM.
I installed 224 hosts, 200 of them are fake hosts and the rest are bare-metal hosts, 1 of them runs 200 vms which serve as fake hosts. This scale is totally reasonable even in rhevm 3.0.
rhevm's java process takes a lot of cpu - ~35-75% of the total cpus, the java threads dump shows that most of the heavy threads use the sun.security.rsa.RSACore.crtCrypt method which is used for ssl handshake with the hosts.

rhevm and webadmin are so slow that it's not possible to work on them!!!

Example of a thread that does the ssl handshake:

"pool-4-thread-2837" prio=10 tid=0x00007f4bc8004000 nid=0x39c runnable [0x00007f4b7c6c4000]
   java.lang.Thread.State: RUNNABLE
        at java.math.BigInteger.oddModPow(BigInteger.java:1850)
        at java.math.BigInteger.modPow(BigInteger.java:1599)
        at sun.security.rsa.RSACore.crtCrypt(RSACore.java:154)
        at sun.security.rsa.RSACore.rsa(RSACore.java:101)
        at com.sun.crypto.provider.RSACipher.doFinal(RSACipher.java:345)
        at com.sun.crypto.provider.RSACipher.engineDoFinal(RSACipher.java:382)
        at javax.crypto.Cipher.doFinal(Cipher.java:1805)
        at java.security.Signature$CipherAdapter.engineSign(Signature.java:1257)
        at java.security.Signature$Delegate.engineSign(Signature.java:1160)
        at java.security.Signature.sign(Signature.java:553)
        at sun.security.ssl.RSASignature.engineSign(RSASignature.java:166)
        at java.security.Signature$Delegate.engineSign(Signature.java:1160)
        at java.security.Signature.sign(Signature.java:553)
        at sun.security.ssl.HandshakeMessage$CertificateVerify.<init>(HandshakeMessage.java:1547)
        at sun.security.ssl.ClientHandshaker.serverHelloDone(ClientHandshaker.java:979)
        at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:282)
        at sun.security.ssl.Handshaker.processLoop(Handshaker.java:868)
        at sun.security.ssl.Handshaker.process_record(Handshaker.java:804)
        at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:998)
        - locked <0x00000000ec66ee18> (a sun.security.ssl.SSLSocketImpl)
        - locked <0x00000000ec66efe8> (a java.lang.Object)
        at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1294)
        - locked <0x00000000ec66efa8> (a java.lang.Object)
        at sun.security.ssl.SSLSocketImpl.writeRecord(SSLSocketImpl.java:685)
        at sun.security.ssl.AppOutputStream.write(AppOutputStream.java:111)
        - locked <0x00000000eef8ca68> (a sun.security.ssl.AppOutputStream)
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
        at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
        - locked <0x00000000eef91f48> (a java.io.BufferedOutputStream)
        at java.io.FilterOutputStream.flush(FilterOutputStream.java:140)
        at org.apache.xmlrpc.client.XmlRpcCommonsTransport$1$1.close(XmlRpcCommonsTransport.java:204)
        at org.apache.xmlrpc.client.XmlRpcHttpTransport$ByteArrayReqWriter.write(XmlRpcHttpTransport.java:55)
        at org.apache.xmlrpc.client.XmlRpcCommonsTransport$1.writeRequest(XmlRpcCommonsTransport.java:214)
        at org.apache.commons.httpclient.methods.EntityEnclosingMethod.writeRequestBody(EntityEnclosingMethod.java:499)
        at org.apache.commons.httpclient.HttpMethodBase.writeRequest(HttpMethodBase.java:2114)
        at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1096)
        at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398)
        at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
        at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
        at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
        at org.apache.xmlrpc.client.XmlRpcCommonsTransport.writeRequest(XmlRpcCommonsTransport.java:227)
        at org.apache.xmlrpc.client.XmlRpcStreamTransport.sendRequest(XmlRpcStreamTransport.java:151)
        at org.apache.xmlrpc.client.XmlRpcHttpTransport.sendRequest(XmlRpcHttpTransport.java:143)
        at org.apache.xmlrpc.client.XmlRpcClientWorker.execute(XmlRpcClientWorker.java:56)
        at org.apache.xmlrpc.client.XmlRpcClient.execute(XmlRpcClient.java:167)
        at org.apache.xmlrpc.client.XmlRpcClient.execute(XmlRpcClient.java:137)
        at org.apache.xmlrpc.client.XmlRpcClient.execute(XmlRpcClient.java:126)
        at org.apache.xmlrpc.client.util.ClientFactory$1.invoke(ClientFactory.java:140)
        at $Proxy32.list(Unknown Source)
        at sun.reflect.GeneratedMethodAccessor91.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:601)
        at org.ovirt.engine.core.vdsbroker.xmlrpc.XmlRpcUtils$AsyncProxy$1.call(XmlRpcUtils.java:185)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalWrapperRunnable.run(ThreadPoolUtil.java:64)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:722)
Comment 8 Juan Hernández 2012-09-22 16:05:46 EDT
After trying several things the only "simple" way I find to improve this is to start using OpenSSL in a way that allows session caching. This is the change I am proposing to do that in VDSM:


This has to be combined with another change in the engine to make sure that we use one SSL context per host. I am working on that.
Comment 9 Juan Hernández 2012-09-24 11:14:58 EDT
After studying how the engine manages the SSL connections with the hosts I found that it creates only one SSL context that is shared by all the hosts. However this context (the SSLContext class, part of the JDK) manages a cache of sessions indexed by host name and port number. I verified that with the VDSM change proposed in comment 8 the engine automatically reuses the handshakes with the hosts, so no additional changes are required in the engine.
Comment 10 Juan Hernández 2012-10-10 10:24:17 EDT
Rami did tests with and without SSL, and the results clearly show that there is an important, very important, change in the use of CPU. I took a sample (using sysstat) of the first hour after the installation, and it shows that the system uses 150% of CPU with SSL disabled and 300% of CPU with SSL enabled. We all know that this is expected, but having half of the CPU dedicated to SSL doesn't look right to me. See the attached graph.

The stack traces that Rami collected again show many threads doing RSA related computations, so I think it is worth trying to reduce the impact of SSL in the use of CPU. Using a session cache as proposed in the patch is an step in that direction. I don't have hard data to compare the situation with/without the cache. Will try to do some simple benchmarks.

Rami, what is your "feeling" about the environment with SSL disabled? Is it more responsive?
Comment 11 Juan Hernández 2012-10-10 10:26:50 EDT
Created attachment 624914 [details]
Use of CPU with and without SSL
Comment 12 Rami Vaknin 2012-10-10 11:28:17 EDT
Juan, thank for the results analysis!

My "feeling" - hard to say as I had problem to connect the storage to the 200 fake hosts so I haven't done anything usefull with the env, but by judging the environment removal, I think it was faster and smoother, the change was also noticable while changing tabs in webadmin.
Comment 13 Rami Vaknin 2012-10-10 11:29:56 EDT
Created attachment 624961 [details]
Java threads dumps + timestamps
Comment 14 Rami Vaknin 2012-11-08 01:08:21 EST
Tested on 3.1 si20.

I've tested the proposed ssl cache patch which achieves imporessive results - it reduces the cpu usage by ~40% in comparison to the current ssl implementation. I'll attach the charts.

Note that the environment (200 fake hosts) with this patch uses twice the kernel space cpu time than the current ssl implementation, but the patch is very economic in the user space cpu time.

There are also new (afaik) errors arouse while using the patch, Juan - could you please tell whether they might be related to the patch?

Thread-66272::ERROR::2012-11-02 20:10:36,815::SecureXMLRPCServer::210::root::(handle_error) client ('', 38849)
Traceback (most recent call last):
  File "/usr/lib64/python2.6/SocketServer.py", line 560, in process_request_thread
    self.finish_request(request, client_address)
  File "/usr/lib64/python2.6/site-packages/vdsm/SecureXMLRPCServer.py", line 207, in finish_request
  File "/usr/lib64/python2.6/SocketServer.py", line 322, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib64/python2.6/SocketServer.py", line 617, in __init__
  File "/usr/lib64/python2.6/BaseHTTPServer.py", line 329, in handle
  File "/usr/lib64/python2.6/BaseHTTPServer.py", line 323, in handle_one_request
  File "/usr/lib64/python2.6/SimpleXMLRPCServer.py", line 498, in do_POST
  File "/usr/lib64/python2.6/site-packages/vdsm/SecureXMLRPCServer.py", line 176, in shutdown
  File "<string>", line 1, in shutdown
error: [Errno 107] Transport endpoint is not connected

libvirtEventLoop::ERROR::2012-11-02 17:43:18,423::guestIF::188::vm.Vm::(desktopLock) vmId=`d60b0a9d-7032-468e-8942-0c8e609b2c49`::desktopLock failed
Traceback (most recent call last):
  File "/usr/share/vdsm/guestIF.py", line 186, in desktopLock
  File "/usr/share/vdsm/guestIF.py", line 93, in _forward
error: [Errno 107] Transport endpoint is not connected
Comment 15 Rami Vaknin 2012-11-08 01:10:06 EST
Created attachment 640591 [details]
CPU usage charts of ssl vs cached_ssl + java threads dump
Comment 16 Juan Hernández 2012-11-12 12:14:47 EST
Thanks for the analysis Rami.

The first error can be related to the patch as it happens in one of the lines that it adds. Do you know what operation is triggering it? Can you provide a complete log file containing it?

The second error I think is unrelated.
Comment 18 Rami Vaknin 2012-11-13 02:05:50 EST
Created attachment 643974 [details]
vdsm log
Comment 20 Juan Hernández 2012-11-13 14:37:13 EST
In the log provided in comment #18 I can't find any reason for the errors described in comment #14.

However I have been asked to replace pyOpenSSL with M2Crypto in the implementation, which is a major change and I think it invalidates previous tests. Can we repeat the performance test with the new version of the patch using M2Crypto?
Comment 24 Juan Hernández 2012-11-14 08:28:30 EST
The change proposed in comment #8 has been merged upstream:

Comment 29 Rami Vaknin 2012-11-15 10:15:04 EST
I've reproduced the described scenario on si24.1, we have found that the cause to the slowness is the ~99% io on the rhevm machine due to db r/w in rhevm machine's local disk, there was also ~60% cpu usage but it's not the cause to the slowness.

I've attached an external iscsi disk and moved the postgresql's data dir into that external disk and now the system behaves well, webadmin is responsive and I managed to attach 2 iscsi storage domains into a data center which contains 250 fake hosts.

Closing this bug.
Comment 31 Juan Hernández 2012-11-15 10:37:52 EST
Any idea of what was causing that 99% disk activity in the database? Was the system idle or performing some test?
Comment 32 Rami Vaknin 2012-11-15 10:42:36 EST
It was during storage domain creation/attachement/activation to the 250 hosts data center, I think there was 99% disk utilization also after the storage was up.
Comment 34 Itamar Heim 2012-11-15 12:02:18 EST
moved to backend to track the overall feature.
juan - what is blocking pushing this patch upstream for both vdsm and engine?
Comment 35 Juan Hernández 2012-11-15 12:05:24 EST
The VDSM patch is already merged upstream (see comment #24), and no change is required in the engine.
Comment 39 Stephen Gordon 2012-12-19 14:51:13 EST
I'm setting docs_scoped- on this bug as I do not believe this change impacts the formal documentation. My interpretation is that we are re-arranging the behind the scenes implementation to resolve performance issues and that this change is best reflected in the Technical Notes for the release (we will come to actually writing the Technical Notes when it comes time to approve the errata).
Comment 40 Juan Hernández 2013-05-20 06:22:40 EDT
To verify that the SSL cache is working on the VDSM side we can use the openssl s_client command:

1. From the engine machine run the following command to connect to the VDSM service (replace h0.example.com with your host name) and save the session information to a file (note the -sess_out argument):

openssl s_client -connect h0.example.com:54321 -CAfile /etc/pki/ovirt-engine/ca.pem -cert /etc/pki/ovirt-engine/certs/engine.cer -key /etc/pki/ovirt-engine/keys/engine_id_rsa -sess_out session.txt

That should establish a SSL, save the session parameters to the session.txt file and similar to this:

depth=1 C = US, O = Example Inc., CN = CA-r6.example.com.72202
verify return:1
depth=0 O = Example Inc., CN =
verify return:1
Certificate chain
 0 s:/O=Example Inc./CN=
   i:/C=US/O=Example Inc./CN=CA-r6.example.com.72202
 1 s:/C=US/O=Example Inc./CN=CA-r6.example.com.72202
   i:/C=US/O=Example Inc./CN=CA-r6.example.com.72202
Server certificate
subject=/O=Example Inc./CN=
issuer=/C=US/O=Example Inc./CN=CA-r6.example.com.72202
No client certificate CA names sent
SSL handshake has read 3163 bytes and written 2793 bytes
New, TLSv1/SSLv3, Cipher is AES256-SHA
Server public key is 2048 bit
Secure Renegotiation IS supported
Compression: zlib compression
Expansion: zlib compression
    Protocol  : TLSv1
    Cipher    : AES256-SHA
    Session-ID: E3CF9EFCF181863E930094FF09C6A7CA19B90768A568EECE6196B06B1B9E0588
    Master-Key: 39A2003F979CEE7AF9C1E38B02788914B62E3121F07F838D6F7118182195079F725EA7A4BC16D0576D8F0BD5264B348D
    Key-Arg   : None
    Krb5 Principal: None
    PSK identity: None
    PSK identity hint: None
    TLS session ticket:
    0000 - 04 a0 a3 47 e4 28 d3 18-6b 11 74 e8 4e 42 a9 01   ...G.(..k.t.NB..
    04b0 - cc b5 ac 70 6f 28 98 d6-29 c3 3f 86 ae 6b dc 4f   ...po(..).?..k.O

    Compression: 1 (zlib compression)
    Start Time: 1369044609
    Timeout   : 300 (sec)
    Verify return code: 0 (ok)

The relevant information in this case is the Session-ID, take note of it.

2. Use the openssl s_client command again providing the session parameters saved to the session.txt file (note the --sess_in parameter):

openssl s_client -connect h0.example.com:54321 -CAfile /etc/pki/ovirt-engine/ca.pem -cert /etc/pki/ovirt-engine/certs/engine.cer -key /etc/pki/ovirt-engine/keys/engine_id_rsa -sess_in session.txt

This should vie an output similar to the previous one, and the *Session-ID must be exactly the same*. If it is different then session caching is not working.
Comment 41 Martin Pavlik 2013-05-20 10:12:37 EDT
moving to assigned, since the #BZ fix was not back ported from upstream yet
Comment 43 Eyal Edri 2013-05-22 05:13:19 EDT
changing back to assigned, since bug was reverted.
Comment 46 Martin Pavlik 2013-06-17 04:09:28 EDT
Works on sf18, Tested as described in Comment 40

openssl s_client -connect dell-r210ii-05.rhev.lab.eng.brq.redhat.com:54321 -CAfile /etc/pki/ovirt-engine/ca.pem -cert /etc/pki/ovirt-engine/certs/engine.cer -key /etc/pki/ovirt-engine/keys/engine_id_rsa -sess_out session.txt

Session-ID: 7E4091CB04432BE9DB1ED2D400DC44F2A3C2E2E17D38D24325DA2E7E3B262A66

openssl s_client -connect dell-r210ii-05.rhev.lab.eng.brq.redhat.com:54321 -CAfile /etc/pki/ovirt-engine/ca.pem -cert /etc/pki/ovirt-engine/certs/engine.cer -key /etc/pki/ovirt-engine/keys/engine_id_rsa -sess_in session.txt

Session-ID: 7E4091CB04432BE9DB1ED2D400DC44F2A3C2E2E17D38D24325DA2E7E3B262A66

session ids match
Session-ID: 7E4091CB04432BE9DB1ED2D400DC44F2A3C2E2E17D38D24325DA2E7E3B262A66
Session-ID: 7E4091CB04432BE9DB1ED2D400DC44F2A3C2E2E17D38D24325DA2E7E3B262A66
Comment 48 errata-xmlrpc 2013-07-16 09:29:05 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.