Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 857035 (ssl_handshake_cache)

Summary:

PRD32 - cache ssl handshakes with the hosts to reduce cpu load

Product:

Red Hat Enterprise Virtualization Manager

Reporter:

Rami Vaknin <rvaknin>

Component:

vdsm

Assignee:

Juan Hernández <juan.hernandez>

Status:

CLOSED ERRATA

QA Contact:

Martin Pavlik <mpavlik>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

3.2.0

CC:

acathrow, alonbl, bazulay, eedri, gklein, hateya, iheim, jkt, juan.hernandez, knesenko, lpeer, myakove, Rhev-m-bugs, scohen, sgordon, sgrinber, yeylon, yzaslavs, zdover

Target Milestone:

---

Keywords:

Regression, Reopened, TestBlocker, ZStream

Target Release:

3.2.1

Flags:

ykaul: needinfo+

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

infra

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

SSL handshakes are now cached with the hosts in order to reduce CPU load on the hosts.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2013-07-16 13:29:05 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Infra

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

613180, 915537

Attachments:

Description	Flags
Use of CPU with and without SSL	none
Java threads dumps + timestamps	none
CPU usage charts of ssl vs cached_ssl + java threads dump	none
vdsm log	none

Description Rami Vaknin 2012-09-13 12:25:38 UTC

Version:
si17

Description:
rhevm's is intalled on a bare-metal with AMD Opteron 4170 HE processor and 32GB RAM.
I installed 224 hosts, 200 of them are fake hosts and the rest are bare-metal hosts, 1 of them runs 200 vms which serve as fake hosts. This scale is totally reasonable even in rhevm 3.0.
rhevm's java process takes a lot of cpu - ~35-75% of the total cpus, the java threads dump shows that most of the heavy threads use the sun.security.rsa.RSACore.crtCrypt method which is used for ssl handshake with the hosts.

rhevm and webadmin are so slow that it's not possible to work on them!!!

Example of a thread that does the ssl handshake:

"pool-4-thread-2837" prio=10 tid=0x00007f4bc8004000 nid=0x39c runnable [0x00007f4b7c6c4000]
   java.lang.Thread.State: RUNNABLE
        at java.math.BigInteger.oddModPow(BigInteger.java:1850)
        at java.math.BigInteger.modPow(BigInteger.java:1599)
        at sun.security.rsa.RSACore.crtCrypt(RSACore.java:154)
        at sun.security.rsa.RSACore.rsa(RSACore.java:101)
        at com.sun.crypto.provider.RSACipher.doFinal(RSACipher.java:345)
        at com.sun.crypto.provider.RSACipher.engineDoFinal(RSACipher.java:382)
        at javax.crypto.Cipher.doFinal(Cipher.java:1805)
        at java.security.Signature$CipherAdapter.engineSign(Signature.java:1257)
        at java.security.Signature$Delegate.engineSign(Signature.java:1160)
        at java.security.Signature.sign(Signature.java:553)
        at sun.security.ssl.RSASignature.engineSign(RSASignature.java:166)
        at java.security.Signature$Delegate.engineSign(Signature.java:1160)
        at java.security.Signature.sign(Signature.java:553)
        at sun.security.ssl.HandshakeMessage$CertificateVerify.<init>(HandshakeMessage.java:1547)
        at sun.security.ssl.ClientHandshaker.serverHelloDone(ClientHandshaker.java:979)
        at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:282)
        at sun.security.ssl.Handshaker.processLoop(Handshaker.java:868)
        at sun.security.ssl.Handshaker.process_record(Handshaker.java:804)
        at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:998)
        - locked <0x00000000ec66ee18> (a sun.security.ssl.SSLSocketImpl)
        - locked <0x00000000ec66efe8> (a java.lang.Object)
        at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1294)
        - locked <0x00000000ec66efa8> (a java.lang.Object)
        at sun.security.ssl.SSLSocketImpl.writeRecord(SSLSocketImpl.java:685)
        at sun.security.ssl.AppOutputStream.write(AppOutputStream.java:111)
        - locked <0x00000000eef8ca68> (a sun.security.ssl.AppOutputStream)
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
        at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
        - locked <0x00000000eef91f48> (a java.io.BufferedOutputStream)
        at java.io.FilterOutputStream.flush(FilterOutputStream.java:140)
        at org.apache.xmlrpc.client.XmlRpcCommonsTransport$1$1.close(XmlRpcCommonsTransport.java:204)
        at org.apache.xmlrpc.client.XmlRpcHttpTransport$ByteArrayReqWriter.write(XmlRpcHttpTransport.java:55)
        at org.apache.xmlrpc.client.XmlRpcCommonsTransport$1.writeRequest(XmlRpcCommonsTransport.java:214)
        at org.apache.commons.httpclient.methods.EntityEnclosingMethod.writeRequestBody(EntityEnclosingMethod.java:499)
        at org.apache.commons.httpclient.HttpMethodBase.writeRequest(HttpMethodBase.java:2114)
        at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1096)
        at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398)
        at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
        at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
        at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
        at org.apache.xmlrpc.client.XmlRpcCommonsTransport.writeRequest(XmlRpcCommonsTransport.java:227)
        at org.apache.xmlrpc.client.XmlRpcStreamTransport.sendRequest(XmlRpcStreamTransport.java:151)
        at org.apache.xmlrpc.client.XmlRpcHttpTransport.sendRequest(XmlRpcHttpTransport.java:143)
        at org.apache.xmlrpc.client.XmlRpcClientWorker.execute(XmlRpcClientWorker.java:56)
        at org.apache.xmlrpc.client.XmlRpcClient.execute(XmlRpcClient.java:167)
        at org.apache.xmlrpc.client.XmlRpcClient.execute(XmlRpcClient.java:137)
        at org.apache.xmlrpc.client.XmlRpcClient.execute(XmlRpcClient.java:126)
        at org.apache.xmlrpc.client.util.ClientFactory$1.invoke(ClientFactory.java:140)
        at $Proxy32.list(Unknown Source)
        at sun.reflect.GeneratedMethodAccessor91.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:601)
        at org.ovirt.engine.core.vdsbroker.xmlrpc.XmlRpcUtils$AsyncProxy$1.call(XmlRpcUtils.java:185)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalWrapperRunnable.run(ThreadPoolUtil.java:64)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:722)

Comment 8 Juan Hernández 2012-09-22 20:05:46 UTC

After trying several things the only "simple" way I find to improve this is to start using OpenSSL in a way that allows session caching. This is the change I am proposing to do that in VDSM:

http://gerrit.ovirt.org/8123

This has to be combined with another change in the engine to make sure that we use one SSL context per host. I am working on that.

Comment 9 Juan Hernández 2012-09-24 15:14:58 UTC

After studying how the engine manages the SSL connections with the hosts I found that it creates only one SSL context that is shared by all the hosts. However this context (the SSLContext class, part of the JDK) manages a cache of sessions indexed by host name and port number. I verified that with the VDSM change proposed in comment 8 the engine automatically reuses the handshakes with the hosts, so no additional changes are required in the engine.

Comment 10 Juan Hernández 2012-10-10 14:24:17 UTC

Rami did tests with and without SSL, and the results clearly show that there is an important, very important, change in the use of CPU. I took a sample (using sysstat) of the first hour after the installation, and it shows that the system uses 150% of CPU with SSL disabled and 300% of CPU with SSL enabled. We all know that this is expected, but having half of the CPU dedicated to SSL doesn't look right to me. See the attached graph.

The stack traces that Rami collected again show many threads doing RSA related computations, so I think it is worth trying to reduce the impact of SSL in the use of CPU. Using a session cache as proposed in the patch is an step in that direction. I don't have hard data to compare the situation with/without the cache. Will try to do some simple benchmarks.

Rami, what is your "feeling" about the environment with SSL disabled? Is it more responsive?

Comment 11 Juan Hernández 2012-10-10 14:26:50 UTC

Created attachment 624914 [details]
Use of CPU with and without SSL

Comment 12 Rami Vaknin 2012-10-10 15:28:17 UTC

Juan, thank for the results analysis!

My "feeling" - hard to say as I had problem to connect the storage to the 200 fake hosts so I haven't done anything usefull with the env, but by judging the environment removal, I think it was faster and smoother, the change was also noticable while changing tabs in webadmin.

Comment 13 Rami Vaknin 2012-10-10 15:29:56 UTC

Created attachment 624961 [details]
Java threads dumps + timestamps

Comment 14 Rami Vaknin 2012-11-08 06:08:21 UTC

Tested on 3.1 si20.

I've tested the proposed ssl cache patch which achieves imporessive results - it reduces the cpu usage by ~40% in comparison to the current ssl implementation. I'll attach the charts.

Note that the environment (200 fake hosts) with this patch uses twice the kernel space cpu time than the current ssl implementation, but the patch is very economic in the user space cpu time.

There are also new (afaik) errors arouse while using the patch, Juan - could you please tell whether they might be related to the patch?

Thread-66272::ERROR::2012-11-02 20:10:36,815::SecureXMLRPCServer::210::root::(handle_error) client ('10.35.160.93', 38849)
Traceback (most recent call last):
  File "/usr/lib64/python2.6/SocketServer.py", line 560, in process_request_thread
    self.finish_request(request, client_address)
  File "/usr/lib64/python2.6/site-packages/vdsm/SecureXMLRPCServer.py", line 207, in finish_request
    client_address)
  File "/usr/lib64/python2.6/SocketServer.py", line 322, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib64/python2.6/SocketServer.py", line 617, in __init__
    self.handle()
  File "/usr/lib64/python2.6/BaseHTTPServer.py", line 329, in handle
    self.handle_one_request()
  File "/usr/lib64/python2.6/BaseHTTPServer.py", line 323, in handle_one_request
    method()
  File "/usr/lib64/python2.6/SimpleXMLRPCServer.py", line 498, in do_POST
    self.connection.shutdown(1)
  File "/usr/lib64/python2.6/site-packages/vdsm/SecureXMLRPCServer.py", line 176, in shutdown
    self.raw.shutdown(how)
  File "<string>", line 1, in shutdown
error: [Errno 107] Transport endpoint is not connected



libvirtEventLoop::ERROR::2012-11-02 17:43:18,423::guestIF::188::vm.Vm::(desktopLock) vmId=`d60b0a9d-7032-468e-8942-0c8e609b2c49`::desktopLock failed
Traceback (most recent call last):
  File "/usr/share/vdsm/guestIF.py", line 186, in desktopLock
    self._forward("lock-screen")
  File "/usr/share/vdsm/guestIF.py", line 93, in _forward
    self._sock.send(message)
error: [Errno 107] Transport endpoint is not connected

Comment 15 Rami Vaknin 2012-11-08 06:10:06 UTC

Created attachment 640591 [details]
CPU usage charts of ssl vs cached_ssl + java threads dump

Comment 16 Juan Hernández 2012-11-12 17:14:47 UTC

Thanks for the analysis Rami.

The first error can be related to the patch as it happens in one of the lines that it adds. Do you know what operation is triggering it? Can you provide a complete log file containing it?

The second error I think is unrelated.

Comment 18 Rami Vaknin 2012-11-13 07:05:50 UTC

Created attachment 643974 [details]
vdsm log

Comment 20 Juan Hernández 2012-11-13 19:37:13 UTC

In the log provided in comment #18 I can't find any reason for the errors described in comment #14.

However I have been asked to replace pyOpenSSL with M2Crypto in the implementation, which is a major change and I think it invalidates previous tests. Can we repeat the performance test with the new version of the patch using M2Crypto?

Comment 24 Juan Hernández 2012-11-14 13:28:30 UTC

The change proposed in comment #8 has been merged upstream:

http://gerrit.ovirt.org/gitweb?p=vdsm.git;a=commit;h=5ea9b646f4fcd56ca48ed673810e86e26d44b523

Comment 29 Rami Vaknin 2012-11-15 15:15:04 UTC

I've reproduced the described scenario on si24.1, we have found that the cause to the slowness is the ~99% io on the rhevm machine due to db r/w in rhevm machine's local disk, there was also ~60% cpu usage but it's not the cause to the slowness.

I've attached an external iscsi disk and moved the postgresql's data dir into that external disk and now the system behaves well, webadmin is responsive and I managed to attach 2 iscsi storage domains into a data center which contains 250 fake hosts.

Closing this bug.

Comment 31 Juan Hernández 2012-11-15 15:37:52 UTC

Any idea of what was causing that 99% disk activity in the database? Was the system idle or performing some test?

Comment 32 Rami Vaknin 2012-11-15 15:42:36 UTC

It was during storage domain creation/attachement/activation to the 250 hosts data center, I think there was 99% disk utilization also after the storage was up.

Comment 34 Itamar Heim 2012-11-15 17:02:18 UTC

moved to backend to track the overall feature.
juan - what is blocking pushing this patch upstream for both vdsm and engine?

Comment 35 Juan Hernández 2012-11-15 17:05:24 UTC

The VDSM patch is already merged upstream (see comment #24), and no change is required in the engine.

Comment 39 Stephen Gordon 2012-12-19 19:51:13 UTC

I'm setting docs_scoped- on this bug as I do not believe this change impacts the formal documentation. My interpretation is that we are re-arranging the behind the scenes implementation to resolve performance issues and that this change is best reflected in the Technical Notes for the release (we will come to actually writing the Technical Notes when it comes time to approve the errata).

Comment 40 Juan Hernández 2013-05-20 10:22:40 UTC

To verify that the SSL cache is working on the VDSM side we can use the openssl s_client command:

1. From the engine machine run the following command to connect to the VDSM service (replace h0.example.com with your host name) and save the session information to a file (note the -sess_out argument):

openssl s_client -connect h0.example.com:54321 -CAfile /etc/pki/ovirt-engine/ca.pem -cert /etc/pki/ovirt-engine/certs/engine.cer -key /etc/pki/ovirt-engine/keys/engine_id_rsa -sess_out session.txt

That should establish a SSL, save the session parameters to the session.txt file and similar to this:

CONNECTED(00000003)
depth=1 C = US, O = Example Inc., CN = CA-r6.example.com.72202
verify return:1
depth=0 O = Example Inc., CN = 192.168.200.100
verify return:1
---
Certificate chain
 0 s:/O=Example Inc./CN=192.168.200.100
   i:/C=US/O=Example Inc./CN=CA-r6.example.com.72202
 1 s:/C=US/O=Example Inc./CN=CA-r6.example.com.72202
   i:/C=US/O=Example Inc./CN=CA-r6.example.com.72202
---
Server certificate
-----BEGIN CERTIFICATE-----
MIIC7zCCAdcCAQwwDQYJKoZIhvcNAQEFBQAwRjELMAkGA1UEBhMCVVMxFTATBgNV
...
LsNwb2Vw9JRH7Se2NZah8o1xJj+3C2SzAFtehO/Hgc/wn1k=
-----END CERTIFICATE-----
subject=/O=Example Inc./CN=192.168.200.100
issuer=/C=US/O=Example Inc./CN=CA-r6.example.com.72202
---
No client certificate CA names sent
---
SSL handshake has read 3163 bytes and written 2793 bytes
---
New, TLSv1/SSLv3, Cipher is AES256-SHA
Server public key is 2048 bit
Secure Renegotiation IS supported
Compression: zlib compression
Expansion: zlib compression
SSL-Session:
    Protocol  : TLSv1
    Cipher    : AES256-SHA
    Session-ID: E3CF9EFCF181863E930094FF09C6A7CA19B90768A568EECE6196B06B1B9E0588
    Session-ID-ctx: 
    Master-Key: 39A2003F979CEE7AF9C1E38B02788914B62E3121F07F838D6F7118182195079F725EA7A4BC16D0576D8F0BD5264B348D
    Key-Arg   : None
    Krb5 Principal: None
    PSK identity: None
    PSK identity hint: None
    TLS session ticket:
    0000 - 04 a0 a3 47 e4 28 d3 18-6b 11 74 e8 4e 42 a9 01   ...G.(..k.t.NB..
    ...
    04b0 - cc b5 ac 70 6f 28 98 d6-29 c3 3f 86 ae 6b dc 4f   ...po(..).?..k.O

    Compression: 1 (zlib compression)
    Start Time: 1369044609
    Timeout   : 300 (sec)
    Verify return code: 0 (ok)

The relevant information in this case is the Session-ID, take note of it.

2. Use the openssl s_client command again providing the session parameters saved to the session.txt file (note the --sess_in parameter):

openssl s_client -connect h0.example.com:54321 -CAfile /etc/pki/ovirt-engine/ca.pem -cert /etc/pki/ovirt-engine/certs/engine.cer -key /etc/pki/ovirt-engine/keys/engine_id_rsa -sess_in session.txt

This should vie an output similar to the previous one, and the *Session-ID must be exactly the same*. If it is different then session caching is not working.

Comment 41 Martin Pavlik 2013-05-20 14:12:37 UTC

moving to assigned, since the #BZ fix was not back ported from upstream yet

Comment 43 Eyal Edri 2013-05-22 09:13:19 UTC

changing back to assigned, since bug was reverted.

Comment 46 Martin Pavlik 2013-06-17 08:09:28 UTC

Works on sf18, Tested as described in Comment 40

openssl s_client -connect dell-r210ii-05.rhev.lab.eng.brq.redhat.com:54321 -CAfile /etc/pki/ovirt-engine/ca.pem -cert /etc/pki/ovirt-engine/certs/engine.cer -key /etc/pki/ovirt-engine/keys/engine_id_rsa -sess_out session.txt

produced
Session-ID: 7E4091CB04432BE9DB1ED2D400DC44F2A3C2E2E17D38D24325DA2E7E3B262A66

openssl s_client -connect dell-r210ii-05.rhev.lab.eng.brq.redhat.com:54321 -CAfile /etc/pki/ovirt-engine/ca.pem -cert /etc/pki/ovirt-engine/certs/engine.cer -key /etc/pki/ovirt-engine/keys/engine_id_rsa -sess_in session.txt

produced 
Session-ID: 7E4091CB04432BE9DB1ED2D400DC44F2A3C2E2E17D38D24325DA2E7E3B262A66


session ids match
Session-ID: 7E4091CB04432BE9DB1ED2D400DC44F2A3C2E2E17D38D24325DA2E7E3B262A66
Session-ID: 7E4091CB04432BE9DB1ED2D400DC44F2A3C2E2E17D38D24325DA2E7E3B262A66

Comment 48 errata-xmlrpc 2013-07-16 13:29:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1073.html