Bug 1416308 - [downstream clone - 4.0.7] ovirt-shell: sporadic HTTP 500 errors
Summary: [downstream clone - 4.0.7] ovirt-shell: sporadic HTTP 500 errors
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: unspecified
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ovirt-4.0.7
: ---
Assignee: Ravi Nori
QA Contact: Petr Matyáš
URL:
Whiteboard:
Depends On: 1396833
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-01-25 08:54 UTC by rhev-integ
Modified: 2017-03-16 15:31 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
This update adds a connection pool to reuse existing connections for communication between the Manager and the SSO module. By default, a connection pool is created on the Manager to communicate with the SSO module. The size of this pool is defined by the ENGINE_SSO_SERVICE_CLIENT_POOL_SIZE option (default size is 10 connections). Similarly, a connection pool is created on the SSO side to communicate with the Manager. The size of this pool is defined by the SSO_CALLBACK_CLIENT_POOL_SIZE option (default size is 10 connections). If needed, these options can be customized by creating the /etc/ovirt-engine/engine.conf.d/99-custom-connection-pool.conf file with the following (where N and M are the new sizes): ENGINE_SSO_SERVICE_CLIENT_POOL_SIZE=N SSO_CALLBACK_CLIENT_POOL_SIZE=M
Clone Of: 1396833
Environment:
Last Closed: 2017-03-16 15:31:34 UTC
oVirt Team: Infra
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:0542 0 normal SHIPPED_LIVE Red Hat Virtualization Manager 4.0.7 2017-03-16 19:25:04 UTC
oVirt gerrit 67294 0 master MERGED core: Use persistent HTTP connections between engine and SSO 2017-01-25 08:57:12 UTC
oVirt gerrit 67724 0 master MERGED aaa: Handle session creation during clean expired user sessions 2017-01-25 08:57:12 UTC
oVirt gerrit 67742 0 ovirt-engine-4.0 MERGED aaa: Handle session creation during clean expired user sessions 2017-01-25 08:57:12 UTC
oVirt gerrit 68393 0 ovirt-engine-4.0 MERGED core: Align 'httpcomponents' version with WildFly 2017-01-25 08:57:12 UTC
oVirt gerrit 68394 0 ovirt-engine-4.0 MERGED core: Use persistent HTTP connections between engine and SSO 2017-01-25 08:57:12 UTC

Description rhev-integ 2017-01-25 08:54:07 UTC
+++ This bug is an upstream to downstream clone. The original bug is: +++
+++   bug 1396833 +++
======================================================================

Description of problem:

ovirt-shell gives sometimes http 500 errors.


Version-Release number of selected component (if applicable):

Ovirt 4.0.5

How reproducible:

100%

Steps to Reproduce:

1) Run "list disk" loop with script:

#!/bin/sh
while [ 1 -eq 1 ]; do
  /usr/bin/ovirt-shell -c -E 'list disks' >/dev/null
  echo -n .
done

2) wait for http 500 error. Will occur once in 10-15 minutes.

Actual results:

...........................
=========================== ERROR =======================
[500] - Internal Server Error
=========================================================
=========================== ERROR =======================
command "list" is not valid or not available while not connected.
=========================================================
....................................

Expected results:

no errors.

Additional info:

content of .ovirtshellrc:

[cli]
autoconnect = True
autopage = True
[ovirt-shell]
username = admin@internal
renew_session = False
timeout = None
extended_prompt = False
url = https://<engine.host>/ovirt-engine/api
insecure = True
kerberos = False
filter = False
session_timeout = None
ca_file = None
dont_validate_cert_chain = True
key_file = None
password = <password>
cert_file = None

(Originally by mst)

Comment 1 rhev-integ 2017-01-25 08:54:19 UTC
That error message is the expected reaction of the CLI when the engine fails with a 500 error. To determine what is happening we need to take a look at the /var/log/ovirt-engine/server.log and /var/log/ovirt-engine/engine.log files. Can you share them?

(Originally by juan.hernandez)

Comment 3 rhev-integ 2017-01-25 08:54:25 UTC
Last error:

Di 22. Nov 20:52:50 CET 2016
============================= ERROR =======================================
               [500] - Internal Server Error
=========================================================================

============================ ERROR =======================================
command "list" is not valid or not available while not connected.
==========================================================================
Di 22. Nov 20:52:50 CET 2016

server/engine logs attached

(Originally by mst)

Comment 4 rhev-integ 2017-01-25 08:54:33 UTC
Created attachment 1222889 [details]
server log

(Originally by mst)

Comment 5 rhev-integ 2017-01-25 08:54:39 UTC
Created attachment 1222890 [details]
engine log

(Originally by mst)

Comment 6 rhev-integ 2017-01-25 08:54:45 UTC
*** Bug 1396835 has been marked as a duplicate of this bug. ***

(Originally by mst)

Comment 7 rhev-integ 2017-01-25 08:54:52 UTC
Can you attach /var/log/httpd/ssl_access_log?

(Originally by juan.hernandez)

Comment 8 rhev-integ 2017-01-25 08:54:59 UTC
Created attachment 1222891 [details]
ssl log

(Originally by mst)

Comment 9 rhev-integ 2017-01-25 08:55:05 UTC
Unfortunately I can't reproduce this problem, and the log messages aren't really useful. However, why trying to reproduce it, using the script provided by Markus, I see that the engine is killed by the out of memory killer of the kernel. The reason is that the RSS size grows without limit. This suggests that there is a memory leak, which may cause all kinds of strange results. I am focusing on that.

(Originally by juan.hernandez)

Comment 10 rhev-integ 2017-01-25 08:55:12 UTC
Juan - should we backport for 4.0.6/4.0.7?

(Originally by Oved Ourfali)

Comment 11 rhev-integ 2017-01-25 08:55:19 UTC
I am not 100% sure of the cause of the problem that Markus detected. Currently I am focusing on a possible leak of native memory, caused by the creation of multiple SSL sockets. But the problem that I can reproduce, based on that hypothesis, is a complete crash of the engine, which isn't exactly what Markus sees.

However the optimization to use persistent SSL connections is important enough to consider backporting it 4.0. It is also invasive enough to consider it very carefully. I'd prefer to wait a bit to see what Martin and Ravi think. Then we can decide what to do.

(Originally by juan.hernandez)

Comment 12 rhev-integ 2017-01-25 08:55:25 UTC
By the way Markus, can you share what is the configuration of your engine? What is the amount of RAM of the machine? And the amount of RAM assigned to the engine (from /etc/ovirt-engine/engine.conf.d/10-setup-java.conf)?

(Originally by juan.hernandez)

Comment 13 rhev-integ 2017-01-25 08:55:33 UTC
engine has 12GB RAM.

config is:

# cat /etc/ovirt-engine/engine.conf.d/10-setup-java.conf
ENGINE_HEAP_MIN="4096M"
ENGINE_HEAP_MAX="4096M"

@Juan: you are right your bugfix for the SSL pool does not directly fix the HTTP 500 error. As you might have noticed: The error logs give no real idea of the root cause.

But what I have noticed: The error occurs more often if you are running a live disk migration during the calls. To be more precise: I run

one time: 
/usr/bin/ovirt-shell -c -E 'action disk <diskname> move --storage_domain-name <targetdomain>

multiple time (to wait until migration finishes)

rdy=0
while [ $rdy -eq 0 ]; do
  sleep 10
  echo -n "."

  rdy=`/usr/bin/ovirt-shell -c -E 'list disks --query "storage = '<targetdomain>' and name = '<diskname>'"' | grep <diskname> | wc -l`
done

(Originally by mst)

Comment 14 rhev-integ 2017-01-25 08:55:40 UTC
Thanks for the additional information Markus. Unfortunately I am not able to reproduce this in my environment, even with that example. I have the suspicion that the exception is generated in the authentication layer, as that is one of the parts of the application where we don't send to the log the complete exception stack traces.

Can you enable the debug log for the AAA module? That, in general, is described here:

  https://www.ovirt.org/develop/developer-guide/engine/jmx-support

In this particular case it should be a matter of running these commands:

  echo 'ENGINE_JMX_INTERFACE=loopback' > /etc/ovirt-engine/engine.conf.d/99-jmx.conf
  systemctl restart ovirt-engine
  export JBOSS_HOME=/usr/share/ovirt-engine-wildfy
  $JBOSS_HOME/bin/jboss-cli.sh --controller=localhost:8706 --connect --user=admin@internal

Then, in the prompt of the 'jboss-cli' tool:

  /subsystem=logging/logger=org.ovirt.engine.core.aaa:add
  /subsystem=logging/logger=org.ovirt.engine.core.aaa:write-attribute(name=level,value=DEBUG)

Once that is done the AAA module will start sending messages to the 'engine.log' file, hopefully including the complete stack trace of the exception.

Once you are done collecting the results you can remove the '99-jmx.conf' file, and restart the engine, and everything will be back to the initial configuration.

(Originally by juan.hernandez)

Comment 15 rhev-integ 2017-01-25 08:55:47 UTC
Created attachment 1226483 [details]
new server log

(Originally by mst)

Comment 16 rhev-integ 2017-01-25 08:55:53 UTC
Created attachment 1226484 [details]
new engine log

(Originally by mst)

Comment 17 rhev-integ 2017-01-25 08:56:00 UTC
Next error:

Mi 30. Nov 21:08:20 CET 2016
======================================= ERROR ========================
  status: 500
  reason: Internal Server Error
  detail: java.lang.IllegalStateException
  ====================================================================
Mi 30. Nov 21:08:22 CET 2016

logs attached

(Originally by mst)

Comment 18 rhev-integ 2017-01-25 08:56:06 UTC
This is a different error, but at least we have an stack trace for it. The relevant log message is the following:

2016-11-30 21:08:21,928 ERROR [org.ovirt.engine.core.bll.SearchQuery] (default task-62) [] Exception: java.lang.NullPointerException
        at org.ovirt.engine.core.bll.QueriesCommandBase.validatePermissions(QueriesCommandBase.java:157) [bll.jar:]

That means that there is a null pointer exception here:

  https://github.com/oVirt/ovirt-engine/blob/ovirt-engine-4.0.5.5/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/QueriesCommandBase.java#L157

The code there is like this:

  return getUser().isAdmin();

The only way that can generate a null pointer exception is if the "getUser()" method returns null, and that should never happen for a an authenticated request, unless the backend session is removed in the middle of the command, which shouldn't happen either.

Ravi, can you take a look? Note that the logs contain the AAA debug info.

(Originally by juan.hernandez)

Comment 19 rhev-integ 2017-01-25 08:56:13 UTC
I was able to reproduce the issue, it happens when a session is created during the execution of cleanExpiredUsersSessions after the session status was retrieved from SSO.

(Originally by Ravi Nori)

Comment 20 rhev-integ 2017-01-25 08:56:20 UTC
Please backport at least the session handling fix of Ravi Nori to 4.0.x.

(Originally by mst)

Comment 21 rhev-integ 2017-01-25 08:56:27 UTC
I agree with Markus that this should be backported to 4.0.z. Martin, can you re-target?

(Originally by juan.hernandez)

Comment 22 rhev-integ 2017-01-25 08:56:34 UTC
I agree, let's finish the fix for master and when done decide how to fix that in 4.0.z

(Originally by Martin Perina)

Comment 23 rhev-integ 2017-01-25 08:56:41 UTC
*** Bug 1379690 has been marked as a duplicate of this bug. ***

(Originally by Martin Perina)

Comment 24 rhev-integ 2017-01-25 08:56:47 UTC
Backport to ovirt-engine-4.0 branch is needed

(Originally by Martin Perina)

Comment 25 rhev-integ 2017-01-25 08:56:54 UTC
4.0.6 has been the last oVirt 4.0 release, please re-target this bug.

(Originally by Sandro Bonazzola)

Comment 26 rhev-integ 2017-01-25 08:57:01 UTC
Setting flags back to allow cloning

(Originally by Martin Perina)

Comment 28 Petr Matyáš 2017-02-01 14:53:26 UTC
Verified on 4.0.7-1

Comment 30 errata-xmlrpc 2017-03-16 15:31:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0542.html


Note You need to log in before you can comment on or make changes to this bug.