Bug 1396833
| Summary: | ovirt-shell: sporadic HTTP 500 errors | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [oVirt] ovirt-engine | Reporter: | Markus Stockhausen <mst> | ||||||||||||
| Component: | AAA | Assignee: | Ravi Nori <rnori> | ||||||||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Petr Matyáš <pmatyas> | ||||||||||||
| Severity: | medium | Docs Contact: | |||||||||||||
| Priority: | unspecified | ||||||||||||||
| Version: | 4.0.5.5 | CC: | bugs, christian.grundmann, juan.hernandez, lsvaty, mperina, mst, oourfali, rnori | ||||||||||||
| Target Milestone: | ovirt-4.1.0-rc | Keywords: | ZStream | ||||||||||||
| Target Release: | 4.1.0 | Flags: | rule-engine:
ovirt-4.1+
rule-engine: planning_ack+ mperina: devel_ack+ lsvaty: testing_ack+ |
||||||||||||
| Hardware: | Unspecified | ||||||||||||||
| OS: | Unspecified | ||||||||||||||
| Whiteboard: | |||||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||
| Doc Text: |
Cause:
Using new HTTPS connection for each communication between engine and SSO module causes serious performance issues.
Consequence:
Fix:
This fix adds connection pool to reuse existing connection for engine <-> SSO communication.
Result:
By default we create a connection pool on engine side to communicate with SSO module. The size of this pool is defined by ENGINE_SSO_SERVICE_CLIENT_POOL_SIZE option (default size is 10 connections).
Similarly we create a connection pool on SSO side to communication with clients (only engine at the moment). The size of this pool is defined by SSO_CALLBACK_CLIENT_POOL_SIZE option (default size is 10 connections).
If needed those options can be customized by creating /etc/ovirt-engine/engine.conf.d/99-custom-connection-pool.conf file with following content:
ENGINE_SSO_SERVICE_CLIENT_POOL_SIZE=NNN
SSO_CALLBACK_CLIENT_POOL_SIZE=MMM
where NNN respectively MMM are new sizes of relevant connection pool.
|
Story Points: | --- | ||||||||||||
| Clone Of: | |||||||||||||||
| : | 1416308 (view as bug list) | Environment: | |||||||||||||
| Last Closed: | 2017-02-01 14:41:10 UTC | Type: | Bug | ||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||
| Documentation: | --- | CRM: | |||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||
| oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
| Embargoed: | |||||||||||||||
| Bug Depends On: | |||||||||||||||
| Bug Blocks: | 1301067, 1416308 | ||||||||||||||
| Attachments: |
|
||||||||||||||
|
Description
Markus Stockhausen
2016-11-20 17:26:35 UTC
That error message is the expected reaction of the CLI when the engine fails with a 500 error. To determine what is happening we need to take a look at the /var/log/ovirt-engine/server.log and /var/log/ovirt-engine/engine.log files. Can you share them? Last error:
Di 22. Nov 20:52:50 CET 2016
============================= ERROR =======================================
[500] - Internal Server Error
=========================================================================
============================ ERROR =======================================
command "list" is not valid or not available while not connected.
==========================================================================
Di 22. Nov 20:52:50 CET 2016
server/engine logs attached
Created attachment 1222889 [details]
server log
Created attachment 1222890 [details]
engine log
*** Bug 1396835 has been marked as a duplicate of this bug. *** Can you attach /var/log/httpd/ssl_access_log? Created attachment 1222891 [details]
ssl log
Unfortunately I can't reproduce this problem, and the log messages aren't really useful. However, why trying to reproduce it, using the script provided by Markus, I see that the engine is killed by the out of memory killer of the kernel. The reason is that the RSS size grows without limit. This suggests that there is a memory leak, which may cause all kinds of strange results. I am focusing on that. Juan - should we backport for 4.0.6/4.0.7? I am not 100% sure of the cause of the problem that Markus detected. Currently I am focusing on a possible leak of native memory, caused by the creation of multiple SSL sockets. But the problem that I can reproduce, based on that hypothesis, is a complete crash of the engine, which isn't exactly what Markus sees. However the optimization to use persistent SSL connections is important enough to consider backporting it 4.0. It is also invasive enough to consider it very carefully. I'd prefer to wait a bit to see what Martin and Ravi think. Then we can decide what to do. By the way Markus, can you share what is the configuration of your engine? What is the amount of RAM of the machine? And the amount of RAM assigned to the engine (from /etc/ovirt-engine/engine.conf.d/10-setup-java.conf)? engine has 12GB RAM. config is: # cat /etc/ovirt-engine/engine.conf.d/10-setup-java.conf ENGINE_HEAP_MIN="4096M" ENGINE_HEAP_MAX="4096M" @Juan: you are right your bugfix for the SSL pool does not directly fix the HTTP 500 error. As you might have noticed: The error logs give no real idea of the root cause. But what I have noticed: The error occurs more often if you are running a live disk migration during the calls. To be more precise: I run one time: /usr/bin/ovirt-shell -c -E 'action disk <diskname> move --storage_domain-name <targetdomain> multiple time (to wait until migration finishes) rdy=0 while [ $rdy -eq 0 ]; do sleep 10 echo -n "." rdy=`/usr/bin/ovirt-shell -c -E 'list disks --query "storage = '<targetdomain>' and name = '<diskname>'"' | grep <diskname> | wc -l` done Thanks for the additional information Markus. Unfortunately I am not able to reproduce this in my environment, even with that example. I have the suspicion that the exception is generated in the authentication layer, as that is one of the parts of the application where we don't send to the log the complete exception stack traces. Can you enable the debug log for the AAA module? That, in general, is described here: https://www.ovirt.org/develop/developer-guide/engine/jmx-support In this particular case it should be a matter of running these commands: echo 'ENGINE_JMX_INTERFACE=loopback' > /etc/ovirt-engine/engine.conf.d/99-jmx.conf systemctl restart ovirt-engine export JBOSS_HOME=/usr/share/ovirt-engine-wildfy $JBOSS_HOME/bin/jboss-cli.sh --controller=localhost:8706 --connect --user=admin@internal Then, in the prompt of the 'jboss-cli' tool: /subsystem=logging/logger=org.ovirt.engine.core.aaa:add /subsystem=logging/logger=org.ovirt.engine.core.aaa:write-attribute(name=level,value=DEBUG) Once that is done the AAA module will start sending messages to the 'engine.log' file, hopefully including the complete stack trace of the exception. Once you are done collecting the results you can remove the '99-jmx.conf' file, and restart the engine, and everything will be back to the initial configuration. Created attachment 1226483 [details]
new server log
Created attachment 1226484 [details]
new engine log
Next error: Mi 30. Nov 21:08:20 CET 2016 ======================================= ERROR ======================== status: 500 reason: Internal Server Error detail: java.lang.IllegalStateException ==================================================================== Mi 30. Nov 21:08:22 CET 2016 logs attached This is a different error, but at least we have an stack trace for it. The relevant log message is the following:
2016-11-30 21:08:21,928 ERROR [org.ovirt.engine.core.bll.SearchQuery] (default task-62) [] Exception: java.lang.NullPointerException
at org.ovirt.engine.core.bll.QueriesCommandBase.validatePermissions(QueriesCommandBase.java:157) [bll.jar:]
That means that there is a null pointer exception here:
https://github.com/oVirt/ovirt-engine/blob/ovirt-engine-4.0.5.5/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/QueriesCommandBase.java#L157
The code there is like this:
return getUser().isAdmin();
The only way that can generate a null pointer exception is if the "getUser()" method returns null, and that should never happen for a an authenticated request, unless the backend session is removed in the middle of the command, which shouldn't happen either.
Ravi, can you take a look? Note that the logs contain the AAA debug info.
I was able to reproduce the issue, it happens when a session is created during the execution of cleanExpiredUsersSessions after the session status was retrieved from SSO. Please backport at least the session handling fix of Ravi Nori to 4.0.x. I agree with Markus that this should be backported to 4.0.z. Martin, can you re-target? I agree, let's finish the fix for master and when done decide how to fix that in 4.0.z *** Bug 1379690 has been marked as a duplicate of this bug. *** Backport to ovirt-engine-4.0 branch is needed 4.0.6 has been the last oVirt 4.0 release, please re-target this bug. Setting flags back to allow cloning Verified on 4.1.0-9 |