Description of problem: At peak times of the day, logins to the User and Admin Portals are extremely slow, and then once logged in, operations are also very slow. After the ovirt-engine service is restarted, the problem clears up, but only for a short time. Various tuning options have been tried to no avail, e.g. engine and DWH heap sizes have been increased, and finally the DWH was stopped. As an example, it can take 5 to 8 minutes to display the main landing page ("Welcome to Red Hat Virtualization Manager"). It is then slow through the entire login process. It can take up to 30 minutes. Once logged in, it can take minutes to move between tabs and/or displaying a subsequent page. Version-Release number of selected component (if applicable): RHV 4.1.8 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
I crossed the data in logs with the thread dumps, and found that the users cannot get their VMs due to the HTTPS call the engine does to iself to get some token information. It takes too long to connect to the webserver and then talk to it, so VmLogon timeouts with transaction timeout. In summary: --> VmLogonCommand --> SsoUtils.getPassword ---> socketConnect (~1m) ---> socketRead0 (~4m) --> VmLogonVDSCommand ---> fails with transaction aborted (300s max) The full analysis is quite big, so I'll attach it here as a file. I'm not sure what too look next. Sounds like it could be the http server overloaded?
Few parallel efforts to look at: 1. Disable SSO_CALLBACK_PREFIX_CHECK: Create /etc/ovirt-engine/engine.conf.d/99-custom-sso-setup.conf with following content: SSO_CALLBACK_PREFIX_CHECK=false and restart ovirt-engine service 2. Reduce TCP timeout for closed connections (but need to check if that is indeed an issue on the machine). 3. Is the HTTP server overloaded? Are you referring to the Apache side or the Engine side? I assume DNS is correctly configured (could check if /etc/hosts is correctly configured, so the Engine doesn't need to do resolving via DNS).
(In reply to Germano Veit Michel from comment #8) > I crossed the data in logs with the thread dumps, and found that the users > cannot get their VMs due to the HTTPS call the engine does to iself to get > some token information. It takes too long to connect to the webserver and > then talk to it, so VmLogon timeouts with transaction timeout. > > In summary: > > --> VmLogonCommand > --> SsoUtils.getPassword > ---> socketConnect (~1m) > ---> socketRead0 (~4m) > --> VmLogonVDSCommand > ---> fails with transaction aborted (300s max) > > The full analysis is quite big, so I'll attach it here as a file. > > I'm not sure what too look next. Sounds like it could be the http server > overloaded? What is the io wait of the engine process? if this socket read takes too long then the OS disk is loaded. Specifically this means that the read of the keystore from /etc/ovirt-engine took 4 minutes.
Also we might suffer from low entropy since this is a VM. Please make sure in vmware to start this vm with a random number generator. Maybe you want to check the entropy pool stats before that.
(In reply to Yaniv Kaul from comment #10) > Few parallel efforts to look at: > 1. Disable SSO_CALLBACK_PREFIX_CHECK: > > Create /etc/ovirt-engine/engine.conf.d/99-custom-sso-setup.conf with > following content: > > SSO_CALLBACK_PREFIX_CHECK=false > > and restart ovirt-engine service This is not relevant, setting this to false just prevents checking if correct engine FQDN is used to access the engine. (In reply to Germano Veit Michel from comment #8) > I crossed the data in logs with the thread dumps, and found that the users > cannot get their VMs due to the HTTPS call the engine does to iself to get > some token information. It takes too long to connect to the webserver and > then talk to it, so VmLogon timeouts with transaction timeout. > > In summary: > > --> VmLogonCommand > --> SsoUtils.getPassword > ---> socketConnect (~1m) > ---> socketRead0 (~4m) So definitely please check if host has enough entropy, because we are using HTTPS to get user details and we are also decrypting user password using engine private key. Another thing I'd check if disk on engine host is not slowing down the whole operation. I've also noticed some errors about inability to get database connection. Is engine database on the same host or on a remote host? Is DWH database on the same host or remote host? Are there any errors/warnings in postgresql logs?
DWH is disabled - "Various tuning options have been tried to no avail, e.g. engine and DWH heap sizes have been increased, and finally the DWH was stopped."
Hi Roy, (In reply to Roy Golan from comment #11) > (In reply to Germano Veit Michel from comment #8) > > I crossed the data in logs with the thread dumps, and found that the users > > cannot get their VMs due to the HTTPS call the engine does to iself to get > > some token information. It takes too long to connect to the webserver and > > then talk to it, so VmLogon timeouts with transaction timeout. > > > > In summary: > > > > --> VmLogonCommand > > --> SsoUtils.getPassword > > ---> socketConnect (~1m) > > ---> socketRead0 (~4m) > > --> VmLogonVDSCommand > > ---> fails with transaction aborted (300s max) > > > > The full analysis is quite big, so I'll attach it here as a file. > > > > I'm not sure what too look next. Sounds like it could be the http server > > overloaded? > > What is the io wait of the engine process? if this socket read takes too > long then the OS disk is loaded. But the tcp connect also seem to take ages. And it seems to fail as I see a Retry in the traceback. A TCP connect doesn't require any disk IO AFAIK. The connection is even reset (TCP RST) during this. It looks like one side (userspace) is not keeping up with the load. I'm trying to not jump ahead, trying to go step by step. So I've asked for the listening TCP socket (443) details to check if the backlog is too high, maybe apache cannot handle all those requests. Also requested apache logs to match with the timestamps from the logs we already have (SsoUtils.getPassword). To see what time apache actually received that request. It should tell us what direction to follow. > Specifically this means that the read of the keystore from /etc/ovirt-engine > took 4 minutes. Hmmm. AFAIK all the basic performance stuff has already been checked. If this is the case I would like to see the engine stuck on it in the thread dumps to confirm it. We have something stuck for 4 minutes, it must show up somewhere. How do I track this loopback request in the engine? For example. Assuming getPassword ends up on the other side (request) on AuthenticationUtils, like this: task AA : SsoUtils.getPassword ---> loopback request via https to engine itself ---> task BB : core.sso.utils.AuthenticationUtils 1. Is the above correct? 2. Is there any logging I can enable to relate AA to BB? (Because there are too many at the same time.) I would like to see the request (BB) as early as possible on the engine). What debug logs should I enable? Thanks for your help!
(In reply to Germano Veit Michel from comment #23) This all smells very very OS related. - How do other processes behave in this OS? - What happens if you open a tls - connection to some other page on the apache web server? - sar command and top? - The Hyper Visor CPU is choked or has SLA on the VM cpu? - A single login works well? - KPTI runtime disable just to check the effect? in the VM and/or the Host > Hmmm. AFAIK all the basic performance stuff has already been checked. If > this is the case I would like to see the engine stuck on it in the thread > dumps to confirm it. We have something stuck for 4 minutes, it must show up > somewhere. > Sorry if I ask dumb question, just trying to get as much details as I can to help. I know you probably looked around and gathered a lot of info > How do I track this loopback request in the engine? > > For example. Assuming getPassword ends up on the other side (request) on > AuthenticationUtils, like this: > > task AA : SsoUtils.getPassword > ---> loopback request via https to engine itself > ---> task BB : core.sso.utils.AuthenticationUtils > > 1. Is the above correct? > 2. Is there any logging I can enable to relate AA to BB? (Because there are > too many at the same time.) > > I would like to see the request (BB) as early as possible on the engine). > What debug logs should I enable? > few ideas: - set the logging level of org.ovirt.engine.core to DEUBG in runtime (I have a contrib script to help with that [1]). Maybe also org.ovirt.engine.ui - then track correlation id on the log entry. - if it doesn't help then Byteman :) [1] https://github.com/oVirt/ovirt-engine/tree/master/contrib > Thanks for your help!
Hi Roy, Thanks. But by the time you start I'm finishing :( So just a few more things please so I don't get blocked tomorrow ;) (In reply to Roy Golan from comment #24) > (In reply to Germano Veit Michel from comment #23) > > This all smells very very OS related. > - How do other processes behave in this OS? > - What happens if you open a tls - connection to some other page on the > apache web server? > - sar command and top? > - The Hyper Visor CPU is choked or has SLA on the VM cpu? > - A single login works well? > - KPTI runtime disable just to check the effect? in the VM and/or the Host > > > Hmmm. AFAIK all the basic performance stuff has already been checked. If > > this is the case I would like to see the engine stuck on it in the thread > > dumps to confirm it. We have something stuck for 4 minutes, it must show up > > somewhere. > > > > Sorry if I ask dumb question, just trying to get as much details as I can to > help. I know you probably looked around and gathered a lot of info No its not dumb at all. I did not check these things myself, I joined the case much later. I agree it's worth a second look given the symptoms we are seeing. > > > How do I track this loopback request in the engine? > > > > For example. Assuming getPassword ends up on the other side (request) on > > AuthenticationUtils, like this: > > > > task AA : SsoUtils.getPassword > > ---> loopback request via https to engine itself > > ---> task BB : core.sso.utils.AuthenticationUtils > > > > 1. Is the above correct? > > 2. Is there any logging I can enable to relate AA to BB? (Because there are > > too many at the same time.) > > > > I would like to see the request (BB) as early as possible on the engine). > > What debug logs should I enable? > > > > few ideas: > - set the logging level of org.ovirt.engine.core to DEUBG in runtime (I have > a contrib script to help with that [1]). Maybe also org.ovirt.engine.ui I was hoping to not enable big chunks of log, as the env is already huge and busy. Anything more specific comes to your mind? > - then track correlation id on the log entry. > - if it doesn't help then Byteman :) I'm failing to find the link between the HTTPS request for token-info and what is called inside the engine when this request arrives. I think this is it? ovirt-engine/backend/manager/modules/enginesso/src/main/java/org/ovirt/engine/core/sso/servlets/OAuthTokenInfoServlet.java Is the service method that is called? > > [1] https://github.com/oVirt/ovirt-engine/tree/master/contrib And again, thanks for your help!
To further narrow it down you can set debug on org.ovirt.engine.core.sso Looking more closely on the vm command, I see that we call VDSM there VmLogonCommand ssoUtils.getPassword runVdsCommand // this is a call to vdsm which probably is hanging/stalling return I'm going over the logs again.
So it looks like the engine calling from inside the url ENGINE_SSO_SERVICE. The engine keeps an internal client with configurable amount of connections. See SsoOAuthServiceUtils.createClient If we have that much connections maybe we need to tweak the pool. Possibly the pool is depleted. Too bad we don't have logs there. All of those items should be under /usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.conf
for QE, we may need to check mass VM logon under stress. The internal client pool size is 10. If the problem here is really the performance of that pool under certain conditions then we need to uncover those and perhaps tweak the pool - for example see if we can expand and shrink the pool dynamically
(In reply to Roy Golan from comment #28) > So it looks like the engine calling from inside the url ENGINE_SSO_SERVICE. > The engine keeps an internal client with configurable amount of connections. > > See SsoOAuthServiceUtils.createClient > > If we have that much connections maybe we need to tweak the pool. Possibly > the pool is depleted. Too bad we don't have logs there. > > All of those items should be under > /usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.conf Ravi, please provide additional details about that if needed
Found a way to dump http traffic using jboss cli: cat << EOF > httpdumper.jbosscli /subsystem=undertow/configuration=filter/custom-filter=request-dumper:add(class-name="io.undertow.server.handlers.RequestDumpingHandler", module="io.undertow.core") /subsystem=undertow/server=default-server/host=default-host/filter-ref=request-dumper:add EOF ${JBOSS_HOME}/bin/jboss-cli.sh --controller=remote+http://127.0.0.1:8706 --connect --user=admin@internal --file=httpdumper.jbosscli in server.log: 2018-02-01 11:46:42,560+02 INFO [io.undertow.request.dump] (default task-20) ----------------------------REQUEST--------------------------- URI=/ovirt-engine/sso/oauth/token-info characterEncoding=null contentLength=92 contentType=[application/x-www-form-urlencoded; charset=UTF-8] cookie=locale=en_US header=Accept=application/json header=Accept-Encoding=gzip,deflate header=User-Agent=Apache-HttpClient/4.5.2 (Java/1.8.0_151) header=Connection=Keep-Alive header=Authorization=Basic b3ZpcnQtZW5naW5lLWNvcmU6Rnl6SkZ1TlhFbER6YldpeThNQ0VFWlFxSGFiUTJQZUk= header=Content-Length=92 header=Content-Type=application/x-www-form-urlencoded; charset=UTF-8
To increase the pool size for sso connections we need to set the variable ENGINE_SSO_SERVICE_CLIENT_POOL_SIZE in a custom conf file /etc/ovirt-engine/engine.conf.d/99-sso-setup.conf ENGINE_SSO_SERVICE_CLIENT_POOL_SIZE=20 The default value is 10 which is plenty unless we have an issue with the authentication domain. Before increasing the sso client pool size I would try increasing the db connections.
OK, from the changes to customer configuration only change to jboss.as.management.blocking.timeout in ovirt-engine.xml.in is not properly exposed and it will be lost during next upgrade. So Ravi, could you please expose jboss.as.management.blocking.timeout using our engine.conf.d standard?
Hi Martin, could you please provide validation instructions to this bug?
(In reply to Daniel Gur from comment #42) > Hi Martin, could you please provide validation instructions to this bug? Here are the steps: 1) Create new configuration file as mentioned in Doc Text 2) Restart ovirt-engine service 3) Execute below command to check if above value is set in runtime JBoss configuration /usr/share/ovirt-engine-wildfly/bin/jboss-cli.sh --controller=remote+http://127.0.0.1:8706 --connect --user=admin@internal -p=admin --command="/system-property=jboss.as.management.blocking.timeout:read-resource"
Setting ENGINE_JBOSS_BLOCKING_TIMEOUT=120 in /etc/ovirt-engine/engine.conf.d/99-jboss-blocking-timeout.conf changed jboss configuration. Otherwise default 300 is set. # /opt/rh/eap7/root/usr/share/wildfly/bin/jboss-cli.sh --controller=127.0.0.1:8706 --connect --user=admin@internal '/system-property=jboss.as.management.blocking.timeout:read-resource' { "outcome" => "success", "result" => {"value" => "120"} } verified in ovirt-engine-4.2.2.1-0.1.el7.noarch
INFO: Bug status (VERIFIED) wasn't changed but the folowing should be fixed: [Tag 'ovirt-engine-4.2.2.4' doesn't contain patch 'https://gerrit.ovirt.org/87885'] gitweb: https://gerrit.ovirt.org/gitweb?p=ovirt-engine.git;a=shortlog;h=refs/tags/ovirt-engine-4.2.2.4 For more info please contact: rhv-devops
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:1488
BZ<2>Jira Resync