1540289 – Very slow login to and very slow execution in the RHV Manager

Bug 1540289 - Very slow login to and very slow execution in the RHV Manager

Summary: Very slow login to and very slow execution in the RHV Manager

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.1.8
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	ovirt-4.2.2
Target Release:	---
Assignee:	Ravi Nori
QA Contact:	Lucie Leistnerova
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1540310 1547031
TreeView+	depends on / blocked

Reported:	2018-01-30 17:25 UTC by Gordon Watson
Modified:	2021-03-11 19:40 UTC (History)
CC List:	21 users (show)
Fixed In Version:	ovirt-engine-4.2.2.1
Doc Type:	Enhancement
Doc Text:	The default value of JBoss's jboss.as.management.blocking.timeout option can be changed by creating /etc/ovirt-engine/engine.conf.d/99-jboss-blocking-timeout.conf with "ENGINE_JBOSS_BLOCKING_TIMEOUT=NNN", where "NNN" is the timeout value in seconds.
Clone Of:
Clones:	1547031 (view as bug list)
Environment:
Last Closed:	2018-05-15 17:48:28 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	3344751	None	None	None	2018-02-09 18:43:38 UTC
Red Hat Knowledge Base (Solution)	3409491	None	None	None	2018-04-12 06:53:20 UTC
Red Hat Product Errata	RHEA-2018:1488	None	None	None	2018-05-15 17:49:44 UTC
oVirt gerrit	87753	master	MERGED	engine : Expose deployment timeout in engine conf	2021-02-02 13:49:49 UTC
oVirt gerrit	87835	ovirt-engine-4.2	MERGED	engine : Expose deployment timeout in engine conf	2021-02-02 13:49:49 UTC
oVirt gerrit	87884	master	MERGED	core: Fix default value of jboss.as.management.blocking.timeout	2021-02-02 13:49:49 UTC
oVirt gerrit	87885	ovirt-engine-4.2	MERGED	core: Fix default value of jboss.as.management.blocking.timeout	2021-02-02 13:49:49 UTC

Description Gordon Watson 2018-01-30 17:25:02 UTC

Description of problem:

At peak times of the day, logins to the User and Admin Portals are extremely slow, and then once logged in, operations are also very slow.

After the ovirt-engine service is restarted, the problem clears up, but only for a short time.

Various tuning options have been tried to no avail, e.g. engine and DWH heap sizes have been increased, and finally the DWH was stopped.

As an example, it can take 5 to 8 minutes to display the main landing page ("Welcome to Red Hat Virtualization Manager"). It is then slow through the entire login process. It can take up to 30 minutes.

Once logged in, it can take minutes to move between tabs and/or displaying a subsequent page.



Version-Release number of selected component (if applicable):

RHV 4.1.8


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 8 Germano Veit Michel 2018-01-31 02:04:58 UTC

I crossed the data in logs with the thread dumps, and found that the users cannot get their VMs due to the HTTPS call the engine does to iself to get some token information. It takes too long to connect to the webserver and then talk to it, so VmLogon timeouts with transaction timeout. 

In summary:

  --> VmLogonCommand
      --> SsoUtils.getPassword
           ---> socketConnect (~1m)
           ---> socketRead0 (~4m)
      --> VmLogonVDSCommand
           ---> fails with transaction aborted (300s max)

The full analysis is quite big, so I'll attach it here as a file. 

I'm not sure what too look next. Sounds like it could be the http server overloaded?

Comment 10 Yaniv Kaul 2018-01-31 08:22:21 UTC

Few parallel efforts to look at:
1. Disable SSO_CALLBACK_PREFIX_CHECK:

Create /etc/ovirt-engine/engine.conf.d/99-custom-sso-setup.conf with following content:

  SSO_CALLBACK_PREFIX_CHECK=false

and restart ovirt-engine service

2. Reduce TCP timeout for closed connections (but need to check if that is indeed an issue on the machine).

3. Is the HTTP server overloaded? Are you referring to the Apache side or the Engine side? I assume DNS is correctly configured (could check if /etc/hosts is correctly configured, so the Engine doesn't need to do resolving via DNS).

Comment 11 Roy Golan 2018-01-31 09:26:19 UTC

(In reply to Germano Veit Michel from comment #8)
> I crossed the data in logs with the thread dumps, and found that the users
> cannot get their VMs due to the HTTPS call the engine does to iself to get
> some token information. It takes too long to connect to the webserver and
> then talk to it, so VmLogon timeouts with transaction timeout. 
> 
> In summary:
> 
>   --> VmLogonCommand
>       --> SsoUtils.getPassword
>            ---> socketConnect (~1m)
>            ---> socketRead0 (~4m)
>       --> VmLogonVDSCommand
>            ---> fails with transaction aborted (300s max)
> 
> The full analysis is quite big, so I'll attach it here as a file. 
> 
> I'm not sure what too look next. Sounds like it could be the http server
> overloaded?

What is the io wait of the engine process? if this socket read takes too long then the OS disk is loaded.

Specifically this means that the read of the keystore from /etc/ovirt-engine took 4 minutes.

Comment 12 Roy Golan 2018-01-31 09:30:24 UTC

Also we might suffer from low entropy since this is a VM. 

Please make sure in vmware to start this vm with a random number generator.

Maybe you want to check the entropy pool stats before that.

Comment 13 Martin Perina 2018-01-31 10:01:00 UTC

(In reply to Yaniv Kaul from comment #10)
> Few parallel efforts to look at:
> 1. Disable SSO_CALLBACK_PREFIX_CHECK:
> 
> Create /etc/ovirt-engine/engine.conf.d/99-custom-sso-setup.conf with
> following content:
> 
>   SSO_CALLBACK_PREFIX_CHECK=false
> 
> and restart ovirt-engine service

This is not relevant, setting this to false just prevents checking if correct engine FQDN is used to access the engine.


(In reply to Germano Veit Michel from comment #8)
> I crossed the data in logs with the thread dumps, and found that the users
> cannot get their VMs due to the HTTPS call the engine does to iself to get
> some token information. It takes too long to connect to the webserver and
> then talk to it, so VmLogon timeouts with transaction timeout. 
> 
> In summary:
> 
>   --> VmLogonCommand
>       --> SsoUtils.getPassword
>            ---> socketConnect (~1m)
>            ---> socketRead0 (~4m)

So definitely please check if host has enough entropy, because we are using HTTPS to get user details and we are also decrypting user password using engine private key. Another thing I'd check if disk on engine host is not slowing down the whole operation.

I've also noticed some errors about inability to get database connection. Is engine database on the same host or on a remote host? Is DWH database on the same host or remote host? Are there any errors/warnings in postgresql logs?

Comment 15 Roy Golan 2018-01-31 10:52:25 UTC

DWH is disabled -
"Various tuning options have been tried to no avail, e.g. engine and DWH heap sizes have been increased, and finally the DWH was stopped."

Comment 23 Germano Veit Michel 2018-02-01 06:14:00 UTC

Hi Roy,

(In reply to Roy Golan from comment #11)
> (In reply to Germano Veit Michel from comment #8)
> > I crossed the data in logs with the thread dumps, and found that the users
> > cannot get their VMs due to the HTTPS call the engine does to iself to get
> > some token information. It takes too long to connect to the webserver and
> > then talk to it, so VmLogon timeouts with transaction timeout. 
> > 
> > In summary:
> > 
> >   --> VmLogonCommand
> >       --> SsoUtils.getPassword
> >            ---> socketConnect (~1m)
> >            ---> socketRead0 (~4m)
> >       --> VmLogonVDSCommand
> >            ---> fails with transaction aborted (300s max)
> > 
> > The full analysis is quite big, so I'll attach it here as a file. 
> > 
> > I'm not sure what too look next. Sounds like it could be the http server
> > overloaded?
> 
> What is the io wait of the engine process? if this socket read takes too
> long then the OS disk is loaded.

But the tcp connect also seem to take ages. And it seems to fail as I see a Retry in the traceback. A TCP connect doesn't require any disk IO AFAIK. The connection is even reset (TCP RST) during this. It looks like one side (userspace) is not keeping up with the load.

I'm trying to not jump ahead, trying to go step by step. So I've asked for the listening TCP socket (443) details to check if the backlog is too high, maybe apache cannot handle all those requests. Also requested apache logs to match with the timestamps from the logs we already have (SsoUtils.getPassword). To see what time apache actually received that request. It should tell us what direction to follow.

> Specifically this means that the read of the keystore from /etc/ovirt-engine
> took 4 minutes.

Hmmm. AFAIK all the basic performance stuff has already been checked. If this is the case I would like to see the engine stuck on it in the thread dumps to confirm it. We have something stuck for 4 minutes, it must show up somewhere.

How do I track this loopback request in the engine?

For example. Assuming getPassword ends up on the other side (request) on AuthenticationUtils, like this:

task AA : SsoUtils.getPassword
  ---> loopback request via https to engine itself
          ---> task BB : core.sso.utils.AuthenticationUtils

1. Is the above correct?
2. Is there any logging I can enable to relate AA to BB? (Because there are too many at the same time.)

I would like to see the request (BB) as early as possible on the engine). What debug logs should I enable?

Thanks for your help!

Comment 24 Roy Golan 2018-02-01 06:59:41 UTC

(In reply to Germano Veit Michel from comment #23)

This all smells very very OS related.
- How do other processes behave in this OS?
- What happens if you open a tls - connection to some other page on the apache web server?
- sar command and top?
- The Hyper Visor CPU is choked or has SLA on the VM cpu?
- A single login works well?
- KPTI runtime disable just to check the effect? in the VM and/or the Host

> Hmmm. AFAIK all the basic performance stuff has already been checked. If
> this is the case I would like to see the engine stuck on it in the thread
> dumps to confirm it. We have something stuck for 4 minutes, it must show up
> somewhere.
> 

Sorry if I ask dumb question, just trying to get as much details as I can to help. I know you probably looked around and gathered a lot of info

> How do I track this loopback request in the engine?
> 
> For example. Assuming getPassword ends up on the other side (request) on
> AuthenticationUtils, like this:
> 
> task AA : SsoUtils.getPassword
>   ---> loopback request via https to engine itself
>           ---> task BB : core.sso.utils.AuthenticationUtils
> 
> 1. Is the above correct?
> 2. Is there any logging I can enable to relate AA to BB? (Because there are
> too many at the same time.)
> 
> I would like to see the request (BB) as early as possible on the engine).
> What debug logs should I enable?
> 

few ideas:
- set the logging level of org.ovirt.engine.core to DEUBG in runtime (I have a contrib script to help with that [1]). Maybe also org.ovirt.engine.ui
- then track correlation id on the log entry. 
- if it doesn't help then Byteman :)

[1] https://github.com/oVirt/ovirt-engine/tree/master/contrib
> Thanks for your help!

Comment 26 Germano Veit Michel 2018-02-01 07:10:42 UTC

Hi Roy,

Thanks. But by the time you start I'm finishing :(

So just a few more things please so I don't get blocked tomorrow ;)

(In reply to Roy Golan from comment #24)
> (In reply to Germano Veit Michel from comment #23)
> 
> This all smells very very OS related.
> - How do other processes behave in this OS?
> - What happens if you open a tls - connection to some other page on the
> apache web server?
> - sar command and top?
> - The Hyper Visor CPU is choked or has SLA on the VM cpu?
> - A single login works well?
> - KPTI runtime disable just to check the effect? in the VM and/or the Host
> 
> > Hmmm. AFAIK all the basic performance stuff has already been checked. If
> > this is the case I would like to see the engine stuck on it in the thread
> > dumps to confirm it. We have something stuck for 4 minutes, it must show up
> > somewhere.
> > 
> 
> Sorry if I ask dumb question, just trying to get as much details as I can to
> help. I know you probably looked around and gathered a lot of info

No its not dumb at all. I did not check these things myself, I joined the case much later.
I agree it's worth a second look given the symptoms we are seeing.

> 
> > How do I track this loopback request in the engine?
> > 
> > For example. Assuming getPassword ends up on the other side (request) on
> > AuthenticationUtils, like this:
> > 
> > task AA : SsoUtils.getPassword
> >   ---> loopback request via https to engine itself
> >           ---> task BB : core.sso.utils.AuthenticationUtils
> > 
> > 1. Is the above correct?
> > 2. Is there any logging I can enable to relate AA to BB? (Because there are
> > too many at the same time.)
> > 
> > I would like to see the request (BB) as early as possible on the engine).
> > What debug logs should I enable?
> > 
> 
> few ideas:
> - set the logging level of org.ovirt.engine.core to DEUBG in runtime (I have
> a contrib script to help with that [1]). Maybe also org.ovirt.engine.ui

I was hoping to not enable big chunks of log, as the env is already huge and busy.
Anything more specific comes to your mind?

> - then track correlation id on the log entry. 
> - if it doesn't help then Byteman :)

I'm failing to find the link between the HTTPS request for token-info and what is called inside the engine when this request arrives. I think this is it?

ovirt-engine/backend/manager/modules/enginesso/src/main/java/org/ovirt/engine/core/sso/servlets/OAuthTokenInfoServlet.java

Is the service method that is called?

> 
> [1] https://github.com/oVirt/ovirt-engine/tree/master/contrib

And again, thanks for your help!

Comment 27 Roy Golan 2018-02-01 07:20:56 UTC

To further narrow it down you can set debug on org.ovirt.engine.core.sso 

Looking more closely on the vm command, I see that we call VDSM there

 VmLogonCommand
   ssoUtils.getPassword
   runVdsCommand // this is a call to vdsm which probably is hanging/stalling
   return


I'm going over the logs again.

Comment 28 Roy Golan 2018-02-01 08:18:38 UTC

So it looks like the engine calling from inside the url ENGINE_SSO_SERVICE. The engine keeps an internal client with configurable amount of connections.

See SsoOAuthServiceUtils.createClient

If we have that much connections maybe we need to tweak the pool. Possibly the pool is depleted. Too bad we don't have logs there.

All of those items should be under   /usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.conf

Comment 29 Roy Golan 2018-02-01 08:25:04 UTC

for QE, we may need to check mass VM logon under stress. The internal client pool size is 10. If the problem here is really the performance of that pool under certain conditions then we need to uncover those and perhaps tweak the pool - for example see if we can expand and shrink the pool dynamically

Comment 30 Martin Perina 2018-02-01 08:48:07 UTC

(In reply to Roy Golan from comment #28)
> So it looks like the engine calling from inside the url ENGINE_SSO_SERVICE.
> The engine keeps an internal client with configurable amount of connections.
> 
> See SsoOAuthServiceUtils.createClient
> 
> If we have that much connections maybe we need to tweak the pool. Possibly
> the pool is depleted. Too bad we don't have logs there.
> 
> All of those items should be under  
> /usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.conf

Ravi, please provide additional details about that if needed

Comment 31 Roy Golan 2018-02-01 09:50:21 UTC

Found a way to dump http traffic using jboss cli:

cat << EOF > httpdumper.jbosscli
/subsystem=undertow/configuration=filter/custom-filter=request-dumper:add(class-name="io.undertow.server.handlers.RequestDumpingHandler",  module="io.undertow.core")
/subsystem=undertow/server=default-server/host=default-host/filter-ref=request-dumper:add
EOF

${JBOSS_HOME}/bin/jboss-cli.sh --controller=remote+http://127.0.0.1:8706 --connect --user=admin@internal --file=httpdumper.jbosscli


in server.log:

2018-02-01 11:46:42,560+02 INFO  [io.undertow.request.dump] (default task-20)
----------------------------REQUEST---------------------------
               URI=/ovirt-engine/sso/oauth/token-info
 characterEncoding=null
     contentLength=92
       contentType=[application/x-www-form-urlencoded; charset=UTF-8]
            cookie=locale=en_US
            header=Accept=application/json
            header=Accept-Encoding=gzip,deflate
            header=User-Agent=Apache-HttpClient/4.5.2 (Java/1.8.0_151)
            header=Connection=Keep-Alive
            header=Authorization=Basic b3ZpcnQtZW5naW5lLWNvcmU6Rnl6SkZ1TlhFbER6YldpeThNQ0VFWlFxSGFiUTJQZUk=
            header=Content-Length=92
            header=Content-Type=application/x-www-form-urlencoded; charset=UTF-8

Comment 32 Ravi Nori 2018-02-01 14:21:56 UTC

To increase the pool size for sso connections we need to set the variable 
ENGINE_SSO_SERVICE_CLIENT_POOL_SIZE in a custom conf file /etc/ovirt-engine/engine.conf.d/99-sso-setup.conf

ENGINE_SSO_SERVICE_CLIENT_POOL_SIZE=20

The default value is 10 which is plenty unless we have an issue with the authentication domain. Before increasing the sso client pool size I would try increasing the db connections.

Comment 41 Martin Perina 2018-02-15 14:35:52 UTC

OK, from the changes to customer configuration only change to jboss.as.management.blocking.timeout in ovirt-engine.xml.in is not properly exposed and it will be lost during next upgrade.

So Ravi, could you please expose jboss.as.management.blocking.timeout using our engine.conf.d standard?

Comment 42 Daniel Gur 2018-02-18 08:45:06 UTC

Hi Martin, could you please provide validation instructions to this bug?

Comment 43 Martin Perina 2018-02-19 19:44:00 UTC

(In reply to Daniel Gur from comment #42)
> Hi Martin, could you please provide validation instructions to this bug?

Here are the steps:

1) Create new configuration file as mentioned in Doc Text

2) Restart ovirt-engine service

3) Execute below command to check if above value is set in runtime JBoss configuration

 /usr/share/ovirt-engine-wildfly/bin/jboss-cli.sh --controller=remote+http://127.0.0.1:8706 --connect --user=admin@internal -p=admin  --command="/system-property=jboss.as.management.blocking.timeout:read-resource"

Comment 45 Lucie Leistnerova 2018-02-22 08:13:12 UTC

Setting ENGINE_JBOSS_BLOCKING_TIMEOUT=120 in /etc/ovirt-engine/engine.conf.d/99-jboss-blocking-timeout.conf changed jboss configuration. Otherwise default 300 is set.

# /opt/rh/eap7/root/usr/share/wildfly/bin/jboss-cli.sh --controller=127.0.0.1:8706 --connect --user=admin@internal '/system-property=jboss.as.management.blocking.timeout:read-resource'

{
    "outcome" => "success",
    "result" => {"value" => "120"}
}

verified in ovirt-engine-4.2.2.1-0.1.el7.noarch

Comment 46 RHV bug bot 2018-03-16 15:02:18 UTC

INFO: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Tag 'ovirt-engine-4.2.2.4' doesn't contain patch 'https://gerrit.ovirt.org/87885']
gitweb: https://gerrit.ovirt.org/gitweb?p=ovirt-engine.git;a=shortlog;h=refs/tags/ovirt-engine-4.2.2.4

For more info please contact: rhv-devops

Comment 50 errata-xmlrpc 2018-05-15 17:48:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1488

Comment 51 Franta Kust 2019-05-16 13:08:02 UTC

BZ<2>Jira Resync

Note You need to log in before you can comment on or make changes to this bug.

apinnick
audgiri
bburmest
dwoodruf
gveitmic
gwatson
lsurette
lveyde
mgoldboi
mkalinin
mlehrer
mperina
pmatyas
pstehlik
pvilayat
rbalakri
rgolan
Rhev-m-bugs
rnori
srevivo
ykaul