1352384 – Creating the DB on hosted engine takes more than 600 seconds - increase timeout to half an hour to compensate for slow network/storage

Bug 1352384 - Creating the DB on hosted engine takes more than 600 seconds - increase timeout to half an hour to compensate for slow network/storage

Summary: Creating the DB on hosted engine takes more than 600 seconds - increase timeo...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-hosted-engine-setup
Classification:	oVirt
Component:	General
Sub Component:
Version:	2.0.0.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	ovirt-4.1.0-beta
Target Release:	2.1.0
Assignee:	Rafael Martins
QA Contact:	Nikolai Sednev
Docs Contact:
URL:
Whiteboard:
Depends On:	1357246
Blocks:
TreeView+	depends on / blocked

Reported:	2016-07-04 03:35 UTC by Kevin Hung
Modified:	2017-05-11 09:23 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-02-01 14:36:15 UTC
oVirt Team:	Integration
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.1+

Attachments	(Terms of Use)
ovirt-engine-setup.log (3.52 MB, text/plain) 2016-07-17 18:08 UTC, Kevin Hung	no flags	Details
sosreport from host (8.46 MB, application/x-xz) 2017-01-23 16:39 UTC, Nikolai Sednev	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	69973	0	'None'	MERGED	engine: raise the default timeout on engine setup to 1800s	2020-02-12 11:46:05 UTC
oVirt gerrit	69974	0	'None'	MERGED	engine: raise the default timeout on engine setup to 1800s	2020-02-12 11:46:05 UTC

Description Kevin Hung 2016-07-04 03:35:17 UTC

Description of problem:
When you deploy the oVirt 4.0 hosted engine, steps where the database schemas are being generated can take longer than the predefined timeout of 600 seconds.

How reproducible:
Very frequently

Steps to Reproduce:
Run hosted-engine --deploy on a new oVirt deployment.

Actual results:
When the hosted engine setup reaches the step "Creating/refreshing Engine database schema" or "Creating/refreshing DWH database schema", and it takes longer than 600 seconds to complete, the following error is shown.
[ ERROR ] Engine setup got stuck on the appliance
[ ERROR ] Failed to execute stage 'Closing up': Engine setup is stalled on
the appliance since 600 seconds ago. Please check its log on the appliance.


Expected results:
A user prompt asking if you want to continue waiting.

Additional info:
This BZ was first suggested by Yedidyah Bar David [1], however I do not see an existing BZ.
[1] http://lists.ovirt.org/pipermail/users/2016-May/039546.html

Comment 1 Yaniv Kaul 2016-07-04 06:51:52 UTC

I'd actually look at why it's taking 10 minutes. This is very long.

Comment 2 Kevin Hung 2016-07-15 22:49:44 UTC

I noticed that the default CPU/RAM configuration for the engine appliance is rather large (4 cores/16 GB RAM). I set the values to the minimum requirements (2 cores/4 GB RAM) when deploying. Perhaps that may cause this issue?

Comment 3 Yedidyah Bar David 2016-07-17 06:40:55 UTC

(In reply to Kevin Hung from comment #2)
> I noticed that the default CPU/RAM configuration for the engine appliance is
> rather large (4 cores/16 GB RAM). I set the values to the minimum
> requirements (2 cores/4 GB RAM) when deploying. Perhaps that may cause this
> issue?

Unlikely. Min reqs are more than enough for the initial setup - they might not be enough eventually, depending on the actual use and load.

Please attach engine-setup logs. You can find them in the engine machine in:
/var/log/ovirt-engine/setup. Thanks.

Comment 4 Yedidyah Bar David 2016-07-17 06:43:51 UTC

(In reply to Yedidyah Bar David from comment #3)
> (In reply to Kevin Hung from comment #2)
> > I noticed that the default CPU/RAM configuration for the engine appliance is
> > rather large (4 cores/16 GB RAM). I set the values to the minimum
> > requirements (2 cores/4 GB RAM) when deploying. Perhaps that may cause this
> > issue?
> 
> Unlikely. Min reqs are more than enough for the initial setup - they might
> not be enough eventually, depending on the actual use and load.
> 
> Please attach engine-setup logs. You can find them in the engine machine in:
> /var/log/ovirt-engine/setup. Thanks.

Please ignore this. I see in the link to the list archive that we already found the root cause for this specific case, which was lack of entropy during ovirt-aaa-jdbc-tool run. Perhaps we should have a bz to make sure hosted-engine-setup+engine appliance (one of them, or both, need to think about this) make sure that the engine vm has enough entropy.

Comment 5 Yedidyah Bar David 2016-07-17 08:08:29 UTC

(In reply to Yedidyah Bar David from comment #4)
> Please ignore this. I see in the link to the list archive that we already
> found the root cause for this specific case, which was lack of entropy
> during ovirt-aaa-jdbc-tool run. Perhaps we should have a bz to make sure
> hosted-engine-setup+engine appliance (one of them, or both, need to think
> about this) make sure that the engine vm has enough entropy.

Now opened but 1357246 to track this. Still keeping current bug open, because in principle there might be other reasons for engine-setup to take a long time, and we might want to allow that.

Comment 6 Yedidyah Bar David 2016-07-17 08:09:33 UTC

(In reply to Yedidyah Bar David from comment #5)
> Now opened but 1357246 to track this.

bug 1357246 ...

Perhaps bugzilla should be extended to notice such typos and link anyway :-)

Comment 7 Kevin Hung 2016-07-17 18:08:57 UTC

Created attachment 1180825 [details]
ovirt-engine-setup.log

Here is the setup log as requested.

Comment 8 Yedidyah Bar David 2016-07-19 07:36:20 UTC

Thanks for the log, although I wrote it's not really needed, but it seems it is :-)

According to the log:

The entire run took almost 18 minutes.

aaajdbc.Plugin._setupAdminPassword took 215 seconds. Since I assume this is the main one to be affected by fixing bug 1357246, it would still leave us with around 14.5 minutes, which is significantly more than our current maximum, which is 10 minutes.

The two significant time consumers are, as usual, engine and dwh db schema creation, taking 414 and 308 seconds, respectively.

Since we do not log per-SQL times, we can't tell how long each call took there.

Generally speaking, I'd say these are long time, much longer than what we usually see, but are not unrealistic, especially with a slower/loaded network and storage.

Comment 9 Sandro Bonazzola 2016-09-01 07:11:06 UTC

(In reply to Yedidyah Bar David from comment #8)
> Thanks for the log, although I wrote it's not really needed, but it seems it
> is :-)
> 
> According to the log:
> 
> The entire run took almost 18 minutes.
> 
> aaajdbc.Plugin._setupAdminPassword took 215 seconds. Since I assume this is
> the main one to be affected by fixing bug 1357246, it would still leave us
> with around 14.5 minutes, which is significantly more than our current
> maximum, which is 10 minutes.
> 
> The two significant time consumers are, as usual, engine and dwh db schema
> creation, taking 414 and 308 seconds, respectively.
> 
> Since we do not log per-SQL times, we can't tell how long each call took
> there.
> 
> Generally speaking, I'd say these are long time, much longer than what we
> usually see, but are not unrealistic, especially with a slower/loaded
> network and storage.

So is this to be closed as not a bug, due to storage / network traffic, since the entropy part has been addressed in a separate bug?

Comment 10 Yedidyah Bar David 2016-09-01 09:42:38 UTC

(In reply to Sandro Bonazzola from comment #9)
> So is this to be closed as not a bug, due to storage / network traffic,
> since the entropy part has been addressed in a separate bug?

I do not think so. If we decide that these are "not unrealistic", we should do something. Kevin asked to prompt the user, which might make sense. Perhaps just making the maximum much higher is enough. Or something more complex, such as somehow adding to engine-setup output in the middle of schema.sh indicating it's still running (plus some kind of progress indicator would be cool), so that hosted-engine setup can notice the added output and understand it's still running.

Comment 11 Yedidyah Bar David 2016-09-14 14:02:33 UTC

I think that merely making the maximum half an hour or so, is zero work - just pick a version and push there. No reason to delay this, we already got more reports for this bug (both due to slowness and to low entropy). More complex (and user friendly) solutions can be added later on.

Comment 12 Sandro Bonazzola 2016-12-05 10:53:37 UTC

(In reply to Yedidyah Bar David from comment #11)
> I think that merely making the maximum half an hour or so, is zero work -
> just pick a version and push there. No reason to delay this, we already got
> more reports for this bug (both due to slowness and to low entropy). More
> complex (and user friendly) solutions can be added later on.

ok, let's go with the timeout increase.

Comment 13 Nikolai Sednev 2017-01-23 16:35:51 UTC

Looks like I'm getting the same issue on a new deployment of latest hosted engine:
rhevm-appliance-20161214.0-1.el7ev.noarch
ovirt-hosted-engine-ha-2.1.0-1.el7ev.noarch
ovirt-host-deploy-1.6.0-1.el7ev.noarch
ovirt-imageio-common-0.5.0-0.el7ev.noarch
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
qemu-kvm-rhev-2.6.0-28.el7_3.3.x86_64
libvirt-client-2.0.0-10.el7_3.4.x86_64
mom-0.5.8-1.el7ev.noarch
vdsm-4.19.2-2.el7ev.x86_64
ovirt-hosted-engine-setup-2.1.0-2.el7ev.noarch
ovirt-setup-lib-1.1.0-1.el7ev.noarch
ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch
ovirt-imageio-daemon-0.5.0-0.el7ev.noarch
ovirt-vmconsole-1.0.4-1.el7ev.noarch
rhevm-appliance-20161214.0-1.el7ev.noarch
sanlock-3.4.0-1.el7.x86_64
Linux version 3.10.0-514.6.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Sat Dec 10 11:15:38 EST 2016
Linux 3.10.0-514.6.1.el7.x86_64 #1 SMP Sat Dec 10 11:15:38 EST 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.3 (Maipo)

Deployment was attempted over NFS storage.

Comment 14 Nikolai Sednev 2017-01-23 16:36:10 UTC

[ INFO  ] Running engine-setup on the appliance

[ ERROR ] Engine setup got stuck on the appliance
[ ERROR ] Failed to execute stage 'Closing up': Engine setup is stalled on the appliance since 1800 seconds ago. Please check its log on the appliance. 
[ INFO  ] Stage: Clean up
[ INFO  ] Generating answer file '/var/lib/ovirt-hosted-engine-setup/answers/answers-20170123181000.conf'
[ INFO  ] Stage: Pre-termination
[ INFO  ] Stage: Termination
[ ERROR ] Hosted Engine deployment failed: this system is not reliable, please check the issue,fix and redeploy
          Log file is located at /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20170123172939-91xjk7.log

Comment 15 Nikolai Sednev 2017-01-23 16:39:05 UTC

Created attachment 1243657 [details]
sosreport from host

Comment 16 Simone Tiraboschi 2017-01-24 08:48:48 UTC

(In reply to Nikolai Sednev from comment #14)
> [ INFO  ] Running engine-setup on the appliance
> 
> [ ERROR ] Engine setup got stuck on the appliance
> [ ERROR ] Failed to execute stage 'Closing up': Engine setup is stalled on
> the appliance since 1800 seconds ago. Please check its log on the appliance. 

We raised the default timeout from 600 seconds to 1800 as for the request;
the point is why it got stuck.
Did you do something special to kill engine setup? if no, you hit a different issue.

Comment 17 Nikolai Sednev 2017-01-24 09:26:54 UTC

(In reply to Simone Tiraboschi from comment #16)
> (In reply to Nikolai Sednev from comment #14)
> > [ INFO  ] Running engine-setup on the appliance
> > 
> > [ ERROR ] Engine setup got stuck on the appliance
> > [ ERROR ] Failed to execute stage 'Closing up': Engine setup is stalled on
> > the appliance since 1800 seconds ago. Please check its log on the appliance. 
> 
> We raised the default timeout from 600 seconds to 1800 as for the request;
> the point is why it got stuck.
> Did you do something special to kill engine setup? if no, you hit a
> different issue.

The one and the only specific configuration was the size of HE-VM's disk, which I've set to be at the size of 150GB, although even larger disk sizes worked well in previous deployments. 
Then I gues this very bug is verified forth to I've witnessed 1800 seconds was properly waited and then reported.
I'll open a separate bug on this issue if reproduced again.
Moving this bug to verified.

Note You need to log in before you can comment on or make changes to this bug.