Description of problem: When you deploy the oVirt 4.0 hosted engine, steps where the database schemas are being generated can take longer than the predefined timeout of 600 seconds. How reproducible: Very frequently Steps to Reproduce: Run hosted-engine --deploy on a new oVirt deployment. Actual results: When the hosted engine setup reaches the step "Creating/refreshing Engine database schema" or "Creating/refreshing DWH database schema", and it takes longer than 600 seconds to complete, the following error is shown. [ ERROR ] Engine setup got stuck on the appliance [ ERROR ] Failed to execute stage 'Closing up': Engine setup is stalled on the appliance since 600 seconds ago. Please check its log on the appliance. Expected results: A user prompt asking if you want to continue waiting. Additional info: This BZ was first suggested by Yedidyah Bar David [1], however I do not see an existing BZ. [1] http://lists.ovirt.org/pipermail/users/2016-May/039546.html
I'd actually look at why it's taking 10 minutes. This is very long.
I noticed that the default CPU/RAM configuration for the engine appliance is rather large (4 cores/16 GB RAM). I set the values to the minimum requirements (2 cores/4 GB RAM) when deploying. Perhaps that may cause this issue?
(In reply to Kevin Hung from comment #2) > I noticed that the default CPU/RAM configuration for the engine appliance is > rather large (4 cores/16 GB RAM). I set the values to the minimum > requirements (2 cores/4 GB RAM) when deploying. Perhaps that may cause this > issue? Unlikely. Min reqs are more than enough for the initial setup - they might not be enough eventually, depending on the actual use and load. Please attach engine-setup logs. You can find them in the engine machine in: /var/log/ovirt-engine/setup. Thanks.
(In reply to Yedidyah Bar David from comment #3) > (In reply to Kevin Hung from comment #2) > > I noticed that the default CPU/RAM configuration for the engine appliance is > > rather large (4 cores/16 GB RAM). I set the values to the minimum > > requirements (2 cores/4 GB RAM) when deploying. Perhaps that may cause this > > issue? > > Unlikely. Min reqs are more than enough for the initial setup - they might > not be enough eventually, depending on the actual use and load. > > Please attach engine-setup logs. You can find them in the engine machine in: > /var/log/ovirt-engine/setup. Thanks. Please ignore this. I see in the link to the list archive that we already found the root cause for this specific case, which was lack of entropy during ovirt-aaa-jdbc-tool run. Perhaps we should have a bz to make sure hosted-engine-setup+engine appliance (one of them, or both, need to think about this) make sure that the engine vm has enough entropy.
(In reply to Yedidyah Bar David from comment #4) > Please ignore this. I see in the link to the list archive that we already > found the root cause for this specific case, which was lack of entropy > during ovirt-aaa-jdbc-tool run. Perhaps we should have a bz to make sure > hosted-engine-setup+engine appliance (one of them, or both, need to think > about this) make sure that the engine vm has enough entropy. Now opened but 1357246 to track this. Still keeping current bug open, because in principle there might be other reasons for engine-setup to take a long time, and we might want to allow that.
(In reply to Yedidyah Bar David from comment #5) > Now opened but 1357246 to track this. bug 1357246 ... Perhaps bugzilla should be extended to notice such typos and link anyway :-)
Created attachment 1180825 [details] ovirt-engine-setup.log Here is the setup log as requested.
Thanks for the log, although I wrote it's not really needed, but it seems it is :-) According to the log: The entire run took almost 18 minutes. aaajdbc.Plugin._setupAdminPassword took 215 seconds. Since I assume this is the main one to be affected by fixing bug 1357246, it would still leave us with around 14.5 minutes, which is significantly more than our current maximum, which is 10 minutes. The two significant time consumers are, as usual, engine and dwh db schema creation, taking 414 and 308 seconds, respectively. Since we do not log per-SQL times, we can't tell how long each call took there. Generally speaking, I'd say these are long time, much longer than what we usually see, but are not unrealistic, especially with a slower/loaded network and storage.
(In reply to Yedidyah Bar David from comment #8) > Thanks for the log, although I wrote it's not really needed, but it seems it > is :-) > > According to the log: > > The entire run took almost 18 minutes. > > aaajdbc.Plugin._setupAdminPassword took 215 seconds. Since I assume this is > the main one to be affected by fixing bug 1357246, it would still leave us > with around 14.5 minutes, which is significantly more than our current > maximum, which is 10 minutes. > > The two significant time consumers are, as usual, engine and dwh db schema > creation, taking 414 and 308 seconds, respectively. > > Since we do not log per-SQL times, we can't tell how long each call took > there. > > Generally speaking, I'd say these are long time, much longer than what we > usually see, but are not unrealistic, especially with a slower/loaded > network and storage. So is this to be closed as not a bug, due to storage / network traffic, since the entropy part has been addressed in a separate bug?
(In reply to Sandro Bonazzola from comment #9) > So is this to be closed as not a bug, due to storage / network traffic, > since the entropy part has been addressed in a separate bug? I do not think so. If we decide that these are "not unrealistic", we should do something. Kevin asked to prompt the user, which might make sense. Perhaps just making the maximum much higher is enough. Or something more complex, such as somehow adding to engine-setup output in the middle of schema.sh indicating it's still running (plus some kind of progress indicator would be cool), so that hosted-engine setup can notice the added output and understand it's still running.
I think that merely making the maximum half an hour or so, is zero work - just pick a version and push there. No reason to delay this, we already got more reports for this bug (both due to slowness and to low entropy). More complex (and user friendly) solutions can be added later on.
(In reply to Yedidyah Bar David from comment #11) > I think that merely making the maximum half an hour or so, is zero work - > just pick a version and push there. No reason to delay this, we already got > more reports for this bug (both due to slowness and to low entropy). More > complex (and user friendly) solutions can be added later on. ok, let's go with the timeout increase.
Looks like I'm getting the same issue on a new deployment of latest hosted engine: rhevm-appliance-20161214.0-1.el7ev.noarch ovirt-hosted-engine-ha-2.1.0-1.el7ev.noarch ovirt-host-deploy-1.6.0-1.el7ev.noarch ovirt-imageio-common-0.5.0-0.el7ev.noarch ovirt-vmconsole-host-1.0.4-1.el7ev.noarch qemu-kvm-rhev-2.6.0-28.el7_3.3.x86_64 libvirt-client-2.0.0-10.el7_3.4.x86_64 mom-0.5.8-1.el7ev.noarch vdsm-4.19.2-2.el7ev.x86_64 ovirt-hosted-engine-setup-2.1.0-2.el7ev.noarch ovirt-setup-lib-1.1.0-1.el7ev.noarch ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch ovirt-imageio-daemon-0.5.0-0.el7ev.noarch ovirt-vmconsole-1.0.4-1.el7ev.noarch rhevm-appliance-20161214.0-1.el7ev.noarch sanlock-3.4.0-1.el7.x86_64 Linux version 3.10.0-514.6.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Sat Dec 10 11:15:38 EST 2016 Linux 3.10.0-514.6.1.el7.x86_64 #1 SMP Sat Dec 10 11:15:38 EST 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.3 (Maipo) Deployment was attempted over NFS storage.
[ INFO ] Running engine-setup on the appliance [ ERROR ] Engine setup got stuck on the appliance [ ERROR ] Failed to execute stage 'Closing up': Engine setup is stalled on the appliance since 1800 seconds ago. Please check its log on the appliance. [ INFO ] Stage: Clean up [ INFO ] Generating answer file '/var/lib/ovirt-hosted-engine-setup/answers/answers-20170123181000.conf' [ INFO ] Stage: Pre-termination [ INFO ] Stage: Termination [ ERROR ] Hosted Engine deployment failed: this system is not reliable, please check the issue,fix and redeploy Log file is located at /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20170123172939-91xjk7.log
Created attachment 1243657 [details] sosreport from host
(In reply to Nikolai Sednev from comment #14) > [ INFO ] Running engine-setup on the appliance > > [ ERROR ] Engine setup got stuck on the appliance > [ ERROR ] Failed to execute stage 'Closing up': Engine setup is stalled on > the appliance since 1800 seconds ago. Please check its log on the appliance. We raised the default timeout from 600 seconds to 1800 as for the request; the point is why it got stuck. Did you do something special to kill engine setup? if no, you hit a different issue.
(In reply to Simone Tiraboschi from comment #16) > (In reply to Nikolai Sednev from comment #14) > > [ INFO ] Running engine-setup on the appliance > > > > [ ERROR ] Engine setup got stuck on the appliance > > [ ERROR ] Failed to execute stage 'Closing up': Engine setup is stalled on > > the appliance since 1800 seconds ago. Please check its log on the appliance. > > We raised the default timeout from 600 seconds to 1800 as for the request; > the point is why it got stuck. > Did you do something special to kill engine setup? if no, you hit a > different issue. The one and the only specific configuration was the size of HE-VM's disk, which I've set to be at the size of 150GB, although even larger disk sizes worked well in previous deployments. Then I gues this very bug is verified forth to I've witnessed 1800 seconds was properly waited and then reported. I'll open a separate bug on this issue if reproduced again. Moving this bug to verified.