Bug 1299232 - Hosts are stuck in 'installing'
Hosts are stuck in 'installing'
Status: CLOSED CURRENTRELEASE
Product: ovirt-engine
Classification: oVirt
Component: Host-Deploy (Show other bugs)
3.6.3
Unspecified Unspecified
urgent Severity urgent (vote)
: ovirt-3.6.3
: 3.6.3
Assigned To: Moti Asayag
Meni Yakove
: AutomationBlocker, Regression
: 1299961 1301377 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-01-17 11:04 EST by Nelly Credi
Modified: 2016-04-20 16:32 EDT (History)
15 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-02-18 06:14:15 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Network
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
didi: needinfo-
rule-engine: ovirt‑3.6.z+
rule-engine: blocker+
ylavi: planning_ack+
masayag: devel_ack+
rule-engine: testing_ack+


Attachments (Terms of Use)
engine logs (8.34 MB, text/plain)
2016-01-18 01:36 EST, Nadav Goldin
no flags Details
vdsm log (1.01 MB, application/x-gzip)
2016-01-19 01:48 EST, Nelly Credi
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 52393 ovirt-engine-3.6 MERGED host-deploy: attemptConnection() should not throw exceptions 2016-01-19 03:03 EST
oVirt gerrit 52394 master MERGED host-deploy: attemptConnection() should not throw exceptions 2016-01-19 02:33 EST
oVirt gerrit 52406 master MERGED Exception shouldn't be recursive 2016-01-19 05:56 EST
oVirt gerrit 52407 ovirt-engine-3.6.2 NEW host-deploy: attemptConnection() should not throw exceptions 2016-03-20 02:45 EDT
oVirt gerrit 52411 ovirt-engine-3.6 MERGED core: Ignore expected exception during connection attempts 2016-01-19 09:44 EST
oVirt gerrit 52412 master MERGED core: Ignore expected exception during connection attempts 2016-01-19 09:28 EST
oVirt gerrit 52430 master MERGED jsonrpc: bump version 2016-01-20 07:12 EST
oVirt gerrit 52431 ovirt-engine-3.6 MERGED jsonrpc: bump version 2016-01-25 04:34 EST
oVirt gerrit 52432 ovirt-engine-3.6.2 ABANDONED jsonrpc: bump version 2016-01-26 10:43 EST

  None (edit)
Description Nelly Credi 2016-01-17 11:04:08 EST
Description of problem:
When adding hosts to the engine, sometimes they get stuck in 'installing' state
if we restart the engine, it may either succeed and indicate hosts are 'up' or in other case we have seen - the hosts become non operational because they dont have required network - "Host host_mixed_1 does not comply with the cluster golden_env_mixed_1 networks, the following networks are missing on host: 'ovirtmgmt'"

Version-Release number of selected component (if applicable):
ovirt-host-deploy-1.4.1-1.el6ev.noarch
rhevm-3.6.3-0.1000.121.5ef0c7a.master.el6ev.noarch

How reproducible:
we see it a lot in dev CI, but only seen it once in QE on GE setup

Steps to Reproduce:
1.run GE builed
2.fail on host installation step
3.

Actual results:
hosts are stuck in 'installing'

Expected results:
hosts should be in 'up' state

Additional info:
https://ge-ci-coresystem-engine01.eng.lab.tlv.redhat.com/ovirt-engine/webadmin/?locale=en_US#hosts-events

in one case (QE env) after restart everything worked fine
in this case the hosts are in non-op because of req network

in the logs there were no meaningful errors:
engine.log
2016-01-17 15:21:47,486 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.TimeBoundPollVDSCommand] (org.ovirt.thread.pool-7-t
hread-6) [490fcabd] Error: org.ovirt.engine.core.vdsbroker.xmlrpc.XmlRpcRunTimeException: Connection issues during send requ
est
2016-01-17 15:21:47,505 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-
7-thread-6) [490fcabd] Correlation ID: 490fcabd, Job ID: 15217a40-5db9-4f98-a71c-598952750b97, Call Stack: null, Custom Even
t ID: -1, Message: Host host_mixed_3 installation failed. Please refer to /var/log/ovirt-engine/engine.log and log logs unde
r /var/log/ovirt-engine/host-deploy/ for further details..

No errors in host_deploy log
but there is one file named:
/var/log/ovirt-engine/host-deploy/ovirt-host-mgmt-20160117175515-10.35.148.42-null.log


we are keeping this env up, so you can look at it
Comment 1 Nelly Credi 2016-01-17 11:12:31 EST
correction:
in both cases it required host reinstall after engine restart
Comment 2 Nadav Goldin 2016-01-18 01:36 EST
Created attachment 1115753 [details]
engine logs
Comment 3 Nadav Goldin 2016-01-18 01:41:30 EST
adding two more details:

the following exception appeared in host deploy logs:
2016-01-17 15:20:30 DEBUG otopi.plugins.otopi.packagers.dnfpackager dnfpackager._boot:178 Cannot initialize minidnf
Traceback (most recent call last):
  File "/tmp/ovirt-qS7Cl5Alvp/otopi-plugins/otopi/packagers/dnfpackager.py", line 165, in _boot
    constants.PackEnv.DNF_DISABLED_PLUGINS
  File "/tmp/ovirt-qS7Cl5Alvp/otopi-plugins/otopi/packagers/dnfpackager.py", line 75, in _getMiniDNF
    from otopi import minidnf
  File "/tmp/ovirt-qS7Cl5Alvp/pythonlib/otopi/minidnf.py", line 31, in <module>
    import dnf
ImportError: No module named dnf

as far as we can tell, first time this happened on GE was 12/01/2016:
https://rhev-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/GE-builder/945/consoleFull


engine is on el 6.7, logs are attached above.
Comment 4 Red Hat Bugzilla Rules Engine 2016-01-18 08:29:02 EST
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
Comment 5 Sandro Bonazzola 2016-01-18 08:30:58 EST
Didi, can you take a look at this? dnf plugin is not supposed to be enabled on el6.
Comment 6 Sandro Bonazzola 2016-01-18 08:31:59 EST
(In reply to Sandro Bonazzola from comment #5)
> Didi, can you take a look at this? dnf plugin is not supposed to be enabled
> on el6.

maybe it's just a "try to import and use if there" but better to check
Comment 7 Piotr Kliczewski 2016-01-18 08:47:27 EST
Please provide vdsm log.
Comment 8 Yedidyah Bar David 2016-01-18 09:36:10 EST
(In reply to Sandro Bonazzola from comment #6)
> (In reply to Sandro Bonazzola from comment #5)
> > Didi, can you take a look at this? dnf plugin is not supposed to be enabled
> > on el6.
> 
> maybe it's just a "try to import and use if there" but better to check

It is, and can be ignored.

If you only have dnf and not yum, you'll see the same error about yum.

If both are missing, you'll later see that the otopi packager will fail. See e.g. bug 1297835.
Comment 9 Moti Asayag 2016-01-18 09:51:44 EST
(In reply to Nadav Goldin from comment #3)
> adding two more details:
> 
> the following exception appeared in host deploy logs:
> 2016-01-17 15:20:30 DEBUG otopi.plugins.otopi.packagers.dnfpackager
> dnfpackager._boot:178 Cannot initialize minidnf
> Traceback (most recent call last):
>   File "/tmp/ovirt-qS7Cl5Alvp/otopi-plugins/otopi/packagers/dnfpackager.py",
> line 165, in _boot
>     constants.PackEnv.DNF_DISABLED_PLUGINS
>   File "/tmp/ovirt-qS7Cl5Alvp/otopi-plugins/otopi/packagers/dnfpackager.py",
> line 75, in _getMiniDNF
>     from otopi import minidnf
>   File "/tmp/ovirt-qS7Cl5Alvp/pythonlib/otopi/minidnf.py", line 31, in
> <module>
>     import dnf
> ImportError: No module named dnf

This isn't a cause for any failure. If 'dnf' is available on the server, it will be used instead of 'yum' for managing packages, else 'yum' will be used.

> 
> as far as we can tell, first time this happened on GE was 12/01/2016:
> https://rhev-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/GE-builder/945/
> consoleFull
> 
> 
> engine is on el 6.7, logs are attached above.

The logs contains stackoverflow exception:
Caused by: java.lang.StackOverflowError
	at java.lang.Throwable.toString(Throwable.java:480) [rt.jar:1.7.0_91]
	at java.lang.String.valueOf(String.java:2849) [rt.jar:1.7.0_91]
	at java.lang.StringBuilder.append(StringBuilder.java:128) [rt.jar:1.7.0_91]
	at org.jboss.logmanager.formatters.Formatters$14.renderCause(Formatters.java:823) [jboss-logmanager.jar:1.5.4.Final-redhat-1]
	at org.jboss.logmanager.formatters.Formatters$14.renderCause(Formatters.java:841) [jboss-logmanager.jar:1.5.4.Final-redhat-1]
	at org.jboss.logmanager.formatters.Formatters$14.renderCause(Formatters.java:841) [jboss-logmanager.jar:1.5.4.Final-redhat-1]

The root cause for the stackoverflow which result in engine stuck is not clear. Can we get the vdsm.log of the installed hosts for further debug ?
Comment 10 Dr. David Alan Gilbert 2016-01-18 13:28:01 EST
I also had a host stuck in installing; the error I noticed in the install log was:

'libvirt: Network Filter Driver error : Network filter not found: no nwfilter with matching name 'vdsm-no-mac-spoofing'' on a failed adding a host/installing a host to my rhev-m

but the bigger problem was that it stuck in installing until I rebooted the rhev-m box.
Comment 11 Moti Asayag 2016-01-18 15:01:18 EST
(In reply to Dr. David Alan Gilbert from comment #10)
> I also had a host stuck in installing; the error I noticed in the install
> log was:
> 
> 'libvirt: Network Filter Driver error : Network filter not found: no
> nwfilter with matching name 'vdsm-no-mac-spoofing'' on a failed adding a
> host/installing a host to my rhev-m
> 
> but the bigger problem was that it stuck in installing until I rebooted the
> rhev-m box.

Could you attach the engine.log and server.log from the rhevm server ?
Comment 12 Nelly Credi 2016-01-19 01:48 EST
Created attachment 1116065 [details]
vdsm log
Comment 13 Moti Asayag 2016-01-19 02:03:21 EST
The exact reproduce for this bug is to halt vdsm after the host-deploy has ended.
It will produce an exception on the engine side which isn't handled and causes the host installation action to fail with the described result.
Comment 14 Moti Asayag 2016-01-20 03:34:25 EST
*** Bug 1299961 has been marked as a duplicate of this bug. ***
Comment 15 Oved Ourfali 2016-01-20 03:35:28 EST
It didn't make it on time for 3.6.2, so pushing to 3.6.3.
Comment 16 Maor 2016-01-25 03:56:27 EST
*** Bug 1301377 has been marked as a duplicate of this bug. ***
Comment 17 Oved Ourfali 2016-01-25 08:03:26 EST
Moti - when will this be on MODIFIED?
Comment 18 Moti Asayag 2016-01-25 08:29:07 EST
(In reply to Oved Ourfali from comment #17)
> Moti - when will this be on MODIFIED?

It was merged to 3.6 today - so it can be moved to MODIFIED.
Comment 19 Alex Boyd 2016-04-06 05:29:21 EDT
I'm seeing very similar issues with a fresh install on RHEL 7 (hosted engine on RHEL7 too). New hosts get stuck at installing and restart of hosted engine fixes. Also unable to add second Host to hosted engine as the installer gets stuck, although hosted-engine status shows host in maintenance?
--== Host 2 status ==--

Status up-to-date                  : True
Hostname                           : omxakt01.oam.eeint.co.uk
Host ID                            : 2
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 0
stopped                            : False
Local maintenance                  : True
crc32                              : d476ad4a
Host timestamp                     : 40127
Comment 20 Kobi Hakimi 2016-04-20 16:32:36 EDT
we see the same issue with ovirt 4.0 on rhel 7.2:
the first host stuck at installing state.
but the other 2 hosts installed as expected and reach the up state.
after restart of ovirt-engine the host become non-operational and we could remove it.

NOTES: 
 - The first host which stuck at installing state also lost its IP a day after. 
 - After network restart we got back the host ip. 
 - The hosts and the engine are rhel 7.2.

Note You need to log in before you can comment on or make changes to this bug.