Bug 1299232

Summary: Hosts are stuck in 'installing'
Product: [oVirt] ovirt-engine Reporter: Nelly Credi <ncredi>
Component: Host-DeployAssignee: Moti Asayag <masayag>
Status: CLOSED CURRENTRELEASE QA Contact: Meni Yakove <myakove>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.6.3CC: alex.boyd, bugs, dgilbert, didi, khakimi, masayag, mlipchuk, ncredi, ngoldin, oourfali, pkliczew, pmatyas, sasundar, sbonazzo, ylavi
Target Milestone: ovirt-3.6.3Keywords: AutomationBlocker, Regression
Target Release: 3.6.3Flags: didi: needinfo-
rule-engine: ovirt-3.6.z+
rule-engine: blocker+
ylavi: planning_ack+
masayag: devel_ack+
rule-engine: testing_ack+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-02-18 11:14:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Network RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
engine logs
none
vdsm log none

Description Nelly Credi 2016-01-17 16:04:08 UTC
Description of problem:
When adding hosts to the engine, sometimes they get stuck in 'installing' state
if we restart the engine, it may either succeed and indicate hosts are 'up' or in other case we have seen - the hosts become non operational because they dont have required network - "Host host_mixed_1 does not comply with the cluster golden_env_mixed_1 networks, the following networks are missing on host: 'ovirtmgmt'"

Version-Release number of selected component (if applicable):
ovirt-host-deploy-1.4.1-1.el6ev.noarch
rhevm-3.6.3-0.1000.121.5ef0c7a.master.el6ev.noarch

How reproducible:
we see it a lot in dev CI, but only seen it once in QE on GE setup

Steps to Reproduce:
1.run GE builed
2.fail on host installation step
3.

Actual results:
hosts are stuck in 'installing'

Expected results:
hosts should be in 'up' state

Additional info:
https://ge-ci-coresystem-engine01.eng.lab.tlv.redhat.com/ovirt-engine/webadmin/?locale=en_US#hosts-events

in one case (QE env) after restart everything worked fine
in this case the hosts are in non-op because of req network

in the logs there were no meaningful errors:
engine.log
2016-01-17 15:21:47,486 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.TimeBoundPollVDSCommand] (org.ovirt.thread.pool-7-t
hread-6) [490fcabd] Error: org.ovirt.engine.core.vdsbroker.xmlrpc.XmlRpcRunTimeException: Connection issues during send requ
est
2016-01-17 15:21:47,505 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-
7-thread-6) [490fcabd] Correlation ID: 490fcabd, Job ID: 15217a40-5db9-4f98-a71c-598952750b97, Call Stack: null, Custom Even
t ID: -1, Message: Host host_mixed_3 installation failed. Please refer to /var/log/ovirt-engine/engine.log and log logs unde
r /var/log/ovirt-engine/host-deploy/ for further details..

No errors in host_deploy log
but there is one file named:
/var/log/ovirt-engine/host-deploy/ovirt-host-mgmt-20160117175515-10.35.148.42-null.log


we are keeping this env up, so you can look at it

Comment 1 Nelly Credi 2016-01-17 16:12:31 UTC
correction:
in both cases it required host reinstall after engine restart

Comment 2 Nadav Goldin 2016-01-18 06:36:23 UTC
Created attachment 1115753 [details]
engine logs

Comment 3 Nadav Goldin 2016-01-18 06:41:30 UTC
adding two more details:

the following exception appeared in host deploy logs:
2016-01-17 15:20:30 DEBUG otopi.plugins.otopi.packagers.dnfpackager dnfpackager._boot:178 Cannot initialize minidnf
Traceback (most recent call last):
  File "/tmp/ovirt-qS7Cl5Alvp/otopi-plugins/otopi/packagers/dnfpackager.py", line 165, in _boot
    constants.PackEnv.DNF_DISABLED_PLUGINS
  File "/tmp/ovirt-qS7Cl5Alvp/otopi-plugins/otopi/packagers/dnfpackager.py", line 75, in _getMiniDNF
    from otopi import minidnf
  File "/tmp/ovirt-qS7Cl5Alvp/pythonlib/otopi/minidnf.py", line 31, in <module>
    import dnf
ImportError: No module named dnf

as far as we can tell, first time this happened on GE was 12/01/2016:
https://rhev-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/GE-builder/945/consoleFull


engine is on el 6.7, logs are attached above.

Comment 4 Red Hat Bugzilla Rules Engine 2016-01-18 13:29:02 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 5 Sandro Bonazzola 2016-01-18 13:30:58 UTC
Didi, can you take a look at this? dnf plugin is not supposed to be enabled on el6.

Comment 6 Sandro Bonazzola 2016-01-18 13:31:59 UTC
(In reply to Sandro Bonazzola from comment #5)
> Didi, can you take a look at this? dnf plugin is not supposed to be enabled
> on el6.

maybe it's just a "try to import and use if there" but better to check

Comment 7 Piotr Kliczewski 2016-01-18 13:47:27 UTC
Please provide vdsm log.

Comment 8 Yedidyah Bar David 2016-01-18 14:36:10 UTC
(In reply to Sandro Bonazzola from comment #6)
> (In reply to Sandro Bonazzola from comment #5)
> > Didi, can you take a look at this? dnf plugin is not supposed to be enabled
> > on el6.
> 
> maybe it's just a "try to import and use if there" but better to check

It is, and can be ignored.

If you only have dnf and not yum, you'll see the same error about yum.

If both are missing, you'll later see that the otopi packager will fail. See e.g. bug 1297835.

Comment 9 Moti Asayag 2016-01-18 14:51:44 UTC
(In reply to Nadav Goldin from comment #3)
> adding two more details:
> 
> the following exception appeared in host deploy logs:
> 2016-01-17 15:20:30 DEBUG otopi.plugins.otopi.packagers.dnfpackager
> dnfpackager._boot:178 Cannot initialize minidnf
> Traceback (most recent call last):
>   File "/tmp/ovirt-qS7Cl5Alvp/otopi-plugins/otopi/packagers/dnfpackager.py",
> line 165, in _boot
>     constants.PackEnv.DNF_DISABLED_PLUGINS
>   File "/tmp/ovirt-qS7Cl5Alvp/otopi-plugins/otopi/packagers/dnfpackager.py",
> line 75, in _getMiniDNF
>     from otopi import minidnf
>   File "/tmp/ovirt-qS7Cl5Alvp/pythonlib/otopi/minidnf.py", line 31, in
> <module>
>     import dnf
> ImportError: No module named dnf

This isn't a cause for any failure. If 'dnf' is available on the server, it will be used instead of 'yum' for managing packages, else 'yum' will be used.

> 
> as far as we can tell, first time this happened on GE was 12/01/2016:
> https://rhev-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/GE-builder/945/
> consoleFull
> 
> 
> engine is on el 6.7, logs are attached above.

The logs contains stackoverflow exception:
Caused by: java.lang.StackOverflowError
	at java.lang.Throwable.toString(Throwable.java:480) [rt.jar:1.7.0_91]
	at java.lang.String.valueOf(String.java:2849) [rt.jar:1.7.0_91]
	at java.lang.StringBuilder.append(StringBuilder.java:128) [rt.jar:1.7.0_91]
	at org.jboss.logmanager.formatters.Formatters$14.renderCause(Formatters.java:823) [jboss-logmanager.jar:1.5.4.Final-redhat-1]
	at org.jboss.logmanager.formatters.Formatters$14.renderCause(Formatters.java:841) [jboss-logmanager.jar:1.5.4.Final-redhat-1]
	at org.jboss.logmanager.formatters.Formatters$14.renderCause(Formatters.java:841) [jboss-logmanager.jar:1.5.4.Final-redhat-1]

The root cause for the stackoverflow which result in engine stuck is not clear. Can we get the vdsm.log of the installed hosts for further debug ?

Comment 10 Dr. David Alan Gilbert 2016-01-18 18:28:01 UTC
I also had a host stuck in installing; the error I noticed in the install log was:

'libvirt: Network Filter Driver error : Network filter not found: no nwfilter with matching name 'vdsm-no-mac-spoofing'' on a failed adding a host/installing a host to my rhev-m

but the bigger problem was that it stuck in installing until I rebooted the rhev-m box.

Comment 11 Moti Asayag 2016-01-18 20:01:18 UTC
(In reply to Dr. David Alan Gilbert from comment #10)
> I also had a host stuck in installing; the error I noticed in the install
> log was:
> 
> 'libvirt: Network Filter Driver error : Network filter not found: no
> nwfilter with matching name 'vdsm-no-mac-spoofing'' on a failed adding a
> host/installing a host to my rhev-m
> 
> but the bigger problem was that it stuck in installing until I rebooted the
> rhev-m box.

Could you attach the engine.log and server.log from the rhevm server ?

Comment 12 Nelly Credi 2016-01-19 06:48:08 UTC
Created attachment 1116065 [details]
vdsm log

Comment 13 Moti Asayag 2016-01-19 07:03:21 UTC
The exact reproduce for this bug is to halt vdsm after the host-deploy has ended.
It will produce an exception on the engine side which isn't handled and causes the host installation action to fail with the described result.

Comment 14 Moti Asayag 2016-01-20 08:34:25 UTC
*** Bug 1299961 has been marked as a duplicate of this bug. ***

Comment 15 Oved Ourfali 2016-01-20 08:35:28 UTC
It didn't make it on time for 3.6.2, so pushing to 3.6.3.

Comment 16 Maor 2016-01-25 08:56:27 UTC
*** Bug 1301377 has been marked as a duplicate of this bug. ***

Comment 17 Oved Ourfali 2016-01-25 13:03:26 UTC
Moti - when will this be on MODIFIED?

Comment 18 Moti Asayag 2016-01-25 13:29:07 UTC
(In reply to Oved Ourfali from comment #17)
> Moti - when will this be on MODIFIED?

It was merged to 3.6 today - so it can be moved to MODIFIED.

Comment 19 Alex Boyd 2016-04-06 09:29:21 UTC
I'm seeing very similar issues with a fresh install on RHEL 7 (hosted engine on RHEL7 too). New hosts get stuck at installing and restart of hosted engine fixes. Also unable to add second Host to hosted engine as the installer gets stuck, although hosted-engine status shows host in maintenance?
--== Host 2 status ==--

Status up-to-date                  : True
Hostname                           : omxakt01.oam.eeint.co.uk
Host ID                            : 2
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 0
stopped                            : False
Local maintenance                  : True
crc32                              : d476ad4a
Host timestamp                     : 40127

Comment 20 Kobi Hakimi 2016-04-20 20:32:36 UTC
we see the same issue with ovirt 4.0 on rhel 7.2:
the first host stuck at installing state.
but the other 2 hosts installed as expected and reach the up state.
after restart of ovirt-engine the host become non-operational and we could remove it.

NOTES: 
 - The first host which stuck at installing state also lost its IP a day after. 
 - After network restart we got back the host ip. 
 - The hosts and the engine are rhel 7.2.