Bug 1386293 - Host stuck in installing state during ovirt-hosted-engine-setup run
Summary: Host stuck in installing state during ovirt-hosted-engine-setup run
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-hosted-engine-setup
Classification: oVirt
Component: General
Version: 2.0.2.2
Hardware: x86_64
OS: Linux
high
unspecified
Target Milestone: ovirt-4.0.6
: 2.0.4
Assignee: Simone Tiraboschi
QA Contact: Nikolai Sednev
URL:
Whiteboard: integration
Depends On: 1393932
Blocks: RHEV-RHCI 1386256 1396051 1411435
TreeView+ depends on / blocked
 
Reported: 2016-10-18 15:08 UTC by Fabian von Feilitzsch
Modified: 2019-04-28 13:38 UTC (History)
14 users (show)

Fixed In Version:
Clone Of:
: 1411435 (view as bug list)
Environment:
Last Closed: 2017-01-18 07:29:35 UTC
oVirt Team: Integration
Embargoed:
stirabos: needinfo+
rule-engine: ovirt-4.0.z+


Attachments (Terms of Use)
/var/log/messages (204.04 KB, text/plain)
2016-10-18 15:08 UTC, Fabian von Feilitzsch
no flags Details
/var/log/audit/audit.log (676.32 KB, text/plain)
2016-10-18 15:09 UTC, Fabian von Feilitzsch
no flags Details
/var/log/libvirt/qemu/HostedEngine.log (3.17 KB, text/plain)
2016-10-18 15:10 UTC, Fabian von Feilitzsch
no flags Details
/var/log/vdsm/mom.log (5.59 KB, text/plain)
2016-10-18 15:10 UTC, Fabian von Feilitzsch
no flags Details
/var/log/vdsm/supervdsm.log (47.62 KB, text/plain)
2016-10-18 15:11 UTC, Fabian von Feilitzsch
no flags Details
/var/log/vdsm/vdsm.log (1.13 MB, text/plain)
2016-10-18 15:11 UTC, Fabian von Feilitzsch
no flags Details
/var/log/yum.log (19.17 KB, text/plain)
2016-10-18 15:12 UTC, Fabian von Feilitzsch
no flags Details
answers file passed to hosted-engine installer (2.19 KB, text/plain)
2016-10-18 15:18 UTC, Fabian von Feilitzsch
no flags Details
/var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20161018143149-y3tpgi.log (443.63 KB, text/plain)
2016-10-20 15:02 UTC, Fabian von Feilitzsch
no flags Details
/tmp/ovirt-host-deploy-20161114171727-faui4d.log (397.30 KB, text/plain)
2016-11-14 18:43 UTC, Fabian von Feilitzsch
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1396032 0 high CLOSED libvirt fails applying an ebtables rule so a bridge goes down and so the VMs running on it 2021-02-22 00:41:40 UTC
oVirt gerrit 66853 0 master MERGED network: remove vdsm-no-mac-spoofing filter 2021-02-04 05:07:20 UTC
oVirt gerrit 66891 0 ovirt-hosted-engine-setup-2.0 MERGED network: remove vdsm-no-mac-spoofing filter 2021-02-04 05:07:20 UTC

Internal Links: 1396032

Description Fabian von Feilitzsch 2016-10-18 15:08:37 UTC
Created attachment 1211766 [details]
/var/log/messages

Description of problem:
When running the "hosted-engine --deploy" command on the RHEL 7.3 beta (to deploy RHV 4.0), the installation seems to hang while installing the host, after the engine is set up. In /var/log/messages I see several error messages from vdsm/libvirt (included below). It appears that VDSM shuts down (along with the engine VM) while the host is installing, and when it comes back the engine VM does not come with it. Manually (re)starting the engine (hosted-engine --vm-start) seems to allow the installer to continue.

Logs that stood out to me (from journal):

Oct 18 14:50:39 mac5254002a783e.example.com libvirtd[26418]: libvirt version: 2.0.0, package: 4.el7 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2016-08-02-09:15:12, x86-034.build.eng.bos.redhat.com)
Oct 18 14:50:39 mac5254002a783e.example.com libvirtd[26418]: hostname: mac5254002a783e.example.com
Oct 18 14:50:39 mac5254002a783e.example.com libvirtd[26418]: internal error: Failed to apply firewall rules /usr/sbin/ebtables --concurrent -t nat -N libvirt-J-vnet0: The kernel doesn't support the ebtables 'nat' table.
...
Oct 18 14:50:44 mac5254002a783e.example.com vdsm[26957]: vdsm vds WARNING recovery: found 1 VMs from recovery files not reported by libvirt. This should not happen! Will try to recover them.
...
Oct 18 14:50:44 mac5254002a783e.example.com vdsm[26957]: vdsm vds WARNING Not ready yet, ignoring event u'|virt|VM_status|059ab810-5d7c-441c-8b52-2cade9e45fe1' args={u'059ab810-5d7c-441c-8b52-2cade9e45fe1': {'status': 'Down', 'exitReason': 9, 'exitMessage': 'Failed to find the libvirt domain', 'exitCode': 1}}


Version-Release number of selected component (if applicable):
ovirt-hosted-engine-setup : 2.0.2.2
libvirt-daemon : 2.0.0
vdsm : 4.18.13

How reproducible:
100%

Steps to Reproduce:
1. On a RHEL 7.3 beta host, install ovirt-hosted-engine-setup and rhevm-appliance
2. run hosted-engine --deploy and answer the questions

Actual results:
Installation hangs while installing the host

Expected results:
Installation succeeds 

Additional info:
Using NFS for storage

Here is the ansible playbook that is being used to deploy the self-hosted engine: https://github.com/fusor/ansible-ovirt/blob/master/self_hosted.yml

and the task file that is running the installe: https://github.com/fusor/ansible-ovirt/blob/master/roles/self_hosted_first_host/tasks/main.yml

I do not see this issue when running the installer with the upstream packages, and I did not see this issue until upgrading to the RHEL 7.3 beta.

Comment 1 Fabian von Feilitzsch 2016-10-18 15:09:20 UTC
Created attachment 1211767 [details]
/var/log/audit/audit.log

Comment 2 Fabian von Feilitzsch 2016-10-18 15:10:07 UTC
Created attachment 1211769 [details]
/var/log/libvirt/qemu/HostedEngine.log

Comment 3 Fabian von Feilitzsch 2016-10-18 15:10:44 UTC
Created attachment 1211770 [details]
/var/log/vdsm/mom.log

Comment 4 Fabian von Feilitzsch 2016-10-18 15:11:12 UTC
Created attachment 1211771 [details]
/var/log/vdsm/supervdsm.log

Comment 5 Fabian von Feilitzsch 2016-10-18 15:11:43 UTC
Created attachment 1211772 [details]
/var/log/vdsm/vdsm.log

Comment 6 Fabian von Feilitzsch 2016-10-18 15:12:09 UTC
Created attachment 1211773 [details]
/var/log/yum.log

Comment 7 Fabian von Feilitzsch 2016-10-18 15:18:14 UTC
Created attachment 1211774 [details]
answers file passed to hosted-engine installer

Comment 8 Simone Tiraboschi 2016-10-20 08:00:21 UTC
Fabian, can you please attach also hosted-engine-setup logs?

Comment 9 Fabian von Feilitzsch 2016-10-20 15:02:44 UTC
Created attachment 1212559 [details]
/var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20161018143149-y3tpgi.log

Of course I attached every log except for this one...

Comment 10 Simone Tiraboschi 2016-10-20 15:12:32 UTC
Thanks Felix,
the issue seams here:

Oct 18 14:50:39 mac5254002a783e journal: internal error: Failed to apply firewall rules /usr/sbin/ebtables --concurrent -t nat -N libvirt-J-vnet0: The kernel doesn't support the ebtables 'nat' table.
Oct 18 14:50:39 mac5254002a783e kernel: ip_tables: (C) 2000-2006 Netfilter Core Team
Oct 18 14:50:39 mac5254002a783e iptables.init: iptables: Unloading modules: [  OK  ]
Oct 18 14:50:39 mac5254002a783e systemd: Stopped IPv4 firewall with iptables.
Oct 18 14:50:39 mac5254002a783e kernel: ovirtmgmt: port 2(vnet0) entered disabled state
Oct 18 14:50:39 mac5254002a783e kernel: device vnet0 left promiscuous mode
Oct 18 14:50:39 mac5254002a783e kernel: ovirtmgmt: port 2(vnet0) entered disabled state
Oct 18 14:50:39 mac5254002a783e systemd: Starting IPv4 firewall with iptables...
Oct 18 14:50:39 mac5254002a783e kvm: 0 guests now active
Oct 18 14:50:39 mac5254002a783e kernel: nf_conntrack version 0.5.0 (65536 buckets, 262144 max)
Oct 18 14:50:39 mac5254002a783e systemd-machined: Machine qemu-1-HostedEngine terminated.

When the engine tried to deloy the host, there was an error setting an ebtables rule.
iptables went down, the ovirtmgmt bridge went down too and the engine VM got terminated.

Comment 11 Simone Tiraboschi 2016-10-20 16:17:35 UTC
I successfully deployed hosted-engine from 4.0.5-2 over a clean RHEL-7.3-20161010.n.0

No sign of nat chain as ovirtmgmt is a bridge.

Fabian, did you tried configuring a NAT network somehow outside RHEV?

Comment 12 Fabian von Feilitzsch 2016-10-20 19:57:41 UTC
No, we leave the network configuration up to RHV.

Comment 13 Fabian von Feilitzsch 2016-10-25 15:14:28 UTC
Is there any info/log/command output about the networking setup that you need to see? This is currently breaking self-hosted deployments for QCI 1.1

Comment 14 Simone Tiraboschi 2016-10-25 15:41:17 UTC
Can you please attach the output of ebtables-save ?

Comment 15 Fabian von Feilitzsch 2016-10-25 19:35:25 UTC
# Generated by ebtables-save v1.0 on Tue Oct 25 19:34:19 UTC 2016
*nat
:PREROUTING ACCEPT
:OUTPUT ACCEPT
:POSTROUTING ACCEPT
:J-vnet0-arp-mac ACCEPT

*filter
:INPUT ACCEPT
:FORWARD ACCEPT
:OUTPUT ACCEPT

Comment 16 Simone Tiraboschi 2016-10-25 20:01:50 UTC
(In reply to Fabian von Feilitzsch from comment #15)
> # Generated by ebtables-save v1.0 on Tue Oct 25 19:34:19 UTC 2016
> *nat
> :PREROUTING ACCEPT
> :OUTPUT ACCEPT
> :POSTROUTING ACCEPT
> :J-vnet0-arp-mac ACCEPT

The issue is due to the presence of this chain ^^^.
I tried on a clean 7.3 beta system with just hosted-engine and I didn't get it.
Maybe it's due to another QCI component.

> *filter
> :INPUT ACCEPT
> :FORWARD ACCEPT
> :OUTPUT ACCEPT

Comment 17 Fabian von Feilitzsch 2016-10-26 13:55:11 UTC
So is it expected that the presence of this chain causes this behavior?

Comment 18 Gil Klein 2016-10-31 13:15:54 UTC
Given the error message:  "The kernel doesn't support the ebtables 'nat' table"  could you please share the output of:

1. "rpm -qa | grep ebtables"
2. "lsmod |grep ebtable"

Comment 19 Fabian von Feilitzsch 2016-10-31 14:58:06 UTC
$ rpm -qa | grep ebtables
ebtables-2.0.10-15.el7.x86_64

$ lsmod | grep ebtable
ebtable_filter         12827  0
ebtables               35009  1 ebtable_filter

Comment 20 Gil Klein 2016-10-31 20:34:56 UTC
(In reply to Fabian von Feilitzsch from comment #19)
> $ rpm -qa | grep ebtables
> ebtables-2.0.10-15.el7.x86_64
> 
> $ lsmod | grep ebtable
> ebtable_filter         12827  0
> ebtables               35009  1 ebtable_filter
I don't know why, but it seems the ebtable_nat kernel module is not loaded.

Could you please provide some more info:
1. uname -a
2. ls -l /usr/lib/modules/*/kernel/net/bridge/netfilter/ebtable_nat.ko
3. modprobe -v ebtable_nat

Comment 21 Fabian von Feilitzsch 2016-10-31 22:26:49 UTC
1. uname -a
Linux mac5254005ed267.example.com 3.10.0-493.el7.x86_64 #1 SMP Tue Aug 16 11:45:26 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux

2. ls -l /usr/lib/modules/*/kernel/net/bridge/netfilter/ebtable_nat.ko
-rw-r--r--. 1 root root 7361 Aug 16 16:15 /usr/lib/modules/3.10.0-493.el7.x86_64/kernel/net/bridge/netfilter/ebtable_nat.ko

3. modprobe -v ebtable_nat
( no output, return code of 0 )

Comment 22 Laine Stump 2016-11-01 16:06:16 UTC
Comment 20 is looking in the correct direction, but I wanted to dispel a misconception hinted at in Comment 11:

The fact that there is a rule applied to the ebtables "nat" table is unrelated to the connected network being a <forward mode='nat'> network - it just happens that the way ebtables is used to implement the anti-mac-spoofing rules (applied by "<filterref filter='vdsm-no-mac-spoofing'/>" in the guest's <interface>) is via ebtables' "nat" table. This is done regardless of the forward type of the network the guest is connected to.

The error message you're seeing is being relayed from the output of the ebtables command by libvirt and, as Gil points out, is due to the ebtable_nat module not being loaded. I'm not familiar enough with the internals of ebtables to know if that module is supposed to auto-load as the result of some event (similar to how the bridge module autoloads when you create a bridge device) or if there must be something explicit somewhere in system config to cause it to load. Definitely libvirt never explicitly loads it, though.

Comment 23 Fabian von Feilitzsch 2016-11-01 20:17:37 UTC
Here is the result of ebtables -t nat -L:

Bridge table: nat

Bridge chain: PREROUTING, entries: 0, policy: ACCEPT

Bridge chain: OUTPUT, entries: 0, policy: ACCEPT

Bridge chain: POSTROUTING, entries: 0, policy: ACCEPT


I tried running this command before running the hosted-engine installer, in the hopes that if it weren't loaded this would force it, but it had no effect. I'm not sure why the system sees the nat module outside of the hosted-engine installer. This is very strange to me.

Comment 24 Fabian von Feilitzsch 2016-11-01 20:20:29 UTC
also, running /usr/sbin/ebtables --concurrent -t nat -A J-vnet0-mac -s b2:29:9d:7a:32:2f -j RETURN by hand (after the failure was logged), succeeded with a return code of 0

Comment 25 Gil Klein 2016-11-01 21:55:45 UTC
(In reply to Fabian von Feilitzsch from comment #23)
> Here is the result of ebtables -t nat -L:
> 
> Bridge table: nat
> 
> Bridge chain: PREROUTING, entries: 0, policy: ACCEPT
> 
> Bridge chain: OUTPUT, entries: 0, policy: ACCEPT
> 
> Bridge chain: POSTROUTING, entries: 0, policy: ACCEPT
> 
> 
> I tried running this command before running the hosted-engine installer, in
> the hopes that if it weren't loaded this would force it, but it had no
> effect. I'm not sure why the system sees the nat module outside of the
> hosted-engine installer. This is very strange to me.
Could you please try the following:
1. modprobe ebtable_nat
2. Confirm you see ebtable_nat loaded by checking the output of: "lsmod | grep ebtable_nat"
3. If you see it loaded run "hosted-engine --deploy" and follow the wizard

Comment 26 Fabian von Feilitzsch 2016-11-01 23:19:17 UTC
1. modprobe ebtable_nat
( no output, return code of 0 )
2. Confirm you see ebtable_nat loaded by checking the output of: "lsmod | grep ebtable_nat"
ebtable_nat            12807  1 
ebtables               35009  2 ebtable_nat,ebtable_filter


3. If you see it loaded run "hosted-engine --deploy" and follow the wizard
same result as in the included log file, and the ebtables error message appears in /var/log/messages as usual.

Comment 27 Fabian von Feilitzsch 2016-11-01 23:38:42 UTC
Interestingly, the engine vm eventually came back up (I would guess it was successfully recovered this time) and the host was in an up state, so the deployment eventually succeeded.

Comment 28 Fabian von Feilitzsch 2016-11-03 18:18:32 UTC
I am still able to replicate this with the 7.3 GA. I ran ebtables-save on a host, freshly provisioned through Satellite, and it matched exactly the output of ebtables-save on my non-Satellite managed host. Here is the output:

*nat
:PREROUTING ACCEPT
:OUTPUT ACCEPT
:POSTROUTING ACCEPT
:PREROUTING_direct RETURN
:POSTROUTING_direct RETURN
:OUTPUT_direct RETURN
-A PREROUTING -j PREROUTING_direct
-A OUTPUT -j OUTPUT_direct
-A POSTROUTING -j POSTROUTING_direct

*broute
:BROUTING ACCEPT
:BROUTING_direct RETURN
-A BROUTING -j BROUTING_direct

*filter
:INPUT ACCEPT
:FORWARD ACCEPT
:OUTPUT ACCEPT
:INPUT_direct RETURN
:OUTPUT_direct RETURN
:FORWARD_direct RETURN
-A INPUT -j INPUT_direct
-A FORWARD -j FORWARD_direct
-A OUTPUT -j OUTPUT_direct


I have not been able to replicate this on any host not provisioned with Satellite. We use the Satellite Kickstart Default provisioning template. I'm going to attach the generated kickstart in a private comment, maybe there's something in there I'm overlooking.

Comment 30 Fabian von Feilitzsch 2016-11-03 18:22:52 UTC
(In reply to Fabian von Feilitzsch from comment #28)
> I am still able to replicate this with the 7.3 GA. I ran ebtables-save on a
> host, freshly provisioned through Satellite, and it matched exactly the
> output of ebtables-save on my non-Satellite managed host.

Sorry, to be clear, this is before running the hosted-engine installer. They seem to have the same initial ebtables rules configured prior to running the installer.

Comment 31 James Olin Oden 2016-11-04 18:32:24 UTC
I ran into this problem.   I decided to run journalctl to look for vdsm entries, because clearly it had shutdown as I saw it receive a SIGTERM and shutdown in /var/log/vdsm/vdsm.log.   Anyway the last entries concerning vdsm where:

Nov 04 16:34:20 mac525400bb29f6.b.b vdsmd_init_common.sh[23484]: vdsm: Running run_final_hooks
Nov 04 16:34:20 mac525400bb29f6.b.b systemd[1]: Stopped Virtual Desktop Server Manager.
Nov 04 16:34:20 mac525400bb29f6.b.b systemd[1]: Stopping Auxiliary vdsm service for running helper functions as root...
Nov 04 16:34:20 mac525400bb29f6.b.b python[23028]: DIGEST-MD5 client mech dispose
Nov 04 16:34:20 mac525400bb29f6.b.b python[23028]: DIGEST-MD5 common mech dispose
Nov 04 16:34:20 mac525400bb29f6.b.b daemonAdapter[23028]: Traceback (most recent call last):
Nov 04 16:34:20 mac525400bb29f6.b.b daemonAdapter[23028]: File "/usr/lib64/python2.7/multiprocessing/util.py", line 268, in _run_finalizers
Nov 04 16:34:20 mac525400bb29f6.b.b daemonAdapter[23028]: finalizer()
Nov 04 16:34:20 mac525400bb29f6.b.b daemonAdapter[23028]: File "/usr/lib64/python2.7/multiprocessing/util.py", line 201, in __call__
Nov 04 16:34:20 mac525400bb29f6.b.b daemonAdapter[23028]: res = self._callback(*self._args, **self._kwargs)
Nov 04 16:34:20 mac525400bb29f6.b.b daemonAdapter[23028]: OSError: [Errno 2] No such file or directory: '/var/run/vdsm/svdsm.sock'
Nov 04 16:34:20 mac525400bb29f6.b.b systemd[1]: Stopped Auxiliary vdsm service for running helper functions as root.
Nov 04 16:34:18 mac525400bb29f6.b.b systemd[1]: Time has been changed

Note at the end of this there is an entry about the time changing, and it actually occurred before the vdsm messages.   It happened at 16:34:18 while the vdsm messages all happened at 16:34:20.   So the log is out of order, but it's clear a time changes happened a few seconds before vdsm shutdown.

Comment 34 Fabian von Feilitzsch 2016-11-08 14:49:01 UTC
(In reply to James Olin Oden from comment #31)

> Nov 04 16:34:18 mac525400bb29f6.b.b systemd[1]: Time has been changed
> 
> Note at the end of this there is an entry about the time changing, and it
> actually occurred before the vdsm messages.   It happened at 16:34:18 while
> the vdsm messages all happened at 16:34:20.   So the log is out of order,
> but it's clear a time changes happened a few seconds before vdsm shutdown.

I looked into this, not sure what's causing it but I see it in both successful and failing environments, so I don't think it's causing the issue we see.

Comment 35 Fabian von Feilitzsch 2016-11-08 21:59:00 UTC
I think the issue might be caused by the installation of the @Core package group by the Satellite Default Kickstart.

Comment 37 Simone Tiraboschi 2016-11-09 13:57:41 UTC
(In reply to Fabian von Feilitzsch from comment #35)
> I think the issue might be caused by the installation of the @Core package
> group by the Satellite Default Kickstart.

Fabian, so you know why Satellite requires that ebtable rules or do you think that they are there just a side effect of an existing template?
If so we have to open a bug on satellite to cleanup that template.

Dan, is there something we can do on VDSM side to handle that rules?

Comment 38 Laine Stump 2016-11-09 15:40:07 UTC
As far as I can understand, it isn't the adding of the ebtables rules that causes the problem - those are *always* there for RHV guests. The problem (at least the error message logged by libvirt) is caused by something preventing the ebtable_nat module from being autoloaded.

Comment 39 Fabian von Feilitzsch 2016-11-09 16:04:39 UTC
(In reply to Simone Tiraboschi from comment #37)

> Fabian, so you know why Satellite requires that ebtable rules or do you
> think that they are there just a side effect of an existing template?
> If so we have to open a bug on satellite to cleanup that template.

I don't think there's anything wrong with the template, unless the packages in the @Core package group are not compatible with RHV. Here is the list of packages in @Core: 

NetworkManager-1:0.9.9.1-25.git20140326.4dba720.el7_0.i686
NetworkManager-1:1.4.0-12.el7.x86_64
NetworkManager-config-server-1:1.4.0-12.el7.x86_64
NetworkManager-team-1:1.4.0-12.el7.x86_64
NetworkManager-tui-1:1.4.0-12.el7.x86_64
Red_Hat_Enterprise_Linux-Release_Notes-7-en-US-0:7-2.el7.noarch
aic94xx-firmware-0:30-6.el7.noarch
alsa-firmware-0:1.0.28-2.el7.noarch
audit-0:2.6.5-3.el7.x86_64
basesystem-0:10.0-7.el7.noarch
bash-0:4.2.46-20.el7_2.x86_64
biosdevname-0:0.7.2-1.el7.x86_64
btrfs-progs-0:4.4.1-1.el7.i686
btrfs-progs-0:4.4.1-1.el7.x86_64
coreutils-0:8.22-18.el7.x86_64
cronie-0:1.4.11-14.el7_2.1.x86_64
curl-0:7.29.0-35.el7.x86_64
dhclient-12:4.2.5-47.el7.x86_64
dracut-config-generic-0:033-463.el7.x86_64
dracut-config-rescue-0:033-463.el7.x86_64
dracut-fips-0:033-463.el7.x86_64
dracut-fips-aesni-0:033-463.el7.x86_64
dracut-network-0:033-463.el7.x86_64
e2fsprogs-0:1.42.9-9.el7.x86_64
filesystem-0:3.2-21.el7.x86_64
firewalld-0:0.4.3.2-8.el7.noarch
glibc-0:2.17-157.el7.i686
glibc-0:2.17-157.el7.x86_64
hostname-0:3.13-3.el7.x86_64
initscripts-0:9.49.37-1.el7.x86_64
iproute-0:3.10.0-74.el7.x86_64
iprutils-0:2.4.13.1-1.el7.x86_64
iptables-0:1.4.21-17.el7.i686
iptables-0:1.4.21-17.el7.x86_64
iputils-0:20160308-8.el7.x86_64
irqbalance-3:1.0.7-6.el7.x86_64
ivtv-firmware-2:20080701-26.el7.noarch
iwl100-firmware-0:39.31.5.1-49.el7.noarch
iwl1000-firmware-1:39.31.5.1-49.el7.noarch
iwl105-firmware-0:18.168.6.1-49.el7.noarch
iwl135-firmware-0:18.168.6.1-49.el7.noarch
iwl2000-firmware-0:18.168.6.1-49.el7.noarch
iwl2030-firmware-0:18.168.6.1-49.el7.noarch
iwl3160-firmware-0:22.0.7.0-49.el7.noarch
iwl3945-firmware-0:15.32.2.9-49.el7.noarch
iwl4965-firmware-0:228.61.2.24-49.el7.noarch
iwl5000-firmware-0:8.83.5.1_1-49.el7.noarch
iwl5150-firmware-0:8.24.2.2-49.el7.noarch
iwl6000-firmware-0:9.221.4.1-49.el7.noarch
iwl6000g2a-firmware-0:17.168.5.3-49.el7.noarch
iwl6000g2b-firmware-0:17.168.5.2-49.el7.noarch
iwl6050-firmware-0:41.28.5.1-49.el7.noarch
iwl7260-firmware-0:22.0.7.0-49.el7.noarch
iwl7265-firmware-0:22.0.7.0-49.el7.noarch
kbd-0:1.15.5-12.el7.x86_64
kernel-tools-0:3.10.0-514.el7.x86_64
kexec-tools-0:2.0.7-50.el7.x86_64
less-0:458-9.el7.x86_64
libertas-sd8686-firmware-0:20140804-0.1.git6bce2b0.el7_0.noarch
libertas-sd8787-firmware-0:20140804-0.1.git6bce2b0.el7_0.noarch
libertas-usb8388-firmware-2:20140804-0.1.git6bce2b0.el7_0.noarch
libsysfs-0:2.1.0-16.el7.i686
libsysfs-0:2.1.0-16.el7.x86_64
linux-firmware-0:20160830-49.git7534e19.el7.noarch
man-db-0:2.6.3-9.el7.x86_64
microcode_ctl-2:2.1-16.el7.x86_64
ncurses-0:5.9-13.20130511.el7.x86_64
openssh-clients-0:6.6.1p1-31.el7.x86_64
openssh-keycat-0:6.6.1p1-31.el7.x86_64
openssh-server-0:6.6.1p1-31.el7.x86_64
parted-0:3.1-28.el7.i686
parted-0:3.1-28.el7.x86_64
passwd-0:0.79-4.el7.x86_64
plymouth-0:0.8.9-0.26.20140113.el7.x86_64
policycoreutils-0:2.5-8.el7.x86_64
postfix-2:2.10.1-6.el7.x86_64
procps-ng-0:3.3.10-10.el7.i686
procps-ng-0:3.3.10-10.el7.x86_64
rdma-0:7.3_4.7_rc2-5.el7.noarch
redhat-support-tool-0:0.9.8-6.el7.noarch
rootfiles-0:8.1-11.el7.noarch
rpm-0:4.11.3-21.el7.x86_64
rsyslog-0:7.4.7-16.el7.x86_64
selinux-policy-mls-0:3.13.1-102.el7_3.4.noarch
selinux-policy-targeted-0:3.13.1-102.el7_3.4.noarch
setup-0:2.8.71-7.el7.noarch
shadow-utils-2:4.1.5.1-24.el7.x86_64
subscription-manager-0:1.17.15-1.el7.x86_64
sudo-0:1.8.6p7-20.el7.x86_64
systemd-0:219-30.el7_3.3.x86_64
tar-2:1.26-31.el7.x86_64
tboot-1:1.9.4-2.el7.x86_64
tuned-0:2.7.1-3.el7.noarch
util-linux-0:2.23.2-33.el7.i686
util-linux-0:2.23.2-33.el7.x86_64
vim-minimal-2:7.4.160-1.el7.x86_64
xfsprogs-0:4.5.0-8.el7.i686
xfsprogs-0:4.5.0-8.el7.x86_64
yum-0:3.4.3-150.el7.noarch
yum-rhn-plugin-0:2.0.1-6.el7.noarch


I think I've narrowed the failure down to one of these packages (testing in progress):

iptables-0:1.4.21-17.el7.i686
iptables-0:1.4.21-17.el7.x86_64
cronie-0:1.4.11-14.el7_2.1.x86_64
dhclient-12:4.2.5-47.el7.x86_64
firewalld-0:0.4.3.2-8.el7.noarch
iproute-0:3.10.0-74.el7.x86_64
iprutils-0:2.4.13.1-1.el7.x86_64
iputils-0:20160308-8.el7.x86_64
kernel-tools-0:3.10.0-514.el7.x86_64
kexec-tools-0:2.0.7-50.el7.x86_64
policycoreutils-0:2.5-8.el7.x86_64
selinux-policy-mls-0:3.13.1-102.el7_3.4.noarch
selinux-policy-targeted-0:3.13.1-102.el7_3.4.noarch
subscription-manager-0:1.17.15-1.el7.x86_64


Is there anything in there that should not be installed alongside RHV?

Comment 40 Simone Tiraboschi 2016-11-09 16:35:22 UTC
Did you tried stopping and masking NetworkManager?

Comment 41 Fabian von Feilitzsch 2016-11-09 18:05:16 UTC
(In reply to Simone Tiraboschi from comment #40)
> Did you tried stopping and masking NetworkManager?

We had NetworkManager stopped, but not masked. Just did a test run in the vagrant environment and it seems to have succeeded, going to try the same from QCI.

Comment 42 Fabian von Feilitzsch 2016-11-09 21:42:09 UTC
> We had NetworkManager stopped, but not masked. Just did a test run in the
> vagrant environment and it seems to have succeeded, going to try the same
> from QCI.

Not sure the cause of it, but it still fails in QCI. I have pushed an update to the vagrant reproduction environment that can replicate this failure even with NetworkManager masked.

Comment 43 Simone Tiraboschi 2016-11-10 16:27:55 UTC
I opened a related bug also on Satellite to see if somebody could help on that front.

Comment 44 Fabian von Feilitzsch 2016-11-10 16:31:30 UTC
Good call. With NetworkManager masked, I can still replicate the failures consistently with Satellite, but I am having trouble consistently replicating it in my vagrant environment.

Comment 46 Fabian von Feilitzsch 2016-11-14 15:45:42 UTC
Would it be possible to see a kickstart that the RHV QE team is using for self-hosted testing? Would like to compare that to what we have.

Comment 47 Fabian von Feilitzsch 2016-11-14 18:43:03 UTC
Created attachment 1220501 [details]
/tmp/ovirt-host-deploy-20161114171727-faui4d.log

I was poking around one of my deployments and I found a reference to this log of the host deploy process. Not sure if there's anything you'll find useful in there. The error occurs at 17:17:50, then there are no logs until 17:33:19.

Comment 48 Fabian von Feilitzsch 2016-11-14 21:32:11 UTC
We just got access to a bare metal system and it seems like we aren't able to reproduce the issue there. Could this be an issue with RHEL 7.3 and nested virt?

Comment 49 Simone Tiraboschi 2016-11-15 08:31:43 UTC
(In reply to Fabian von Feilitzsch from comment #46)
> Would it be possible to see a kickstart that the RHV QE team is using for
> self-hosted testing? Would like to compare that to what we have.

I'm not using a kickstart file to initialize my hosts.
Not sure about QE.

Comment 50 Simone Tiraboschi 2016-11-15 08:33:58 UTC
(In reply to Fabian von Feilitzsch from comment #48)
> We just got access to a bare metal system and it seems like we aren't able
> to reproduce the issue there. Could this be an issue with RHEL 7.3 and
> nested virt?

I also tried to reproduce on RHEL 7.3 in nested env but it worked as expected.

Comment 51 Yedidyah Bar David 2016-11-15 08:49:35 UTC
(In reply to Fabian von Feilitzsch from comment #46)
> Would it be possible to see a kickstart that the RHV QE team is using for
> self-hosted testing? Would like to compare that to what we have.

oVirt CI is using (also) lago with ovirt-system-tests. You might have a look at these.

Not sure about RHV QE, but you can see some of their stuff here:
https://github.com/rhevm-qe-automation

Comment 52 Tasos Papaioannou 2016-11-15 21:51:15 UTC
FYI, I've verified that downgrading the ebtables rpm from ebtables-2.0.10-15.el7.x86_64 to ebtables-2.0.10-13.el7.x86_64 on the RHEL 7.3 host before running hosted-engine --deploy resolves the issue. The "kernel doesn't support the ebtables 'nat' table" error doesn't appear in the system logs, and the deployment completes successfully:

----
[ INFO  ] Waiting for the host to become operational in the engine. This may take several minutes...
[ INFO  ] Still waiting for VDSM host to become operational...
[ INFO  ] The VDSM Host is now operational
[ INFO  ] Saving hosted-engine configuration on the shared storage domain
[ INFO  ] Shutting down the engine VM
[ INFO  ] Enabling and starting HA services
[ INFO  ] Stage: Clean up
[ INFO  ] Generating answer file '/var/lib/ovirt-hosted-engine-setup/answers/answers-20161115214832.conf'
[ INFO  ] Generating answer file '/etc/ovirt-hosted-engine/answers.conf'
[ INFO  ] Stage: Pre-termination
[ INFO  ] Stage: Termination
[ INFO  ] Hosted Engine successfully deployed
----

Comment 53 Artyom 2016-11-16 10:13:21 UTC
2016-11-16 08:29:00.919+0000: 20933: error : virFirewallApplyRuleDirect:732 : internal error: Failed to apply firewall rules /usr/sbin/ebtables --concurrent -t nat -N J-vnet0-mac: Chain J-vnet0-mac already exists.

I saw this bug on versions:
# rpm -qa | grep vdsm
vdsm-hook-ethtool-options-4.18.15.3-1.el7ev.noarch
vdsm-xmlrpc-4.18.15.3-1.el7ev.noarch
vdsm-api-4.18.15.3-1.el7ev.noarch
vdsm-yajsonrpc-4.18.15.3-1.el7ev.noarch
vdsm-infra-4.18.15.3-1.el7ev.noarch
vdsm-cli-4.18.15.3-1.el7ev.noarch
vdsm-hook-vmfex-dev-4.18.15.3-1.el7ev.noarch
vdsm-4.18.15.3-1.el7ev.x86_64
vdsm-python-4.18.15.3-1.el7ev.noarch
vdsm-jsonrpc-4.18.15.3-1.el7ev.noarch

# rpm -qa | grep ebtables
ebtables-2.0.10-15.el7.x86_64

# rpm -qa | grep libvirt
libvirt-daemon-config-nwfilter-2.0.0-10.el7.x86_64
libvirt-daemon-2.0.0-10.el7.x86_64
libvirt-daemon-driver-secret-2.0.0-10.el7.x86_64
libvirt-client-2.0.0-10.el7.x86_64
libvirt-daemon-driver-storage-2.0.0-10.el7.x86_64
libvirt-daemon-driver-lxc-2.0.0-10.el7.x86_64
libvirt-2.0.0-10.el7.x86_64
libvirt-python-2.0.0-2.el7.x86_64
libvirt-lock-sanlock-2.0.0-10.el7.x86_64
libvirt-daemon-driver-nwfilter-2.0.0-10.el7.x86_64
libvirt-daemon-config-network-2.0.0-10.el7.x86_64
libvirt-daemon-driver-nodedev-2.0.0-10.el7.x86_64
libvirt-daemon-kvm-2.0.0-10.el7.x86_64
libvirt-daemon-driver-network-2.0.0-10.el7.x86_64
libvirt-daemon-driver-interface-2.0.0-10.el7.x86_64
libvirt-daemon-driver-qemu-2.0.0-10.el7.x86_64

# rpm -qa | grep hosted
ovirt-hosted-engine-ha-2.0.4-1.el7ev.noarch
ovirt-hosted-engine-setup-2.0.3-2.el7ev.noarch

Comment 54 Simone Tiraboschi 2016-11-16 10:41:46 UTC
We found a similar issue installing hosted-engine on RHEL 7.3:

2016-11-16 08:29:00.884+0000: 20933: info : virFirewallApplyGroup:895 : Starting transaction for firewall=0x7f2a6c001820 group=0x7f2a6c004050 flags=1
2016-11-16 08:29:00.884+0000: 20933: info : virFirewallApplyRule:838 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -D PREROUTING -i vnet0 -j libvirt-J-vnet0'
2016-11-16 08:29:00.888+0000: 20933: info : virFirewallApplyRule:838 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -D POSTROUTING -o vnet0 -j libvirt-P-vnet0'
2016-11-16 08:29:00.891+0000: 20933: info : virFirewallApplyRule:838 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -L libvirt-J-vnet0'
2016-11-16 08:29:00.893+0000: 20933: info : virFirewallApplyRule:838 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -L libvirt-P-vnet0'
2016-11-16 08:29:00.896+0000: 20933: info : virFirewallApplyRule:838 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -F libvirt-J-vnet0'
2016-11-16 08:29:00.899+0000: 20933: info : virFirewallApplyRule:838 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -X libvirt-J-vnet0'
2016-11-16 08:29:00.900+0000: 20840: info : virEventPollDispatchHandles:506 : EVENT_POLL_DISPATCH_HANDLE: watch=9 events=1
2016-11-16 08:29:00.900+0000: 20840: info : virEventPollRunOnce:640 : EVENT_POLL_RUN: nhandles=12 timeout=-1
2016-11-16 08:29:00.901+0000: 20840: info : virEventPollDispatchHandles:506 : EVENT_POLL_DISPATCH_HANDLE: watch=12 events=1
2016-11-16 08:29:00.901+0000: 20840: info : virEventPollRunOnce:640 : EVENT_POLL_RUN: nhandles=12 timeout=-1
2016-11-16 08:29:00.902+0000: 20933: info : virFirewallApplyRule:838 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -F libvirt-P-vnet0'
2016-11-16 08:29:00.905+0000: 20933: info : virFirewallApplyRule:838 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -X libvirt-P-vnet0'
2016-11-16 08:29:00.907+0000: 20933: info : virFirewallApplyGroup:895 : Starting transaction for firewall=0x7f2a6c001820 group=0x7f2a6c006a10 flags=0
2016-11-16 08:29:00.907+0000: 20933: info : virFirewallApplyRule:838 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -N libvirt-J-vnet0'
2016-11-16 08:29:00.910+0000: 20933: info : virFirewallApplyRule:838 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -F J-vnet0-mac'
2016-11-16 08:29:00.913+0000: 20933: info : virFirewallApplyRule:838 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -X J-vnet0-mac'
2016-11-16 08:29:00.916+0000: 20933: info : virFirewallApplyRule:838 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -N J-vnet0-mac'
2016-11-16 08:29:00.919+0000: 20933: error : virFirewallApplyRuleDirect:732 : internal error: Failed to apply firewall rules /usr/sbin/ebtables --concurrent -t nat -N J-vnet0-mac: Chain J-vnet0-mac already exists.

2016-11-16 08:29:00.919+0000: 20933: info : virFirewallRollbackGroup:915 : Starting rollback for group 0x7f2a6c006a10
2016-11-16 08:29:00.919+0000: 20933: info : virFirewallApplyRule:838 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -D PREROUTING -i vnet0 -j libvirt-J-vnet0'
2016-11-16 08:29:00.921+0000: 20933: info : virFirewallApplyRule:838 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -D POSTROUTING -o vnet0 -j libvirt-P-vnet0'
2016-11-16 08:29:00.924+0000: 20933: info : virFirewallApplyRule:838 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -L libvirt-J-vnet0'
2016-11-16 08:29:00.927+0000: 20933: info : virFirewallApplyRule:838 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -L libvirt-P-vnet0'
2016-11-16 08:29:00.930+0000: 20933: info : virFirewallApplyRule:838 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -F libvirt-J-vnet0'
2016-11-16 08:29:00.933+0000: 20933: info : virFirewallApplyRule:838 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -X libvirt-J-vnet0'
2016-11-16 08:29:00.936+0000: 20933: info : virFirewallApplyRule:838 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -F libvirt-P-vnet0'
2016-11-16 08:29:00.938+0000: 20933: info : virFirewallApplyRule:838 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -X libvirt-P-vnet0'

libvirt fails applying an ebtables rule and so start to rollback.
vnet0 goes down, so ovirtmgmt and so the engine VM.

Nov 16 10:29:00 puma23 systemd: Starting Virtual Desktop Server Manager network restoration...
Nov 16 10:29:00 puma23 dnsmasq[4773]: read /etc/hosts - 2 addresses
Nov 16 10:29:00 puma23 dnsmasq[4773]: read /var/lib/libvirt/dnsmasq/default.addnhosts - 0 addresses
Nov 16 10:29:00 puma23 dnsmasq-dhcp[4773]: read /var/lib/libvirt/dnsmasq/default.hostsfile
Nov 16 10:29:00 puma23 kernel: type=1400 audit(1479284940.858:26): avc:  denied  { search } for  pid=19894 comm="systemd-machine" name="19893" dev="proc" ino=155551324 scontext=system_u:system_r:systemd_machined_t:s0 tcontext=system_u:system_r:svirt_t:s0:c431,c860 tclass=dir
Nov 16 10:29:01 puma23 kernel: ovirtmgmt: port 2(vnet0) entered disabled state
Nov 16 10:29:01 puma23 kernel: device vnet0 left promiscuous mode
Nov 16 10:29:01 puma23 kernel: ovirtmgmt: port 2(vnet0) entered disabled state
Nov 16 10:29:01 puma23 kvm: 0 guests now active
Nov 16 10:29:01 puma23 systemd-machined: Machine qemu-1-HostedEngine terminated.
Nov 16 10:29:01 puma23 systemd: Started Virtual Desktop Server Manager network restoration.

Comment 55 Tasos Papaioannou 2016-11-16 16:02:02 UTC
I've verified that hosted-engine --deploy runs successfully with ebtables-2.0.10-15.el7.x86_64 and the changes in https://gerrit.ovirt.org/#/c/66853 / https://gerrit.ovirt.org/#/c/66891.

Comment 56 Simone Tiraboschi 2016-11-16 17:21:37 UTC
(In reply to Tasos Papaioannou from comment #55)
> I've verified that hosted-engine --deploy runs successfully with
> ebtables-2.0.10-15.el7.x86_64 and the changes in
> https://gerrit.ovirt.org/#/c/66853 / https://gerrit.ovirt.org/#/c/66891.

This is basically a workaround since the issue with ebtables and vdsm-no-mac-spoofing is probably still there but at the end we don't relay need vdsm-no-mac-spoofing for the engine VM.
So without vdsm-no-mac-spoofing we don't have that ebtable rule and so we don't see the issue, at least for hosted-engine-setup.

As far as I saw, that issue wasn't completely systematic so I don't exclude a race condition between libvirt setting the ebtables rules and the kernel loading the required modules.

Comment 57 Laine Stump 2016-11-16 18:14:00 UTC
> I don't exclude a race condition between libvirt setting the ebtables
> rules and the kernel loading the required modules.

If there is a race then it is self-contained in the ebtables module - ebtables is run to add a rule, and that rule requires autoloading of the ebtable_nat module, so it is loaded in order to satisfy the rule.

Aside from that, since the ebtable_nat module is apparently still not loaded even after the failure, I would venture that a race is not the problem, the autoload simply isn't happening for whatever reason.

Comment 58 Nikolai Sednev 2016-11-29 11:16:37 UTC
Works for me on these components on hosts:
rhevm-appliance-20161116.0-1.el7ev.noarch
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
rhevm-appliance-20161116.0-1.el7ev.noarch
ovirt-setup-lib-1.0.2-1.el7ev.noarch
ovirt-imageio-daemon-0.4.0-0.el7ev.noarch
sanlock-3.4.0-1.el7.x86_64
ovirt-vmconsole-1.0.4-1.el7ev.noarch
mom-0.5.8-1.el7ev.noarch
vdsm-4.18.17-1.el7ev.x86_64
ovirt-hosted-engine-setup-2.0.4-2.el7ev.noarch
rhev-release-4.0.6-3-001.noarch
ovirt-host-deploy-1.5.3-1.el7ev.noarch
ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch
ovirt-imageio-common-0.3.0-0.el7ev.noarch
libvirt-client-2.0.0-10.el7.x86_64
qemu-kvm-rhev-2.6.0-27.el7.x86_64
ovirt-hosted-engine-ha-2.0.5-1.el7ev.noarch
Linux version 3.10.0-514.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Wed Oct 19 11:24:13 EDT 2016
Linux 3.10.0-514.el7.x86_64 #1 SMP Wed Oct 19 11:24:13 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.3 (Maipo)

Engine:
rhevm-setup-plugins-4.0.0.3-1.el7ev.noarch
rhevm-spice-client-x86-msi-4.0-3.el7ev.noarch
rhevm-dependencies-4.0.0-1.el7ev.noarch
rhev-guest-tools-iso-4.0-6.el7ev.noarch
rhevm-4.0.5.5-0.1.el7ev.noarch
rhevm-spice-client-x64-msi-4.0-3.el7ev.noarch
rhevm-doc-4.0.5-1.el7ev.noarch
rhevm-branding-rhev-4.0.0-5.el7ev.noarch
rhevm-guest-agent-common-1.0.12-3.el7ev.noarch
Linux version 3.10.0-514.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Wed Oct 19 11:24:13 EDT 2016
Linux 3.10.0-514.el7.x86_64 #1 SMP Wed Oct 19 11:24:13 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.3 (Maipo)

I've successfully deployed rhevm-appliance-20161116.0-1.el7ev.noarch over NFS on host.


Note You need to log in before you can comment on or make changes to this bug.