Created attachment 1211766 [details] /var/log/messages Description of problem: When running the "hosted-engine --deploy" command on the RHEL 7.3 beta (to deploy RHV 4.0), the installation seems to hang while installing the host, after the engine is set up. In /var/log/messages I see several error messages from vdsm/libvirt (included below). It appears that VDSM shuts down (along with the engine VM) while the host is installing, and when it comes back the engine VM does not come with it. Manually (re)starting the engine (hosted-engine --vm-start) seems to allow the installer to continue. Logs that stood out to me (from journal): Oct 18 14:50:39 mac5254002a783e.example.com libvirtd[26418]: libvirt version: 2.0.0, package: 4.el7 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2016-08-02-09:15:12, x86-034.build.eng.bos.redhat.com) Oct 18 14:50:39 mac5254002a783e.example.com libvirtd[26418]: hostname: mac5254002a783e.example.com Oct 18 14:50:39 mac5254002a783e.example.com libvirtd[26418]: internal error: Failed to apply firewall rules /usr/sbin/ebtables --concurrent -t nat -N libvirt-J-vnet0: The kernel doesn't support the ebtables 'nat' table. ... Oct 18 14:50:44 mac5254002a783e.example.com vdsm[26957]: vdsm vds WARNING recovery: found 1 VMs from recovery files not reported by libvirt. This should not happen! Will try to recover them. ... Oct 18 14:50:44 mac5254002a783e.example.com vdsm[26957]: vdsm vds WARNING Not ready yet, ignoring event u'|virt|VM_status|059ab810-5d7c-441c-8b52-2cade9e45fe1' args={u'059ab810-5d7c-441c-8b52-2cade9e45fe1': {'status': 'Down', 'exitReason': 9, 'exitMessage': 'Failed to find the libvirt domain', 'exitCode': 1}} Version-Release number of selected component (if applicable): ovirt-hosted-engine-setup : 2.0.2.2 libvirt-daemon : 2.0.0 vdsm : 4.18.13 How reproducible: 100% Steps to Reproduce: 1. On a RHEL 7.3 beta host, install ovirt-hosted-engine-setup and rhevm-appliance 2. run hosted-engine --deploy and answer the questions Actual results: Installation hangs while installing the host Expected results: Installation succeeds Additional info: Using NFS for storage Here is the ansible playbook that is being used to deploy the self-hosted engine: https://github.com/fusor/ansible-ovirt/blob/master/self_hosted.yml and the task file that is running the installe: https://github.com/fusor/ansible-ovirt/blob/master/roles/self_hosted_first_host/tasks/main.yml I do not see this issue when running the installer with the upstream packages, and I did not see this issue until upgrading to the RHEL 7.3 beta.
Created attachment 1211767 [details] /var/log/audit/audit.log
Created attachment 1211769 [details] /var/log/libvirt/qemu/HostedEngine.log
Created attachment 1211770 [details] /var/log/vdsm/mom.log
Created attachment 1211771 [details] /var/log/vdsm/supervdsm.log
Created attachment 1211772 [details] /var/log/vdsm/vdsm.log
Created attachment 1211773 [details] /var/log/yum.log
Created attachment 1211774 [details] answers file passed to hosted-engine installer
Fabian, can you please attach also hosted-engine-setup logs?
Created attachment 1212559 [details] /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20161018143149-y3tpgi.log Of course I attached every log except for this one...
Thanks Felix, the issue seams here: Oct 18 14:50:39 mac5254002a783e journal: internal error: Failed to apply firewall rules /usr/sbin/ebtables --concurrent -t nat -N libvirt-J-vnet0: The kernel doesn't support the ebtables 'nat' table. Oct 18 14:50:39 mac5254002a783e kernel: ip_tables: (C) 2000-2006 Netfilter Core Team Oct 18 14:50:39 mac5254002a783e iptables.init: iptables: Unloading modules: [ OK ] Oct 18 14:50:39 mac5254002a783e systemd: Stopped IPv4 firewall with iptables. Oct 18 14:50:39 mac5254002a783e kernel: ovirtmgmt: port 2(vnet0) entered disabled state Oct 18 14:50:39 mac5254002a783e kernel: device vnet0 left promiscuous mode Oct 18 14:50:39 mac5254002a783e kernel: ovirtmgmt: port 2(vnet0) entered disabled state Oct 18 14:50:39 mac5254002a783e systemd: Starting IPv4 firewall with iptables... Oct 18 14:50:39 mac5254002a783e kvm: 0 guests now active Oct 18 14:50:39 mac5254002a783e kernel: nf_conntrack version 0.5.0 (65536 buckets, 262144 max) Oct 18 14:50:39 mac5254002a783e systemd-machined: Machine qemu-1-HostedEngine terminated. When the engine tried to deloy the host, there was an error setting an ebtables rule. iptables went down, the ovirtmgmt bridge went down too and the engine VM got terminated.
I successfully deployed hosted-engine from 4.0.5-2 over a clean RHEL-7.3-20161010.n.0 No sign of nat chain as ovirtmgmt is a bridge. Fabian, did you tried configuring a NAT network somehow outside RHEV?
No, we leave the network configuration up to RHV.
Is there any info/log/command output about the networking setup that you need to see? This is currently breaking self-hosted deployments for QCI 1.1
Can you please attach the output of ebtables-save ?
# Generated by ebtables-save v1.0 on Tue Oct 25 19:34:19 UTC 2016 *nat :PREROUTING ACCEPT :OUTPUT ACCEPT :POSTROUTING ACCEPT :J-vnet0-arp-mac ACCEPT *filter :INPUT ACCEPT :FORWARD ACCEPT :OUTPUT ACCEPT
(In reply to Fabian von Feilitzsch from comment #15) > # Generated by ebtables-save v1.0 on Tue Oct 25 19:34:19 UTC 2016 > *nat > :PREROUTING ACCEPT > :OUTPUT ACCEPT > :POSTROUTING ACCEPT > :J-vnet0-arp-mac ACCEPT The issue is due to the presence of this chain ^^^. I tried on a clean 7.3 beta system with just hosted-engine and I didn't get it. Maybe it's due to another QCI component. > *filter > :INPUT ACCEPT > :FORWARD ACCEPT > :OUTPUT ACCEPT
So is it expected that the presence of this chain causes this behavior?
Given the error message: "The kernel doesn't support the ebtables 'nat' table" could you please share the output of: 1. "rpm -qa | grep ebtables" 2. "lsmod |grep ebtable"
$ rpm -qa | grep ebtables ebtables-2.0.10-15.el7.x86_64 $ lsmod | grep ebtable ebtable_filter 12827 0 ebtables 35009 1 ebtable_filter
(In reply to Fabian von Feilitzsch from comment #19) > $ rpm -qa | grep ebtables > ebtables-2.0.10-15.el7.x86_64 > > $ lsmod | grep ebtable > ebtable_filter 12827 0 > ebtables 35009 1 ebtable_filter I don't know why, but it seems the ebtable_nat kernel module is not loaded. Could you please provide some more info: 1. uname -a 2. ls -l /usr/lib/modules/*/kernel/net/bridge/netfilter/ebtable_nat.ko 3. modprobe -v ebtable_nat
1. uname -a Linux mac5254005ed267.example.com 3.10.0-493.el7.x86_64 #1 SMP Tue Aug 16 11:45:26 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux 2. ls -l /usr/lib/modules/*/kernel/net/bridge/netfilter/ebtable_nat.ko -rw-r--r--. 1 root root 7361 Aug 16 16:15 /usr/lib/modules/3.10.0-493.el7.x86_64/kernel/net/bridge/netfilter/ebtable_nat.ko 3. modprobe -v ebtable_nat ( no output, return code of 0 )
Comment 20 is looking in the correct direction, but I wanted to dispel a misconception hinted at in Comment 11: The fact that there is a rule applied to the ebtables "nat" table is unrelated to the connected network being a <forward mode='nat'> network - it just happens that the way ebtables is used to implement the anti-mac-spoofing rules (applied by "<filterref filter='vdsm-no-mac-spoofing'/>" in the guest's <interface>) is via ebtables' "nat" table. This is done regardless of the forward type of the network the guest is connected to. The error message you're seeing is being relayed from the output of the ebtables command by libvirt and, as Gil points out, is due to the ebtable_nat module not being loaded. I'm not familiar enough with the internals of ebtables to know if that module is supposed to auto-load as the result of some event (similar to how the bridge module autoloads when you create a bridge device) or if there must be something explicit somewhere in system config to cause it to load. Definitely libvirt never explicitly loads it, though.
Here is the result of ebtables -t nat -L: Bridge table: nat Bridge chain: PREROUTING, entries: 0, policy: ACCEPT Bridge chain: OUTPUT, entries: 0, policy: ACCEPT Bridge chain: POSTROUTING, entries: 0, policy: ACCEPT I tried running this command before running the hosted-engine installer, in the hopes that if it weren't loaded this would force it, but it had no effect. I'm not sure why the system sees the nat module outside of the hosted-engine installer. This is very strange to me.
also, running /usr/sbin/ebtables --concurrent -t nat -A J-vnet0-mac -s b2:29:9d:7a:32:2f -j RETURN by hand (after the failure was logged), succeeded with a return code of 0
(In reply to Fabian von Feilitzsch from comment #23) > Here is the result of ebtables -t nat -L: > > Bridge table: nat > > Bridge chain: PREROUTING, entries: 0, policy: ACCEPT > > Bridge chain: OUTPUT, entries: 0, policy: ACCEPT > > Bridge chain: POSTROUTING, entries: 0, policy: ACCEPT > > > I tried running this command before running the hosted-engine installer, in > the hopes that if it weren't loaded this would force it, but it had no > effect. I'm not sure why the system sees the nat module outside of the > hosted-engine installer. This is very strange to me. Could you please try the following: 1. modprobe ebtable_nat 2. Confirm you see ebtable_nat loaded by checking the output of: "lsmod | grep ebtable_nat" 3. If you see it loaded run "hosted-engine --deploy" and follow the wizard
1. modprobe ebtable_nat ( no output, return code of 0 ) 2. Confirm you see ebtable_nat loaded by checking the output of: "lsmod | grep ebtable_nat" ebtable_nat 12807 1 ebtables 35009 2 ebtable_nat,ebtable_filter 3. If you see it loaded run "hosted-engine --deploy" and follow the wizard same result as in the included log file, and the ebtables error message appears in /var/log/messages as usual.
Interestingly, the engine vm eventually came back up (I would guess it was successfully recovered this time) and the host was in an up state, so the deployment eventually succeeded.
I am still able to replicate this with the 7.3 GA. I ran ebtables-save on a host, freshly provisioned through Satellite, and it matched exactly the output of ebtables-save on my non-Satellite managed host. Here is the output: *nat :PREROUTING ACCEPT :OUTPUT ACCEPT :POSTROUTING ACCEPT :PREROUTING_direct RETURN :POSTROUTING_direct RETURN :OUTPUT_direct RETURN -A PREROUTING -j PREROUTING_direct -A OUTPUT -j OUTPUT_direct -A POSTROUTING -j POSTROUTING_direct *broute :BROUTING ACCEPT :BROUTING_direct RETURN -A BROUTING -j BROUTING_direct *filter :INPUT ACCEPT :FORWARD ACCEPT :OUTPUT ACCEPT :INPUT_direct RETURN :OUTPUT_direct RETURN :FORWARD_direct RETURN -A INPUT -j INPUT_direct -A FORWARD -j FORWARD_direct -A OUTPUT -j OUTPUT_direct I have not been able to replicate this on any host not provisioned with Satellite. We use the Satellite Kickstart Default provisioning template. I'm going to attach the generated kickstart in a private comment, maybe there's something in there I'm overlooking.
(In reply to Fabian von Feilitzsch from comment #28) > I am still able to replicate this with the 7.3 GA. I ran ebtables-save on a > host, freshly provisioned through Satellite, and it matched exactly the > output of ebtables-save on my non-Satellite managed host. Sorry, to be clear, this is before running the hosted-engine installer. They seem to have the same initial ebtables rules configured prior to running the installer.
I ran into this problem. I decided to run journalctl to look for vdsm entries, because clearly it had shutdown as I saw it receive a SIGTERM and shutdown in /var/log/vdsm/vdsm.log. Anyway the last entries concerning vdsm where: Nov 04 16:34:20 mac525400bb29f6.b.b vdsmd_init_common.sh[23484]: vdsm: Running run_final_hooks Nov 04 16:34:20 mac525400bb29f6.b.b systemd[1]: Stopped Virtual Desktop Server Manager. Nov 04 16:34:20 mac525400bb29f6.b.b systemd[1]: Stopping Auxiliary vdsm service for running helper functions as root... Nov 04 16:34:20 mac525400bb29f6.b.b python[23028]: DIGEST-MD5 client mech dispose Nov 04 16:34:20 mac525400bb29f6.b.b python[23028]: DIGEST-MD5 common mech dispose Nov 04 16:34:20 mac525400bb29f6.b.b daemonAdapter[23028]: Traceback (most recent call last): Nov 04 16:34:20 mac525400bb29f6.b.b daemonAdapter[23028]: File "/usr/lib64/python2.7/multiprocessing/util.py", line 268, in _run_finalizers Nov 04 16:34:20 mac525400bb29f6.b.b daemonAdapter[23028]: finalizer() Nov 04 16:34:20 mac525400bb29f6.b.b daemonAdapter[23028]: File "/usr/lib64/python2.7/multiprocessing/util.py", line 201, in __call__ Nov 04 16:34:20 mac525400bb29f6.b.b daemonAdapter[23028]: res = self._callback(*self._args, **self._kwargs) Nov 04 16:34:20 mac525400bb29f6.b.b daemonAdapter[23028]: OSError: [Errno 2] No such file or directory: '/var/run/vdsm/svdsm.sock' Nov 04 16:34:20 mac525400bb29f6.b.b systemd[1]: Stopped Auxiliary vdsm service for running helper functions as root. Nov 04 16:34:18 mac525400bb29f6.b.b systemd[1]: Time has been changed Note at the end of this there is an entry about the time changing, and it actually occurred before the vdsm messages. It happened at 16:34:18 while the vdsm messages all happened at 16:34:20. So the log is out of order, but it's clear a time changes happened a few seconds before vdsm shutdown.
(In reply to James Olin Oden from comment #31) > Nov 04 16:34:18 mac525400bb29f6.b.b systemd[1]: Time has been changed > > Note at the end of this there is an entry about the time changing, and it > actually occurred before the vdsm messages. It happened at 16:34:18 while > the vdsm messages all happened at 16:34:20. So the log is out of order, > but it's clear a time changes happened a few seconds before vdsm shutdown. I looked into this, not sure what's causing it but I see it in both successful and failing environments, so I don't think it's causing the issue we see.
I think the issue might be caused by the installation of the @Core package group by the Satellite Default Kickstart.
(In reply to Fabian von Feilitzsch from comment #35) > I think the issue might be caused by the installation of the @Core package > group by the Satellite Default Kickstart. Fabian, so you know why Satellite requires that ebtable rules or do you think that they are there just a side effect of an existing template? If so we have to open a bug on satellite to cleanup that template. Dan, is there something we can do on VDSM side to handle that rules?
As far as I can understand, it isn't the adding of the ebtables rules that causes the problem - those are *always* there for RHV guests. The problem (at least the error message logged by libvirt) is caused by something preventing the ebtable_nat module from being autoloaded.
(In reply to Simone Tiraboschi from comment #37) > Fabian, so you know why Satellite requires that ebtable rules or do you > think that they are there just a side effect of an existing template? > If so we have to open a bug on satellite to cleanup that template. I don't think there's anything wrong with the template, unless the packages in the @Core package group are not compatible with RHV. Here is the list of packages in @Core: NetworkManager-1:0.9.9.1-25.git20140326.4dba720.el7_0.i686 NetworkManager-1:1.4.0-12.el7.x86_64 NetworkManager-config-server-1:1.4.0-12.el7.x86_64 NetworkManager-team-1:1.4.0-12.el7.x86_64 NetworkManager-tui-1:1.4.0-12.el7.x86_64 Red_Hat_Enterprise_Linux-Release_Notes-7-en-US-0:7-2.el7.noarch aic94xx-firmware-0:30-6.el7.noarch alsa-firmware-0:1.0.28-2.el7.noarch audit-0:2.6.5-3.el7.x86_64 basesystem-0:10.0-7.el7.noarch bash-0:4.2.46-20.el7_2.x86_64 biosdevname-0:0.7.2-1.el7.x86_64 btrfs-progs-0:4.4.1-1.el7.i686 btrfs-progs-0:4.4.1-1.el7.x86_64 coreutils-0:8.22-18.el7.x86_64 cronie-0:1.4.11-14.el7_2.1.x86_64 curl-0:7.29.0-35.el7.x86_64 dhclient-12:4.2.5-47.el7.x86_64 dracut-config-generic-0:033-463.el7.x86_64 dracut-config-rescue-0:033-463.el7.x86_64 dracut-fips-0:033-463.el7.x86_64 dracut-fips-aesni-0:033-463.el7.x86_64 dracut-network-0:033-463.el7.x86_64 e2fsprogs-0:1.42.9-9.el7.x86_64 filesystem-0:3.2-21.el7.x86_64 firewalld-0:0.4.3.2-8.el7.noarch glibc-0:2.17-157.el7.i686 glibc-0:2.17-157.el7.x86_64 hostname-0:3.13-3.el7.x86_64 initscripts-0:9.49.37-1.el7.x86_64 iproute-0:3.10.0-74.el7.x86_64 iprutils-0:2.4.13.1-1.el7.x86_64 iptables-0:1.4.21-17.el7.i686 iptables-0:1.4.21-17.el7.x86_64 iputils-0:20160308-8.el7.x86_64 irqbalance-3:1.0.7-6.el7.x86_64 ivtv-firmware-2:20080701-26.el7.noarch iwl100-firmware-0:39.31.5.1-49.el7.noarch iwl1000-firmware-1:39.31.5.1-49.el7.noarch iwl105-firmware-0:18.168.6.1-49.el7.noarch iwl135-firmware-0:18.168.6.1-49.el7.noarch iwl2000-firmware-0:18.168.6.1-49.el7.noarch iwl2030-firmware-0:18.168.6.1-49.el7.noarch iwl3160-firmware-0:22.0.7.0-49.el7.noarch iwl3945-firmware-0:15.32.2.9-49.el7.noarch iwl4965-firmware-0:228.61.2.24-49.el7.noarch iwl5000-firmware-0:8.83.5.1_1-49.el7.noarch iwl5150-firmware-0:8.24.2.2-49.el7.noarch iwl6000-firmware-0:9.221.4.1-49.el7.noarch iwl6000g2a-firmware-0:17.168.5.3-49.el7.noarch iwl6000g2b-firmware-0:17.168.5.2-49.el7.noarch iwl6050-firmware-0:41.28.5.1-49.el7.noarch iwl7260-firmware-0:22.0.7.0-49.el7.noarch iwl7265-firmware-0:22.0.7.0-49.el7.noarch kbd-0:1.15.5-12.el7.x86_64 kernel-tools-0:3.10.0-514.el7.x86_64 kexec-tools-0:2.0.7-50.el7.x86_64 less-0:458-9.el7.x86_64 libertas-sd8686-firmware-0:20140804-0.1.git6bce2b0.el7_0.noarch libertas-sd8787-firmware-0:20140804-0.1.git6bce2b0.el7_0.noarch libertas-usb8388-firmware-2:20140804-0.1.git6bce2b0.el7_0.noarch libsysfs-0:2.1.0-16.el7.i686 libsysfs-0:2.1.0-16.el7.x86_64 linux-firmware-0:20160830-49.git7534e19.el7.noarch man-db-0:2.6.3-9.el7.x86_64 microcode_ctl-2:2.1-16.el7.x86_64 ncurses-0:5.9-13.20130511.el7.x86_64 openssh-clients-0:6.6.1p1-31.el7.x86_64 openssh-keycat-0:6.6.1p1-31.el7.x86_64 openssh-server-0:6.6.1p1-31.el7.x86_64 parted-0:3.1-28.el7.i686 parted-0:3.1-28.el7.x86_64 passwd-0:0.79-4.el7.x86_64 plymouth-0:0.8.9-0.26.20140113.el7.x86_64 policycoreutils-0:2.5-8.el7.x86_64 postfix-2:2.10.1-6.el7.x86_64 procps-ng-0:3.3.10-10.el7.i686 procps-ng-0:3.3.10-10.el7.x86_64 rdma-0:7.3_4.7_rc2-5.el7.noarch redhat-support-tool-0:0.9.8-6.el7.noarch rootfiles-0:8.1-11.el7.noarch rpm-0:4.11.3-21.el7.x86_64 rsyslog-0:7.4.7-16.el7.x86_64 selinux-policy-mls-0:3.13.1-102.el7_3.4.noarch selinux-policy-targeted-0:3.13.1-102.el7_3.4.noarch setup-0:2.8.71-7.el7.noarch shadow-utils-2:4.1.5.1-24.el7.x86_64 subscription-manager-0:1.17.15-1.el7.x86_64 sudo-0:1.8.6p7-20.el7.x86_64 systemd-0:219-30.el7_3.3.x86_64 tar-2:1.26-31.el7.x86_64 tboot-1:1.9.4-2.el7.x86_64 tuned-0:2.7.1-3.el7.noarch util-linux-0:2.23.2-33.el7.i686 util-linux-0:2.23.2-33.el7.x86_64 vim-minimal-2:7.4.160-1.el7.x86_64 xfsprogs-0:4.5.0-8.el7.i686 xfsprogs-0:4.5.0-8.el7.x86_64 yum-0:3.4.3-150.el7.noarch yum-rhn-plugin-0:2.0.1-6.el7.noarch I think I've narrowed the failure down to one of these packages (testing in progress): iptables-0:1.4.21-17.el7.i686 iptables-0:1.4.21-17.el7.x86_64 cronie-0:1.4.11-14.el7_2.1.x86_64 dhclient-12:4.2.5-47.el7.x86_64 firewalld-0:0.4.3.2-8.el7.noarch iproute-0:3.10.0-74.el7.x86_64 iprutils-0:2.4.13.1-1.el7.x86_64 iputils-0:20160308-8.el7.x86_64 kernel-tools-0:3.10.0-514.el7.x86_64 kexec-tools-0:2.0.7-50.el7.x86_64 policycoreutils-0:2.5-8.el7.x86_64 selinux-policy-mls-0:3.13.1-102.el7_3.4.noarch selinux-policy-targeted-0:3.13.1-102.el7_3.4.noarch subscription-manager-0:1.17.15-1.el7.x86_64 Is there anything in there that should not be installed alongside RHV?
Did you tried stopping and masking NetworkManager?
(In reply to Simone Tiraboschi from comment #40) > Did you tried stopping and masking NetworkManager? We had NetworkManager stopped, but not masked. Just did a test run in the vagrant environment and it seems to have succeeded, going to try the same from QCI.
> We had NetworkManager stopped, but not masked. Just did a test run in the > vagrant environment and it seems to have succeeded, going to try the same > from QCI. Not sure the cause of it, but it still fails in QCI. I have pushed an update to the vagrant reproduction environment that can replicate this failure even with NetworkManager masked.
I opened a related bug also on Satellite to see if somebody could help on that front.
Good call. With NetworkManager masked, I can still replicate the failures consistently with Satellite, but I am having trouble consistently replicating it in my vagrant environment.
Would it be possible to see a kickstart that the RHV QE team is using for self-hosted testing? Would like to compare that to what we have.
Created attachment 1220501 [details] /tmp/ovirt-host-deploy-20161114171727-faui4d.log I was poking around one of my deployments and I found a reference to this log of the host deploy process. Not sure if there's anything you'll find useful in there. The error occurs at 17:17:50, then there are no logs until 17:33:19.
We just got access to a bare metal system and it seems like we aren't able to reproduce the issue there. Could this be an issue with RHEL 7.3 and nested virt?
(In reply to Fabian von Feilitzsch from comment #46) > Would it be possible to see a kickstart that the RHV QE team is using for > self-hosted testing? Would like to compare that to what we have. I'm not using a kickstart file to initialize my hosts. Not sure about QE.
(In reply to Fabian von Feilitzsch from comment #48) > We just got access to a bare metal system and it seems like we aren't able > to reproduce the issue there. Could this be an issue with RHEL 7.3 and > nested virt? I also tried to reproduce on RHEL 7.3 in nested env but it worked as expected.
(In reply to Fabian von Feilitzsch from comment #46) > Would it be possible to see a kickstart that the RHV QE team is using for > self-hosted testing? Would like to compare that to what we have. oVirt CI is using (also) lago with ovirt-system-tests. You might have a look at these. Not sure about RHV QE, but you can see some of their stuff here: https://github.com/rhevm-qe-automation
FYI, I've verified that downgrading the ebtables rpm from ebtables-2.0.10-15.el7.x86_64 to ebtables-2.0.10-13.el7.x86_64 on the RHEL 7.3 host before running hosted-engine --deploy resolves the issue. The "kernel doesn't support the ebtables 'nat' table" error doesn't appear in the system logs, and the deployment completes successfully: ---- [ INFO ] Waiting for the host to become operational in the engine. This may take several minutes... [ INFO ] Still waiting for VDSM host to become operational... [ INFO ] The VDSM Host is now operational [ INFO ] Saving hosted-engine configuration on the shared storage domain [ INFO ] Shutting down the engine VM [ INFO ] Enabling and starting HA services [ INFO ] Stage: Clean up [ INFO ] Generating answer file '/var/lib/ovirt-hosted-engine-setup/answers/answers-20161115214832.conf' [ INFO ] Generating answer file '/etc/ovirt-hosted-engine/answers.conf' [ INFO ] Stage: Pre-termination [ INFO ] Stage: Termination [ INFO ] Hosted Engine successfully deployed ----
2016-11-16 08:29:00.919+0000: 20933: error : virFirewallApplyRuleDirect:732 : internal error: Failed to apply firewall rules /usr/sbin/ebtables --concurrent -t nat -N J-vnet0-mac: Chain J-vnet0-mac already exists. I saw this bug on versions: # rpm -qa | grep vdsm vdsm-hook-ethtool-options-4.18.15.3-1.el7ev.noarch vdsm-xmlrpc-4.18.15.3-1.el7ev.noarch vdsm-api-4.18.15.3-1.el7ev.noarch vdsm-yajsonrpc-4.18.15.3-1.el7ev.noarch vdsm-infra-4.18.15.3-1.el7ev.noarch vdsm-cli-4.18.15.3-1.el7ev.noarch vdsm-hook-vmfex-dev-4.18.15.3-1.el7ev.noarch vdsm-4.18.15.3-1.el7ev.x86_64 vdsm-python-4.18.15.3-1.el7ev.noarch vdsm-jsonrpc-4.18.15.3-1.el7ev.noarch # rpm -qa | grep ebtables ebtables-2.0.10-15.el7.x86_64 # rpm -qa | grep libvirt libvirt-daemon-config-nwfilter-2.0.0-10.el7.x86_64 libvirt-daemon-2.0.0-10.el7.x86_64 libvirt-daemon-driver-secret-2.0.0-10.el7.x86_64 libvirt-client-2.0.0-10.el7.x86_64 libvirt-daemon-driver-storage-2.0.0-10.el7.x86_64 libvirt-daemon-driver-lxc-2.0.0-10.el7.x86_64 libvirt-2.0.0-10.el7.x86_64 libvirt-python-2.0.0-2.el7.x86_64 libvirt-lock-sanlock-2.0.0-10.el7.x86_64 libvirt-daemon-driver-nwfilter-2.0.0-10.el7.x86_64 libvirt-daemon-config-network-2.0.0-10.el7.x86_64 libvirt-daemon-driver-nodedev-2.0.0-10.el7.x86_64 libvirt-daemon-kvm-2.0.0-10.el7.x86_64 libvirt-daemon-driver-network-2.0.0-10.el7.x86_64 libvirt-daemon-driver-interface-2.0.0-10.el7.x86_64 libvirt-daemon-driver-qemu-2.0.0-10.el7.x86_64 # rpm -qa | grep hosted ovirt-hosted-engine-ha-2.0.4-1.el7ev.noarch ovirt-hosted-engine-setup-2.0.3-2.el7ev.noarch
We found a similar issue installing hosted-engine on RHEL 7.3: 2016-11-16 08:29:00.884+0000: 20933: info : virFirewallApplyGroup:895 : Starting transaction for firewall=0x7f2a6c001820 group=0x7f2a6c004050 flags=1 2016-11-16 08:29:00.884+0000: 20933: info : virFirewallApplyRule:838 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -D PREROUTING -i vnet0 -j libvirt-J-vnet0' 2016-11-16 08:29:00.888+0000: 20933: info : virFirewallApplyRule:838 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -D POSTROUTING -o vnet0 -j libvirt-P-vnet0' 2016-11-16 08:29:00.891+0000: 20933: info : virFirewallApplyRule:838 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -L libvirt-J-vnet0' 2016-11-16 08:29:00.893+0000: 20933: info : virFirewallApplyRule:838 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -L libvirt-P-vnet0' 2016-11-16 08:29:00.896+0000: 20933: info : virFirewallApplyRule:838 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -F libvirt-J-vnet0' 2016-11-16 08:29:00.899+0000: 20933: info : virFirewallApplyRule:838 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -X libvirt-J-vnet0' 2016-11-16 08:29:00.900+0000: 20840: info : virEventPollDispatchHandles:506 : EVENT_POLL_DISPATCH_HANDLE: watch=9 events=1 2016-11-16 08:29:00.900+0000: 20840: info : virEventPollRunOnce:640 : EVENT_POLL_RUN: nhandles=12 timeout=-1 2016-11-16 08:29:00.901+0000: 20840: info : virEventPollDispatchHandles:506 : EVENT_POLL_DISPATCH_HANDLE: watch=12 events=1 2016-11-16 08:29:00.901+0000: 20840: info : virEventPollRunOnce:640 : EVENT_POLL_RUN: nhandles=12 timeout=-1 2016-11-16 08:29:00.902+0000: 20933: info : virFirewallApplyRule:838 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -F libvirt-P-vnet0' 2016-11-16 08:29:00.905+0000: 20933: info : virFirewallApplyRule:838 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -X libvirt-P-vnet0' 2016-11-16 08:29:00.907+0000: 20933: info : virFirewallApplyGroup:895 : Starting transaction for firewall=0x7f2a6c001820 group=0x7f2a6c006a10 flags=0 2016-11-16 08:29:00.907+0000: 20933: info : virFirewallApplyRule:838 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -N libvirt-J-vnet0' 2016-11-16 08:29:00.910+0000: 20933: info : virFirewallApplyRule:838 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -F J-vnet0-mac' 2016-11-16 08:29:00.913+0000: 20933: info : virFirewallApplyRule:838 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -X J-vnet0-mac' 2016-11-16 08:29:00.916+0000: 20933: info : virFirewallApplyRule:838 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -N J-vnet0-mac' 2016-11-16 08:29:00.919+0000: 20933: error : virFirewallApplyRuleDirect:732 : internal error: Failed to apply firewall rules /usr/sbin/ebtables --concurrent -t nat -N J-vnet0-mac: Chain J-vnet0-mac already exists. 2016-11-16 08:29:00.919+0000: 20933: info : virFirewallRollbackGroup:915 : Starting rollback for group 0x7f2a6c006a10 2016-11-16 08:29:00.919+0000: 20933: info : virFirewallApplyRule:838 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -D PREROUTING -i vnet0 -j libvirt-J-vnet0' 2016-11-16 08:29:00.921+0000: 20933: info : virFirewallApplyRule:838 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -D POSTROUTING -o vnet0 -j libvirt-P-vnet0' 2016-11-16 08:29:00.924+0000: 20933: info : virFirewallApplyRule:838 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -L libvirt-J-vnet0' 2016-11-16 08:29:00.927+0000: 20933: info : virFirewallApplyRule:838 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -L libvirt-P-vnet0' 2016-11-16 08:29:00.930+0000: 20933: info : virFirewallApplyRule:838 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -F libvirt-J-vnet0' 2016-11-16 08:29:00.933+0000: 20933: info : virFirewallApplyRule:838 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -X libvirt-J-vnet0' 2016-11-16 08:29:00.936+0000: 20933: info : virFirewallApplyRule:838 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -F libvirt-P-vnet0' 2016-11-16 08:29:00.938+0000: 20933: info : virFirewallApplyRule:838 : Applying rule '/usr/sbin/ebtables --concurrent -t nat -X libvirt-P-vnet0' libvirt fails applying an ebtables rule and so start to rollback. vnet0 goes down, so ovirtmgmt and so the engine VM. Nov 16 10:29:00 puma23 systemd: Starting Virtual Desktop Server Manager network restoration... Nov 16 10:29:00 puma23 dnsmasq[4773]: read /etc/hosts - 2 addresses Nov 16 10:29:00 puma23 dnsmasq[4773]: read /var/lib/libvirt/dnsmasq/default.addnhosts - 0 addresses Nov 16 10:29:00 puma23 dnsmasq-dhcp[4773]: read /var/lib/libvirt/dnsmasq/default.hostsfile Nov 16 10:29:00 puma23 kernel: type=1400 audit(1479284940.858:26): avc: denied { search } for pid=19894 comm="systemd-machine" name="19893" dev="proc" ino=155551324 scontext=system_u:system_r:systemd_machined_t:s0 tcontext=system_u:system_r:svirt_t:s0:c431,c860 tclass=dir Nov 16 10:29:01 puma23 kernel: ovirtmgmt: port 2(vnet0) entered disabled state Nov 16 10:29:01 puma23 kernel: device vnet0 left promiscuous mode Nov 16 10:29:01 puma23 kernel: ovirtmgmt: port 2(vnet0) entered disabled state Nov 16 10:29:01 puma23 kvm: 0 guests now active Nov 16 10:29:01 puma23 systemd-machined: Machine qemu-1-HostedEngine terminated. Nov 16 10:29:01 puma23 systemd: Started Virtual Desktop Server Manager network restoration.
I've verified that hosted-engine --deploy runs successfully with ebtables-2.0.10-15.el7.x86_64 and the changes in https://gerrit.ovirt.org/#/c/66853 / https://gerrit.ovirt.org/#/c/66891.
(In reply to Tasos Papaioannou from comment #55) > I've verified that hosted-engine --deploy runs successfully with > ebtables-2.0.10-15.el7.x86_64 and the changes in > https://gerrit.ovirt.org/#/c/66853 / https://gerrit.ovirt.org/#/c/66891. This is basically a workaround since the issue with ebtables and vdsm-no-mac-spoofing is probably still there but at the end we don't relay need vdsm-no-mac-spoofing for the engine VM. So without vdsm-no-mac-spoofing we don't have that ebtable rule and so we don't see the issue, at least for hosted-engine-setup. As far as I saw, that issue wasn't completely systematic so I don't exclude a race condition between libvirt setting the ebtables rules and the kernel loading the required modules.
> I don't exclude a race condition between libvirt setting the ebtables > rules and the kernel loading the required modules. If there is a race then it is self-contained in the ebtables module - ebtables is run to add a rule, and that rule requires autoloading of the ebtable_nat module, so it is loaded in order to satisfy the rule. Aside from that, since the ebtable_nat module is apparently still not loaded even after the failure, I would venture that a race is not the problem, the autoload simply isn't happening for whatever reason.
Works for me on these components on hosts: rhevm-appliance-20161116.0-1.el7ev.noarch ovirt-vmconsole-host-1.0.4-1.el7ev.noarch rhevm-appliance-20161116.0-1.el7ev.noarch ovirt-setup-lib-1.0.2-1.el7ev.noarch ovirt-imageio-daemon-0.4.0-0.el7ev.noarch sanlock-3.4.0-1.el7.x86_64 ovirt-vmconsole-1.0.4-1.el7ev.noarch mom-0.5.8-1.el7ev.noarch vdsm-4.18.17-1.el7ev.x86_64 ovirt-hosted-engine-setup-2.0.4-2.el7ev.noarch rhev-release-4.0.6-3-001.noarch ovirt-host-deploy-1.5.3-1.el7ev.noarch ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch ovirt-imageio-common-0.3.0-0.el7ev.noarch libvirt-client-2.0.0-10.el7.x86_64 qemu-kvm-rhev-2.6.0-27.el7.x86_64 ovirt-hosted-engine-ha-2.0.5-1.el7ev.noarch Linux version 3.10.0-514.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Wed Oct 19 11:24:13 EDT 2016 Linux 3.10.0-514.el7.x86_64 #1 SMP Wed Oct 19 11:24:13 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.3 (Maipo) Engine: rhevm-setup-plugins-4.0.0.3-1.el7ev.noarch rhevm-spice-client-x86-msi-4.0-3.el7ev.noarch rhevm-dependencies-4.0.0-1.el7ev.noarch rhev-guest-tools-iso-4.0-6.el7ev.noarch rhevm-4.0.5.5-0.1.el7ev.noarch rhevm-spice-client-x64-msi-4.0-3.el7ev.noarch rhevm-doc-4.0.5-1.el7ev.noarch rhevm-branding-rhev-4.0.0-5.el7ev.noarch rhevm-guest-agent-common-1.0.12-3.el7ev.noarch Linux version 3.10.0-514.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Wed Oct 19 11:24:13 EDT 2016 Linux 3.10.0-514.el7.x86_64 #1 SMP Wed Oct 19 11:24:13 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.3 (Maipo) I've successfully deployed rhevm-appliance-20161116.0-1.el7ev.noarch over NFS on host.