Bug 1463218

Summary: [downstream clone - 4.1.7] Adding rhvh-4.1-20170417.0 to engine failed with bond(active+backup) configured by cockpit
Product: Red Hat Enterprise Virtualization Manager Reporter: rhev-integ
Component: vdsmAssignee: Edward Haas <edwardh>
Status: CLOSED ERRATA QA Contact: dguo
Severity: urgent Docs Contact:
Priority: high    
Version: 4.1.0CC: bazulay, bugs, cshao, danken, dfediuck, dguo, edwardh, eedri, fgiudici, huzhao, jiawu, leiwang, lsurette, mburman, pbrilla, qiyuan, rbarry, rhev-integ, sbonazzo, srevivo, weiwang, yaniwang, ycui, ykaul, ylavi, yzhao
Target Milestone: ovirt-4.1.7Keywords: Regression, TestBlocker, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1443347 Environment:
Last Closed: 2017-11-07 17:29:21 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Network RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1443347, 1472965    
Bug Blocks: 1491561    
Attachments:
Description Flags
All the logs including engine.log vdsm.log host-deploy.log ifcfg-files
none
creating bond on cockpit
none
New logs where bond mac is not specified
none
/var/log/*
none
NM log in debug1 none

Description rhev-integ 2017-06-20 11:49:31 UTC
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1443347 +++
======================================================================

Description of problem:
Adding rhvh-4.1-20170417.0 to engine failed with bond(active+backup) configured by cockpit

Version-Release number of selected component (if applicable):
Red Hat Virtualization Manager Version: 4.1.1.8-0.1.el7
redhat-virtualization-host-4.1-20170417.0.x86_64
imgbased-0.9.23-0.1.el7ev.noarch
vdsm-4.19.10.1-1.el7ev.x86_64
cockpit-ovirt-dashboard-0.10.7-0.0.17.el7ev.noarch
cockpit-system-135-4.el7.noarch

How reproducible:
100%

Steps to Reproduce:
1. Install a rhvh4.1
2. Configure bond0(active+backup) via cockpit on rhvh4.1
3. Add this host to engine4.1

Actual results:
1.After step#3, adding failed. During installing process, the ip address was changed, and after adding failed, the ip address was disappear

Expected results:
1. After step#3, the host can be added successfully

Additional info:
1. Regression since no such issue in previous build
2. Tested with (vlan over bond) configured by cockpit, also adding failed
3. Bond configured by ifcfg-files manually can be added successfully

(Originally by Daijie Guo)

Comment 1 rhev-integ 2017-06-20 11:49:45 UTC
Created attachment 1272497 [details]
engine log

(Originally by Daijie Guo)

Comment 3 rhev-integ 2017-06-20 11:49:54 UTC
Is this the same version of rhvm?

Can you grab the engine log and the generated ifcfg files from the previous RHVH build and this one?

The problem is either engine or platform cockpit, but this information is needed for root cause analysis

(Originally by Ryan Barry)

Comment 4 rhev-integ 2017-06-20 11:50:03 UTC
Created attachment 1272500 [details]
network-scripts

(Originally by Daijie Guo)

Comment 5 rhev-integ 2017-06-20 11:50:12 UTC
Created attachment 1272501 [details]
vdsm logs and deploy log

(Originally by Daijie Guo)

Comment 6 rhev-integ 2017-06-20 11:50:21 UTC
NOTE for Regression & Testblocker:
No such issue on previous version(redhat-virtualization-host-4.1-20170413.) and this bug will block add RHVH to engine with bond configured test scenario.

(Originally by Chen Shao)

Comment 7 rhev-integ 2017-06-20 11:50:31 UTC
Same engine version? This is critical.

Absolutely nothing changed in RHVH which would affect this (in general, but especially from 0413 to 0417). If the interface comes up properly in Cockpit, I'd also expect engine to work.

(Originally by Ryan Barry)

Comment 8 rhev-integ 2017-06-20 11:50:40 UTC
(In reply to Ryan Barry from comment #6)

Ryan,

> Same engine version? This is critical.
Yes, same version rhvm-4.1.1.8-0.1.el7

> 
> Absolutely nothing changed in RHVH which would affect this (in general, but
> especially from 0413 to 0417). If the interface comes up properly in
> Cockpit, I'd also expect engine to work.

I should correct comments 5 rhvh verison to rhvh-4.1-20170403, not 0413. Since respin in 0413, I did not test this bond scenario there, thus, finally found in 0417

It should be noted that there is a big change between 0403 to 0417, which is cockpit version.

ON rhvh-4.1-20170403:
cockpit-shell-126-1.el7.noarch
cockpit-ovirt-dashboard-0.10.7-0.0.16.el7ev.noarch

On rhvh-4.1-20170417:
cockpit-ovirt-dashboard-0.10.7-0.0.17.el7ev.noarch
cockpit-system-135-4.el7.noarch

and there is a bug fix for network issue in Cockpit 132, which might affect.
https://bugzilla.redhat.com/show_bug.cgi?id=1395108
https://bugzilla.redhat.com/show_bug.cgi?id=1420708

(Originally by Daijie Guo)

Comment 9 rhev-integ 2017-06-20 11:50:47 UTC
Created attachment 1272520 [details]
network-scripts in previous build 0403

(Originally by Daijie Guo)

Comment 10 rhev-integ 2017-06-20 11:50:56 UTC
(In reply to dguo from comment #7)
> https://bugzilla.redhat.com/show_bug.cgi?id=1395108
> https://bugzilla.redhat.com/show_bug.cgi?id=1420708

Since this actually works until an attempt to register to engine is made, I expect that Cockpit is actually working here, and the problem is some confusion in the ifcfg scripts, but I'm looking

(Originally by Ryan Barry)

Comment 11 rhev-integ 2017-06-20 11:51:05 UTC
It appears that host-deploy is not adding the vlan to ovirtmgmt.

This makes comparison difficult, though, since the previous ifcfg scripts do not contain a VLAN config. Can you please attach new ifcfgs with a matching config?

If "network-scripts.after_add" is without a vlan (ifcfg-bond0 has no vlan config here), then the attachment is more confusing, since before_add has a vlan...

(Originally by Ryan Barry)

Comment 12 rhev-integ 2017-06-20 11:51:13 UTC
engine.log and vdsm.log both have messages about SSL handshake errors rather than 'no route to host', so networking is probably up.

Can you please provide the following:

Configure a system with a bond OR bond+vlan, but keep the configuration the same:

ifcfg files 0403 before and after add
ifcfg files 0417 before and after add

host-deploy, vdsm, and engine logs from the failed addition

(Originally by Ryan Barry)

Comment 13 rhev-integ 2017-06-20 11:51:20 UTC
Created attachment 1272833 [details]
vdsm.log, hosted-engine.log, ifcfg files

(Originally by Yihui Zhao)

Comment 14 rhev-integ 2017-06-20 11:51:29 UTC
Deploy the HE with bond(bond+vlan) during the bond's ip changed.

Upload the vdsm.log , hosted-engine.log, ifcfg files(before setup bond0), ifcfg files(setup bond0), ifcfg files(deploy HE failed).

Attachment : https://bugzilla.redhat.com/attachment.cgi?id=1272833

(Originally by Yihui Zhao)

Comment 15 rhev-integ 2017-06-20 11:51:38 UTC
(In reply to Yihui Zhao from comment #13)
> Deploy the HE with bond(bond+vlan) during the bond's ip changed.
> 
> Upload the vdsm.log , hosted-engine.log, ifcfg files(before setup bond0),
> ifcfg files(setup bond0), ifcfg files(deploy HE failed).
> 
> Attachment : https://bugzilla.redhat.com/attachment.cgi?id=1272833

So, the bug will also block HE testing (HE with bond or bond+vlan).

(Originally by Yihui Zhao)

Comment 16 rhev-integ 2017-06-20 11:51:45 UTC
Created attachment 1272840 [details]
All files of 04017

(Originally by Daijie Guo)

Comment 17 rhev-integ 2017-06-20 11:51:53 UTC
Created attachment 1272841 [details]
All files of 0403

(Originally by Daijie Guo)

Comment 18 rhev-integ 2017-06-20 11:52:00 UTC
(In reply to Ryan Barry from comment #11)
> engine.log and vdsm.log both have messages about SSL handshake errors rather
> than 'no route to host', so networking is probably up.
> 
> Can you please provide the following:
> 
> Configure a system with a bond OR bond+vlan, but keep the configuration the
> same:
> 
> ifcfg files 0403 before and after add
> ifcfg files 0417 before and after add
> 
> host-deploy, vdsm, and engine logs from the failed addition

Ryan, Attach all files required, and clarify them into 0403 and 0417.

(Originally by Daijie Guo)

Comment 19 rhev-integ 2017-06-20 11:52:11 UTC
From all tests did on 0417, we observed the following phenomenon:
1. Create bond0 over em1 + em2(em1 was set to master slave), The bond0 got the em2's mac, which ip was 10.73.131.184. 
2. Add host over bond0, during the installation, the bond0's mac was changed to em1's, which ip was 10.73.131.65.
3. After adding failed, the bond0's ip was disappear. 

But for tests did on 0403:
1. Bond0 got em1(master)'s mac, which ip was 10.73.131.65.
2. Add host over bond0, the mac there was not changed, and the ip was always 10.73.131.65

(Originally by Daijie Guo)

Comment 21 rhev-integ 2017-06-20 11:52:27 UTC
Reassigning to vdsm for tracking.

The cause of this seems to be a known problem with NM/cockpit changing IPs if the active mac changes. There are workaround for this.

(Originally by Ryan Barry)

Comment 22 rhev-integ 2017-06-20 11:52:36 UTC
The proposed patch (https://gerrit.ovirt.org/77933) should be suitable for RHVH, as the VDSM has been already installed on it with the NM configuration file.

Note that the NM configuration that enables adding slaves to a bond in the order of the slaves names (same as initscripts order) will be available in RHEL 7.4, with NM version 1.8.

(Originally by edwardh)

Comment 24 rhev-integ 2017-07-07 12:24:41 UTC
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Tag 'v4.19.21' doesn't contain patch 'https://gerrit.ovirt.org/78362']
gitweb: https://gerrit.ovirt.org/gitweb?p=vdsm.git;a=shortlog;h=refs/tags/v4.19.21

For more info please contact: rhv-devops

Comment 25 Edward Haas 2017-07-08 12:42:21 UTC
(In reply to rhev-integ from comment #24)
> INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following
> reason:
> 
> [Tag 'v4.19.21' doesn't contain patch 'https://gerrit.ovirt.org/78362']
> gitweb:
> https://gerrit.ovirt.org/gitweb?p=vdsm.git;a=shortlog;h=refs/tags/v4.19.21
> 
> For more info please contact: rhv-devops

But it does contain it. (commit 168ebb7)

Comment 28 dguo 2017-07-18 03:34:03 UTC
Failed ON-QA on the latest rhvh build

Test version:
Red Hat Virtualization Manager Version: 4.1.4.1-0.1.el7
redhat-virtualization-host-4.1-20170714.1
vdsm-4.19.22-1.el7ev.x86_64
imgbased-0.9.33-0.1.el7ev.noarch
cockpit-ovirt-dashboard-0.10.7-0.0.21.el7ev.noarch
cockpit-ws-141-2.el7.x86_64

Test step:
1. Install a rhvh4.1
2. Configure bond0(active+backup) via cockpit on rhvh4.1
3. Add this host to engine4.1

Actual results:
1.After step#3, adding failed. During installing process, the ip address was changed, and after adding failed, the ip address was disappear

Expected results:
1. After step#3, the host can be added successfully


Additional info:
1. Please see the logs in the new attachment

Comment 29 dguo 2017-07-18 03:35:11 UTC
Created attachment 1300219 [details]
All the logs including engine.log vdsm.log host-deploy.log ifcfg-files

Comment 30 Edward Haas 2017-07-18 06:17:17 UTC
The bond0 interface as described in its ifcfg file, before VDSM takes over, has an mac address statically set: MACADDR=08:9E:01:63:2C:B3
VDSM does not support such a configuration, it expects NM to automatically select the mac address per the name order.

Please advice who is setting this... Is it cockpit? If so, a BZ should be opened against it.

Comment 31 dguo 2017-07-18 08:40:03 UTC
(In reply to Edward Haas from comment #30)
> The bond0 interface as described in its ifcfg file, before VDSM takes over,
> has an mac address statically set: MACADDR=08:9E:01:63:2C:B3
> VDSM does not support such a configuration, it expects NM to automatically
> select the mac address per the name order.
> 
> Please advice who is setting this... Is it cockpit? If so, a BZ should be
> opened against it.

There is new "Mac input box" while adding a bond on cockpit, which you can see from attached picture. 

I try to create the bond with two different way:
1. Specify the mac address with the existing em2 mac, the bond will get the em2's ip
2. Do not specify the mac address and leave the "MAC input box" blank.

For #1, Failed to add the host to engine with this bond, as you pointed.
And for #2, it also failed, I will attach the new logs.

Comment 32 dguo 2017-07-18 08:41:05 UTC
Created attachment 1300324 [details]
creating bond on cockpit

Comment 33 dguo 2017-07-18 08:42:27 UTC
Created attachment 1300325 [details]
New logs where bond mac is not specified

Comment 34 Edward Haas 2017-07-18 09:15:18 UTC
(In reply to dguo from comment #31)
> 
> I try to create the bond with two different way:
> 1. Specify the mac address with the existing em2 mac, the bond will get the
> em2's ip
> 2. Do not specify the mac address and leave the "MAC input box" blank.
> 
> For #1, Failed to add the host to engine with this bond, as you pointed.
> And for #2, it also failed, I will attach the new logs.

Thanks for the input.
I see the IP has changed, but I cannot see to what mac address it changed to.
Could you please post the "ip link" output before the 120sec timeout is reached?

Comment 35 dguo 2017-07-19 06:48:59 UTC
(In reply to Edward Haas from comment #34)
> (In reply to dguo from comment #31)
> > 
> > I try to create the bond with two different way:
> > 1. Specify the mac address with the existing em2 mac, the bond will get the
> > em2's ip
> > 2. Do not specify the mac address and leave the "MAC input box" blank.
> > 
> > For #1, Failed to add the host to engine with this bond, as you pointed.
> > And for #2, it also failed, I will attach the new logs.
> 
> Thanks for the input.
> I see the IP has changed, but I cannot see to what mac address it changed to.
> Could you please post the "ip link" output before the 120sec timeout is
> reached?

Below is the output which you required:

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: p1p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 256
    link/ether 00:c0:dd:20:13:e8 brd ff:ff:ff:ff:ff:ff
3: em1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP qlen 1000
    link/ether de:27:be:a6:a6:2d brd ff:ff:ff:ff:ff:ff
4: em2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP qlen 1000
    link/ether de:27:be:a6:a6:2d brd ff:ff:ff:ff:ff:ff
5: p3p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 00:1b:21:a6:3d:7a brd ff:ff:ff:ff:ff:ff
6: p3p2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 00:1b:21:a6:3d:7b brd ff:ff:ff:ff:ff:ff
7: p2p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 00:1b:21:a6:64:6c brd ff:ff:ff:ff:ff:ff
8: p2p2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN qlen 1000
    link/ether 00:1b:21:a6:64:6d brd ff:ff:ff:ff:ff:ff
24: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue master ovirtmgmt state UP qlen 1000
    link/ether de:27:be:a6:a6:2d brd ff:ff:ff:ff:ff:ff
25: ;vdsmdummy;: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether c2:e5:c8:c3:c0:4a brd ff:ff:ff:ff:ff:ff
26: ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP qlen 1000
    link/ether de:27:be:a6:a6:2d brd ff:ff:ff:ff:ff:ff
    inet 10.73.75.189/22 brd 10.73.75.255 scope global dynamic ovirtmgmt
       valid_lft 43190sec preferred_lft 43190sec
    inet6 2620:52:0:4948:dc27:beff:fea6:a62d/64 scope global mngtmpaddr dynamic 
       valid_lft 2591990sec preferred_lft 604790sec
    inet6 fe80::dc27:beff:fea6:a62d/64 scope link 
       valid_lft forever preferred_lft forever

Comment 36 Yihui Zhao 2017-07-19 10:02:27 UTC
Deploy HostedEngine also failed with bond+vlan (active+backup) configured by cockpit. Because the NIC like "bond0.20" ip was changed.

Test version:
rhvh-4.1-0.20170714.0+1
vdsm-4.19.22-1.el7ev.x86_64
cockpit-ovirt-dashboard-0.10.7-0.0.21.el7ev.noarch
ovirt-hosted-engine-setup-2.1.3.4-1.el7ev.noarch
imgbased-0.9.33-0.1.el7ev.noarch
rhvm-appliance-4.1.20170709.3-1.el7.noarch


How to reproduce:
1. Install RHVH4.1
2. Configure network (bond+vlan) by cockpit
3. Deploy HostedEngine



Actual results:
1.After step3, deploying failed. During deploying process, the ip address was changed.

Expected results:
1. After step3, the host can deploy HostedEngine successfully

Additional info:
/var/log/* in the attachment.

Comment 37 Yihui Zhao 2017-07-19 10:03:13 UTC
Created attachment 1300958 [details]
/var/log/*

Comment 38 Edward Haas 2017-07-19 11:59:25 UTC
It looks like the mac assigned to the bond (de:27:be:a6:a6:2d) is not one of the nics original macs.
@mburman also has seen the same scenario in TLV lab this morning.

We currently suspect NM of overwriting the bond mac, although it does not manage it.

Waiting for some insights from NM team on this.

Comment 39 Michael Burman 2017-07-19 13:08:07 UTC
Created attachment 1301090 [details]
NM log in debug1

Comment 40 Francesco Giudici 2017-07-19 17:15:55 UTC
Isolated the issue that is probably the root cause of this (thanks for Michael for sharing setup and machine).
If a NetworkManager bond connection which is already up is brought up again, the MAC address will be reset to the "fake" one the bond interface had before enslaving any interface.

Tracked this with trace level logs here:
https://bugzilla.redhat.com/show_bug.cgi?id=1472965

Comment 41 Pavol Brilla 2017-07-24 06:29:18 UTC
changing summary to reflect re-target of the bug

Comment 44 Yaniv Kaul 2017-10-15 08:12:22 UTC
This is tracked for 7.4.z in bug 1490741 which is already VERIFIED.
Do we have anything specific to do here, besides wait for it to be released?

Comment 45 Michael Burman 2017-10-15 08:37:02 UTC
In order to test this report properly we still need:

1) rhv-h + cockpit build with NetworkManager-1.8.0-10.el7_4.x86_64 included. 
2) In order to consume bond with MACADDR= key we need to wait for - BZ 1422430, if a cloned MAC address is specified via cockpit, vdsm currently will fail to consume it.

Comment 46 Edward Haas 2017-10-15 09:22:54 UTC
(In reply to Yaniv Kaul from comment #44)
> This is tracked for 7.4.z in bug 1490741 which is already VERIFIED.
> Do we have anything specific to do here, besides wait for it to be released?

The release of the NM fix is expected on the 17th of Oct which should include NetworkManager-1.8.0-10.el7_4.

The first RHV-H image that includes it should resolve this BZ.

Comment 48 dguo 2017-10-25 08:53:09 UTC
Test two scenario on latest rhvh-4.1-0.20171024.0 based on comment 31

Test version:
[root@localhost ~]# rpm -q redhat-release-virtualization-host
redhat-release-virtualization-host-4.1-7.0.el7.x86_64
[root@localhost ~]# imgbase w
You are on rhvh-4.1-0.20171024.0+1
[root@localhost ~]# rpm -q NetworkManager
NetworkManager-1.8.0-11.el7_4.x86_64
[root@localhost ~]# rpm -q vdsm
vdsm-4.19.35-1.el7ev.x86_64

Test scenario:
1. Do not specify mac address while configuring bond on cockpit

Add rhvh to rhvm over this bond successfully

2. Specify the bond mac address with existing em2's mac
Failed to add to rhvm, and the bond's ip still disappear after the failure.

Seems that we still need to wait BZ 1422430 from comment 45.

Comment 49 Dan Kenigsberg 2017-10-25 21:38:48 UTC
Please do not fail this bug due to scenario 2, it should track only scenario 1.

We know well that scenario 2 fails. We have bug 1422430 tracking it, and we don't need a second bug for that. The most important bit is that *finally*, we can use cockpit to define a dhcp address over a bond, and add the host successfully to ovirt-engine. In my opinion, this merits a VERIFIED status, please reconsider.

Comment 50 dguo 2017-10-26 02:39:30 UTC
(In reply to Dan Kenigsberg from comment #49)
> Please do not fail this bug due to scenario 2, it should track only scenario
> 1.
> 
> We know well that scenario 2 fails. We have bug 1422430 tracking it, and we
> don't need a second bug for that. The most important bit is that *finally*,
> we can use cockpit to define a dhcp address over a bond, and add the host
> successfully to ovirt-engine. In my opinion, this merits a VERIFIED status,
> please reconsider.

Thanks for the clarification, I agree with that. 
From customer's perspective, We also need to give them the clarification that scenario#2 is not supported currently.

Comment 52 errata-xmlrpc 2017-11-07 17:29:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3139