Bug 1463218 - [downstream clone - 4.1.5] Adding rhvh-4.1-20170417.0 to engine failed with bond(active+backup) configured by cockpit
[downstream clone - 4.1.5] Adding rhvh-4.1-20170417.0 to engine failed with b...
Status: ASSIGNED
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm (Show other bugs)
4.1.0
Unspecified Unspecified
high Severity urgent
: ovirt-4.1.6
: ---
Assigned To: Edward Haas
dguo
: Regression, TestBlocker, ZStream
Depends On: 1443347 1472965
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-20 07:49 EDT by rhev-integ
Modified: 2017-08-09 04:06 EDT (History)
26 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1443347
Environment:
Last Closed:
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Network
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
All the logs including engine.log vdsm.log host-deploy.log ifcfg-files (118.48 KB, application/x-gzip)
2017-07-17 23:35 EDT, dguo
no flags Details
creating bond on cockpit (154.27 KB, image/png)
2017-07-18 04:41 EDT, dguo
no flags Details
New logs where bond mac is not specified (55.69 KB, application/x-gzip)
2017-07-18 04:42 EDT, dguo
no flags Details
/var/log/* (276.21 KB, application/x-bzip)
2017-07-19 06:03 EDT, Yihui Zhao
no flags Details
NM log in debug1 (382.57 KB, text/plain)
2017-07-19 09:08 EDT, Michael Burman
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 77933 master MERGED net,static: Configure NM with slaves-order=name 2017-06-20 07:52 EDT
oVirt gerrit 78362 ovirt-4.1 MERGED net,static: Configure NM with slaves-order=name 2017-06-23 01:06 EDT

  None (edit)
Description rhev-integ 2017-06-20 07:49:31 EDT
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1443347 +++
======================================================================

Description of problem:
Adding rhvh-4.1-20170417.0 to engine failed with bond(active+backup) configured by cockpit

Version-Release number of selected component (if applicable):
Red Hat Virtualization Manager Version: 4.1.1.8-0.1.el7
redhat-virtualization-host-4.1-20170417.0.x86_64
imgbased-0.9.23-0.1.el7ev.noarch
vdsm-4.19.10.1-1.el7ev.x86_64
cockpit-ovirt-dashboard-0.10.7-0.0.17.el7ev.noarch
cockpit-system-135-4.el7.noarch

How reproducible:
100%

Steps to Reproduce:
1. Install a rhvh4.1
2. Configure bond0(active+backup) via cockpit on rhvh4.1
3. Add this host to engine4.1

Actual results:
1.After step#3, adding failed. During installing process, the ip address was changed, and after adding failed, the ip address was disappear

Expected results:
1. After step#3, the host can be added successfully

Additional info:
1. Regression since no such issue in previous build
2. Tested with (vlan over bond) configured by cockpit, also adding failed
3. Bond configured by ifcfg-files manually can be added successfully

(Originally by Daijie Guo)
Comment 1 rhev-integ 2017-06-20 07:49:45 EDT
Created attachment 1272497 [details]
engine log

(Originally by Daijie Guo)
Comment 3 rhev-integ 2017-06-20 07:49:54 EDT
Is this the same version of rhvm?

Can you grab the engine log and the generated ifcfg files from the previous RHVH build and this one?

The problem is either engine or platform cockpit, but this information is needed for root cause analysis

(Originally by Ryan Barry)
Comment 4 rhev-integ 2017-06-20 07:50:03 EDT
Created attachment 1272500 [details]
network-scripts

(Originally by Daijie Guo)
Comment 5 rhev-integ 2017-06-20 07:50:12 EDT
Created attachment 1272501 [details]
vdsm logs and deploy log

(Originally by Daijie Guo)
Comment 6 rhev-integ 2017-06-20 07:50:21 EDT
NOTE for Regression & Testblocker:
No such issue on previous version(redhat-virtualization-host-4.1-20170413.) and this bug will block add RHVH to engine with bond configured test scenario.

(Originally by Chen Shao)
Comment 7 rhev-integ 2017-06-20 07:50:31 EDT
Same engine version? This is critical.

Absolutely nothing changed in RHVH which would affect this (in general, but especially from 0413 to 0417). If the interface comes up properly in Cockpit, I'd also expect engine to work.

(Originally by Ryan Barry)
Comment 8 rhev-integ 2017-06-20 07:50:40 EDT
(In reply to Ryan Barry from comment #6)

Ryan,

> Same engine version? This is critical.
Yes, same version rhvm-4.1.1.8-0.1.el7

> 
> Absolutely nothing changed in RHVH which would affect this (in general, but
> especially from 0413 to 0417). If the interface comes up properly in
> Cockpit, I'd also expect engine to work.

I should correct comments 5 rhvh verison to rhvh-4.1-20170403, not 0413. Since respin in 0413, I did not test this bond scenario there, thus, finally found in 0417

It should be noted that there is a big change between 0403 to 0417, which is cockpit version.

ON rhvh-4.1-20170403:
cockpit-shell-126-1.el7.noarch
cockpit-ovirt-dashboard-0.10.7-0.0.16.el7ev.noarch

On rhvh-4.1-20170417:
cockpit-ovirt-dashboard-0.10.7-0.0.17.el7ev.noarch
cockpit-system-135-4.el7.noarch

and there is a bug fix for network issue in Cockpit 132, which might affect.
https://bugzilla.redhat.com/show_bug.cgi?id=1395108
https://bugzilla.redhat.com/show_bug.cgi?id=1420708

(Originally by Daijie Guo)
Comment 9 rhev-integ 2017-06-20 07:50:47 EDT
Created attachment 1272520 [details]
network-scripts in previous build 0403

(Originally by Daijie Guo)
Comment 10 rhev-integ 2017-06-20 07:50:56 EDT
(In reply to dguo from comment #7)
> https://bugzilla.redhat.com/show_bug.cgi?id=1395108
> https://bugzilla.redhat.com/show_bug.cgi?id=1420708

Since this actually works until an attempt to register to engine is made, I expect that Cockpit is actually working here, and the problem is some confusion in the ifcfg scripts, but I'm looking

(Originally by Ryan Barry)
Comment 11 rhev-integ 2017-06-20 07:51:05 EDT
It appears that host-deploy is not adding the vlan to ovirtmgmt.

This makes comparison difficult, though, since the previous ifcfg scripts do not contain a VLAN config. Can you please attach new ifcfgs with a matching config?

If "network-scripts.after_add" is without a vlan (ifcfg-bond0 has no vlan config here), then the attachment is more confusing, since before_add has a vlan...

(Originally by Ryan Barry)
Comment 12 rhev-integ 2017-06-20 07:51:13 EDT
engine.log and vdsm.log both have messages about SSL handshake errors rather than 'no route to host', so networking is probably up.

Can you please provide the following:

Configure a system with a bond OR bond+vlan, but keep the configuration the same:

ifcfg files 0403 before and after add
ifcfg files 0417 before and after add

host-deploy, vdsm, and engine logs from the failed addition

(Originally by Ryan Barry)
Comment 13 rhev-integ 2017-06-20 07:51:20 EDT
Created attachment 1272833 [details]
vdsm.log, hosted-engine.log, ifcfg files

(Originally by Yihui Zhao)
Comment 14 rhev-integ 2017-06-20 07:51:29 EDT
Deploy the HE with bond(bond+vlan) during the bond's ip changed.

Upload the vdsm.log , hosted-engine.log, ifcfg files(before setup bond0), ifcfg files(setup bond0), ifcfg files(deploy HE failed).

Attachment : https://bugzilla.redhat.com/attachment.cgi?id=1272833

(Originally by Yihui Zhao)
Comment 15 rhev-integ 2017-06-20 07:51:38 EDT
(In reply to Yihui Zhao from comment #13)
> Deploy the HE with bond(bond+vlan) during the bond's ip changed.
> 
> Upload the vdsm.log , hosted-engine.log, ifcfg files(before setup bond0),
> ifcfg files(setup bond0), ifcfg files(deploy HE failed).
> 
> Attachment : https://bugzilla.redhat.com/attachment.cgi?id=1272833

So, the bug will also block HE testing (HE with bond or bond+vlan).

(Originally by Yihui Zhao)
Comment 16 rhev-integ 2017-06-20 07:51:45 EDT
Created attachment 1272840 [details]
All files of 04017

(Originally by Daijie Guo)
Comment 17 rhev-integ 2017-06-20 07:51:53 EDT
Created attachment 1272841 [details]
All files of 0403

(Originally by Daijie Guo)
Comment 18 rhev-integ 2017-06-20 07:52:00 EDT
(In reply to Ryan Barry from comment #11)
> engine.log and vdsm.log both have messages about SSL handshake errors rather
> than 'no route to host', so networking is probably up.
> 
> Can you please provide the following:
> 
> Configure a system with a bond OR bond+vlan, but keep the configuration the
> same:
> 
> ifcfg files 0403 before and after add
> ifcfg files 0417 before and after add
> 
> host-deploy, vdsm, and engine logs from the failed addition

Ryan, Attach all files required, and clarify them into 0403 and 0417.

(Originally by Daijie Guo)
Comment 19 rhev-integ 2017-06-20 07:52:11 EDT
From all tests did on 0417, we observed the following phenomenon:
1. Create bond0 over em1 + em2(em1 was set to master slave), The bond0 got the em2's mac, which ip was 10.73.131.184. 
2. Add host over bond0, during the installation, the bond0's mac was changed to em1's, which ip was 10.73.131.65.
3. After adding failed, the bond0's ip was disappear. 

But for tests did on 0403:
1. Bond0 got em1(master)'s mac, which ip was 10.73.131.65.
2. Add host over bond0, the mac there was not changed, and the ip was always 10.73.131.65

(Originally by Daijie Guo)
Comment 21 rhev-integ 2017-06-20 07:52:27 EDT
Reassigning to vdsm for tracking.

The cause of this seems to be a known problem with NM/cockpit changing IPs if the active mac changes. There are workaround for this.

(Originally by Ryan Barry)
Comment 22 rhev-integ 2017-06-20 07:52:36 EDT
The proposed patch (https://gerrit.ovirt.org/77933) should be suitable for RHVH, as the VDSM has been already installed on it with the NM configuration file.

Note that the NM configuration that enables adding slaves to a bond in the order of the slaves names (same as initscripts order) will be available in RHEL 7.4, with NM version 1.8.

(Originally by edwardh)
Comment 24 rhev-integ 2017-07-07 08:24:41 EDT
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Tag 'v4.19.21' doesn't contain patch 'https://gerrit.ovirt.org/78362']
gitweb: https://gerrit.ovirt.org/gitweb?p=vdsm.git;a=shortlog;h=refs/tags/v4.19.21

For more info please contact: rhv-devops@redhat.com
Comment 25 Edward Haas 2017-07-08 08:42:21 EDT
(In reply to rhev-integ from comment #24)
> INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following
> reason:
> 
> [Tag 'v4.19.21' doesn't contain patch 'https://gerrit.ovirt.org/78362']
> gitweb:
> https://gerrit.ovirt.org/gitweb?p=vdsm.git;a=shortlog;h=refs/tags/v4.19.21
> 
> For more info please contact: rhv-devops@redhat.com

But it does contain it. (commit 168ebb7)
Comment 28 dguo 2017-07-17 23:34:03 EDT
Failed ON-QA on the latest rhvh build

Test version:
Red Hat Virtualization Manager Version: 4.1.4.1-0.1.el7
redhat-virtualization-host-4.1-20170714.1
vdsm-4.19.22-1.el7ev.x86_64
imgbased-0.9.33-0.1.el7ev.noarch
cockpit-ovirt-dashboard-0.10.7-0.0.21.el7ev.noarch
cockpit-ws-141-2.el7.x86_64

Test step:
1. Install a rhvh4.1
2. Configure bond0(active+backup) via cockpit on rhvh4.1
3. Add this host to engine4.1

Actual results:
1.After step#3, adding failed. During installing process, the ip address was changed, and after adding failed, the ip address was disappear

Expected results:
1. After step#3, the host can be added successfully


Additional info:
1. Please see the logs in the new attachment
Comment 29 dguo 2017-07-17 23:35 EDT
Created attachment 1300219 [details]
All the logs including engine.log vdsm.log host-deploy.log ifcfg-files
Comment 30 Edward Haas 2017-07-18 02:17:17 EDT
The bond0 interface as described in its ifcfg file, before VDSM takes over, has an mac address statically set: MACADDR=08:9E:01:63:2C:B3
VDSM does not support such a configuration, it expects NM to automatically select the mac address per the name order.

Please advice who is setting this... Is it cockpit? If so, a BZ should be opened against it.
Comment 31 dguo 2017-07-18 04:40:03 EDT
(In reply to Edward Haas from comment #30)
> The bond0 interface as described in its ifcfg file, before VDSM takes over,
> has an mac address statically set: MACADDR=08:9E:01:63:2C:B3
> VDSM does not support such a configuration, it expects NM to automatically
> select the mac address per the name order.
> 
> Please advice who is setting this... Is it cockpit? If so, a BZ should be
> opened against it.

There is new "Mac input box" while adding a bond on cockpit, which you can see from attached picture. 

I try to create the bond with two different way:
1. Specify the mac address with the existing em2 mac, the bond will get the em2's ip
2. Do not specify the mac address and leave the "MAC input box" blank.

For #1, Failed to add the host to engine with this bond, as you pointed.
And for #2, it also failed, I will attach the new logs.
Comment 32 dguo 2017-07-18 04:41 EDT
Created attachment 1300324 [details]
creating bond on cockpit
Comment 33 dguo 2017-07-18 04:42 EDT
Created attachment 1300325 [details]
New logs where bond mac is not specified
Comment 34 Edward Haas 2017-07-18 05:15:18 EDT
(In reply to dguo from comment #31)
> 
> I try to create the bond with two different way:
> 1. Specify the mac address with the existing em2 mac, the bond will get the
> em2's ip
> 2. Do not specify the mac address and leave the "MAC input box" blank.
> 
> For #1, Failed to add the host to engine with this bond, as you pointed.
> And for #2, it also failed, I will attach the new logs.

Thanks for the input.
I see the IP has changed, but I cannot see to what mac address it changed to.
Could you please post the "ip link" output before the 120sec timeout is reached?
Comment 35 dguo 2017-07-19 02:48:59 EDT
(In reply to Edward Haas from comment #34)
> (In reply to dguo from comment #31)
> > 
> > I try to create the bond with two different way:
> > 1. Specify the mac address with the existing em2 mac, the bond will get the
> > em2's ip
> > 2. Do not specify the mac address and leave the "MAC input box" blank.
> > 
> > For #1, Failed to add the host to engine with this bond, as you pointed.
> > And for #2, it also failed, I will attach the new logs.
> 
> Thanks for the input.
> I see the IP has changed, but I cannot see to what mac address it changed to.
> Could you please post the "ip link" output before the 120sec timeout is
> reached?

Below is the output which you required:

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: p1p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 256
    link/ether 00:c0:dd:20:13:e8 brd ff:ff:ff:ff:ff:ff
3: em1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP qlen 1000
    link/ether de:27:be:a6:a6:2d brd ff:ff:ff:ff:ff:ff
4: em2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP qlen 1000
    link/ether de:27:be:a6:a6:2d brd ff:ff:ff:ff:ff:ff
5: p3p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 00:1b:21:a6:3d:7a brd ff:ff:ff:ff:ff:ff
6: p3p2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 00:1b:21:a6:3d:7b brd ff:ff:ff:ff:ff:ff
7: p2p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 00:1b:21:a6:64:6c brd ff:ff:ff:ff:ff:ff
8: p2p2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN qlen 1000
    link/ether 00:1b:21:a6:64:6d brd ff:ff:ff:ff:ff:ff
24: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue master ovirtmgmt state UP qlen 1000
    link/ether de:27:be:a6:a6:2d brd ff:ff:ff:ff:ff:ff
25: ;vdsmdummy;: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether c2:e5:c8:c3:c0:4a brd ff:ff:ff:ff:ff:ff
26: ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP qlen 1000
    link/ether de:27:be:a6:a6:2d brd ff:ff:ff:ff:ff:ff
    inet 10.73.75.189/22 brd 10.73.75.255 scope global dynamic ovirtmgmt
       valid_lft 43190sec preferred_lft 43190sec
    inet6 2620:52:0:4948:dc27:beff:fea6:a62d/64 scope global mngtmpaddr dynamic 
       valid_lft 2591990sec preferred_lft 604790sec
    inet6 fe80::dc27:beff:fea6:a62d/64 scope link 
       valid_lft forever preferred_lft forever
Comment 36 Yihui Zhao 2017-07-19 06:02:27 EDT
Deploy HostedEngine also failed with bond+vlan (active+backup) configured by cockpit. Because the NIC like "bond0.20" ip was changed.

Test version:
rhvh-4.1-0.20170714.0+1
vdsm-4.19.22-1.el7ev.x86_64
cockpit-ovirt-dashboard-0.10.7-0.0.21.el7ev.noarch
ovirt-hosted-engine-setup-2.1.3.4-1.el7ev.noarch
imgbased-0.9.33-0.1.el7ev.noarch
rhvm-appliance-4.1.20170709.3-1.el7.noarch


How to reproduce:
1. Install RHVH4.1
2. Configure network (bond+vlan) by cockpit
3. Deploy HostedEngine



Actual results:
1.After step3, deploying failed. During deploying process, the ip address was changed.

Expected results:
1. After step3, the host can deploy HostedEngine successfully

Additional info:
/var/log/* in the attachment.
Comment 37 Yihui Zhao 2017-07-19 06:03 EDT
Created attachment 1300958 [details]
/var/log/*
Comment 38 Edward Haas 2017-07-19 07:59:25 EDT
It looks like the mac assigned to the bond (de:27:be:a6:a6:2d) is not one of the nics original macs.
@mburman also has seen the same scenario in TLV lab this morning.

We currently suspect NM of overwriting the bond mac, although it does not manage it.

Waiting for some insights from NM team on this.
Comment 39 Michael Burman 2017-07-19 09:08 EDT
Created attachment 1301090 [details]
NM log in debug1
Comment 40 Francesco Giudici 2017-07-19 13:15:55 EDT
Isolated the issue that is probably the root cause of this (thanks for Michael for sharing setup and machine).
If a NetworkManager bond connection which is already up is brought up again, the MAC address will be reset to the "fake" one the bond interface had before enslaving any interface.

Tracked this with trace level logs here:
https://bugzilla.redhat.com/show_bug.cgi?id=1472965
Comment 41 Pavol Brilla 2017-07-24 02:29:18 EDT
changing summary to reflect re-target of the bug

Note You need to log in before you can comment on or make changes to this bug.