Bug 1556666 - Bond network and ovirtmgmt disappear when rebooting
Summary: Bond network and ovirtmgmt disappear when rebooting
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-node
Classification: oVirt
Component: Installation & Update
Version: 4.1
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.2.7
: ---
Assignee: Yuval Turgeman
QA Contact: Wei Wang
URL:
Whiteboard:
Depends On: 1548265
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-03-15 02:20 UTC by Huijuan Zhao
Modified: 2018-11-02 14:35 UTC (History)
14 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2018-11-02 14:33:13 UTC
oVirt Team: Node
Embargoed:
rule-engine: ovirt-4.2+
cshao: testing_ack+


Attachments (Terms of Use)
All logs from host (11.65 MB, application/x-gzip)
2018-03-15 02:20 UTC, Huijuan Zhao
no flags Details
log from engine (511.74 KB, text/plain)
2018-03-15 02:41 UTC, Huijuan Zhao
no flags Details

Description Huijuan Zhao 2018-03-15 02:20:01 UTC
Created attachment 1408319 [details]
All logs from host

Description of problem:
Bond network and ovirtmgmt disappeared after upgrade to rhvh-4.1-0.20180307.0


Version-Release number of selected component (if applicable):
# imgbase layout
rhvh-4.1-0.20171207.0
 +- rhvh-4.1-0.20171207.0+1
rhvh-4.1-0.20180307.0
 +- rhvh-4.1-0.20180307.0+1


How reproducible:
100%


Steps to Reproduce:
1. Install RHVH4.1 (rhvh-4.1-0.20171207.0)
2. Setup bond0(Active-backup) network over two slave NICs(eno1 and eno2), bond0 can get dhcp ip(10.73.130.225) normally
3. Register host to rhvm-4.1 with bond0(10.73.130.225), the ovirtmgmt ip changes to 10.73.130.251 during adding to rhvm, then add host to rhvm again with new ovirtmgmt ip 10.73.130.251, check host status in rhvm and network status in host
4. Setup local repos, and upgrade host to rhvh-4.1-0.20180307.0
5. Reboot host and enter to rhvh-4.1-0.20180307.0, check host status in rhvm and network status in host


Actual results:
1. After step3, host is up in rhvm. There is ovirtmgmt network in host. There are ifcfg-bond0, ifcfg-eno1, ifcfg-eno2, and ifcfg-ovirtmgmt in /etc/sysconfig/network-scripts
2. After step5, host is down in rhvm, there is no ovirtmgmt network in host. ifcfg-eno1, ifcfg-eno2, and ifcfg-ovirtmgmt disappear in /etc/sysconfig/network-scripts
# ip a s
2: eno1: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 08:94:ef:21:c0:4d brd ff:ff:ff:ff:ff:ff
3: eno2: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 08:94:ef:21:c0:4e brd ff:ff:ff:ff:ff:ff
       valid_lft forever preferred_lft forever
27: bond0: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether 08:94:ef:21:c0:4d brd ff:ff:ff:ff:ff:ff
28: ;vdsmdummy;: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 16:31:04:87:6f:36 brd ff:ff:ff:ff:ff:ff


Expected results:
After step5, should be same as step3:
Host should be up in rhvm. There is ovirtmgmt network in host. There are ifcfg-bond0, ifcfg-eno1, ifcfg-eno2, and ifcfg-ovirtmgmt in /etc/sysconfig/network-scripts


Additional info:
If not register host to rhvm with bond0, the network can be persisted after upgrade.

Comment 1 Huijuan Zhao 2018-03-15 02:41:03 UTC
Created attachment 1408323 [details]
log from engine

Comment 2 Ryan Barry 2018-03-15 04:08:46 UTC
I am already sure that I do not have the environment needed to reproduce this.

Can you please provide a test system? After 2nd rhvm registration but before update would be ideal

Comment 3 Huijuan Zhao 2018-03-15 05:40:32 UTC
(In reply to Ryan Barry from comment #2)
> I am already sure that I do not have the environment needed to reproduce
> this.
> 
> Can you please provide a test system? After 2nd rhvm registration but before
> update would be ideal

Sure. I will send ENV info to you via email as soon as possible, maybe several hours later. As now the machine is used by other colleague.

Comment 4 Yaniv Kaul 2018-03-15 11:22:08 UTC
I find this part a bit strange:
3. Register host to rhvm-4.1 with bond0(10.73.130.225), the ovirtmgmt ip changes to 10.73.130.251 during adding to rhvm, then add host to rhvm again with new ovirtmgmt ip 10.73.130.251, check host status in rhvm and network status in host

Why does it change IP address?!

Comment 5 Huijuan Zhao 2018-03-16 00:55:20 UTC
(In reply to Yaniv Kaul from comment #4)
> I find this part a bit strange:
> 3. Register host to rhvm-4.1 with bond0(10.73.130.225), the ovirtmgmt ip
> changes to 10.73.130.251 during adding to rhvm, then add host to rhvm again
> with new ovirtmgmt ip 10.73.130.251, check host status in rhvm and network
> status in host
> 
> Why does it change IP address?!

I think this is an old bug related to Bug 1443347 and Bug 1422430.

Comment 6 Huijuan Zhao 2018-03-16 05:42:06 UTC
(In reply to Huijuan Zhao from comment #3)
> (In reply to Ryan Barry from comment #2)
> > I am already sure that I do not have the environment needed to reproduce
> > this.
> > 
> > Can you please provide a test system? After 2nd rhvm registration but before
> > update would be ideal
> 
> Sure. I will send ENV info to you via email as soon as possible, maybe
> several hours later. As now the machine is used by other colleague.

Ryan, already sent ENV info to you via email, please check.

Comment 7 Ryan Barry 2018-03-18 22:37:11 UTC
Unfortunately, this continues to work for me after the one change of address --

Booting back into the old image and upgrading+rebooting repeatedly has not resulted in a change, and it 'sticks' with 10.73.131.63

It's possible that this is due to some networkmanager change, since I've seen a couple of bugs filed by virt QE against NM with vlan+bond.

Comment 8 Ryan Barry 2018-03-20 21:58:20 UTC
There are a variety of problems here, and I don't think any of them are RHVH.

It's possible that there was an incomplete fix from either cockpit or vdsm, since a simple reboot of the system does not return the 2nd address. It is always the first.

vdsm-restore-netconfig is also not saving us here.

Finally, NetworkManager itself is killing the configuration files, which is definitely a bug.

As before, if I reboot to repeat the process, everything works ok. It is only the first time with this "double registration" flow which is broken. It's possible that there's a race between vdsm and NM here, but there's definitely bad behavior from NM:

For reference, I ensured that RHVH actually kept these. Before rebooting (the first, broken time), and saved them off just in case.

2018-03-20 16:51:02,978 [DEBUG] (MainThread) Bases: [<Base rhvh-4.1-0.20171207.0 [<Layer rhvh-4.1-0.20171207.0+1 />] />, <Base rhvh-4.1-0.20180307.0 [<Layer rhvh-4.1-0.20180307.0+1 />] />]                                                  
2018-03-20 16:51:02,978 [INFO] (MainThread) No bases to free                                                                                                                                                                                  
[root@localhost ~]# ls -l /tmp/a/etc/sysconfig/network-scripts/ifcfg-*                                                                                                                                                                        
-rw-rw-r--. 1 root root 180 Mar 18 21:13 /tmp/a/etc/sysconfig/network-scripts/ifcfg-bond0                                                                                                                                                     
-rw-rw-r--. 1 root root 139 Mar 18 21:13 /tmp/a/etc/sysconfig/network-scripts/ifcfg-em1                                                                                                                                                       
-rw-rw-r--. 1 root root 139 Mar 18 21:13 /tmp/a/etc/sysconfig/network-scripts/ifcfg-em2                                                                                                                                                       
-rw-r--r--. 1 root root 275 Mar 18 20:34 /tmp/a/etc/sysconfig/network-scripts/ifcfg-em3                                                                                                                                                       
-rw-r--r--. 1 root root 275 Mar 18 20:34 /tmp/a/etc/sysconfig/network-scripts/ifcfg-em4                                                                                                                                                       
-rw-r--r--. 1 root root 254 Jan  2 11:29 /tmp/a/etc/sysconfig/network-scripts/ifcfg-lo                                                                                                                                                        
-rw-rw-r--. 1 root root 219 Mar 18 21:13 /tmp/a/etc/sysconfig/network-scripts/ifcfg-ovirtmgmt                                                                                                                                                 
-rw-r--r--. 1 root root 277 Mar 18 20:34 /tmp/a/etc/sysconfig/network-scripts/ifcfg-p5p1                                                                                                                                                      
-rw-r--r--. 1 root root 277 Mar 18 20:34 /tmp/a/etc/sysconfig/network-scripts/ifcfg-p5p2                                                                                                                                                      
-rw-r--r--. 1 root root 277 Mar 18 20:34 /tmp/a/etc/sysconfig/network-scripts/ifcfg-p7p1                                                                                                                                                      
-rw-r--r--. 1 root root 277 Mar 18 20:34 /tmp/a/etc/sysconfig/network-scripts/ifcfg-p7p2                                                                                                                                                                                                                                   
[root@localhost ~]# mkdir backup                                                                                                                                                                                                              
[root@localhost ~]# cp -rpv /etc/sysconfig/network-scripts backup/ 

After the reboot:

Mar 20 17:01:07 localhost.localdomain NetworkManager[1440]: <info>  [1521579667.2225] settings: loaded plugin ifcfg-rh: (c) 2007 - 2015 Red Hat, Inc.  To report bugs please use the NetworkManager mailing list. (/usr/lib64/NetworkManager/l
ibnm-settings-plugin-ifcfg-rh.so)
Mar 20 17:01:07 localhost.localdomain NetworkManager[1440]: <info>  [1521579667.3416] ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-em1 (1dad842d-1912-ef5a-a43a-bc238fb267e7,"System em1")
Mar 20 17:01:07 localhost.localdomain NetworkManager[1440]: <info>  [1521579667.3417] ifcfg-rh: Ignoring connection /etc/sysconfig/network-scripts/ifcfg-em1 (1dad842d-1912-ef5a-a43a-bc238fb267e7,"System em1") due to NM_CONTROLLED=no. Unma
naged: interface-name:em1.
Mar 20 17:01:07 localhost.localdomain NetworkManager[1440]: <info>  [1521579667.3420] ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-em2 (0578038a-64e9-a2fd-0a28-e4cd0b553930,"System em2")
Mar 20 17:01:07 localhost.localdomain NetworkManager[1440]: <info>  [1521579667.3421] ifcfg-rh: Ignoring connection /etc/sysconfig/network-scripts/ifcfg-em2 (0578038a-64e9-a2fd-0a28-e4cd0b553930,"System em2") due to NM_CONTROLLED=no. Unma
naged: interface-name:em2.
Mar 20 17:01:07 localhost.localdomain NetworkManager[1440]: <info>  [1521579667.3426] ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-bond0 (ad33d8b0-1f7b-cab9-9447-ba07f855b143,"System bond0")
Mar 20 17:01:07 localhost.localdomain NetworkManager[1440]: <info>  [1521579667.3426] ifcfg-rh: Ignoring connection /etc/sysconfig/network-scripts/ifcfg-bond0 (ad33d8b0-1f7b-cab9-9447-ba07f855b143,"System bond0") due to NM_CONTROLLED=no.
Unmanaged: interface-name:bond0.
Mar 20 17:01:07 localhost.localdomain NetworkManager[1440]: <info>  [1521579667.3439] ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-ovirtmgmt (9a0b07c0-2983-fe97-ec7f-ad2b51c3a3f0,"System ovirtmgmt")
Mar 20 17:01:07 localhost.localdomain NetworkManager[1440]: <info>  [1521579667.3439] ifcfg-rh: Ignoring connection /etc/sysconfig/network-scripts/ifcfg-ovirtmgmt (9a0b07c0-2983-fe97-ec7f-ad2b51c3a3f0,"System ovirtmgmt") due to NM_CONTROL
LED=no. Unmanaged: interface-name:ovirtmgmt.
Mar 20 17:01:07 localhost.localdomain NetworkManager[1440]: <info>  [1521579667.3648] ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-em3 (a9cf239b-a8d9-4013-8686-ebea720a79ea,"em3")
Mar 20 17:01:07 localhost.localdomain NetworkManager[1440]: <info>  [1521579667.3661] ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-em4 (298b80d1-d37c-4537-bfff-392760369184,"em4")
Mar 20 17:01:07 localhost.localdomain NetworkManager[1440]: <info>  [1521579667.3673] ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-p5p1 (20d95264-bf94-485b-90b0-0b28933b76b9,"p5p1")
Mar 20 17:01:07 localhost.localdomain NetworkManager[1440]: <info>  [1521579667.3686] ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-p5p2 (8bdd3eee-d96b-4c6f-9f05-ecd5d15a1158,"p5p2")
Mar 20 17:01:07 localhost.localdomain NetworkManager[1440]: <info>  [1521579667.3699] ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-p7p1 (ef0195fd-a49f-4a38-aec2-304cdb750457,"p7p1")
Mar 20 17:01:07 localhost.localdomain NetworkManager[1440]: <info>  [1521579667.3712] ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-p7p2 (2909cfd3-8bf7-40f2-977c-bc5795561f35,"p7p2")
Mar 20 17:01:39 dell-per730-34.lab.eng.pek2.redhat.com dracut[7186]: Executing: /usr/sbin/dracut --hostonly --hostonly-cmdline --hostonly-i18n -o "plymouth dash resume ifcfg" -a watchdog --mount "/dev/mapper/rhvh_dell--per730--34-var /kdu
mproot//var ext4 defaults,discard" --no-hostonly-default-device -f /boot/initramfs-3.10.0-858.el7.x86_64kdump.img 3.10.0-858.el7.x86_64
Mar 20 17:01:40 dell-per730-34.lab.eng.pek2.redhat.com NetworkManager[1440]: <info>  [1521579700.0096] ifcfg-rh: remove /etc/sysconfig/network-scripts/ifcfg-bond0 (ad33d8b0-1f7b-cab9-9447-ba07f855b143,"System bond0")
Mar 20 17:01:40 dell-per730-34.lab.eng.pek2.redhat.com NetworkManager[1440]: <info>  [1521579700.0097] ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-bond0 (a404fff1-2206-4fdb-80ac-1f899bae26ae,"bond0")

[root@dell-per730-34 ~]# diff /tmp/a/etc/sysconfig/network-scripts/ifcfg-bond0 /etc/sysconfig/network-scripts/ifcfg-bond0                                                                                                                     
1d0                                                                                                                                                                                                                                           
< # Generated by VDSM version 4.19.42-1.el7ev                                                                                                                                                                                                 
3,4c2,17
< BONDING_OPTS='mode=1 miimon=100 primary=em1'
< BRIDGE=ovirtmgmt
---
> BONDING_OPTS="miimon=100 updelay=0 downdelay=0 mode=active-backup primary=em1"
> TYPE=Bond
> BONDING_MASTER=yes
> MACADDR=24:6E:96:19:BB:70
> PROXY_METHOD=none
> BROWSER_ONLY=no
> BOOTPROTO=dhcp
> DEFROUTE=yes
> IPV4_FAILURE_FATAL=no
> IPV6INIT=yes
> IPV6_AUTOCONF=yes
> IPV6_DEFROUTE=yes
> IPV6_FAILURE_FATAL=no
> IPV6_ADDR_GEN_MODE=stable-privacy
> NAME=bond0
> UUID=a404fff1-2206-4fdb-80ac-1f899bae26ae
6,9c19
< MTU=1500
< DEFROUTE=no
< NM_CONTROLLED=no
< IPV6INIT=no
---
> AUTOCONNECT_SLAVES=yes
[root@dell-per730-34 ~]# cat /etc/sysconfig/network-scripts/ifcfg-bond0
DEVICE=bond0
BONDING_OPTS="miimon=100 updelay=0 downdelay=0 mode=active-backup primary=em1"
TYPE=Bond
BONDING_MASTER=yes
MACADDR=24:6E:96:19:BB:70
PROXY_METHOD=none
BROWSER_ONLY=no
BOOTPROTO=dhcp
DEFROUTE=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV6_FAILURE_FATAL=no
IPV6_ADDR_GEN_MODE=stable-privacy
NAME=bond0
UUID=a404fff1-2206-4fdb-80ac-1f899bae26ae
ONBOOT=yes
AUTOCONNECT_SLAVES=yes


NetworkManager actually removes our bond0 here and overwrites it with something that looks like a stock config, managed by NM. It's not in the logs, but I suspect it's also removing em1 and em2, since those are slaves of a "bad" configuration.

It also seems that vdsm-restore-net-config loses the race this time (or wins, and NM doesn't like it -- it's impossible to tell).

Ben - why would NM overwrite ifcfg files? Especially ones which are not managed by NM?

I'm also curious if you have any idea why ifcfg files on an umounted LV are disappearing on a reboot -- they are present before, but gone after rebooting with whatever configuration QE is using.

Dan - any ideas why vdsm-restore-net-config wouldn't win here? We don't have millisecond timestamps in the journal, but it looks like it's racing:

supervdsm.log:restore-net::INFO::2018-03-20 17:01:37,849::vdsm-restore-net-config::470::root::(restore) starting network restoration.                                                                                                         
supervdsm.log:restore-net::DEBUG::2018-03-20 17:01:37,852::vdsm-restore-net-config::226::root::(_remove_networks_in_running_config) Not cleaning running configuration since it is empty.                                                     
supervdsm.log:restore-net::INFO::2018-03-20 17:01:37,862::ifcfg::548::root::(_loadBackupFiles) Loaded /var/lib/vdsm/netconfback/ifcfg-em2                                                                                                     
supervdsm.log:restore-net::INFO::2018-03-20 17:01:37,868::ifcfg::548::root::(_loadBackupFiles) Loaded /var/lib/vdsm/netconfback/ifcfg-em1                                                                                                     
supervdsm.log:restore-net::INFO::2018-03-20 17:01:37,868::ifcfg::548::root::(_loadBackupFiles) Loaded /var/lib/vdsm/netconfback/ifcfg-ovirtmgmt                                                                                               
supervdsm.log:restore-net::INFO::2018-03-20 17:01:37,873::ifcfg::548::root::(_loadBackupFiles) Loaded /var/lib/vdsm/netconfback/ifcfg-bond0                                                                                                   
supervdsm.log:restore-net::DEBUG::2018-03-20 17:01:37,874::commands::69::root::(execCmd) /usr/bin/taskset --cpu-list 0-23 /sbin/ifdown ovirtmgmt (cwd None)                                                                                   
superv
...
supervdsm.log:restore-net::DEBUG::2018-03-20 17:01:40,185::commands::93::root::(execCmd) SUCCESS: <err> = 'Running scope as unit 9c51e623-39af-475b-832f-05e62a26dba9.scope.\n'; <rc> = 0
supervdsm.log:restore-net::DEBUG::2018-03-20 17:01:40,199::vdsm-restore-net-config::382::root::(_wait_for_for_all_devices_up) All devices are up.
supervdsm.log:restore-net::WARNING::2018-03-20 17:01:40,248::bridges::42::root::(ports) ovirtmgmt is not a Linux bridge
supervdsm.log:restore-net::INFO::2018-03-20 17:01:40,249::cache::217::root::(_getNetInfo) Obtaining info for net ovirtmgmt.
supervdsm.log:restore-net::DEBUG::2018-03-20 17:01:40,285::commands::69::root::(execCmd) /usr/bin/taskset --cpu-list 0-23 /sbin/tc qdisc show (cwd None)
supervdsm.log:restore-net::DEBUG::2018-03-20 17:01:40,300::commands::93::root::(execCmd) SUCCESS: <err> = ''; <rc> = 0
supervdsm.log:restore-net::DEBUG::2018-03-20 17:01:40,303::commands::69::root::(execCmd) /usr/bin/taskset --cpu-list 0-23 /bin/systemctl --no-pager list-unit-files (cwd None)
supervdsm.log:restore-net::DEBUG::2018-03-20 17:01:40,406::commands::93::root::(execCmd) SUCCESS: <err> = ''; <rc> = 0
supervdsm.log:restore-net::DEBUG::2018-03-20 17:01:40,407::commands::69::root::(execCmd) /usr/bin/taskset --cpu-list 0-23 /bin/systemctl status openvswitch.service (cwd None)
supervdsm.log:restore-net::DEBUG::2018-03-20 17:01:40,445::commands::93::root::(execCmd) FAILED: <err> = ''; <rc> = 3
supervdsm.log:restore-net::INFO::2018-03-20 17:01:40,446::netconfpersistence::68::root::(setBonding) Adding bond0({'nics': [], 'switch': 'legacy', 'options': 'miimon=100 mode=1'})

Huijuan - 

Does this only happen if you use cockpit? I don't like this configuration in general, since the IP does not survive a reboot even without an update. It seems problematic, and like it may not work in general.

Comment 9 Beniamino Galvani 2018-03-21 07:52:18 UTC
(In reply to Ryan Barry from comment #8)

> NetworkManager actually removes our bond0 here and overwrites it with
> something that looks like a stock config, managed by NM. It's not in the
> logs, but I suspect it's also removing em1 and em2, since those are slaves
> of a "bad" configuration.
> 
> It also seems that vdsm-restore-net-config loses the race this time (or
> wins, and NM doesn't like it -- it's impossible to tell).
> 
> Ben - why would NM overwrite ifcfg files? Especially ones which are not
> managed by NM?

Hi,

NM never removes connection files by its own initiative. At lines:

[1521579700.0096] ifcfg-rh: remove /etc/sysconfig/network-scripts/ifcfg-bond0 (ad33d8b0-1f7b-cab9-9447-ba07f855b143,"System bond0")
[1521579700.0097] ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-bond0 (a404fff1-2206-4fdb-80ac-1f899bae26ae,"bond0")

it detects that the file disappears and that a new file is created, but those changes are done externally by someone else.

Comment 10 cshao 2018-03-21 12:05:27 UTC
> Huijuan - 
> 
> Does this only happen if you use cockpit? I don't like this configuration in
> general, since the IP does not survive a reboot even without an update. It
> seems problematic, and like it may not work in general.

Yes, it only happen on cockpit, no such issue through Anaconda GUI.

Comment 11 Ryan Barry 2018-03-21 14:09:19 UTC
Ben - I actually tested removing files to see if NM would spit out the same message and did not see anything. Does this only happen at one point during init?

Huijuan - this may be a cockpit bug. If this is only reproducible through cockpit, I'd like to reduce the severity/priority (especially since the steps taken to reproduce are unusual). Since this is only reproducible one time, even on the provided system, and the provided network configuration does not come back after rebooting, I may need another test system.

Specifically, I'd like to try disabling cockpit on the updated system to see if this resolves. Clearly, not having cockpit enabled is not an option, but if this is isolated to cockpit, let's push it out until they resolve. This looks like a regression in https://bugzilla.redhat.com/show_bug.cgi?id=1443347

Comment 12 Beniamino Galvani 2018-03-21 15:18:25 UTC
(In reply to Ryan Barry from comment #11)
> Ben - I actually tested removing files to see if NM would spit out the same
> message and did not see anything. Does this only happen at one point during
> init?

It's because NM doesn't monitor ifcfg files at runtime; they are only loaded at startup or when the user calls 'nmcli connection reload' / 'nmcli connection load <file>'.

Note that the 'ifup' command issues a nmcli-connection-load under the hood. A simple way to see the message would be:

 - delete and recreate the ifcfg-bond0 file with a different UUID (or just change the UUID)
 - 'ifup bond0' or 'nmcli connection reload'


When a connection file gets deleted through NetworkManager (which is not the case in the log from comment 8), you should see a audit entry like this in the journal:

 NetworkManager[15436]: <info>  [1521645358.6220] audit: op="connection-delete" uuid="2d74c40b-7514-467e-ad00-935d7155b3f6" name="bond0" pid=18210 uid=0 result="success"

Comment 13 cshao 2018-03-22 02:28:45 UTC
(In reply to Ryan Barry from comment #11)
> Ben - I actually tested removing files to see if NM would spit out the same
> message and did not see anything. Does this only happen at one point during
> init?
> 
> Huijuan - this may be a cockpit bug. If this is only reproducible through
> cockpit, I'd like to reduce the severity/priority (especially since the
> steps taken to reproduce are unusual). Since this is only reproducible one
> time, even on the provided system, and the provided network configuration
> does not come back after rebooting, I may need another test system.
> 
> Specifically, I'd like to try disabling cockpit on the updated system to see
> if this resolves. Clearly, not having cockpit enabled is not an option, but
> if this is isolated to cockpit, let's push it out until they resolve. This
> looks like a regression in
> https://bugzilla.redhat.com/show_bug.cgi?id=1443347

Due to huzhao PTO today, I will redefine the reproduce steps and provide test ENV for you later.

Comment 14 cshao 2018-03-22 08:49:35 UTC
Short Summary:
1. Upgrade step is not need for reproduce this bug.
2. This bug is only reproducible through cockpit.
3. The bug is easy to reproduce if follow the below step.
4. I am doubt the bug is related with bug 1422430.


===========
Scenario 1: Configure bond via cockpit (specify mac address and primary)

Test version:
redhat-virtualization-host-4.1-20171207.0

Test steps:
1. Install redhat-virtualization-host-4.1-20171207.0 via anaconda, config one nic(eno1) up.
2. Key step: Login cockpit UI, setup a dhcp bond(active+backup mode) over two nics(eno1 + eno2), specify one mac address is must(eno1), primary choose eno1 as well.
3. Register host to rhvm-4.1 with bond0 (This step will make IP changed, so register failed)
4. Add host to rhvm again with new ip.
5. Reboot host.

Test result:
1. After step3, RHVH host got new IP, and register failed due to IP changed.
2. After step4, RHVH can up in RHVM.
3. After step5, Bond network and ovirtmgmt disappeared.
There is not ifcfg-eno1, ifcfg-eno2, and ifcfg-ovirtmgmt in /etc/sysconfig/network-scripts


===========
Scenario 2: Configure bond via cockpit (do not specify mac address and primary)

Test version:
redhat-virtualization-host-4.1-20171207.0

Test steps:
1. Install redhat-virtualization-host-4.1-20171207.0 via anaconda, config one nic(eno1) up.
2. Login cockpit UI, setup a dhcp bond(active+backup mode) over two nics(eno1 + eno2), do not specify mac address, and not set primary option.
3. Register host to rhvm-4.1 with bond0 (This step will make IP lost, so register failed)
4. Reboot host check IP.

Test result:
1. After step3, RHVH host lost IP, and register failed.
3. After step4, RHVH host still not IP.
There is not ifcfg-eno1, ifcfg-eno2, and ifcfg-ovirtmgmt in /etc/sysconfig/network-scripts


===========
Scenario 3: Configure bond via cockpit (do not specify mac address and but set primary)

Test version:
redhat-virtualization-host-4.1-20180314.0

Test steps:
1. Install redhat-virtualization-host-4.1-20180314.0 via anaconda, config one nic(eno1) up.
2. Login cockpit UI, setup a dhcp bond(active+backup mode) over two nics(eno1 + eno2), do not specify mac address, but set primary(eno1). RHVH host obtain new IP which provided by eno2.
3. Register host to rhvm-4.1 with bond0.
4. Reboot host check IP.

Test result:
1. After step2, RHVH host obtain new IP which provided by eno2.
1. After step3, Register can succeed.
3. After step4, RHVM RHVH host change IP to previous one which provided by eno1.
There are ifcfg-eno1, ifcfg-eno2, and ifcfg-ovirtmgmt in /etc/sysconfig/network-scripts


===========
Scenario 4: Configure bond via Anaconda work well.
1. Install redhat-virtualization-host-4.1-20171207.0 via anaconda. setup a dhcp bond(active+backup mode) over two nics.(do not specify mac addr due to no such option in anaconda UI)
2. Register to RHVM.
3. Reboot.

Test result:
1. After step2, no IP changed, register to RHVM can succeed.
2. After step3, everything is ok.

Comment 15 Ryan Barry 2018-03-24 01:53:12 UTC
Can you please check this on 4.2?

https://bugzilla.redhat.com/show_bug.cgi?id=1443347#c28 indicates that this may be fixed there, and not 7.5

Comment 16 Huijuan Zhao 2018-03-24 03:23:02 UTC
(In reply to Ryan Barry from comment #15)
> Can you please check this on 4.2?
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1443347#c28 indicates that this
> may be fixed there, and not 7.5

Still have issues on 4.2, but not exactly same as comment 0.

Test version:
From: rhvh-4.1-0.20171207.0
To:   rhvh-4.2-0.20180322.0

Test steps:
Same as comment 0

Actual results:
After step5, host is down in rhvm, there is ovirtmgmt network in host, but no dhcp IP. ifcfg-ovirtmgmt disappear in /etc/sysconfig/network-scripts.


I will send ENV info to you via email, please check if needed.

Comment 17 Ryan Barry 2018-03-24 09:52:28 UTC
supervdsm.log:restore-net::ERROR::2018-03-23 22:56:50,211::restore_net_config:308::root::(_find_nets_with_available_devices) Bond "bond0" is not persisted and will not be configured. Network "ovirtmgmt" will not be configured as a consequence.

Huijuan -

Is this reproducible on RHEL?

Edward - any thoughts?

The test scenario in comment#1 looks wrong to me (double registration to RHVM).

Comment 18 Edward Haas 2018-03-25 08:53:06 UTC
(In reply to Huijuan Zhao from comment #16)
> 
> Still have issues on 4.2, but not exactly same as comment 0.
> 
> Test version:
> From: rhvh-4.1-0.20171207.0
> To:   rhvh-4.2-0.20180322.0
> 
> Test steps:
> Same as comment 0

As a general note, the ability for VDSM to acquire an "external" bond and persist it in its config was fixed in 4.2.
If the external bond is not persisted by VDSM and there is a failure when attempting to do setupNetworks, it will not be persisted after reboot.

1. Is the problem reported here is a regression from a previous 4.1 version? Could you please confirm this and mention the working version? (including what RHEL version it is based on)

2. If there is a similar problem on 4.2 or something else is not working there, please add the relevant logs and mention the time each step has been taken (to interpret the log correctly).
If this is not like the problem described in this BZ, better open a new one.

Comment 19 Huijuan Zhao 2018-03-26 11:00:14 UTC
(In reply to Edward Haas from comment #18)
> (In reply to Huijuan Zhao from comment #16)
> > 
> > Still have issues on 4.2, but not exactly same as comment 0.
> > 
> > Test version:
> > From: rhvh-4.1-0.20171207.0
> > To:   rhvh-4.2-0.20180322.0
> > 
> > Test steps:
> > Same as comment 0
> 
> As a general note, the ability for VDSM to acquire an "external" bond and
> persist it in its config was fixed in 4.2.
> If the external bond is not persisted by VDSM and there is a failure when
> attempting to do setupNetworks, it will not be persisted after reboot.
> 
> 1. Is the problem reported here is a regression from a previous 4.1 version?
> Could you please confirm this and mention the working version? (including
> what RHEL version it is based on)
> 
> 2. If there is a similar problem on 4.2 or something else is not working
> there, please add the relevant logs and mention the time each step has been
> taken (to interpret the log correctly).
> If this is not like the problem described in this BZ, better open a new one.

This is not regression from 4.1 version(both for rhel_7.4 and rhel_7.5). Actually this test scenario was blocked by Bug 1443347 and Bug 1422430 for a long time.

Comment 20 Huijuan Zhao 2018-03-26 11:05:54 UTC
(In reply to Ryan Barry from comment #17)
> supervdsm.log:restore-net::ERROR::2018-03-23
> 22:56:50,211::restore_net_config:308::root::
> (_find_nets_with_available_devices) Bond "bond0" is not persisted and will
> not be configured. Network "ovirtmgmt" will not be configured as a
> consequence.
> 
> Huijuan -
> 
> Is this reproducible on RHEL?
> 
This issue only occures when register host to engine, I think it is related to vdsm/NM, so in my opinion, it maybe also reproducible on RHEL. I will test it later.

Comment 21 Ryan Barry 2018-03-26 11:06:09 UTC
Just to clarify:

This is broken on both 7.4 and 7.5 for 4.1, correct?

Does this work in 4.2?

Comment 22 Huijuan Zhao 2018-03-26 11:12:26 UTC
(In reply to Ryan Barry from comment #21)
> Just to clarify:
> 
> This is broken on both 7.4 and 7.5 for 4.1, correct?

Yes.

> 
> Does this work in 4.2?

1. If upgrade from rhvh-4.1 to 4.2, the issue exists.
2. For rhvh-4.2 rebooting or upgrade from 4.2 to 4.2, I have to test this later and update results here then.

Comment 23 Huijuan Zhao 2018-03-28 02:42:31 UTC
(In reply to Huijuan Zhao from comment #22)
> (In reply to Ryan Barry from comment #21)
> > Just to clarify:
> > 
> > This is broken on both 7.4 and 7.5 for 4.1, correct?
> 
> Yes.
> 
> > 
> > Does this work in 4.2?
> 
> 1. If upgrade from rhvh-4.1 to 4.2, the issue exists.
> 2. For rhvh-4.2 rebooting or upgrade from 4.2 to 4.2, I have to test this
> later and update results here then.

I forgot that for 4.2, this scenario can not be tested currently due to Bug 1548265. 

And as comment 14 Scenario 4 testing results, there is workaround for bond register to rhvm successful, so I will lower the Severity.

Comment 24 Huijuan Zhao 2018-03-29 10:43:48 UTC
(In reply to Huijuan Zhao from comment #20)
> (In reply to Ryan Barry from comment #17)
> > supervdsm.log:restore-net::ERROR::2018-03-23
> > 22:56:50,211::restore_net_config:308::root::
> > (_find_nets_with_available_devices) Bond "bond0" is not persisted and will
> > not be configured. Network "ovirtmgmt" will not be configured as a
> > consequence.
> > 
> > Huijuan -
> > 
> > Is this reproducible on RHEL?
> > 
> This issue only occures when register host to engine, I think it is related
> to vdsm/NM, so in my opinion, it maybe also reproducible on RHEL. I will
> test it later.

I tested on RHEL-7.4, also have such issue(not exactly same with RHVH). 
1. ovirtmgmt can not get DHCP ip after adding rhvh to rhvm, so rhvh can not be up in rhvm;
2. After reboot, ovirtmgmt disappeared and no ip for bond.

So I think this bug is still related with Bug 1443347 and Bug 1422430, maybe we have to test this scenario until the old bugs resolved.

Comment 25 Edward Haas 2018-04-01 07:47:08 UTC
(In reply to Huijuan Zhao from comment #23 , 24)
> 
> I forgot that for 4.2, this scenario can not be tested currently due to Bug
> 1548265. 

I do not understand how that bug is related to this case.
If you suspect rhevm agent to be the problem, then we start from a machine that is correctly configured. So why not configure first the bond manually, then go to cockpit and re-apply it from there.

> I tested on RHEL-7.4, also have such issue(not exactly same with RHVH). 

I pretty much lost track of this BZ here.

Several problems have been fixed in VDSM 4.2, most (if not all) have not been back-ported to 4.1.
If we have a problem in 4.2, we need to fix it.
If you want to backport fixes from 4.2 to 4.1, it will be costly and we need good reasoning (like no workaround) to do so.

Comment 26 Huijuan Zhao 2018-04-02 02:37:58 UTC
(In reply to Edward Haas from comment #25)
> (In reply to Huijuan Zhao from comment #23 , 24)
> > 
> > I forgot that for 4.2, this scenario can not be tested currently due to Bug
> > 1548265. 
> 
> I do not understand how that bug is related to this case.
> If you suspect rhevm agent to be the problem, then we start from a machine
> that is correctly configured. So why not configure first the bond manually,
> then go to cockpit and re-apply it from there.
> 

Due to Bug 1548265, can not create bond via cockpit, so can not register to rhvm via bond network setup by cockpit. 

Of course there are workarounds to avoid this bond bug, and can register rhvm successful via bond network, but the problem here is: I just test this cockpit scenario which has issue. For other scenario, there is no issue, so no need to test.

> > I tested on RHEL-7.4, also have such issue(not exactly same with RHVH). 
> 
> I pretty much lost track of this BZ here.
> 
> Several problems have been fixed in VDSM 4.2, most (if not all) have not
> been back-ported to 4.1.
> If we have a problem in 4.2, we need to fix it.
> If you want to backport fixes from 4.2 to 4.1, it will be costly and we need
> good reasoning (like no workaround) to do so.

I do not expect to back-port the 4.2 fix to 4.1, I just reported such a scenario issue, and after analysis and testing, maybe this scenario issue is still blocked to test as several bugs, so QE will test this scenario in 4.2 after Bug 1548265 is resolved.

Comment 27 Edward Haas 2018-04-02 06:49:50 UTC
(In reply to Huijuan Zhao from comment #26)
> 
> Due to Bug 1548265, can not create bond via cockpit, so can not register to
> rhvm via bond network setup by cockpit. 
> 
> Of course there are workarounds to avoid this bond bug, and can register
> rhvm successful via bond network, but the problem here is: I just test this
> cockpit scenario which has issue. For other scenario, there is no issue, so
> no need to test.
> 

I am trying to argue that you may not need to wait for BZ#1548265 to be resolved in order to proceed with this BZ and check problems in VDSM.

There are at least 3 stages here if I understand correctly: Anaconda, cockpit and adding rhevh to rhevm.
BZ#1548265 blocks stage 2 in a scenario where you attempt to configure a host with 2 dhcp based nics, adding a bond over them.
You could replace that stage steps by removing the dhcp from the nics and then adding the bond, or some other workaround.. with cockpit or without it (if without it, then when you finish, go back and re-apply the config using cockpit).
The end result of stage 2 should be the same, just bypassing the mentioned BZ.
Then you can proceed and test if everything is working in stage3.

Is this what has been done?

Comment 28 Huijuan Zhao 2018-04-02 08:35:38 UTC
(In reply to Edward Haas from comment #27)
> (In reply to Huijuan Zhao from comment #26)
> > 
> > Due to Bug 1548265, can not create bond via cockpit, so can not register to
> > rhvm via bond network setup by cockpit. 
> > 
> > Of course there are workarounds to avoid this bond bug, and can register
> > rhvm successful via bond network, but the problem here is: I just test this
> > cockpit scenario which has issue. For other scenario, there is no issue, so
> > no need to test.
> > 
> 
> I am trying to argue that you may not need to wait for BZ#1548265 to be
> resolved in order to proceed with this BZ and check problems in VDSM.
> 
> There are at least 3 stages here if I understand correctly: Anaconda,
> cockpit and adding rhevh to rhevm.
> BZ#1548265 blocks stage 2 in a scenario where you attempt to configure a
> host with 2 dhcp based nics, adding a bond over them.
> You could replace that stage steps by removing the dhcp from the nics and
> then adding the bond, or some other workaround.. with cockpit or without it
> (if without it, then when you finish, go back and re-apply the config using
> cockpit).
> The end result of stage 2 should be the same, just bypassing the mentioned
> BZ.
> Then you can proceed and test if everything is working in stage3.
> 
> Is this what has been done?

I can understand your point, you would like QE to verify VDSM issue, there is workaround(see comment 14 scenario 4) can identify that stage3 is ok(even in rhvh 4.1).

Here my point is: stage2 have issue now in rhvh 4.2, and just this scenario can not be tested now, and this scenario is what this bug focused on.

Comment 29 Ryan Barry 2018-04-11 12:26:16 UTC
Deferring while we wait for platform to fix the parent bug.

Comment 32 Wei Wang 2018-06-26 08:28:07 UTC
Test this bug according to comment 14 Scenario 1~3.

Test Version:
rhvh-4.2.4.3-0.20180622.0
cockpit-system-169-1.el7.noarch
cockpit-ws-169-1.el7.x86_64
cockpit-dashboard-169-1.el7.x86_64
cockpit-ovirt-dashboard-0.11.28-1.el7ev.noarch
cockpit-169-1.el7.x86_64
cockpit-machines-ovirt-169-1.el7.noarch
cockpit-bridge-169-1.el7.x86_64
cockpit-storaged-169-1.el7.noarch

Test Results:
===========
Scenario 1: Configure bond via cockpit (specify mac address and primary) -- pass
1. After step3, RHVH host doesn't get new IP and register successfully.
2. After step4, RHVH can up in RHVM.
3. After step5, Bond network and ovirtmgmt are still normal.
There are ifcfg-eno1, ifcfg-eno2, and ifcfg-ovirtmgmt in /etc/sysconfig/network-scripts


===========
Scenario 2: Configure bond via cockpit (do not specify mac address and primary) -- fail
1. After step3, RHVH host lost IP, and register failed.
2. After step4, RHVH host got bond IP provided by em1.
There is not ifcfg-ovirtmgmt in /etc/sysconfig/network-scripts


===========
Scenario 3: Configure bond via cockpit (do not specify mac address and but set primary) -- pass
1. After step2, RHVH host obtains new IP which provided by em1.
1. After step3, Register can succeed.
3. After step4, IP of RHVH still be provided by em1
There are ifcfg-eno1, ifcfg-eno2, and ifcfg-ovirtmgmt in /etc/sysconfig/network-scripts

Change the status to ASSIGNED.

Comment 33 Ryan Barry 2018-07-09 12:39:40 UTC
Moving out, since this was not resolved in the platform batch update, and we need a fix from Cockpit.

Comment 34 Sandro Bonazzola 2018-09-21 07:59:56 UTC
BZ#1548265 has been fixed and verified. Can you please re-test with latest 4.2.7/RHEL 7.6 build?

Comment 35 Wei Wang 2018-09-25 09:37:05 UTC
(In reply to Sandro Bonazzola from comment #34)
> BZ#1548265 has been fixed and verified. Can you please re-test with latest
> 4.2.7/RHEL 7.6 build?

Test with RHVH-4.2-20180920.0-RHVH-x86_64-dvd1.iso, scenario 1 is pass, others will be re-test ASAP.

Comment 36 Wei Wang 2018-09-26 05:42:00 UTC
Environment server is broken, have sent ticket to Admin. When it fixs, I will verify this bug soon. So remove the needinfo flag now.

Comment 37 Wei Wang 2018-09-28 03:17:23 UTC
Test this bug according to comment 14 Scenario 1~3.

Test Version:
rhvh-4.2.7.0-0.20180918
cockpit-bridge-173-6.el7.x86_64
cockpit-storaged-172-2.el7.noarch
cockpit-173-6.el7.x86_64
cockpit-ovirt-dashboard-0.11.34-1.el7ev.noarch
cockpit-system-173-6.el7.noarch
cockpit-ws-173-6.el7.x86_64
cockpit-machines-ovirt-172-2.el7.noarch
cockpit-dashboard-172-2.el7.x86_64
NetworkManager-1.12.0-6.el7.x86_64

Test Results:
===========
Scenario 1: Configure bond via cockpit (specify mac address and primary) -- pass
1. After step3, RHVH host doesn't get new IP and register successfully.
2. After step4, RHVH can up in RHVM.
3. After step5, Bond network and ovirtmgmt are still normal.
There are ifcfg-p7p1, ifcfg-p7p2, and ifcfg-ovirtmgmt in /etc/sysconfig/network-scripts


===========
Scenario 2: Configure bond via cockpit (do not specify mac address and primary) -- pass
1. After step3, RHVH host IP is normal, and register successfully.
2. After step4, RHVH host got bond IP provided by p7p2.
There are ifcfg-p7p1, ifcfg-p7p2, and ifcfg-ovirtmgmt in /etc/sysconfig/network-scripts

===========
Scenario 3: Configure bond via cockpit (do not specify mac address and but set primary) -- pass
1. After step2, RHVH host obtains new IP which provided by p7p2.
1. After step3, Register can succeed.
3. After step4, IP of RHVH still be provided by p7p2
There are ifcfg-p7p1, ifcfg-p7p2, and ifcfg-ovirtmgmt in /etc/sysconfig/network-scripts

Change the status to VERIFIED.

Comment 38 Sandro Bonazzola 2018-11-02 14:33:13 UTC
This bugzilla is included in oVirt 4.2.7 release, published on November 2nd 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.7 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.