Created attachment 1408319 [details] All logs from host Description of problem: Bond network and ovirtmgmt disappeared after upgrade to rhvh-4.1-0.20180307.0 Version-Release number of selected component (if applicable): # imgbase layout rhvh-4.1-0.20171207.0 +- rhvh-4.1-0.20171207.0+1 rhvh-4.1-0.20180307.0 +- rhvh-4.1-0.20180307.0+1 How reproducible: 100% Steps to Reproduce: 1. Install RHVH4.1 (rhvh-4.1-0.20171207.0) 2. Setup bond0(Active-backup) network over two slave NICs(eno1 and eno2), bond0 can get dhcp ip(10.73.130.225) normally 3. Register host to rhvm-4.1 with bond0(10.73.130.225), the ovirtmgmt ip changes to 10.73.130.251 during adding to rhvm, then add host to rhvm again with new ovirtmgmt ip 10.73.130.251, check host status in rhvm and network status in host 4. Setup local repos, and upgrade host to rhvh-4.1-0.20180307.0 5. Reboot host and enter to rhvh-4.1-0.20180307.0, check host status in rhvm and network status in host Actual results: 1. After step3, host is up in rhvm. There is ovirtmgmt network in host. There are ifcfg-bond0, ifcfg-eno1, ifcfg-eno2, and ifcfg-ovirtmgmt in /etc/sysconfig/network-scripts 2. After step5, host is down in rhvm, there is no ovirtmgmt network in host. ifcfg-eno1, ifcfg-eno2, and ifcfg-ovirtmgmt disappear in /etc/sysconfig/network-scripts # ip a s 2: eno1: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether 08:94:ef:21:c0:4d brd ff:ff:ff:ff:ff:ff 3: eno2: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether 08:94:ef:21:c0:4e brd ff:ff:ff:ff:ff:ff valid_lft forever preferred_lft forever 27: bond0: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000 link/ether 08:94:ef:21:c0:4d brd ff:ff:ff:ff:ff:ff 28: ;vdsmdummy;: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 16:31:04:87:6f:36 brd ff:ff:ff:ff:ff:ff Expected results: After step5, should be same as step3: Host should be up in rhvm. There is ovirtmgmt network in host. There are ifcfg-bond0, ifcfg-eno1, ifcfg-eno2, and ifcfg-ovirtmgmt in /etc/sysconfig/network-scripts Additional info: If not register host to rhvm with bond0, the network can be persisted after upgrade.
Created attachment 1408323 [details] log from engine
I am already sure that I do not have the environment needed to reproduce this. Can you please provide a test system? After 2nd rhvm registration but before update would be ideal
(In reply to Ryan Barry from comment #2) > I am already sure that I do not have the environment needed to reproduce > this. > > Can you please provide a test system? After 2nd rhvm registration but before > update would be ideal Sure. I will send ENV info to you via email as soon as possible, maybe several hours later. As now the machine is used by other colleague.
I find this part a bit strange: 3. Register host to rhvm-4.1 with bond0(10.73.130.225), the ovirtmgmt ip changes to 10.73.130.251 during adding to rhvm, then add host to rhvm again with new ovirtmgmt ip 10.73.130.251, check host status in rhvm and network status in host Why does it change IP address?!
(In reply to Yaniv Kaul from comment #4) > I find this part a bit strange: > 3. Register host to rhvm-4.1 with bond0(10.73.130.225), the ovirtmgmt ip > changes to 10.73.130.251 during adding to rhvm, then add host to rhvm again > with new ovirtmgmt ip 10.73.130.251, check host status in rhvm and network > status in host > > Why does it change IP address?! I think this is an old bug related to Bug 1443347 and Bug 1422430.
(In reply to Huijuan Zhao from comment #3) > (In reply to Ryan Barry from comment #2) > > I am already sure that I do not have the environment needed to reproduce > > this. > > > > Can you please provide a test system? After 2nd rhvm registration but before > > update would be ideal > > Sure. I will send ENV info to you via email as soon as possible, maybe > several hours later. As now the machine is used by other colleague. Ryan, already sent ENV info to you via email, please check.
Unfortunately, this continues to work for me after the one change of address -- Booting back into the old image and upgrading+rebooting repeatedly has not resulted in a change, and it 'sticks' with 10.73.131.63 It's possible that this is due to some networkmanager change, since I've seen a couple of bugs filed by virt QE against NM with vlan+bond.
There are a variety of problems here, and I don't think any of them are RHVH. It's possible that there was an incomplete fix from either cockpit or vdsm, since a simple reboot of the system does not return the 2nd address. It is always the first. vdsm-restore-netconfig is also not saving us here. Finally, NetworkManager itself is killing the configuration files, which is definitely a bug. As before, if I reboot to repeat the process, everything works ok. It is only the first time with this "double registration" flow which is broken. It's possible that there's a race between vdsm and NM here, but there's definitely bad behavior from NM: For reference, I ensured that RHVH actually kept these. Before rebooting (the first, broken time), and saved them off just in case. 2018-03-20 16:51:02,978 [DEBUG] (MainThread) Bases: [<Base rhvh-4.1-0.20171207.0 [<Layer rhvh-4.1-0.20171207.0+1 />] />, <Base rhvh-4.1-0.20180307.0 [<Layer rhvh-4.1-0.20180307.0+1 />] />] 2018-03-20 16:51:02,978 [INFO] (MainThread) No bases to free [root@localhost ~]# ls -l /tmp/a/etc/sysconfig/network-scripts/ifcfg-* -rw-rw-r--. 1 root root 180 Mar 18 21:13 /tmp/a/etc/sysconfig/network-scripts/ifcfg-bond0 -rw-rw-r--. 1 root root 139 Mar 18 21:13 /tmp/a/etc/sysconfig/network-scripts/ifcfg-em1 -rw-rw-r--. 1 root root 139 Mar 18 21:13 /tmp/a/etc/sysconfig/network-scripts/ifcfg-em2 -rw-r--r--. 1 root root 275 Mar 18 20:34 /tmp/a/etc/sysconfig/network-scripts/ifcfg-em3 -rw-r--r--. 1 root root 275 Mar 18 20:34 /tmp/a/etc/sysconfig/network-scripts/ifcfg-em4 -rw-r--r--. 1 root root 254 Jan 2 11:29 /tmp/a/etc/sysconfig/network-scripts/ifcfg-lo -rw-rw-r--. 1 root root 219 Mar 18 21:13 /tmp/a/etc/sysconfig/network-scripts/ifcfg-ovirtmgmt -rw-r--r--. 1 root root 277 Mar 18 20:34 /tmp/a/etc/sysconfig/network-scripts/ifcfg-p5p1 -rw-r--r--. 1 root root 277 Mar 18 20:34 /tmp/a/etc/sysconfig/network-scripts/ifcfg-p5p2 -rw-r--r--. 1 root root 277 Mar 18 20:34 /tmp/a/etc/sysconfig/network-scripts/ifcfg-p7p1 -rw-r--r--. 1 root root 277 Mar 18 20:34 /tmp/a/etc/sysconfig/network-scripts/ifcfg-p7p2 [root@localhost ~]# mkdir backup [root@localhost ~]# cp -rpv /etc/sysconfig/network-scripts backup/ After the reboot: Mar 20 17:01:07 localhost.localdomain NetworkManager[1440]: <info> [1521579667.2225] settings: loaded plugin ifcfg-rh: (c) 2007 - 2015 Red Hat, Inc. To report bugs please use the NetworkManager mailing list. (/usr/lib64/NetworkManager/l ibnm-settings-plugin-ifcfg-rh.so) Mar 20 17:01:07 localhost.localdomain NetworkManager[1440]: <info> [1521579667.3416] ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-em1 (1dad842d-1912-ef5a-a43a-bc238fb267e7,"System em1") Mar 20 17:01:07 localhost.localdomain NetworkManager[1440]: <info> [1521579667.3417] ifcfg-rh: Ignoring connection /etc/sysconfig/network-scripts/ifcfg-em1 (1dad842d-1912-ef5a-a43a-bc238fb267e7,"System em1") due to NM_CONTROLLED=no. Unma naged: interface-name:em1. Mar 20 17:01:07 localhost.localdomain NetworkManager[1440]: <info> [1521579667.3420] ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-em2 (0578038a-64e9-a2fd-0a28-e4cd0b553930,"System em2") Mar 20 17:01:07 localhost.localdomain NetworkManager[1440]: <info> [1521579667.3421] ifcfg-rh: Ignoring connection /etc/sysconfig/network-scripts/ifcfg-em2 (0578038a-64e9-a2fd-0a28-e4cd0b553930,"System em2") due to NM_CONTROLLED=no. Unma naged: interface-name:em2. Mar 20 17:01:07 localhost.localdomain NetworkManager[1440]: <info> [1521579667.3426] ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-bond0 (ad33d8b0-1f7b-cab9-9447-ba07f855b143,"System bond0") Mar 20 17:01:07 localhost.localdomain NetworkManager[1440]: <info> [1521579667.3426] ifcfg-rh: Ignoring connection /etc/sysconfig/network-scripts/ifcfg-bond0 (ad33d8b0-1f7b-cab9-9447-ba07f855b143,"System bond0") due to NM_CONTROLLED=no. Unmanaged: interface-name:bond0. Mar 20 17:01:07 localhost.localdomain NetworkManager[1440]: <info> [1521579667.3439] ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-ovirtmgmt (9a0b07c0-2983-fe97-ec7f-ad2b51c3a3f0,"System ovirtmgmt") Mar 20 17:01:07 localhost.localdomain NetworkManager[1440]: <info> [1521579667.3439] ifcfg-rh: Ignoring connection /etc/sysconfig/network-scripts/ifcfg-ovirtmgmt (9a0b07c0-2983-fe97-ec7f-ad2b51c3a3f0,"System ovirtmgmt") due to NM_CONTROL LED=no. Unmanaged: interface-name:ovirtmgmt. Mar 20 17:01:07 localhost.localdomain NetworkManager[1440]: <info> [1521579667.3648] ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-em3 (a9cf239b-a8d9-4013-8686-ebea720a79ea,"em3") Mar 20 17:01:07 localhost.localdomain NetworkManager[1440]: <info> [1521579667.3661] ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-em4 (298b80d1-d37c-4537-bfff-392760369184,"em4") Mar 20 17:01:07 localhost.localdomain NetworkManager[1440]: <info> [1521579667.3673] ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-p5p1 (20d95264-bf94-485b-90b0-0b28933b76b9,"p5p1") Mar 20 17:01:07 localhost.localdomain NetworkManager[1440]: <info> [1521579667.3686] ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-p5p2 (8bdd3eee-d96b-4c6f-9f05-ecd5d15a1158,"p5p2") Mar 20 17:01:07 localhost.localdomain NetworkManager[1440]: <info> [1521579667.3699] ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-p7p1 (ef0195fd-a49f-4a38-aec2-304cdb750457,"p7p1") Mar 20 17:01:07 localhost.localdomain NetworkManager[1440]: <info> [1521579667.3712] ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-p7p2 (2909cfd3-8bf7-40f2-977c-bc5795561f35,"p7p2") Mar 20 17:01:39 dell-per730-34.lab.eng.pek2.redhat.com dracut[7186]: Executing: /usr/sbin/dracut --hostonly --hostonly-cmdline --hostonly-i18n -o "plymouth dash resume ifcfg" -a watchdog --mount "/dev/mapper/rhvh_dell--per730--34-var /kdu mproot//var ext4 defaults,discard" --no-hostonly-default-device -f /boot/initramfs-3.10.0-858.el7.x86_64kdump.img 3.10.0-858.el7.x86_64 Mar 20 17:01:40 dell-per730-34.lab.eng.pek2.redhat.com NetworkManager[1440]: <info> [1521579700.0096] ifcfg-rh: remove /etc/sysconfig/network-scripts/ifcfg-bond0 (ad33d8b0-1f7b-cab9-9447-ba07f855b143,"System bond0") Mar 20 17:01:40 dell-per730-34.lab.eng.pek2.redhat.com NetworkManager[1440]: <info> [1521579700.0097] ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-bond0 (a404fff1-2206-4fdb-80ac-1f899bae26ae,"bond0") [root@dell-per730-34 ~]# diff /tmp/a/etc/sysconfig/network-scripts/ifcfg-bond0 /etc/sysconfig/network-scripts/ifcfg-bond0 1d0 < # Generated by VDSM version 4.19.42-1.el7ev 3,4c2,17 < BONDING_OPTS='mode=1 miimon=100 primary=em1' < BRIDGE=ovirtmgmt --- > BONDING_OPTS="miimon=100 updelay=0 downdelay=0 mode=active-backup primary=em1" > TYPE=Bond > BONDING_MASTER=yes > MACADDR=24:6E:96:19:BB:70 > PROXY_METHOD=none > BROWSER_ONLY=no > BOOTPROTO=dhcp > DEFROUTE=yes > IPV4_FAILURE_FATAL=no > IPV6INIT=yes > IPV6_AUTOCONF=yes > IPV6_DEFROUTE=yes > IPV6_FAILURE_FATAL=no > IPV6_ADDR_GEN_MODE=stable-privacy > NAME=bond0 > UUID=a404fff1-2206-4fdb-80ac-1f899bae26ae 6,9c19 < MTU=1500 < DEFROUTE=no < NM_CONTROLLED=no < IPV6INIT=no --- > AUTOCONNECT_SLAVES=yes [root@dell-per730-34 ~]# cat /etc/sysconfig/network-scripts/ifcfg-bond0 DEVICE=bond0 BONDING_OPTS="miimon=100 updelay=0 downdelay=0 mode=active-backup primary=em1" TYPE=Bond BONDING_MASTER=yes MACADDR=24:6E:96:19:BB:70 PROXY_METHOD=none BROWSER_ONLY=no BOOTPROTO=dhcp DEFROUTE=yes IPV4_FAILURE_FATAL=no IPV6INIT=yes IPV6_AUTOCONF=yes IPV6_DEFROUTE=yes IPV6_FAILURE_FATAL=no IPV6_ADDR_GEN_MODE=stable-privacy NAME=bond0 UUID=a404fff1-2206-4fdb-80ac-1f899bae26ae ONBOOT=yes AUTOCONNECT_SLAVES=yes NetworkManager actually removes our bond0 here and overwrites it with something that looks like a stock config, managed by NM. It's not in the logs, but I suspect it's also removing em1 and em2, since those are slaves of a "bad" configuration. It also seems that vdsm-restore-net-config loses the race this time (or wins, and NM doesn't like it -- it's impossible to tell). Ben - why would NM overwrite ifcfg files? Especially ones which are not managed by NM? I'm also curious if you have any idea why ifcfg files on an umounted LV are disappearing on a reboot -- they are present before, but gone after rebooting with whatever configuration QE is using. Dan - any ideas why vdsm-restore-net-config wouldn't win here? We don't have millisecond timestamps in the journal, but it looks like it's racing: supervdsm.log:restore-net::INFO::2018-03-20 17:01:37,849::vdsm-restore-net-config::470::root::(restore) starting network restoration. supervdsm.log:restore-net::DEBUG::2018-03-20 17:01:37,852::vdsm-restore-net-config::226::root::(_remove_networks_in_running_config) Not cleaning running configuration since it is empty. supervdsm.log:restore-net::INFO::2018-03-20 17:01:37,862::ifcfg::548::root::(_loadBackupFiles) Loaded /var/lib/vdsm/netconfback/ifcfg-em2 supervdsm.log:restore-net::INFO::2018-03-20 17:01:37,868::ifcfg::548::root::(_loadBackupFiles) Loaded /var/lib/vdsm/netconfback/ifcfg-em1 supervdsm.log:restore-net::INFO::2018-03-20 17:01:37,868::ifcfg::548::root::(_loadBackupFiles) Loaded /var/lib/vdsm/netconfback/ifcfg-ovirtmgmt supervdsm.log:restore-net::INFO::2018-03-20 17:01:37,873::ifcfg::548::root::(_loadBackupFiles) Loaded /var/lib/vdsm/netconfback/ifcfg-bond0 supervdsm.log:restore-net::DEBUG::2018-03-20 17:01:37,874::commands::69::root::(execCmd) /usr/bin/taskset --cpu-list 0-23 /sbin/ifdown ovirtmgmt (cwd None) superv ... supervdsm.log:restore-net::DEBUG::2018-03-20 17:01:40,185::commands::93::root::(execCmd) SUCCESS: <err> = 'Running scope as unit 9c51e623-39af-475b-832f-05e62a26dba9.scope.\n'; <rc> = 0 supervdsm.log:restore-net::DEBUG::2018-03-20 17:01:40,199::vdsm-restore-net-config::382::root::(_wait_for_for_all_devices_up) All devices are up. supervdsm.log:restore-net::WARNING::2018-03-20 17:01:40,248::bridges::42::root::(ports) ovirtmgmt is not a Linux bridge supervdsm.log:restore-net::INFO::2018-03-20 17:01:40,249::cache::217::root::(_getNetInfo) Obtaining info for net ovirtmgmt. supervdsm.log:restore-net::DEBUG::2018-03-20 17:01:40,285::commands::69::root::(execCmd) /usr/bin/taskset --cpu-list 0-23 /sbin/tc qdisc show (cwd None) supervdsm.log:restore-net::DEBUG::2018-03-20 17:01:40,300::commands::93::root::(execCmd) SUCCESS: <err> = ''; <rc> = 0 supervdsm.log:restore-net::DEBUG::2018-03-20 17:01:40,303::commands::69::root::(execCmd) /usr/bin/taskset --cpu-list 0-23 /bin/systemctl --no-pager list-unit-files (cwd None) supervdsm.log:restore-net::DEBUG::2018-03-20 17:01:40,406::commands::93::root::(execCmd) SUCCESS: <err> = ''; <rc> = 0 supervdsm.log:restore-net::DEBUG::2018-03-20 17:01:40,407::commands::69::root::(execCmd) /usr/bin/taskset --cpu-list 0-23 /bin/systemctl status openvswitch.service (cwd None) supervdsm.log:restore-net::DEBUG::2018-03-20 17:01:40,445::commands::93::root::(execCmd) FAILED: <err> = ''; <rc> = 3 supervdsm.log:restore-net::INFO::2018-03-20 17:01:40,446::netconfpersistence::68::root::(setBonding) Adding bond0({'nics': [], 'switch': 'legacy', 'options': 'miimon=100 mode=1'}) Huijuan - Does this only happen if you use cockpit? I don't like this configuration in general, since the IP does not survive a reboot even without an update. It seems problematic, and like it may not work in general.
(In reply to Ryan Barry from comment #8) > NetworkManager actually removes our bond0 here and overwrites it with > something that looks like a stock config, managed by NM. It's not in the > logs, but I suspect it's also removing em1 and em2, since those are slaves > of a "bad" configuration. > > It also seems that vdsm-restore-net-config loses the race this time (or > wins, and NM doesn't like it -- it's impossible to tell). > > Ben - why would NM overwrite ifcfg files? Especially ones which are not > managed by NM? Hi, NM never removes connection files by its own initiative. At lines: [1521579700.0096] ifcfg-rh: remove /etc/sysconfig/network-scripts/ifcfg-bond0 (ad33d8b0-1f7b-cab9-9447-ba07f855b143,"System bond0") [1521579700.0097] ifcfg-rh: new connection /etc/sysconfig/network-scripts/ifcfg-bond0 (a404fff1-2206-4fdb-80ac-1f899bae26ae,"bond0") it detects that the file disappears and that a new file is created, but those changes are done externally by someone else.
> Huijuan - > > Does this only happen if you use cockpit? I don't like this configuration in > general, since the IP does not survive a reboot even without an update. It > seems problematic, and like it may not work in general. Yes, it only happen on cockpit, no such issue through Anaconda GUI.
Ben - I actually tested removing files to see if NM would spit out the same message and did not see anything. Does this only happen at one point during init? Huijuan - this may be a cockpit bug. If this is only reproducible through cockpit, I'd like to reduce the severity/priority (especially since the steps taken to reproduce are unusual). Since this is only reproducible one time, even on the provided system, and the provided network configuration does not come back after rebooting, I may need another test system. Specifically, I'd like to try disabling cockpit on the updated system to see if this resolves. Clearly, not having cockpit enabled is not an option, but if this is isolated to cockpit, let's push it out until they resolve. This looks like a regression in https://bugzilla.redhat.com/show_bug.cgi?id=1443347
(In reply to Ryan Barry from comment #11) > Ben - I actually tested removing files to see if NM would spit out the same > message and did not see anything. Does this only happen at one point during > init? It's because NM doesn't monitor ifcfg files at runtime; they are only loaded at startup or when the user calls 'nmcli connection reload' / 'nmcli connection load <file>'. Note that the 'ifup' command issues a nmcli-connection-load under the hood. A simple way to see the message would be: - delete and recreate the ifcfg-bond0 file with a different UUID (or just change the UUID) - 'ifup bond0' or 'nmcli connection reload' When a connection file gets deleted through NetworkManager (which is not the case in the log from comment 8), you should see a audit entry like this in the journal: NetworkManager[15436]: <info> [1521645358.6220] audit: op="connection-delete" uuid="2d74c40b-7514-467e-ad00-935d7155b3f6" name="bond0" pid=18210 uid=0 result="success"
(In reply to Ryan Barry from comment #11) > Ben - I actually tested removing files to see if NM would spit out the same > message and did not see anything. Does this only happen at one point during > init? > > Huijuan - this may be a cockpit bug. If this is only reproducible through > cockpit, I'd like to reduce the severity/priority (especially since the > steps taken to reproduce are unusual). Since this is only reproducible one > time, even on the provided system, and the provided network configuration > does not come back after rebooting, I may need another test system. > > Specifically, I'd like to try disabling cockpit on the updated system to see > if this resolves. Clearly, not having cockpit enabled is not an option, but > if this is isolated to cockpit, let's push it out until they resolve. This > looks like a regression in > https://bugzilla.redhat.com/show_bug.cgi?id=1443347 Due to huzhao PTO today, I will redefine the reproduce steps and provide test ENV for you later.
Short Summary: 1. Upgrade step is not need for reproduce this bug. 2. This bug is only reproducible through cockpit. 3. The bug is easy to reproduce if follow the below step. 4. I am doubt the bug is related with bug 1422430. =========== Scenario 1: Configure bond via cockpit (specify mac address and primary) Test version: redhat-virtualization-host-4.1-20171207.0 Test steps: 1. Install redhat-virtualization-host-4.1-20171207.0 via anaconda, config one nic(eno1) up. 2. Key step: Login cockpit UI, setup a dhcp bond(active+backup mode) over two nics(eno1 + eno2), specify one mac address is must(eno1), primary choose eno1 as well. 3. Register host to rhvm-4.1 with bond0 (This step will make IP changed, so register failed) 4. Add host to rhvm again with new ip. 5. Reboot host. Test result: 1. After step3, RHVH host got new IP, and register failed due to IP changed. 2. After step4, RHVH can up in RHVM. 3. After step5, Bond network and ovirtmgmt disappeared. There is not ifcfg-eno1, ifcfg-eno2, and ifcfg-ovirtmgmt in /etc/sysconfig/network-scripts =========== Scenario 2: Configure bond via cockpit (do not specify mac address and primary) Test version: redhat-virtualization-host-4.1-20171207.0 Test steps: 1. Install redhat-virtualization-host-4.1-20171207.0 via anaconda, config one nic(eno1) up. 2. Login cockpit UI, setup a dhcp bond(active+backup mode) over two nics(eno1 + eno2), do not specify mac address, and not set primary option. 3. Register host to rhvm-4.1 with bond0 (This step will make IP lost, so register failed) 4. Reboot host check IP. Test result: 1. After step3, RHVH host lost IP, and register failed. 3. After step4, RHVH host still not IP. There is not ifcfg-eno1, ifcfg-eno2, and ifcfg-ovirtmgmt in /etc/sysconfig/network-scripts =========== Scenario 3: Configure bond via cockpit (do not specify mac address and but set primary) Test version: redhat-virtualization-host-4.1-20180314.0 Test steps: 1. Install redhat-virtualization-host-4.1-20180314.0 via anaconda, config one nic(eno1) up. 2. Login cockpit UI, setup a dhcp bond(active+backup mode) over two nics(eno1 + eno2), do not specify mac address, but set primary(eno1). RHVH host obtain new IP which provided by eno2. 3. Register host to rhvm-4.1 with bond0. 4. Reboot host check IP. Test result: 1. After step2, RHVH host obtain new IP which provided by eno2. 1. After step3, Register can succeed. 3. After step4, RHVM RHVH host change IP to previous one which provided by eno1. There are ifcfg-eno1, ifcfg-eno2, and ifcfg-ovirtmgmt in /etc/sysconfig/network-scripts =========== Scenario 4: Configure bond via Anaconda work well. 1. Install redhat-virtualization-host-4.1-20171207.0 via anaconda. setup a dhcp bond(active+backup mode) over two nics.(do not specify mac addr due to no such option in anaconda UI) 2. Register to RHVM. 3. Reboot. Test result: 1. After step2, no IP changed, register to RHVM can succeed. 2. After step3, everything is ok.
Can you please check this on 4.2? https://bugzilla.redhat.com/show_bug.cgi?id=1443347#c28 indicates that this may be fixed there, and not 7.5
(In reply to Ryan Barry from comment #15) > Can you please check this on 4.2? > > https://bugzilla.redhat.com/show_bug.cgi?id=1443347#c28 indicates that this > may be fixed there, and not 7.5 Still have issues on 4.2, but not exactly same as comment 0. Test version: From: rhvh-4.1-0.20171207.0 To: rhvh-4.2-0.20180322.0 Test steps: Same as comment 0 Actual results: After step5, host is down in rhvm, there is ovirtmgmt network in host, but no dhcp IP. ifcfg-ovirtmgmt disappear in /etc/sysconfig/network-scripts. I will send ENV info to you via email, please check if needed.
supervdsm.log:restore-net::ERROR::2018-03-23 22:56:50,211::restore_net_config:308::root::(_find_nets_with_available_devices) Bond "bond0" is not persisted and will not be configured. Network "ovirtmgmt" will not be configured as a consequence. Huijuan - Is this reproducible on RHEL? Edward - any thoughts? The test scenario in comment#1 looks wrong to me (double registration to RHVM).
(In reply to Huijuan Zhao from comment #16) > > Still have issues on 4.2, but not exactly same as comment 0. > > Test version: > From: rhvh-4.1-0.20171207.0 > To: rhvh-4.2-0.20180322.0 > > Test steps: > Same as comment 0 As a general note, the ability for VDSM to acquire an "external" bond and persist it in its config was fixed in 4.2. If the external bond is not persisted by VDSM and there is a failure when attempting to do setupNetworks, it will not be persisted after reboot. 1. Is the problem reported here is a regression from a previous 4.1 version? Could you please confirm this and mention the working version? (including what RHEL version it is based on) 2. If there is a similar problem on 4.2 or something else is not working there, please add the relevant logs and mention the time each step has been taken (to interpret the log correctly). If this is not like the problem described in this BZ, better open a new one.
(In reply to Edward Haas from comment #18) > (In reply to Huijuan Zhao from comment #16) > > > > Still have issues on 4.2, but not exactly same as comment 0. > > > > Test version: > > From: rhvh-4.1-0.20171207.0 > > To: rhvh-4.2-0.20180322.0 > > > > Test steps: > > Same as comment 0 > > As a general note, the ability for VDSM to acquire an "external" bond and > persist it in its config was fixed in 4.2. > If the external bond is not persisted by VDSM and there is a failure when > attempting to do setupNetworks, it will not be persisted after reboot. > > 1. Is the problem reported here is a regression from a previous 4.1 version? > Could you please confirm this and mention the working version? (including > what RHEL version it is based on) > > 2. If there is a similar problem on 4.2 or something else is not working > there, please add the relevant logs and mention the time each step has been > taken (to interpret the log correctly). > If this is not like the problem described in this BZ, better open a new one. This is not regression from 4.1 version(both for rhel_7.4 and rhel_7.5). Actually this test scenario was blocked by Bug 1443347 and Bug 1422430 for a long time.
(In reply to Ryan Barry from comment #17) > supervdsm.log:restore-net::ERROR::2018-03-23 > 22:56:50,211::restore_net_config:308::root:: > (_find_nets_with_available_devices) Bond "bond0" is not persisted and will > not be configured. Network "ovirtmgmt" will not be configured as a > consequence. > > Huijuan - > > Is this reproducible on RHEL? > This issue only occures when register host to engine, I think it is related to vdsm/NM, so in my opinion, it maybe also reproducible on RHEL. I will test it later.
Just to clarify: This is broken on both 7.4 and 7.5 for 4.1, correct? Does this work in 4.2?
(In reply to Ryan Barry from comment #21) > Just to clarify: > > This is broken on both 7.4 and 7.5 for 4.1, correct? Yes. > > Does this work in 4.2? 1. If upgrade from rhvh-4.1 to 4.2, the issue exists. 2. For rhvh-4.2 rebooting or upgrade from 4.2 to 4.2, I have to test this later and update results here then.
(In reply to Huijuan Zhao from comment #22) > (In reply to Ryan Barry from comment #21) > > Just to clarify: > > > > This is broken on both 7.4 and 7.5 for 4.1, correct? > > Yes. > > > > > Does this work in 4.2? > > 1. If upgrade from rhvh-4.1 to 4.2, the issue exists. > 2. For rhvh-4.2 rebooting or upgrade from 4.2 to 4.2, I have to test this > later and update results here then. I forgot that for 4.2, this scenario can not be tested currently due to Bug 1548265. And as comment 14 Scenario 4 testing results, there is workaround for bond register to rhvm successful, so I will lower the Severity.
(In reply to Huijuan Zhao from comment #20) > (In reply to Ryan Barry from comment #17) > > supervdsm.log:restore-net::ERROR::2018-03-23 > > 22:56:50,211::restore_net_config:308::root:: > > (_find_nets_with_available_devices) Bond "bond0" is not persisted and will > > not be configured. Network "ovirtmgmt" will not be configured as a > > consequence. > > > > Huijuan - > > > > Is this reproducible on RHEL? > > > This issue only occures when register host to engine, I think it is related > to vdsm/NM, so in my opinion, it maybe also reproducible on RHEL. I will > test it later. I tested on RHEL-7.4, also have such issue(not exactly same with RHVH). 1. ovirtmgmt can not get DHCP ip after adding rhvh to rhvm, so rhvh can not be up in rhvm; 2. After reboot, ovirtmgmt disappeared and no ip for bond. So I think this bug is still related with Bug 1443347 and Bug 1422430, maybe we have to test this scenario until the old bugs resolved.
(In reply to Huijuan Zhao from comment #23 , 24) > > I forgot that for 4.2, this scenario can not be tested currently due to Bug > 1548265. I do not understand how that bug is related to this case. If you suspect rhevm agent to be the problem, then we start from a machine that is correctly configured. So why not configure first the bond manually, then go to cockpit and re-apply it from there. > I tested on RHEL-7.4, also have such issue(not exactly same with RHVH). I pretty much lost track of this BZ here. Several problems have been fixed in VDSM 4.2, most (if not all) have not been back-ported to 4.1. If we have a problem in 4.2, we need to fix it. If you want to backport fixes from 4.2 to 4.1, it will be costly and we need good reasoning (like no workaround) to do so.
(In reply to Edward Haas from comment #25) > (In reply to Huijuan Zhao from comment #23 , 24) > > > > I forgot that for 4.2, this scenario can not be tested currently due to Bug > > 1548265. > > I do not understand how that bug is related to this case. > If you suspect rhevm agent to be the problem, then we start from a machine > that is correctly configured. So why not configure first the bond manually, > then go to cockpit and re-apply it from there. > Due to Bug 1548265, can not create bond via cockpit, so can not register to rhvm via bond network setup by cockpit. Of course there are workarounds to avoid this bond bug, and can register rhvm successful via bond network, but the problem here is: I just test this cockpit scenario which has issue. For other scenario, there is no issue, so no need to test. > > I tested on RHEL-7.4, also have such issue(not exactly same with RHVH). > > I pretty much lost track of this BZ here. > > Several problems have been fixed in VDSM 4.2, most (if not all) have not > been back-ported to 4.1. > If we have a problem in 4.2, we need to fix it. > If you want to backport fixes from 4.2 to 4.1, it will be costly and we need > good reasoning (like no workaround) to do so. I do not expect to back-port the 4.2 fix to 4.1, I just reported such a scenario issue, and after analysis and testing, maybe this scenario issue is still blocked to test as several bugs, so QE will test this scenario in 4.2 after Bug 1548265 is resolved.
(In reply to Huijuan Zhao from comment #26) > > Due to Bug 1548265, can not create bond via cockpit, so can not register to > rhvm via bond network setup by cockpit. > > Of course there are workarounds to avoid this bond bug, and can register > rhvm successful via bond network, but the problem here is: I just test this > cockpit scenario which has issue. For other scenario, there is no issue, so > no need to test. > I am trying to argue that you may not need to wait for BZ#1548265 to be resolved in order to proceed with this BZ and check problems in VDSM. There are at least 3 stages here if I understand correctly: Anaconda, cockpit and adding rhevh to rhevm. BZ#1548265 blocks stage 2 in a scenario where you attempt to configure a host with 2 dhcp based nics, adding a bond over them. You could replace that stage steps by removing the dhcp from the nics and then adding the bond, or some other workaround.. with cockpit or without it (if without it, then when you finish, go back and re-apply the config using cockpit). The end result of stage 2 should be the same, just bypassing the mentioned BZ. Then you can proceed and test if everything is working in stage3. Is this what has been done?
(In reply to Edward Haas from comment #27) > (In reply to Huijuan Zhao from comment #26) > > > > Due to Bug 1548265, can not create bond via cockpit, so can not register to > > rhvm via bond network setup by cockpit. > > > > Of course there are workarounds to avoid this bond bug, and can register > > rhvm successful via bond network, but the problem here is: I just test this > > cockpit scenario which has issue. For other scenario, there is no issue, so > > no need to test. > > > > I am trying to argue that you may not need to wait for BZ#1548265 to be > resolved in order to proceed with this BZ and check problems in VDSM. > > There are at least 3 stages here if I understand correctly: Anaconda, > cockpit and adding rhevh to rhevm. > BZ#1548265 blocks stage 2 in a scenario where you attempt to configure a > host with 2 dhcp based nics, adding a bond over them. > You could replace that stage steps by removing the dhcp from the nics and > then adding the bond, or some other workaround.. with cockpit or without it > (if without it, then when you finish, go back and re-apply the config using > cockpit). > The end result of stage 2 should be the same, just bypassing the mentioned > BZ. > Then you can proceed and test if everything is working in stage3. > > Is this what has been done? I can understand your point, you would like QE to verify VDSM issue, there is workaround(see comment 14 scenario 4) can identify that stage3 is ok(even in rhvh 4.1). Here my point is: stage2 have issue now in rhvh 4.2, and just this scenario can not be tested now, and this scenario is what this bug focused on.
Deferring while we wait for platform to fix the parent bug.
Test this bug according to comment 14 Scenario 1~3. Test Version: rhvh-4.2.4.3-0.20180622.0 cockpit-system-169-1.el7.noarch cockpit-ws-169-1.el7.x86_64 cockpit-dashboard-169-1.el7.x86_64 cockpit-ovirt-dashboard-0.11.28-1.el7ev.noarch cockpit-169-1.el7.x86_64 cockpit-machines-ovirt-169-1.el7.noarch cockpit-bridge-169-1.el7.x86_64 cockpit-storaged-169-1.el7.noarch Test Results: =========== Scenario 1: Configure bond via cockpit (specify mac address and primary) -- pass 1. After step3, RHVH host doesn't get new IP and register successfully. 2. After step4, RHVH can up in RHVM. 3. After step5, Bond network and ovirtmgmt are still normal. There are ifcfg-eno1, ifcfg-eno2, and ifcfg-ovirtmgmt in /etc/sysconfig/network-scripts =========== Scenario 2: Configure bond via cockpit (do not specify mac address and primary) -- fail 1. After step3, RHVH host lost IP, and register failed. 2. After step4, RHVH host got bond IP provided by em1. There is not ifcfg-ovirtmgmt in /etc/sysconfig/network-scripts =========== Scenario 3: Configure bond via cockpit (do not specify mac address and but set primary) -- pass 1. After step2, RHVH host obtains new IP which provided by em1. 1. After step3, Register can succeed. 3. After step4, IP of RHVH still be provided by em1 There are ifcfg-eno1, ifcfg-eno2, and ifcfg-ovirtmgmt in /etc/sysconfig/network-scripts Change the status to ASSIGNED.
Moving out, since this was not resolved in the platform batch update, and we need a fix from Cockpit.
BZ#1548265 has been fixed and verified. Can you please re-test with latest 4.2.7/RHEL 7.6 build?
(In reply to Sandro Bonazzola from comment #34) > BZ#1548265 has been fixed and verified. Can you please re-test with latest > 4.2.7/RHEL 7.6 build? Test with RHVH-4.2-20180920.0-RHVH-x86_64-dvd1.iso, scenario 1 is pass, others will be re-test ASAP.
Environment server is broken, have sent ticket to Admin. When it fixs, I will verify this bug soon. So remove the needinfo flag now.
Test this bug according to comment 14 Scenario 1~3. Test Version: rhvh-4.2.7.0-0.20180918 cockpit-bridge-173-6.el7.x86_64 cockpit-storaged-172-2.el7.noarch cockpit-173-6.el7.x86_64 cockpit-ovirt-dashboard-0.11.34-1.el7ev.noarch cockpit-system-173-6.el7.noarch cockpit-ws-173-6.el7.x86_64 cockpit-machines-ovirt-172-2.el7.noarch cockpit-dashboard-172-2.el7.x86_64 NetworkManager-1.12.0-6.el7.x86_64 Test Results: =========== Scenario 1: Configure bond via cockpit (specify mac address and primary) -- pass 1. After step3, RHVH host doesn't get new IP and register successfully. 2. After step4, RHVH can up in RHVM. 3. After step5, Bond network and ovirtmgmt are still normal. There are ifcfg-p7p1, ifcfg-p7p2, and ifcfg-ovirtmgmt in /etc/sysconfig/network-scripts =========== Scenario 2: Configure bond via cockpit (do not specify mac address and primary) -- pass 1. After step3, RHVH host IP is normal, and register successfully. 2. After step4, RHVH host got bond IP provided by p7p2. There are ifcfg-p7p1, ifcfg-p7p2, and ifcfg-ovirtmgmt in /etc/sysconfig/network-scripts =========== Scenario 3: Configure bond via cockpit (do not specify mac address and but set primary) -- pass 1. After step2, RHVH host obtains new IP which provided by p7p2. 1. After step3, Register can succeed. 3. After step4, IP of RHVH still be provided by p7p2 There are ifcfg-p7p1, ifcfg-p7p2, and ifcfg-ovirtmgmt in /etc/sysconfig/network-scripts Change the status to VERIFIED.
This bugzilla is included in oVirt 4.2.7 release, published on November 2nd 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.7 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.