Bug 1289026

Summary: The bonding/vlan network is disabled after upgrade via TUI prior to Engine registration
Product: Red Hat Enterprise Virtualization Manager Reporter: Huijuan Zhao <huzhao>
Component: ovirt-nodeAssignee: Fabian Deutsch <fdeutsch>
Status: CLOSED NOTABUG QA Contact: Huijuan Zhao <huzhao>
Severity: high Docs Contact:
Priority: high    
Version: 3.6.0CC: cshao, cwu, danken, ecohen, fdeutsch, gklein, huiwa, huzhao, ibarkan, leiwang, lsurette, lyi, mburman, mgoldboi, yaniwang, ycui
Target Milestone: ovirt-3.6.2Keywords: Regression
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: node
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-12-10 10:56:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Node RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1285700    
Attachments:
Description Flags
screenshot bond upgrade fail
none
bond log none

Description Huijuan Zhao 2015-12-07 09:23:34 UTC
Created attachment 1103082 [details]
screenshot bond upgrade fail

Description of problem:
The bonding network is disabled after upgrade from RHEVH-7.2/RHEVH-7.1 publicly released version to RHEV-H 7.2 for 3.6 beta2

Version-Release number of selected component (if applicable):
RHEV-H 7.2-20151201.2.el7ev
ovirt-node-3.6.0-0.23.20151201git5eed7af.el7ev.noarch

How reproducible:
100%
Whiteboard: regression

Steps to Reproduce:
1. TUI install RHEV-H 7.2-20151129.1.el7ev
2. Login RHEV-H 7.2-20151129.1.el7ev, setup bond network with two NICs via dhcp, can obtain dhcp ip successful
3. Upgrade from RHEV-H 7.2-20151129.1.el7ev to RHEV-H 7.2-20151201.2.el7ev via TUI
4. Login RHEV-H 7.2-20151201.2.el7ev, check the bond network

Actual results:
After step4, the bond network is disabled

Expected results:
After step4, the bond network should be up and obtain dhcp ip successful

Additional info:
No such issue on RHEV-H 7.2-20151112.1.el7ev, so this is regression bug

Comment 1 Huijuan Zhao 2015-12-07 09:26:00 UTC
Created attachment 1103084 [details]
bond log

Comment 3 Fabian Deutsch 2015-12-07 11:09:58 UTC
Ido, can you tell anything from the logs?

Comment 4 Ido Barkan 2015-12-07 15:15:20 UTC
from the supervdsm.log I see that nothing was persisted:

restore-net::DEBUG::2015-12-04 07:08:55,475::libvirtconnection::160::root::(get) trying to connect libvirt
restore-net::INFO::2015-12-04 07:08:55,520::vdsm-restore-net-config::385::root::(restore) starting network restoration.
restore-net::DEBUG::2015-12-04 07:08:55,520::vdsm-restore-net-config::183::root::(_remove_networks_in_running_config) Not cleaning running configuration since it is empty.
restore-net::INFO::2015-12-04 07:08:55,523::netconfpersistence::179::root::(_clearDisk) Clearing /var/run/vdsm/netconf/nets/ and /var/run/vdsm/netconf/bonds/
restore-net::DEBUG::2015-12-04 07:08:55,523::netconfpersistence::187::root::(_clearDisk) No existent config to clear.
restore-net::INFO::2015-12-04 07:08:55,524::netconfpersistence::129::root::(save) Saved new config RunningConfig({}, {}) to /var/run/vdsm/netconf/nets/ and /var/run/vdsm/netconf/bonds/
restore-net::DEBUG::2015-12-04 07:08:55,524::vdsm-restore-net-config::329::root::(_wait_for_for_all_devices_up) All devices are up.
restore-net::INFO::2015-12-04 07:08:55,529::netconfpersistence::71::root::(setBonding) Adding bond0({'nics': ['em1', 'p4p2'], 'options': 'miimon=100'})
restore-net::INFO::2015-12-04 07:08:55,530::vdsm-restore-net-config::396::root::(restore) restoration completed successfully.

Comment 5 Fabian Deutsch 2015-12-07 17:52:02 UTC
*** Bug 1289028 has been marked as a duplicate of this bug. ***

Comment 6 Fabian Deutsch 2015-12-07 17:52:48 UTC
What persistence failed?

The node specific file persistence?
Or the unified persistence?

Comment 7 Fabian Deutsch 2015-12-08 10:21:33 UTC
It is noted that this is a regression between RHEV-H 7.2-20151112.1.el7ev (works) and RHEV-H 7.2-20151201.2.el7ev (does not work) , the diff between the two is:
--- RHEV-H 7.2-20151112.1.el7ev 
+++ RHEV-H 7.2-20151201.2.el7ev
-glibc-2.17-105.el7.x86_64
-glibc-common-2.17-105.el7.x86_64
+glibc-2.17-106.el7_2.1.x86_64
+glibc-common-2.17-106.el7_2.1.x86_64
-gmp-6.0.0-11.el7.x86_64
+gmp-6.0.0-12.el7_1.x86_64
-ioprocess-0.14.0-4.el7ev.x86_64
+ioprocess-0.15.0-5.el7ev.x86_64
-ipxe-roms-qemu-20130517-7.gitc4bce43.el7.noarch
+ipxe-roms-qemu-20130517-7.1fm.gitc4bce43.el7sat.noarch
-librados2-0.94.1-19.el7cp.x86_64
-librbd1-0.94.1-19.el7cp.x86_64
+librados2-0.94.3-3.el7cp.x86_64
+librbd1-0.94.3-3.el7cp.x86_64
-libreport-filesystem-2.1.11-30.el7.x86_64
+libreport-filesystem-2.1.11-31.el7.x86_64
-libvirt-1.2.17-13.el7.x86_64
+libvirt-1.2.17-13.el7_2.2.x86_64
-libvirt-client-1.2.17-13.el7.x86_64
-libvirt-daemon-1.2.17-13.el7.x86_64
-libvirt-daemon-config-network-1.2.17-13.el7.x86_64
-libvirt-daemon-config-nwfilter-1.2.17-13.el7.x86_64
-libvirt-daemon-driver-interface-1.2.17-13.el7.x86_64
-libvirt-daemon-driver-lxc-1.2.17-13.el7.x86_64
-libvirt-daemon-driver-network-1.2.17-13.el7.x86_64
-libvirt-daemon-driver-nodedev-1.2.17-13.el7.x86_64
-libvirt-daemon-driver-nwfilter-1.2.17-13.el7.x86_64
-libvirt-daemon-driver-qemu-1.2.17-13.el7.x86_64
-libvirt-daemon-driver-secret-1.2.17-13.el7.x86_64
-libvirt-daemon-driver-storage-1.2.17-13.el7.x86_64
-libvirt-daemon-kvm-1.2.17-13.el7.x86_64
-libvirt-lock-sanlock-1.2.17-13.el7.x86_64
+libvirt-client-1.2.17-13.el7_2.2.x86_64
+libvirt-daemon-1.2.17-13.el7_2.2.x86_64
+libvirt-daemon-config-network-1.2.17-13.el7_2.2.x86_64
+libvirt-daemon-config-nwfilter-1.2.17-13.el7_2.2.x86_64
+libvirt-daemon-driver-interface-1.2.17-13.el7_2.2.x86_64
+libvirt-daemon-driver-lxc-1.2.17-13.el7_2.2.x86_64
+libvirt-daemon-driver-network-1.2.17-13.el7_2.2.x86_64
+libvirt-daemon-driver-nodedev-1.2.17-13.el7_2.2.x86_64
+libvirt-daemon-driver-nwfilter-1.2.17-13.el7_2.2.x86_64
+libvirt-daemon-driver-qemu-1.2.17-13.el7_2.2.x86_64
+libvirt-daemon-driver-secret-1.2.17-13.el7_2.2.x86_64
+libvirt-daemon-driver-storage-1.2.17-13.el7_2.2.x86_64
+libvirt-daemon-kvm-1.2.17-13.el7_2.2.x86_64
+libvirt-lock-sanlock-1.2.17-13.el7_2.2.x86_64
+lttng-ust-2.4.1-1.el7cp.x86_64
+OpenIPMI-2.0.19-11.el7.x86_64
+OpenIPMI-libs-2.0.19-11.el7.x86_64
-ovirt-host-deploy-1.4.1-0.0.master.el7ev.noarch
+ovirt-host-deploy-1.4.1-1.el7ev.noarch
-ovirt-hosted-engine-ha-1.3.2.1-1.el7ev.noarch
-ovirt-hosted-engine-setup-1.3.0-1.el7ev.noarch
-ovirt-node-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
-ovirt-node-branding-rhev-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
-ovirt-node-lib-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
-ovirt-node-lib-config-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
-ovirt-node-lib-legacy-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
-ovirt-node-plugin-cim-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
-ovirt-node-plugin-cim-logic-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
-ovirt-node-plugin-hosted-engine-0.3.0-3.el7ev.noarch
-ovirt-node-plugin-rhn-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
-ovirt-node-plugin-snmp-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
-ovirt-node-plugin-snmp-logic-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
-ovirt-node-plugin-vdsm-0.6.1-3.el7ev.noarch
-ovirt-node-selinux-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
+ovirt-hosted-engine-ha-1.3.3.1-1.el7ev.noarch
+ovirt-hosted-engine-setup-1.3.1.1-1.el7ev.noarch
+ovirt-node-3.6.0-0.23.20151201git5eed7af.el7ev.noarch
+ovirt-node-branding-rhev-3.6.0-0.23.20151201git5eed7af.el7ev.noarch
+ovirt-node-lib-3.6.0-0.23.20151201git5eed7af.el7ev.noarch
+ovirt-node-lib-config-3.6.0-0.23.20151201git5eed7af.el7ev.noarch
+ovirt-node-lib-legacy-3.6.0-0.23.20151201git5eed7af.el7ev.noarch
+ovirt-node-plugin-cim-3.6.0-0.23.20151201git5eed7af.el7ev.noarch
+ovirt-node-plugin-cim-logic-3.6.0-0.23.20151201git5eed7af.el7ev.noarch
+ovirt-node-plugin-hosted-engine-0.3.0-4.el7ev.noarch
+ovirt-node-plugin-rhn-3.6.0-0.23.20151201git5eed7af.el7ev.noarch
+ovirt-node-plugin-snmp-3.6.0-0.23.20151201git5eed7af.el7ev.noarch
+ovirt-node-plugin-snmp-logic-3.6.0-0.23.20151201git5eed7af.el7ev.noarch
+ovirt-node-plugin-vdsm-0.6.1-4.el7ev.noarch
+ovirt-node-selinux-3.6.0-0.23.20151201git5eed7af.el7ev.noarch
-python-ioprocess-0.14.0-4.el7ev.noarch
+python-ioprocess-0.15.0-5.el7ev.noarch
-python-rhsm-1.15.4-5.el7.x86_64
+python-rhsm-1.13.2-1.el7.x86_64
-rdma-7.2_4.1_rc6-1.el7.noarch
+rdma-7.2_4.1_rc6-2.el7.noarch
-screen-4.1.0-0.21.20120314git3c2946.el7.x86_64
+screen-4.1.0-0.22.20120314git3c2946.el7.x86_64
-subscription-manager-1.15.9-15.el7.x86_64
+subscription-manager-1.10.14-10.el7.x86_64
+userspace-rcu-0.7.9-2.el7rhs.x86_64
-vdsm-4.17.10.1-0.el7ev.noarch
-vdsm-cli-4.17.10.1-0.el7ev.noarch
-vdsm-hook-ethtool-options-4.17.10.1-0.el7ev.noarch
-vdsm-infra-4.17.10.1-0.el7ev.noarch
-vdsm-jsonrpc-4.17.10.1-0.el7ev.noarch
-vdsm-python-4.17.10.1-0.el7ev.noarch
-vdsm-xmlrpc-4.17.10.1-0.el7ev.noarch
-vdsm-yajsonrpc-4.17.10.1-0.el7ev.noarch
+vdsm-4.17.12-0.el7ev.noarch
+vdsm-cli-4.17.12-0.el7ev.noarch
+vdsm-hook-ethtool-options-4.17.12-0.el7ev.noarch
+vdsm-infra-4.17.12-0.el7ev.noarch
+vdsm-jsonrpc-4.17.12-0.el7ev.noarch
+vdsm-python-4.17.12-0.el7ev.noarch
+vdsm-xmlrpc-4.17.12-0.el7ev.noarch
+vdsm-yajsonrpc-4.17.12-0.el7ev.noarch

Comment 8 Fabian Deutsch 2015-12-08 10:25:33 UTC
    ifcfg: remove files properly on the node
    
    Since change-id I02ae28c345 we are always persisting ifcfg files on the
    node. This means that we should unpersist them on removal.
    
    Change-Id: I2ab83b3fad7679f8f3f459b682860a95e08d6b1e
    Bug-Url: https://bugzilla.redhat.com/1283628
    Signed-off-by: Dan Kenigsberg <danken>
    Reviewed-on: https://gerrit.ovirt.org/48841
    Reviewed-by: Ido Barkan <ibarkan>
    Reviewed-by: Fabian Deutsch <fabiand>
    Tested-by: Sagi Shnaidman <sshnaidm>
    Reviewed-by: Sagi Shnaidman <sshnaidm>
    (cherry picked from commit 1ae349016221c52e1a80971aac2e5080ad33fd11)
    Reviewed-on: https://gerrit.ovirt.org/49373
    Continuous-Integration: Jenkins CI

Was merged during that time, which could have an effect here.

Comment 9 Dan Kenigsberg 2015-12-08 13:40:19 UTC
Oddly, I see that a DHCPOFFER on 07:02:46 but is somehow ignored on 07:03:58.

Dec  4 07:06:53 localhost dhclient[2245]: DHCPDISCOVER on bond0 to 255.255.255.255 port 67 interval 7 (xid=0x1cf82057)
Dec  4 07:06:53 localhost dhclient[2245]: DHCPREQUEST on bond0 to 255.255.255.255 port 67 (xid=0x1cf82057)
Dec  4 07:06:53 localhost dhclient[2245]: DHCPOFFER from 10.66.73.254
Dec  4 07:07:01 localhost dhclient[2245]: DHCPREQUEST on bond0 to 255.255.255.255 port 67 (xid=0x1cf82057)
Dec  4 07:07:11 localhost dhclient[2245]: DHCPDISCOVER on bond0 to 255.255.255.255 port 67 interval 3 (xid=0x52479cbf)
Dec  4 07:07:14 localhost dhclient[2245]: DHCPDISCOVER on bond0 to 255.255.255.255 port 67 interval 4 (xid=0x52479cbf)
Dec  4 07:07:18 localhost dhclient[2245]: DHCPDISCOVER on bond0 to 255.255.255.255 port 67 interval 10 (xid=0x52479cbf)
Dec  4 07:07:28 localhost dhclient[2245]: DHCPDISCOVER on bond0 to 255.255.255.255 port 67 interval 15 (xid=0x52479cbf)
Dec  4 07:07:43 localhost dhclient[2245]: DHCPDISCOVER on bond0 to 255.255.255.255 port 67 interval 18 (xid=0x52479cbf)
Dec  4 07:08:01 localhost systemd: Created slice user-0.slice.
Dec  4 07:08:01 localhost systemd: Starting user-0.slice.
Dec  4 07:08:01 localhost systemd: Started Session 1 of user root.
Dec  4 07:08:01 localhost systemd: Starting Session 1 of user root.
Dec  4 07:08:01 localhost dhclient[2245]: DHCPDISCOVER on bond0 to 255.255.255.255 port 67 interval 11 (xid=0x52479cbf)
Dec  4 07:08:02 localhost systemd: Removed slice user-0.slice.
Dec  4 07:08:02 localhost systemd: Stopping user-0.slice.
Dec  4 07:08:12 localhost dhclient[2245]: No DHCPOFFERS received.
Dec  4 07:08:12 localhost network: Determining IP information for bond0... failed.
Dec  4 07:08:12 localhost network: [FAILED]

Could it be that your dhcp server is unfamiliar with p4p2's mac address? Can you repeat the test with bond mode=4 (in case this is your switch's config)?

Comment 10 Michael Burman 2015-12-08 14:10:53 UTC
Hi

RHEV-H 7.2-20151129.1.el7ev have a vdsm-4.16.30-1.el7ev.x86_64 , right? 

It means, that you had 'rhevm' under /var/lib/vdsm/persistence/netconf/nets/ and not 'ovirtmgmt' .

And isn't your RHEV-H 7.2-20151112.1.el7ev had vdsm 4.17.10(3.6)?
If yes, it explain why it worked. 

I think it's all related to BZ 1271273.

Comment 11 Huijuan Zhao 2015-12-09 02:42:55 UTC
Hi Michael, RHEV-H is not register to rhevm before upgrade, there is no bridge "rhevm" or "ovirtmgmt", so maybe it is not related to "rhevm" or "ovirtmgmt".

Additional, 
1.
For bonding network, no such issue on RHEV-H 7.2-20151112.1.el7ev, 
But, for vlan network, also encounter the issue on RHEV-H 7.2-20151112.1.el7ev.

2. No such issue during upgrade via cmd.

Comment 12 Michael Burman 2015-12-09 05:42:00 UTC
Huijuan Hi,

Even if the host wasn't registered to rhev-m, 'rhevm' bridge is created over the NIC. 
You can verify/see that after step 2^^ with:
tree /ver/lib/vdsm/persistence/netconf/nets/
├── rhevm

and brctl show command.

Danken, isn't the same issue as BZ 1271273 ? the management network and the associate NICs(bond in this case), weren't persistent.

If Huijuan run an upgrade from :
- RHEV-H 7.2-20151129.1.el7ev (vdsm 3.5.6)>> RHEV-H 7.2-20151201.2.el7ev(vdsm 3.6.1.1 beta)
rhevm>ovirtmgmt
He failed.

- But when he run upgrade from:
RHEV-H 7.2-20151112.1.el7ev (vdsm 3.6.1 beta 1) >> RHEV-H 7.2-20151201.2.el7ev(vdsm 3.6.1.1 beta)
ovirtmgmt>ovirtmgmt
He succeeded.

Looks like 'rhevm'/'ovirtmgmt' management network issue(even if the host wasn't registered to rhev-m)

Comment 13 Dan Kenigsberg 2015-12-09 06:56:02 UTC
Are you sure that "rhemv" exists right after step 2? AFAIK rhev-h 3.5 creates the "rhevm" network only after the details of Engine are supplied to the TUI, and not right after step 2 of comment 0. Comment 0 did not supply Engine IP, so bug 1271273 is unrelated.

Comment 14 Michael Burman 2015-12-09 09:00:33 UTC
Dan , Huijuan

You right guys, sorry, my mistake. 
'rhevm' exist only after the details of engine are supplied to the TUI.

Comment 15 Huijuan Zhao 2015-12-09 09:39:44 UTC
Michael, for Comment 12, I run upgrade again from:

- RHEV-H 7.2-20151129.1.el7ev (vdsm 3.5.6)>> RHEV-H 7.2-20151201.2.el7ev(vdsm 3.6.1.1 beta)
Failed.

- RHEV-H 7.2-20151112.1.el7ev (vdsm 3.6.1 beta 1) >> RHEV-H 7.2-20151201.2.el7ev(vdsm 3.6.1.1 beta)
succeeded. But in the previous run, it failed, so this is not 100% reproduce.

Comment 16 Fabian Deutsch 2015-12-09 10:05:30 UTC
Huijuan, can you try to reproduce this bug on a machine with a dual- or quad-nic-card.
Please create the bond over two ports of such a dual or quad card.

I'd like to see if this problem is related to the currently involved NICs.

Comment 17 Michael Burman 2015-12-09 10:21:24 UTC
Fabian, Dan

Can someone please explain the use case for such upgrade scenario??
without involving the rhev-m engine? why someone will run such upgrade in the first place? 
Thanks ))

Comment 18 Fabian Deutsch 2015-12-09 10:32:43 UTC
A valid point, there were cases were this was happening, but in RHEV, indeed, this should not happen to often.

Still, I suspect that this problem will also be encountered if RHEV-H was connected to RHEV-M, because I don't see any RHEV-H specific problem here.

Huijuan, can you please check if this bug also appears if the host is rgeistered to RHEV-M, so adding a step between step 2 and three of comment 0:

2.a Register to RHEV-M

In addition it would be good to have the question in commen 16 answered.

Comment 19 Michael Burman 2015-12-09 12:54:58 UTC
You will be blocked by BZ 1271273 if you will do this with 
RHEV-H 7.2-20151129.1.el7ev (vdsm 3.5.6)>> RHEV-H 7.2-20151201.2.el7ev(vdsm 3.6.1.1 beta)

RHEV-H 7.2-20151112.1.el7ev (vdsm 3.6.1 beta 1) >> RHEV-H 7.2-20151201.2.el7ev(vdsm 3.6.1.1 beta should succeed.

Comment 20 Huijuan Zhao 2015-12-10 02:55:04 UTC
Fabian, ycui, for comment 0 and comment 15, I both created the bond over two NICs.

Do you mean the bond over two ports on the same one NIC or just two ports(two NICs or one NIC both ok)?

Comment 21 Dan Kenigsberg 2015-12-10 07:22:44 UTC
reducing urgency since upgrade prior to registration is less important.

Comment 22 Huijuan Zhao 2015-12-10 10:28:08 UTC
Fabian, ycui, for comment 16, I tested bond over two ports on same one card for one time, no such issue.

Comment 23 Fabian Deutsch 2015-12-10 10:56:08 UTC
Thanks Huijuan

Closing this bug according to comment 22.

Comment 24 Ying Cui 2015-12-17 10:02:09 UTC
For this bug, here still have something not clear.

1. Why regression happened? See bug description and comment 11.

2. Why bond over two NIC cards is disabled after upgrading? It should be valid scenario.
   But bond over _A_ dual- NICs card works well after upgrading