Description of problem ====================== When I updated one of my gluster console managed clusters (previous update was done on Sep 08 2015), storage servers lost network access (eth0 device has been removed by ovirtmgmt). Version-Release number of selected component (if applicable) ============================================================ Red Hat Gluster Storage Server 3.1 machines are registered into following channels: * rhel-6-server-rpms * rhel-scalefs-for-rhel-6-server-rpms * rhs-3-for-rhel-6-server-rpms Related packages before update (state from Sep 08 2015): ~~~ glusterfs-3.7.1-11.el6rhs.x86_64 glusterfs-api-3.7.1-11.el6rhs.x86_64 glusterfs-cli-3.7.1-11.el6rhs.x86_64 glusterfs-client-xlators-3.7.1-11.el6rhs.x86_64 glusterfs-fuse-3.7.1-11.el6rhs.x86_64 glusterfs-geo-replication-3.7.1-11.el6rhs.x86_64 glusterfs-libs-3.7.1-11.el6rhs.x86_64 glusterfs-rdma-3.7.1-11.el6rhs.x86_64 glusterfs-server-3.7.1-11.el6rhs.x86_64 gluster-nagios-addons-0.2.4-4.el6rhs.x86_64 gluster-nagios-common-0.2.0-1.el6rhs.noarch vdsm-4.16.20-1.2.el6rhs.x86_64 vdsm-cli-4.16.20-1.2.el6rhs.noarch vdsm-gluster-4.16.20-1.2.el6rhs.noarch vdsm-jsonrpc-4.16.20-1.2.el6rhs.noarch vdsm-python-4.16.20-1.2.el6rhs.noarch vdsm-python-zombiereaper-4.16.20-1.2.el6rhs.noarch vdsm-reg-4.16.20-1.2.el6rhs.noarch vdsm-xmlrpc-4.16.20-1.2.el6rhs.noarch vdsm-yajsonrpc-4.16.20-1.2.el6rhs.noarch ~~~ Related packages after update (state from Nov 15 2015): ~~~ glusterfs-3.7.1-16.el6rhs.x86_64 glusterfs-api-3.7.1-16.el6rhs.x86_64 glusterfs-cli-3.7.1-16.el6rhs.x86_64 glusterfs-client-xlators-3.7.1-16.el6rhs.x86_64 glusterfs-fuse-3.7.1-16.el6rhs.x86_64 glusterfs-geo-replication-3.7.1-16.el6rhs.x86_64 glusterfs-libs-3.7.1-16.el6rhs.x86_64 glusterfs-rdma-3.7.1-16.el6rhs.x86_64 glusterfs-server-3.7.1-16.el6rhs.x86_64 gluster-nagios-addons-0.2.5-1.el6rhs.x86_64 gluster-nagios-common-0.2.2-1.el6rhs.noarch python-gluster-3.7.1-16.el6rhs.x86_64 vdsm-4.16.20-1.3.el6rhs.x86_64 vdsm-cli-4.16.20-1.3.el6rhs.noarch vdsm-gluster-4.16.20-1.3.el6rhs.noarch vdsm-jsonrpc-4.16.20-1.3.el6rhs.noarch vdsm-python-4.16.20-1.3.el6rhs.noarch vdsm-python-zombiereaper-4.16.20-1.3.el6rhs.noarch vdsm-reg-4.16.20-1.3.el6rhs.noarch vdsm-xmlrpc-4.16.20-1.3.el6rhs.noarch vdsm-yajsonrpc-4.16.20-1.3.el6rhs.noarch ~~~ How reproducible ================ 100% I observe this issue on cluster of virtual machines, which allowed me to restore snapshot from September (last one I have) and retry multiple times. Steps to Reproduce ================== On cluster managed by RH Gluster Storage Console (configured with nagios), which has been updated in Septermer for the last time: 1. Stop all gluster volumes (in console) 2. Move all storage servers into maintenance mode 3. Stop glusterd daemons on all storage servers 4. Run yum update on all storage servers 5. Reboot all storage servers Actual results ============== Storage servers loses has not network access during boot. When I login via serial console, I see that: * the only network device in up state is loopback * both /etc/sysconfig/network-scripts/ifcfg-{eth0,ovirtmgmt} files are deleted * vdsm (ovirt manager) seems to be involved (see below) Checking /var/log/messages: ~~~ $ grep eth0 messages Oct 15 14:57:03 dhcp-125-8 kernel: device eth0 entered promiscuous mode Oct 15 14:57:03 dhcp-125-8 kernel: ovirtmgmt: port 1(eth0) entering forwarding state Oct 15 14:57:55 dhcp-125-8 kernel: 8021q: adding VLAN 0 to HW filter on device eth0 Oct 15 14:57:56 dhcp-125-8 kernel: ovirtmgmt: port 1(eth0) entering disabled state Oct 15 14:57:57 dhcp-125-8 kernel: device eth0 left promiscuous mode Oct 15 14:57:57 dhcp-125-8 kernel: ovirtmgmt: port 1(eth0) entering disabled state ~~~ Last few lines: ~~~ Oct 15 14:57:48 dhcp-125-8 kernel: lo: Disabled Privacy Extensions Oct 15 14:57:55 dhcp-125-8 kernel: Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Oct 15 14:57:55 dhcp-125-8 kernel: 802.1Q VLAN Support v1.8 Ben Greear <greearb> Oct 15 14:57:55 dhcp-125-8 kernel: All bugs added by David S. Miller <davem> Oct 15 14:57:55 dhcp-125-8 kernel: 8021q: adding VLAN 0 to HW filter on device eth0 Oct 15 14:57:56 dhcp-125-8 ntpd[6875]: ntpd exiting on signal 15 Oct 15 14:57:56 dhcp-125-8 ntpd[7735]: ntpd 4.2.6p5 Tue Apr 28 10:15:27 UTC 2015 (1) Oct 15 14:57:56 dhcp-125-8 ntpd[7736]: proto: precision = 0.211 usec Oct 15 14:57:56 dhcp-125-8 ntpd[7736]: 0.0.0.0 c01d 0d kern kernel time sync enabled Oct 15 14:57:56 dhcp-125-8 ntpd[7736]: Listen and drop on 0 v4wildcard 0.0.0.0 UDP 123 Oct 15 14:57:56 dhcp-125-8 ntpd[7736]: Listen and drop on 1 v6wildcard :: UDP 123 Oct 15 14:57:56 dhcp-125-8 ntpd[7736]: Listen normally on 2 lo 127.0.0.1 UDP 123 Oct 15 14:57:56 dhcp-125-8 ntpd[7736]: Listen normally on 3 ovirtmgmt 10.34.125.8 UDP 123 Oct 15 14:57:56 dhcp-125-8 ntpd[7736]: Listening on routing socket on fd #20 for interface updates Oct 15 14:57:56 dhcp-125-8 ntpd[7736]: Deferring DNS for 0.rhel.pool.ntp.org 1 Oct 15 14:57:56 dhcp-125-8 kernel: ovirtmgmt: port 1(eth0) entering disabled state Oct 15 14:57:56 dhcp-125-8 ntpd[7736]: Deferring DNS for 1.rhel.pool.ntp.org 1 Oct 15 14:57:56 dhcp-125-8 ntpd[7736]: Deferring DNS for 2.rhel.pool.ntp.org 1 Oct 15 14:57:56 dhcp-125-8 ntpd[7736]: Deferring DNS for 3.rhel.pool.ntp.org 1 Oct 15 14:57:56 dhcp-125-8 ntpd[7736]: 0.0.0.0 c016 06 restart Oct 15 14:57:56 dhcp-125-8 ntpd[7736]: 0.0.0.0 c012 02 freq_set kernel -10.082 PPM Oct 15 14:57:57 dhcp-125-8 kernel: device eth0 left promiscuous mode Oct 15 14:57:57 dhcp-125-8 kernel: ovirtmgmt: port 1(eth0) entering disabled state Oct 15 14:57:57 dhcp-125-8 kernel: ovirtmgmt: mixed no checksumming and other settings. Oct 15 14:57:58 dhcp-125-8 vdsm vds WARNING MOM is not available, fallback to KsmMonitor Oct 15 14:57:58 dhcp-125-8 ntpd[7736]: Deleting interface #3 ovirtmgmt, 10.34.125.8#123, interface stats: received=0, sent=0, dropped=0, active_time=2 secs ~~~ Expected results ================ Network works as usual, ifcfg files are not deleted. Additional info =============== When I boot into single mode and the start network after the update instead of normal boot, network works (no ifcfg files are deleted). This means that ifcfg files are deleted during first normal multiuser boot after the update.
Created attachment 1083307 [details] Output of yum update
This looks similar to Bug 1277951, which was closed due to lack of repro data. Triveni, Martin, can you check if 3.1.2 vdsm solves the issue - we rebased 3.1.2 vdsm to 4.16.30 to solve some of the network issues.
Using my archive of libvirt snapshots, I restored machines from Sep 08 2015, re-registered them back into current stable cdn channels and updated storage servers (following steps to reproduce from this BZ) to: 1) current stable version from cdn (RHGS 3.1.1) vdsm-4.16.20-1.2.el6rhs -> vdsm-4.16.20-1.3.el6rhs (this is the same situation as described in this BZ) 2) latest version from qe puddle repo (RHGS 3.1.2) vdsm-4.16.20-1.2.el6rhs -> vdsm-4.16.30-1.3.el6rhs I was able to reproduce the issue in both cases.
From supervdsm.log after reboot: restore-net::INFO::2016-01-08 10:10:33,243::ifcfg::423::root::(_loadBackupFiles) Loaded /var/lib/vdsm/netconfback/ifcfg-eth0 restore-net::INFO::2016-01-08 10:10:33,243::ifcfg::423::root::(_loadBackupFiles) Loaded /var/lib/vdsm/netconfback/ifcfg-ovirtmgmt But looks like the contents of these files are empty - and causes the files to be removed further down in the log.. restore-net::DEBUG::2016-01-08 10:10:34,060::ifcfg::377::root::(restoreAtomicBackup) Removing empty configuration backup /etc/sysconfig/network-scripts/ifcfg-ovirtmgmt restore-net::DEBUG::2016-01-08 10:10:34,060::ifcfg::377::root::(restoreAtomicBackup) Removing empty configuration backup /etc/sysconfig/network-scripts/ifcfg-eth0 Dan, what would cause the backup files to be empty?
Additional information, state of vdsm netconfback files before update and reboot: ~~~ # ls -l /var/lib/vdsm/netconfback/ total 8 -rw-r--r--. 1 vdsm root 30 Jan 13 14:45 ifcfg-eth0 -rw-r--r--. 1 vdsm root 30 Jan 13 14:45 ifcfg-ovirtmgmt # cat /var/lib/vdsm/netconfback/ifcfg-eth0 # original file did not exist # cat /var/lib/vdsm/netconfback/ifcfg-ovirtmgmt # original file did not exist ~~~ While actual ifcfg files were there: ~~~ # cat /etc/sysconfig/network-scripts/ifcfg-eth0 # Generated by VDSM version 4.16.20-1.2.el6rhs DEVICE=eth0 HWADDR=52:54:00:70:7a:8c BRIDGE=ovirtmgmt ONBOOT=yes MTU=1500 NM_CONTROLLED=no # cat /etc/sysconfig/network-scripts/ifcfg-ovirtmgmt # Generated by VDSM version 4.16.20-1.2.el6rhs DEVICE=ovirtmgmt TYPE=Bridge DELAY=0 STP=off ONBOOT=yes BOOTPROTO=dhcp MTU=1500 DEFROUTE=yes NM_CONTROLLED=no HOTPLUG=no ~~~
Additional information: using libvirt and guestfish tools, I scripted extraction of the entire /var/lib/vdsm/ directory into tarball for each snapshot of node1 I have. Then I have searched for netconfback files in each tarball. And it turned out that this issue is likely related to nagios configuration (at lest in my case). See excerpt of my snapshot list: ~~~ Name Creation Time State ------------------------------------------------------------ ... w37_01_rhgsinstalled 2015-09-08 10:57:23 +0200 shutoff w37_02_volumedefined 2015-09-08 11:28:25 +0200 shutoff w37_03_nagiosconfigured 2015-09-08 17:21:14 +0200 shutoff ... ~~~ Almost empty netconfback files appreared in w37_03_nagiosconfigured snapshot for the first time. For every snapshot before that, the directory was empty. Note that w37_03_nagiosconfigured is the snapshot I restored when I hit this issue for the first time (as I noted in comment 6).
I don't think Nagios has anything to do with network configurations. I feel there is something wrong with vdsm during upgrade. May be edwardh or Dan can help us here. Have u faced similar issues with VDSM in RHEV-M?.
Looks like this problem is caused by "net_persistence = ifcfg" in /etc/vdsm/vdsm.conf. Somehow we have a state where /var/lib/vdsm/netconfback/ifcfg-eth0 and /var/lib/vdsm/netconfback/ifcfg-ovirtmgmt has the content '# original file did not exist' and "net_persistence = ifcfg" in /etc/vdsm/vdsm.conf. Rebooting the system in this state removes the ifcfg files at /var/lib/vdsm/netconfback/ and /etc/sysconfig/network-scripts/ and ends up with system not having any network. If I remove the field "net_persistence = ifcfg" in /etc/vdsm/vdsm.conf and reboot then this problem is not seen. This problem is reproducible in vdsm-4.16.20-1.2.el6rhs.x86_64. Edward Haas: Do have any idea about why the network-config backup files in folder /var/lib/vdsm/netconfback/* doesn't have any valid data? Do u have a similar issue in RHEV-M?
Similar bz#1263979 was fixed in vdsm-4.16.28-1.el6ev.x86_64 for RHEV-M 3.5.6. But I am able to reproduce this issue in 4.16.30-1.
We have added "net_persistence=ifcfg" as an workaround for bz#1203422. It was suggested by Dan in "https://bugzilla.redhat.com/show_bug.cgi?id=1215011#c2". But bz#1203422 is fixed now. So can we remove this config?. Regards, Ramesh We would prefer moving to the unified persistent mode (the default), so removing the config is recommended. If moving back to the unified persistent mode solves your issue, would it satisfy this bug? We would prefer not to touch the older ifcfg persistent mode too much, keeping our focus on the unified mode.
(In reply to Edward Haas from comment #16) > We have added "net_persistence=ifcfg" as an workaround for bz#1203422. It > was suggested by Dan in > "https://bugzilla.redhat.com/show_bug.cgi?id=1215011#c2". But bz#1203422 is > fixed now. So can we remove this config?. > > Regards, > Ramesh > > > We would prefer moving to the unified persistent mode (the default), so > removing the config is recommended. We can move to unified persistent mode. We don't see any issue with that. But I would like to understand the scenario when /var/lib/vdsm/netconfback/ifcfg-eth0 and /var/lib/vdsm/netconfback/ifcfg-ovirtmgmt will have the content '# original file did not exist'. This will help me to understand the seriousness if this issue. Regards, Ramesh > > If moving back to the unified persistent mode solves your issue, would it > satisfy this bug? We would prefer not to touch the older ifcfg persistent > mode too much, keeping our focus on the unified mode.
"net_persistence=ifcfg" was added as an workaround for bz#1203422. bz#1203422 is already fixed. So we can remove "net_persistence=ifcfg" from /etc/vdsm/vdsm.conf. Note: Though "net_persistence=ifcfg" was added during rpm install/update, it is getting removed during setupHostNetworks.
Reverted the patch as mentioned in comment 18
I have verified the bug and found no issues with fixed in version vdsm-4.16.30-1.4 for both RHEL7 and RHEL6 nodes. Steps followed: 1. Install RHGS3.1.1 node and then add it to RHSC. 2. i found that "net_persistence=ifcfg" not present. 3. I updated the vdsm package 3.1.3 and put node in maintenance mode. 4. reboot the node and confirmed that network configs are proper 5. activated the host on RHSC and found no issues. 6. same steps followed for 3.1.2 as well. [root@dhcp35-139 ~]# cat /var/lib/vdsm/netconfback/ifcfg-ovirtmgmt cat: /var/lib/vdsm/netconfback/ifcfg-ovirtmgmt: No such file or directory [root@dhcp35-139 ~]# cat /etc/vdsm/vdsm.conf | grep ifcfg [root@dhcp35-139 ~]# [root@dhcp35-139 ~]# cat /etc/vdsm/vdsm.conf [vars] ssl = true [addresses] management_port = 54321 [root@dhcp35-139 ~]#
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1242