1272161 – network lost after update of console managed storage node

Bug 1272161 - network lost after update of console managed storage node

Summary: network lost after update of console managed storage node

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	vdsm
Sub Component:
Version:	rhgs-3.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.1.3
Assignee:	Ramesh N
QA Contact:	Sachin
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1299184
TreeView+	depends on / blocked

Reported:	2015-10-15 15:27 UTC by Martin Bukatovic
Modified:	2019-11-14 07:03 UTC (History)
CC List:	14 users (show)
Fixed In Version:	vdsm-4.16.30-1.4
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-06-23 05:27:11 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Output of yum update (1.92 KB, text/plain) 2015-10-15 15:33 UTC, Martin Bukatovic	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1263979	0	high	CLOSED	Vdsm should recover ifcfg files in case they are no longer exist and recover all networks on the server	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1277951	1	None	None	None	2024-09-18 00:43:48 UTC
Red Hat Product Errata	RHBA-2016:1242	0	normal	SHIPPED_LIVE	Red Hat Gluster Storage Console 3.1 update 3 bug fixes	2016-06-23 09:01:46 UTC

Internal Links: 1263979 1277951

Description Martin Bukatovic 2015-10-15 15:27:00 UTC

Description of problem
======================

When I updated one of my gluster console managed clusters (previous update was
done on Sep 08 2015), storage servers lost network access (eth0 device has been
removed by ovirtmgmt).

Version-Release number of selected component (if applicable)
============================================================

Red Hat Gluster Storage Server 3.1 machines are registered
into following channels:

 * rhel-6-server-rpms
 * rhel-scalefs-for-rhel-6-server-rpms
 * rhs-3-for-rhel-6-server-rpms

Related packages before update (state from Sep 08 2015):

~~~
glusterfs-3.7.1-11.el6rhs.x86_64
glusterfs-api-3.7.1-11.el6rhs.x86_64
glusterfs-cli-3.7.1-11.el6rhs.x86_64
glusterfs-client-xlators-3.7.1-11.el6rhs.x86_64
glusterfs-fuse-3.7.1-11.el6rhs.x86_64
glusterfs-geo-replication-3.7.1-11.el6rhs.x86_64
glusterfs-libs-3.7.1-11.el6rhs.x86_64
glusterfs-rdma-3.7.1-11.el6rhs.x86_64
glusterfs-server-3.7.1-11.el6rhs.x86_64
gluster-nagios-addons-0.2.4-4.el6rhs.x86_64
gluster-nagios-common-0.2.0-1.el6rhs.noarch
vdsm-4.16.20-1.2.el6rhs.x86_64
vdsm-cli-4.16.20-1.2.el6rhs.noarch
vdsm-gluster-4.16.20-1.2.el6rhs.noarch
vdsm-jsonrpc-4.16.20-1.2.el6rhs.noarch
vdsm-python-4.16.20-1.2.el6rhs.noarch
vdsm-python-zombiereaper-4.16.20-1.2.el6rhs.noarch
vdsm-reg-4.16.20-1.2.el6rhs.noarch
vdsm-xmlrpc-4.16.20-1.2.el6rhs.noarch
vdsm-yajsonrpc-4.16.20-1.2.el6rhs.noarch
~~~

Related packages after update (state from Nov 15 2015):

~~~
glusterfs-3.7.1-16.el6rhs.x86_64
glusterfs-api-3.7.1-16.el6rhs.x86_64
glusterfs-cli-3.7.1-16.el6rhs.x86_64
glusterfs-client-xlators-3.7.1-16.el6rhs.x86_64
glusterfs-fuse-3.7.1-16.el6rhs.x86_64
glusterfs-geo-replication-3.7.1-16.el6rhs.x86_64
glusterfs-libs-3.7.1-16.el6rhs.x86_64
glusterfs-rdma-3.7.1-16.el6rhs.x86_64
glusterfs-server-3.7.1-16.el6rhs.x86_64
gluster-nagios-addons-0.2.5-1.el6rhs.x86_64
gluster-nagios-common-0.2.2-1.el6rhs.noarch
python-gluster-3.7.1-16.el6rhs.x86_64
vdsm-4.16.20-1.3.el6rhs.x86_64
vdsm-cli-4.16.20-1.3.el6rhs.noarch
vdsm-gluster-4.16.20-1.3.el6rhs.noarch
vdsm-jsonrpc-4.16.20-1.3.el6rhs.noarch
vdsm-python-4.16.20-1.3.el6rhs.noarch
vdsm-python-zombiereaper-4.16.20-1.3.el6rhs.noarch
vdsm-reg-4.16.20-1.3.el6rhs.noarch
vdsm-xmlrpc-4.16.20-1.3.el6rhs.noarch
vdsm-yajsonrpc-4.16.20-1.3.el6rhs.noarch
~~~

How reproducible
================

100%

I observe this issue on cluster of virtual machines, which allowed me to
restore snapshot from September (last one I have) and retry multiple times.

Steps to Reproduce
==================

On cluster managed by RH Gluster Storage Console (configured with nagios),
which has been updated in Septermer for the last time:

1. Stop all gluster volumes (in console)
2. Move all storage servers into maintenance mode
3. Stop glusterd daemons on all storage servers
4. Run yum update on all storage servers
5. Reboot all storage servers

Actual results
==============

Storage servers loses has not network access during boot. When I login via
serial console, I see that:

 * the only network device in up state is loopback
 * both /etc/sysconfig/network-scripts/ifcfg-{eth0,ovirtmgmt} files
   are deleted
 * vdsm (ovirt manager) seems to be involved (see below)

Checking /var/log/messages:

~~~
$ grep eth0 messages
Oct 15 14:57:03 dhcp-125-8 kernel: device eth0 entered promiscuous mode
Oct 15 14:57:03 dhcp-125-8 kernel: ovirtmgmt: port 1(eth0) entering forwarding state
Oct 15 14:57:55 dhcp-125-8 kernel: 8021q: adding VLAN 0 to HW filter on device eth0
Oct 15 14:57:56 dhcp-125-8 kernel: ovirtmgmt: port 1(eth0) entering disabled state
Oct 15 14:57:57 dhcp-125-8 kernel: device eth0 left promiscuous mode
Oct 15 14:57:57 dhcp-125-8 kernel: ovirtmgmt: port 1(eth0) entering disabled state
~~~

Last few lines:

~~~
Oct 15 14:57:48 dhcp-125-8 kernel: lo: Disabled Privacy Extensions
Oct 15 14:57:55 dhcp-125-8 kernel: Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
Oct 15 14:57:55 dhcp-125-8 kernel: 802.1Q VLAN Support v1.8 Ben Greear <greearb>
Oct 15 14:57:55 dhcp-125-8 kernel: All bugs added by David S. Miller <davem>
Oct 15 14:57:55 dhcp-125-8 kernel: 8021q: adding VLAN 0 to HW filter on device eth0
Oct 15 14:57:56 dhcp-125-8 ntpd[6875]: ntpd exiting on signal 15
Oct 15 14:57:56 dhcp-125-8 ntpd[7735]: ntpd 4.2.6p5 Tue Apr 28 10:15:27 UTC 2015 (1)
Oct 15 14:57:56 dhcp-125-8 ntpd[7736]: proto: precision = 0.211 usec
Oct 15 14:57:56 dhcp-125-8 ntpd[7736]: 0.0.0.0 c01d 0d kern kernel time sync enabled
Oct 15 14:57:56 dhcp-125-8 ntpd[7736]: Listen and drop on 0 v4wildcard 0.0.0.0 UDP 123
Oct 15 14:57:56 dhcp-125-8 ntpd[7736]: Listen and drop on 1 v6wildcard :: UDP 123
Oct 15 14:57:56 dhcp-125-8 ntpd[7736]: Listen normally on 2 lo 127.0.0.1 UDP 123
Oct 15 14:57:56 dhcp-125-8 ntpd[7736]: Listen normally on 3 ovirtmgmt 10.34.125.8 UDP 123
Oct 15 14:57:56 dhcp-125-8 ntpd[7736]: Listening on routing socket on fd #20 for interface updates
Oct 15 14:57:56 dhcp-125-8 ntpd[7736]: Deferring DNS for 0.rhel.pool.ntp.org 1
Oct 15 14:57:56 dhcp-125-8 kernel: ovirtmgmt: port 1(eth0) entering disabled state
Oct 15 14:57:56 dhcp-125-8 ntpd[7736]: Deferring DNS for 1.rhel.pool.ntp.org 1
Oct 15 14:57:56 dhcp-125-8 ntpd[7736]: Deferring DNS for 2.rhel.pool.ntp.org 1
Oct 15 14:57:56 dhcp-125-8 ntpd[7736]: Deferring DNS for 3.rhel.pool.ntp.org 1
Oct 15 14:57:56 dhcp-125-8 ntpd[7736]: 0.0.0.0 c016 06 restart
Oct 15 14:57:56 dhcp-125-8 ntpd[7736]: 0.0.0.0 c012 02 freq_set kernel -10.082 PPM
Oct 15 14:57:57 dhcp-125-8 kernel: device eth0 left promiscuous mode
Oct 15 14:57:57 dhcp-125-8 kernel: ovirtmgmt: port 1(eth0) entering disabled state
Oct 15 14:57:57 dhcp-125-8 kernel: ovirtmgmt: mixed no checksumming and other settings.
Oct 15 14:57:58 dhcp-125-8 vdsm vds WARNING MOM is not available, fallback to KsmMonitor
Oct 15 14:57:58 dhcp-125-8 ntpd[7736]: Deleting interface #3 ovirtmgmt, 10.34.125.8#123, interface stats: received=0, sent=0, dropped=0, active_time=2 secs
~~~

Expected results
================

Network works as usual, ifcfg files are not deleted.

Additional info
===============

When I boot into single mode and the start network after the update instead of
normal boot, network works (no ifcfg files are deleted). This means that ifcfg
files are deleted during first normal multiuser boot after the update.

Comment 2 Martin Bukatovic 2015-10-15 15:33:53 UTC

Created attachment 1083307 [details]
Output of yum update

Comment 4 Sahina Bose 2015-12-24 10:46:39 UTC

This looks similar to Bug 1277951, which was closed due to lack of repro data.
Triveni, Martin, can you check if 3.1.2 vdsm solves the issue - we rebased 3.1.2 vdsm to 4.16.30 to solve some of the network issues.

Comment 5 Martin Bukatovic 2016-01-08 09:35:28 UTC

Using my archive of libvirt snapshots, I restored machines from Sep 08 2015,
re-registered them back into current stable cdn channels and updated storage
servers (following steps to reproduce from this BZ) to:

1) current stable version from cdn (RHGS 3.1.1)
   vdsm-4.16.20-1.2.el6rhs -> vdsm-4.16.20-1.3.el6rhs
   (this is the same situation as described in this BZ)

2) latest version from qe puddle repo (RHGS 3.1.2)
   vdsm-4.16.20-1.2.el6rhs -> vdsm-4.16.30-1.3.el6rhs

I was able to reproduce the issue in both cases.

Comment 8 Sahina Bose 2016-01-13 13:00:12 UTC

From supervdsm.log after reboot:
restore-net::INFO::2016-01-08 10:10:33,243::ifcfg::423::root::(_loadBackupFiles) Loaded /var/lib/vdsm/netconfback/ifcfg-eth0
restore-net::INFO::2016-01-08 10:10:33,243::ifcfg::423::root::(_loadBackupFiles) Loaded /var/lib/vdsm/netconfback/ifcfg-ovirtmgmt

But looks like the contents of these files are empty - and causes the files to be removed further down in the log..

restore-net::DEBUG::2016-01-08 10:10:34,060::ifcfg::377::root::(restoreAtomicBackup) Removing empty configuration backup /etc/sysconfig/network-scripts/ifcfg-ovirtmgmt
restore-net::DEBUG::2016-01-08 10:10:34,060::ifcfg::377::root::(restoreAtomicBackup) Removing empty configuration backup /etc/sysconfig/network-scripts/ifcfg-eth0

Dan, what would cause the backup files to be empty?

Comment 9 Martin Bukatovic 2016-01-13 14:59:05 UTC

Additional information, state of vdsm netconfback files before update and
reboot:

~~~
# ls -l /var/lib/vdsm/netconfback/
total 8
-rw-r--r--. 1 vdsm root 30 Jan 13 14:45 ifcfg-eth0
-rw-r--r--. 1 vdsm root 30 Jan 13 14:45 ifcfg-ovirtmgmt
# cat /var/lib/vdsm/netconfback/ifcfg-eth0 
# original file did not exist
# cat /var/lib/vdsm/netconfback/ifcfg-ovirtmgmt 
# original file did not exist
~~~

While actual ifcfg files were there:

~~~
# cat /etc/sysconfig/network-scripts/ifcfg-eth0
# Generated by VDSM version 4.16.20-1.2.el6rhs
DEVICE=eth0
HWADDR=52:54:00:70:7a:8c
BRIDGE=ovirtmgmt
ONBOOT=yes
MTU=1500
NM_CONTROLLED=no
# cat /etc/sysconfig/network-scripts/ifcfg-ovirtmgmt 
# Generated by VDSM version 4.16.20-1.2.el6rhs
DEVICE=ovirtmgmt
TYPE=Bridge
DELAY=0
STP=off
ONBOOT=yes
BOOTPROTO=dhcp
MTU=1500
DEFROUTE=yes
NM_CONTROLLED=no
HOTPLUG=no
~~~

Comment 10 Martin Bukatovic 2016-01-15 19:55:06 UTC

Additional information:

using libvirt and guestfish tools, I scripted extraction of the entire
/var/lib/vdsm/ directory into tarball for each snapshot of node1 I have.
Then I have searched for netconfback files in each tarball. And it turned
out that this issue is likely related to nagios configuration (at lest in
my case).

See excerpt of my snapshot list:

~~~
 Name                    Creation Time             State
------------------------------------------------------------
 ...
 w37_01_rhgsinstalled    2015-09-08 10:57:23 +0200 shutoff
 w37_02_volumedefined    2015-09-08 11:28:25 +0200 shutoff
 w37_03_nagiosconfigured 2015-09-08 17:21:14 +0200 shutoff
 ...
~~~

Almost empty netconfback files appreared in w37_03_nagiosconfigured
snapshot for the first time. For every snapshot before that, the directory
was empty. Note that w37_03_nagiosconfigured is the snapshot I restored
when I hit this issue for the first time (as I noted in comment 6).

Comment 12 Ramesh N 2016-01-18 09:10:47 UTC

I don't think Nagios has anything to do with network configurations. I feel there is something wrong with vdsm during upgrade. May be edwardh or Dan can help us here. Have u faced similar issues with VDSM in RHEV-M?.

Comment 14 Ramesh N 2016-04-15 12:41:27 UTC

Looks like this problem is caused by "net_persistence = ifcfg" in /etc/vdsm/vdsm.conf. 

Somehow we have a state where  /var/lib/vdsm/netconfback/ifcfg-eth0  and /var/lib/vdsm/netconfback/ifcfg-ovirtmgmt has the content '# original file did not exist' and "net_persistence = ifcfg" in /etc/vdsm/vdsm.conf. Rebooting the system in this state removes the ifcfg files at /var/lib/vdsm/netconfback/ and /etc/sysconfig/network-scripts/ and ends up with system not having any network.

If I remove the field "net_persistence = ifcfg" in /etc/vdsm/vdsm.conf and reboot then this problem is not seen. 

This problem is reproducible in vdsm-4.16.20-1.2.el6rhs.x86_64.

Edward Haas: Do have any idea about why the network-config backup files in folder /var/lib/vdsm/netconfback/* doesn't have any valid data? Do u have a similar issue in RHEV-M?

Comment 15 Ramesh N 2016-04-15 12:50:44 UTC

Similar bz#1263979 was fixed in vdsm-4.16.28-1.el6ev.x86_64 for RHEV-M 3.5.6. But I am able to reproduce this issue in 4.16.30-1.

Comment 16 Edward Haas 2016-04-17 08:14:33 UTC

We have added "net_persistence=ifcfg" as an workaround for bz#1203422. It was suggested by Dan in "https://bugzilla.redhat.com/show_bug.cgi?id=1215011#c2". But bz#1203422 is fixed now. So can we remove this config?.

Regards,
Ramesh


We would prefer moving to the unified persistent mode (the default), so removing the config is recommended.

If moving back to the unified persistent mode solves your issue, would it satisfy this bug? We would prefer not to touch the older ifcfg persistent mode too much, keeping our focus on the unified mode.

Comment 17 Ramesh N 2016-04-18 09:02:35 UTC

(In reply to Edward Haas from comment #16)
> We have added "net_persistence=ifcfg" as an workaround for bz#1203422. It
> was suggested by Dan in
> "https://bugzilla.redhat.com/show_bug.cgi?id=1215011#c2". But bz#1203422 is
> fixed now. So can we remove this config?.
> 
> Regards,
> Ramesh
> 
> 
> We would prefer moving to the unified persistent mode (the default), so
> removing the config is recommended.

We can move to unified persistent mode. We don't see any issue with that. But I would like to understand the scenario when /var/lib/vdsm/netconfback/ifcfg-eth0  and /var/lib/vdsm/netconfback/ifcfg-ovirtmgmt will have the content '# original file did not exist'. This will help me to understand the seriousness if this issue. 

Regards,
Ramesh 

> 
> If moving back to the unified persistent mode solves your issue, would it
> satisfy this bug? We would prefer not to touch the older ifcfg persistent
> mode too much, keeping our focus on the unified mode.

Comment 18 Ramesh N 2016-04-18 11:54:15 UTC

"net_persistence=ifcfg" was added as an workaround for bz#1203422. bz#1203422 is  already fixed. So we can remove "net_persistence=ifcfg" from /etc/vdsm/vdsm.conf. 

Note: Though "net_persistence=ifcfg" was added during rpm install/update, it is getting removed during setupHostNetworks.

Comment 19 Sahina Bose 2016-04-18 13:07:57 UTC

Reverted the patch as mentioned in comment 18

Comment 20 Triveni Rao 2016-04-19 12:06:28 UTC

I have verified the bug and found no issues with fixed in version vdsm-4.16.30-1.4 for both RHEL7 and RHEL6 nodes.

Steps followed:
1. Install RHGS3.1.1 node and then add it to RHSC.
2. i found that "net_persistence=ifcfg"  not present.
3. I updated the vdsm package 3.1.3 and put node in maintenance mode.
4. reboot the node and confirmed that network configs are proper 
5. activated the host on RHSC and found no issues.
6. same steps followed for 3.1.2 as well.

[root@dhcp35-139 ~]# cat /var/lib/vdsm/netconfback/ifcfg-ovirtmgmt
cat: /var/lib/vdsm/netconfback/ifcfg-ovirtmgmt: No such file or directory
[root@dhcp35-139 ~]#  cat /etc/vdsm/vdsm.conf | grep ifcfg
[root@dhcp35-139 ~]# 
[root@dhcp35-139 ~]#  cat /etc/vdsm/vdsm.conf 
[vars]
ssl = true

[addresses]
management_port = 54321

[root@dhcp35-139 ~]#

Comment 23 errata-xmlrpc 2016-06-23 05:27:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1242

Note You need to log in before you can comment on or make changes to this bug.