Bug 1406478

Summary: openstack update fails with lost network connection
Product: Red Hat OpenStack Reporter: Randy Perryman <randy_perryman>
Component: rhosp-directorAssignee: Sofer Athlan-Guyot <sathlang>
Status: CLOSED DUPLICATE QA Contact: Omri Hochman <ohochman>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 8.0 (Liberty)CC: arkady_kanevsky, audra_cooper, bfournie, bgalvani, cdevine, christopher_dearborn, dbecker, dcain, dsneddon, John_walsh, kazen, kurt_hey, lbezdick, mburns, mcornea, morazi, randy_perryman, rhel-osp-director-maint, smerrow, sreichar, sukulkar, thaller
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-01-03 19:37:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1367580, 1388546    
Bug Blocks: 1261979, 1295530, 1305654    
Attachments:
Description Flags
SOS report from affected controller
randy_perryman: review-
SOS report from affected Controller
none
heat deployment-show none

Description Randy Perryman 2016-12-20 16:08:07 UTC
I am running the openstack update command and during the update of the controllers the network is dropping.

Controller Intel X710 Nic installed
Looking at the logs I see a Yum update is running and the disconnect occurs each time with the following packages:


Dec 20 15:52:26 Updated: 1:iwl1000-firmware-39.31.5.1-49.el7.noarch
Dec 20 15:52:26 Updated: iwl2000-firmware-18.168.6.1-49.el7.noarch
Dec 20 15:52:26 Updated: iwl5000-firmware-8.83.5.1_1-49.el7.noarch
Dec 20 15:52:27 Updated: iwl2030-firmware-18.168.6.1-49.el7.noarch
Dec 20 15:52:27 Updated: iwl5150-firmware-8.24.2.2-49.el7.noarch
Dec 20 15:52:27 Updated: iwl6000-firmware-9.221.4.1-49.el7.noarch
Dec 20 15:52:27 Updated: iwl3160-firmware-22.0.7.0-49.el7.noarch
Dec 20 15:52:28 Updated: iwl135-firmware-18.168.6.1-49.el7.noarch
Dec 20 15:52:28 Updated: iwl7260-firmware-22.0.7.0-49.el7.noarch
Dec 20 15:52:28 Updated: iwl3945-firmware-15.32.2.9-49.el7.noarch
Dec 20 15:52:28 Updated: iwl6050-firmware-41.28.5.1-49.el7.noarch
Dec 20 15:52:28 Updated: iwl100-firmware-39.31.5.1-49.el7.noarch
Dec 20 15:52:29 Updated: iwl6000g2b-firmware-17.168.5.2-49.el7.noarch
Dec 20 15:52:29 Updated: iwl6000g2a-firmware-17.168.5.3-49.el7.noarch
Dec 20 15:52:29 Updated: iwl4965-firmware-228.61.2.24-49.el7.noarch
Dec 20 15:52:30 Updated: iwl7265-firmware-22.0.7.0-49.el7.noarch
Dec 20 15:52:30 Updated: 1:NetworkManager-config-server-1.4.0-13.el7_3.x86_64
Dec 20 15:52:30 Updated: iwl105-firmware-18.168.6.1-49.el7.noarch

Network Manager and Intel Wireless firmware are being updated.

Comment 1 Randy Perryman 2016-12-20 16:13:30 UTC
Created attachment 1233925 [details]
SOS report from affected controller

s

Comment 2 Randy Perryman 2016-12-20 16:15:42 UTC
Created attachment 1233927 [details]
SOS report from affected Controller

Comment 3 Randy Perryman 2016-12-20 16:20:06 UTC
This is command I am running:

openstack overcloud update stack overcloud -i --templates ~/pilot/templates/overcloud -e ~/pilot/templates/overcloud/overcloud-resource-registry-puppet.yaml -e ~/pilot/templates/overcloud/environments/network-isolation.yaml -e ~/pilot/templates/overcloud/environments/storage-environment.yaml -e ~/pilot/templates/overcloud/environments/puppet-pacemaker.yaml -e ~/pilot/templates/dell-environment.yaml -e ~/pilot/templates/network-environment.yaml

--- Needless to say it is failing when the controller network disappears. 

I did ensure the network was setup with the correct OVS_EXTRA commands.

Comment 4 Randy Perryman 2016-12-20 16:34:55 UTC
Rebooted the node and did yum update -y - No Packages needed.

Comment 5 Randy Perryman 2016-12-21 13:29:17 UTC
Just did a remote update of the iwl* packages and the controller did not go offline.

Comment 6 Lukas Bezdicka 2016-12-21 16:33:38 UTC
Probably duplicate of #1388546

Comment 7 Randy Perryman 2016-12-21 17:21:44 UTC
Just did another Update and notice the connectivity was lost during the Network Manager Update

Comment 8 Randy Perryman 2016-12-22 15:26:25 UTC
Looking at the logs.  NetworkManager seems to be manipulating the nics/bonds/vlans even though NM_CONTROLLED=no is set.

Comment 9 Randy Perryman 2016-12-22 15:27:28 UTC
Currently the only way to do an update is watch every node and do a network restart from the console.

Comment 10 Randy Perryman 2016-12-22 15:35:55 UTC
This is from the section when the update occurred.  Until I did a network restart the networks were down.  

Why is NetworkManager touching these as all ifcfg-files hafe NM_CONTROLLED = no

Dec 22 15:18:01 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <info>  kernel firmware directory '/lib/firmware' changed
Dec 22 15:18:05 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <info>  kernel firmware directory '/lib/firmware' changed
Dec 22 15:19:22 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <info>  kernel firmware directory '/lib/firmware' changed
Dec 22 15:21:15 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <info>  kernel firmware directory '/lib/firmware' changed
Dec 22 15:21:20 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <info>  kernel firmware directory '/lib/firmware' changed
Dec 22 15:21:39 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <info>  (br-bond0): device state change: activated -> unmanaged (reason 'removed') [100 10 36]
Dec 22 15:21:39 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <info>  NetworkManager state is now CONNECTED_LOCAL
Dec 22 15:21:39 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <warn>  (br-bond0): failed to disable userspace IPv6LL address handling
Dec 22 15:21:39 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <info>  (vlan170): device state change: activated -> unmanaged (reason 'removed') [100 10 36]
Dec 22 15:21:39 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <warn>  (vlan170): failed to disable userspace IPv6LL address handling
Dec 22 15:21:39 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <warn>  (26) failed to call dispatcher scripts: (dbus-glib-error-quark:16) Type of message, '(sa{sa{sv}}a{sv}a{sv}a{
Dec 22 15:21:39 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <warn>  (27) failed to call dispatcher scripts: (dbus-glib-error-quark:16) Type of message, '(sa{sa{sv}}a{sv}a{sv}a{
Dec 22 15:21:39 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <warn>  (br-bond1): failed to disable userspace IPv6LL address handling
Dec 22 15:21:39 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <info>  (vlan180): device state change: activated -> unmanaged (reason 'removed') [100 10 36]
Dec 22 15:21:39 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <warn>  (vlan180): failed to disable userspace IPv6LL address handling
Dec 22 15:21:39 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <warn>  (28) failed to call dispatcher scripts: (dbus-glib-error-quark:16) Type of message, '(sa{sa{sv}}a{sv}a{sv}a{
Dec 22 15:21:40 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <info>  (bond1): enslaved to non-master-type device ovs-system; ignoring
Dec 22 15:21:40 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <info>  (vlan180): new Generic device (carrier: OFF, driver: 'openvswitch', ifindex: 13)
Dec 22 15:21:40 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <info>  (br-bond1): new Generic device (carrier: OFF, driver: 'openvswitch', ifindex: 14)
Dec 22 15:21:40 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <info>  (vlan170): new Generic device (carrier: OFF, driver: 'openvswitch', ifindex: 15)
Dec 22 15:21:40 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <info>  (bond0): enslaved to non-master-type device ovs-system; ignoring
Dec 22 15:21:40 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <info>  (br-bond0): new Generic device (carrier: OFF, driver: 'openvswitch', ifindex: 16)
Dec 22 15:22:05 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <info>  wpa_supplicant stopped
Dec 22 15:22:05 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <info>  wpa_supplicant running
Dec 22 15:22:16 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <info>  wpa_supplicant die count reset
Dec 22 15:30:47 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <info>  ifcfg-rh: new connection /etc/sysconfig/netwo

Comment 11 Bob Fournier 2016-12-22 19:22:06 UTC
Hi Randy,

I'm just trying to sequence the attached sosreport with what you are seeing and when
the problem occurs.   

Towards the end of var/log/messages in the sosreport I see:
Dec 20 16:07:34 overcloud-controller-0 os-collect-config: ---> Package NetworkManager.x86_64 1:1.0.6-29.el7_2 will be updated
Dec 20 16:07:34 overcloud-controller-0 os-collect-config: ---> Package NetworkManager.x86_64 1:1.4.0-13.el7_3 will be an update
Dec 20 16:07:34 overcloud-controller-0 os-collect-config: ---> Package NetworkManager-config-server.x86_64 1:1.0.6-29.el7_2 will be updated
Dec 20 16:07:34 overcloud-controller-0 os-collect-config: ---> Package NetworkManager-config-server.x86_64 1:1.4.0-13.el7_3 will be an update
Dec 20 16:07:34 overcloud-controller-0 os-collect-config: ---> Package NetworkManager-libnm.x86_64 1:1.0.6-29.el7_2 will be updated
Dec 20 16:07:34 overcloud-controller-0 os-collect-config: ---> Package NetworkManager-libnm.x86_64 1:1.4.0-13.el7_3 will be an update
Dec 20 16:07:34 overcloud-controller-0 os-collect-config: ---> Package NetworkManager-team.x86_64 1:1.0.6-29.el7_2 will be updated
Dec 20 16:07:34 overcloud-controller-0 os-collect-config: ---> Package NetworkManager-team.x86_64 1:1.4.0-13.el7_3 will be an update
Dec 20 16:07:34 overcloud-controller-0 os-collect-config: ---> Package NetworkManager-tui.x86_64 1:1.0.6-29.el7_2 will be updated
Dec 20 16:07:34 overcloud-controller-0 os-collect-config: ---> Package NetworkManager-tui.x86_64 1:1.4.0-13.el7_3 will be an update
Dec 20 16:07:34 overcloud-controller-0 os-collect-config: ---> Package os-net-config.noarch 0:0.2.3-2.el7ost will be updated
Dec 20 16:07:34 overcloud-controller-0 os-collect-config: ---> Package os-net-config.noarch 0:0.2.3-4.el7ost will be an update

Indicating NetworkManager has not been updated yet.

But earlier I see:
Dec 20 15:48:51 overcloud-controller-0 yum[117670]: Updated: 1:NetworkManager-tui-1.4.0-13.el7_3.x86_64
Dec 20 15:52:30 overcloud-controller-0 yum[117670]: Updated: 1:NetworkManager-config-server-1.4.0-13.el7_3.x86_64

Its after these earlier yum update messages that we see NM messages about the
links going up and down, although no indication that the links are managed by NM:
Dec 20 15:53:01 overcloud-controller-0 NetworkManager[1705]: <warn>  (br-ex): failed to disable userspace IPv6LL address handling
Dec 20 15:53:01 overcloud-controller-0 NetworkManager[1705]: <warn>  (vlan190): failed to disable userspace IPv6LL address handling
Dec 20 15:53:02 overcloud-controller-0 NetworkManager[1705]: <info>  (em4): link disconnected
Dec 20 15:53:02 overcloud-controller-0 NetworkManager[1705]: <warn>  (br-int): failed to disable userspace IPv6LL address handling
Dec 20 15:53:02 overcloud-controller-0 NetworkManager[1705]: <warn>  (br-tenant): failed to disable userspace IPv6LL address handling
Dec 20 15:53:02 overcloud-controller-0 NetworkManager[1705]: <warn>  (vlan170): failed to disable userspace IPv6LL address handling
Dec 20 15:53:02 overcloud-controller-0 NetworkManager[1705]: <info>  (em3): link disconnected
Dec 20 15:53:02 overcloud-controller-0 NetworkManager[1705]: <warn>  (vlan140): failed to disable userspace IPv6LL address handling
Dec 20 15:53:02 overcloud-controller-0 NetworkManager[1705]: <info>  (em4): link connected
Dec 20 15:53:02 overcloud-controller-0 NetworkManager[1705]: <info>  (bond1): link disconnected
Dec 20 15:53:03 overcloud-controller-0 NetworkManager[1705]: <info>  (bond1): link connected
Dec 20 15:53:03 overcloud-controller-0 NetworkManager[1705]: <info>  (em3): link connected


(Note that the "failed to disable userspace" messages can be safely ignored according to this bug:
IPv6LLhttps://bugzilla.redhat.com/show_bug.cgi?id=1323571)

So when in the sequence is the link disconnection occurring?  Thanks.

Comment 14 Beniamino Galvani 2017-01-02 17:48:04 UTC
Hi,

I have analyzed logs in the sosreport and I haven't found any
indication that NM is touching the network interfaces. I see messages
like these:

 Dec 20 15:34:01 overcloud-controller-0 os-collect-config: [2016/12/20 03:34:01 PM] [INFO] running ifdown on bridge: br-tenant
 Dec 20 15:34:01 overcloud-controller-0 NetworkManager[1705]: <info>  (br-tenant): link disconnected
 Dec 20 15:34:01 overcloud-controller-0 NET[68678]: /etc/sysconfig/network-scripts/ifdown-post : updated /etc/resolv.conf
 Dec 20 15:34:01 overcloud-controller-0 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl -t 10 -- --if-exists del-br br-tenant
 Dec 20 15:34:01 overcloud-controller-0 kernel: device br-tenant left promiscuous mode
 Dec 20 15:34:01 overcloud-controller-0 NetworkManager[1705]: <warn>  (br-tenant): failed to disable userspace IPv6LL address handling

where NM simply recognizes that the interface changed state.

For a further analysis it would be useful to know approximately at
which time the network connection dropped.


Instead, logs from comment 10 (which have a date different from the
sosreport) show that NM is tracking the activation of devices:

 Dec 22 15:21:39 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <info>  (br-bond0): device state change: activated -> unmanaged (reason 'removed') [100 10 36]

and there are some errors probably caused by a mismatch between the
NetworkManager-dispatcher-script and NetworkManager versions:

 Dec 22 15:21:39 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <warn>  (26) failed to call dispatcher scripts: (dbus-glib-error-quark:16) Type of message, '(sa{sa{sv}}a{sv}a{sv}a{

But it's difficult to understand what is happening without a sosreport
from this machine, or full logs.

Comment 15 Randy Perryman 2017-01-03 13:55:03 UTC
The network disappears right after yum.log show networkmananger updated. 
So Comment #0 is the yum tail.  For one of the nodes I did a yum update iwl* before and then the network disappeared literally right after the Network Manager.





 (In reply to Randy Perryman from comment #0)
> I am running the openstack update command and during the update of the
> controllers the network is dropping.
> 
> Controller Intel X710 Nic installed
> Looking at the logs I see a Yum update is running and the disconnect occurs
> each time with the following packages:
> 
> 
> Dec 20 15:52:26 Updated: 1:iwl1000-firmware-39.31.5.1-49.el7.noarch
> Dec 20 15:52:26 Updated: iwl2000-firmware-18.168.6.1-49.el7.noarch
> Dec 20 15:52:26 Updated: iwl5000-firmware-8.83.5.1_1-49.el7.noarch
> Dec 20 15:52:27 Updated: iwl2030-firmware-18.168.6.1-49.el7.noarch
> Dec 20 15:52:27 Updated: iwl5150-firmware-8.24.2.2-49.el7.noarch
> Dec 20 15:52:27 Updated: iwl6000-firmware-9.221.4.1-49.el7.noarch
> Dec 20 15:52:27 Updated: iwl3160-firmware-22.0.7.0-49.el7.noarch
> Dec 20 15:52:28 Updated: iwl135-firmware-18.168.6.1-49.el7.noarch
> Dec 20 15:52:28 Updated: iwl7260-firmware-22.0.7.0-49.el7.noarch
> Dec 20 15:52:28 Updated: iwl3945-firmware-15.32.2.9-49.el7.noarch
> Dec 20 15:52:28 Updated: iwl6050-firmware-41.28.5.1-49.el7.noarch
> Dec 20 15:52:28 Updated: iwl100-firmware-39.31.5.1-49.el7.noarch
> Dec 20 15:52:29 Updated: iwl6000g2b-firmware-17.168.5.2-49.el7.noarch
> Dec 20 15:52:29 Updated: iwl6000g2a-firmware-17.168.5.3-49.el7.noarch
> Dec 20 15:52:29 Updated: iwl4965-firmware-228.61.2.24-49.el7.noarch
> Dec 20 15:52:30 Updated: iwl7265-firmware-22.0.7.0-49.el7.noarch
> Dec 20 15:52:30 Updated: 1:NetworkManager-config-server-1.4.0-13.el7_3.x86_64
> Dec 20 15:52:30 Updated: iwl105-firmware-18.168.6.1-49.el7.noarch
> 
> Network Manager and Intel Wireless firmware are being updated.

Comment 16 Randy Perryman 2017-01-03 14:45:07 UTC
I am attaching the output from the heat deployment-show from a 2nd cluster that we hit this with.  We have now successfully reproduced this on two separate clusters.  

Run Openstack Update command for cluster

Updates a controller
Somewhere during the yum update the networks go offline and do not come back online.
Cluster then fails to start with no interfaces defined.
Update Fails!

if you watch the nodes as the update goes through each and manually restart the networks right after yum updates Network Manager, you can catch it and update will succeed..  I had to do this for all Controllers and Computes.

Comment 17 Randy Perryman 2017-01-03 14:46:04 UTC
Created attachment 1236906 [details]
heat deployment-show

Shows what the actual error is.

Comment 18 Lukas Bezdicka 2017-01-03 14:54:11 UTC
Have you tried applying fix from https://bugzilla.redhat.com/show_bug.cgi?id=1388546 -> https://bugs.launchpad.net/tripleo/+bug/1635205 ?

Comment 19 Randy Perryman 2017-01-03 15:17:35 UTC
I can confirm that we are going from OVS 2.4 to OVS 2.5

[heat-admin@overcloud-controller-0 ~]$ rpm -qa | grep openv
openvswitch-2.4.0-2.el7_2.x86_64
openstack-neutron-openvswitch-7.0.1-15.el7ost.noarch
python-openvswitch-2.4.0-2.el7_2.noarch
[heat-admin@overcloud-controller-0 ~]$ exit
logout
Connection to 192.168.120.194 closed.
[osp_admin@director ~]$ ssh cntl1
Last login: Tue Jan  3 14:37:27 2017 from gateway
[heat-admin@overcloud-controller-1 ~]$ rpm -qa |grep openv
openvswitch-2.5.0-14.git20160727.el7fdp.x86_64
openstack-neutron-openvswitch-7.2.0-5.el7ost.noarch
python-openvswitch-2.5.0-14.git20160727.el7fdp.noarch
[heat-admin@overcloud-controller-1 ~]$


-----------
So the fix looks to be 
stop the cluster in full
stop openvswitch - if all IP's are openvswitch that is an issue
update openvswitch
restart everything
run update?

Comment 20 Sofer Athlan-Guyot 2017-01-03 18:55:00 UTC
Hi,

the workaround that has been implemented revolves around not
triggering post hook during install of openvswitch.

Could you verify that this code is indeed present in the tripleo heat
templates.  In particular in the

    ~/pilot/templates/overcloud/extraconfig/tasks/yum_update.sh

(it's usually in /usr/share/openstack-tripleo-heat-templates/extraconfig/tasks/yum_update.sh)

A simple:

    grep -r 'rpm.*nopostrun' ~/pilot/templates

should do.

Comment 21 Randy Perryman 2017-01-03 19:16:02 UTC
[osp_admin@director ~]$ grep -r 'rpm.*nopostrun' ~/pilot/templates
[osp_admin@director ~]$
[osp_admin@director ~]$ find ./ -name yum_update.sh
./pilot/templates/overcloud/kilo/extraconfig/tasks/yum_update.sh
./pilot/templates/overcloud/extraconfig/tasks/yum_update.sh
[osp_admin@director ~]$
-----------
also here
[osp_admin@director tasks]$ pwd
/usr/share/openstack-tripleo-heat-templates/extraconfig/tasks
[osp_admin@director tasks]$ grep -r 'rpm.*nopostrun' ./*
[osp_admin@director tasks]$

Comment 22 Randy Perryman 2017-01-03 19:21:10 UTC
heat templates installed:

openstack-tripleo-heat-templates-0.8.14-21.el7ost.noarch

Comment 23 Randy Perryman 2017-01-03 19:30:16 UTC
Checking https://github.com/openstack/tripleo-heat-templates/blob/liberty-eol/extraconfig/tasks/yum_update.sh

I see that nopostrun is not part of Liberty tag, but the Mitaka branch has it.

Comment 24 Randy Perryman 2017-01-03 19:37:50 UTC
Closing this bug as duplicate of 1388546.   Going to escalate 1388546.

*** This bug has been marked as a duplicate of bug 1388546 ***