Bug 1406478
Summary: | openstack update fails with lost network connection | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Randy Perryman <randy_perryman> | ||||||||
Component: | rhosp-director | Assignee: | Sofer Athlan-Guyot <sathlang> | ||||||||
Status: | CLOSED DUPLICATE | QA Contact: | Omri Hochman <ohochman> | ||||||||
Severity: | urgent | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 8.0 (Liberty) | CC: | arkady_kanevsky, audra_cooper, bfournie, bgalvani, cdevine, christopher_dearborn, dbecker, dcain, dsneddon, John_walsh, kazen, kurt_hey, lbezdick, mburns, mcornea, morazi, randy_perryman, rhel-osp-director-maint, smerrow, sreichar, sukulkar, thaller | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2017-01-03 19:37:50 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | 1367580, 1388546 | ||||||||||
Bug Blocks: | 1261979, 1295530, 1305654 | ||||||||||
Attachments: |
|
Description
Randy Perryman
2016-12-20 16:08:07 UTC
Created attachment 1233925 [details]
SOS report from affected controller
s
Created attachment 1233927 [details]
SOS report from affected Controller
This is command I am running: openstack overcloud update stack overcloud -i --templates ~/pilot/templates/overcloud -e ~/pilot/templates/overcloud/overcloud-resource-registry-puppet.yaml -e ~/pilot/templates/overcloud/environments/network-isolation.yaml -e ~/pilot/templates/overcloud/environments/storage-environment.yaml -e ~/pilot/templates/overcloud/environments/puppet-pacemaker.yaml -e ~/pilot/templates/dell-environment.yaml -e ~/pilot/templates/network-environment.yaml --- Needless to say it is failing when the controller network disappears. I did ensure the network was setup with the correct OVS_EXTRA commands. Rebooted the node and did yum update -y - No Packages needed. Just did a remote update of the iwl* packages and the controller did not go offline. Probably duplicate of #1388546 Just did another Update and notice the connectivity was lost during the Network Manager Update Looking at the logs. NetworkManager seems to be manipulating the nics/bonds/vlans even though NM_CONTROLLED=no is set. Currently the only way to do an update is watch every node and do a network restart from the console. This is from the section when the update occurred. Until I did a network restart the networks were down. Why is NetworkManager touching these as all ifcfg-files hafe NM_CONTROLLED = no Dec 22 15:18:01 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <info> kernel firmware directory '/lib/firmware' changed Dec 22 15:18:05 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <info> kernel firmware directory '/lib/firmware' changed Dec 22 15:19:22 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <info> kernel firmware directory '/lib/firmware' changed Dec 22 15:21:15 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <info> kernel firmware directory '/lib/firmware' changed Dec 22 15:21:20 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <info> kernel firmware directory '/lib/firmware' changed Dec 22 15:21:39 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <info> (br-bond0): device state change: activated -> unmanaged (reason 'removed') [100 10 36] Dec 22 15:21:39 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <info> NetworkManager state is now CONNECTED_LOCAL Dec 22 15:21:39 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <warn> (br-bond0): failed to disable userspace IPv6LL address handling Dec 22 15:21:39 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <info> (vlan170): device state change: activated -> unmanaged (reason 'removed') [100 10 36] Dec 22 15:21:39 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <warn> (vlan170): failed to disable userspace IPv6LL address handling Dec 22 15:21:39 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <warn> (26) failed to call dispatcher scripts: (dbus-glib-error-quark:16) Type of message, '(sa{sa{sv}}a{sv}a{sv}a{ Dec 22 15:21:39 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <warn> (27) failed to call dispatcher scripts: (dbus-glib-error-quark:16) Type of message, '(sa{sa{sv}}a{sv}a{sv}a{ Dec 22 15:21:39 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <warn> (br-bond1): failed to disable userspace IPv6LL address handling Dec 22 15:21:39 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <info> (vlan180): device state change: activated -> unmanaged (reason 'removed') [100 10 36] Dec 22 15:21:39 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <warn> (vlan180): failed to disable userspace IPv6LL address handling Dec 22 15:21:39 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <warn> (28) failed to call dispatcher scripts: (dbus-glib-error-quark:16) Type of message, '(sa{sa{sv}}a{sv}a{sv}a{ Dec 22 15:21:40 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <info> (bond1): enslaved to non-master-type device ovs-system; ignoring Dec 22 15:21:40 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <info> (vlan180): new Generic device (carrier: OFF, driver: 'openvswitch', ifindex: 13) Dec 22 15:21:40 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <info> (br-bond1): new Generic device (carrier: OFF, driver: 'openvswitch', ifindex: 14) Dec 22 15:21:40 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <info> (vlan170): new Generic device (carrier: OFF, driver: 'openvswitch', ifindex: 15) Dec 22 15:21:40 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <info> (bond0): enslaved to non-master-type device ovs-system; ignoring Dec 22 15:21:40 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <info> (br-bond0): new Generic device (carrier: OFF, driver: 'openvswitch', ifindex: 16) Dec 22 15:22:05 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <info> wpa_supplicant stopped Dec 22 15:22:05 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <info> wpa_supplicant running Dec 22 15:22:16 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <info> wpa_supplicant die count reset Dec 22 15:30:47 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <info> ifcfg-rh: new connection /etc/sysconfig/netwo Hi Randy, I'm just trying to sequence the attached sosreport with what you are seeing and when the problem occurs. Towards the end of var/log/messages in the sosreport I see: Dec 20 16:07:34 overcloud-controller-0 os-collect-config: ---> Package NetworkManager.x86_64 1:1.0.6-29.el7_2 will be updated Dec 20 16:07:34 overcloud-controller-0 os-collect-config: ---> Package NetworkManager.x86_64 1:1.4.0-13.el7_3 will be an update Dec 20 16:07:34 overcloud-controller-0 os-collect-config: ---> Package NetworkManager-config-server.x86_64 1:1.0.6-29.el7_2 will be updated Dec 20 16:07:34 overcloud-controller-0 os-collect-config: ---> Package NetworkManager-config-server.x86_64 1:1.4.0-13.el7_3 will be an update Dec 20 16:07:34 overcloud-controller-0 os-collect-config: ---> Package NetworkManager-libnm.x86_64 1:1.0.6-29.el7_2 will be updated Dec 20 16:07:34 overcloud-controller-0 os-collect-config: ---> Package NetworkManager-libnm.x86_64 1:1.4.0-13.el7_3 will be an update Dec 20 16:07:34 overcloud-controller-0 os-collect-config: ---> Package NetworkManager-team.x86_64 1:1.0.6-29.el7_2 will be updated Dec 20 16:07:34 overcloud-controller-0 os-collect-config: ---> Package NetworkManager-team.x86_64 1:1.4.0-13.el7_3 will be an update Dec 20 16:07:34 overcloud-controller-0 os-collect-config: ---> Package NetworkManager-tui.x86_64 1:1.0.6-29.el7_2 will be updated Dec 20 16:07:34 overcloud-controller-0 os-collect-config: ---> Package NetworkManager-tui.x86_64 1:1.4.0-13.el7_3 will be an update Dec 20 16:07:34 overcloud-controller-0 os-collect-config: ---> Package os-net-config.noarch 0:0.2.3-2.el7ost will be updated Dec 20 16:07:34 overcloud-controller-0 os-collect-config: ---> Package os-net-config.noarch 0:0.2.3-4.el7ost will be an update Indicating NetworkManager has not been updated yet. But earlier I see: Dec 20 15:48:51 overcloud-controller-0 yum[117670]: Updated: 1:NetworkManager-tui-1.4.0-13.el7_3.x86_64 Dec 20 15:52:30 overcloud-controller-0 yum[117670]: Updated: 1:NetworkManager-config-server-1.4.0-13.el7_3.x86_64 Its after these earlier yum update messages that we see NM messages about the links going up and down, although no indication that the links are managed by NM: Dec 20 15:53:01 overcloud-controller-0 NetworkManager[1705]: <warn> (br-ex): failed to disable userspace IPv6LL address handling Dec 20 15:53:01 overcloud-controller-0 NetworkManager[1705]: <warn> (vlan190): failed to disable userspace IPv6LL address handling Dec 20 15:53:02 overcloud-controller-0 NetworkManager[1705]: <info> (em4): link disconnected Dec 20 15:53:02 overcloud-controller-0 NetworkManager[1705]: <warn> (br-int): failed to disable userspace IPv6LL address handling Dec 20 15:53:02 overcloud-controller-0 NetworkManager[1705]: <warn> (br-tenant): failed to disable userspace IPv6LL address handling Dec 20 15:53:02 overcloud-controller-0 NetworkManager[1705]: <warn> (vlan170): failed to disable userspace IPv6LL address handling Dec 20 15:53:02 overcloud-controller-0 NetworkManager[1705]: <info> (em3): link disconnected Dec 20 15:53:02 overcloud-controller-0 NetworkManager[1705]: <warn> (vlan140): failed to disable userspace IPv6LL address handling Dec 20 15:53:02 overcloud-controller-0 NetworkManager[1705]: <info> (em4): link connected Dec 20 15:53:02 overcloud-controller-0 NetworkManager[1705]: <info> (bond1): link disconnected Dec 20 15:53:03 overcloud-controller-0 NetworkManager[1705]: <info> (bond1): link connected Dec 20 15:53:03 overcloud-controller-0 NetworkManager[1705]: <info> (em3): link connected (Note that the "failed to disable userspace" messages can be safely ignored according to this bug: IPv6LLhttps://bugzilla.redhat.com/show_bug.cgi?id=1323571) So when in the sequence is the link disconnection occurring? Thanks. Hi, I have analyzed logs in the sosreport and I haven't found any indication that NM is touching the network interfaces. I see messages like these: Dec 20 15:34:01 overcloud-controller-0 os-collect-config: [2016/12/20 03:34:01 PM] [INFO] running ifdown on bridge: br-tenant Dec 20 15:34:01 overcloud-controller-0 NetworkManager[1705]: <info> (br-tenant): link disconnected Dec 20 15:34:01 overcloud-controller-0 NET[68678]: /etc/sysconfig/network-scripts/ifdown-post : updated /etc/resolv.conf Dec 20 15:34:01 overcloud-controller-0 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl -t 10 -- --if-exists del-br br-tenant Dec 20 15:34:01 overcloud-controller-0 kernel: device br-tenant left promiscuous mode Dec 20 15:34:01 overcloud-controller-0 NetworkManager[1705]: <warn> (br-tenant): failed to disable userspace IPv6LL address handling where NM simply recognizes that the interface changed state. For a further analysis it would be useful to know approximately at which time the network connection dropped. Instead, logs from comment 10 (which have a date different from the sosreport) show that NM is tracking the activation of devices: Dec 22 15:21:39 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <info> (br-bond0): device state change: activated -> unmanaged (reason 'removed') [100 10 36] and there are some errors probably caused by a mismatch between the NetworkManager-dispatcher-script and NetworkManager versions: Dec 22 15:21:39 overcloud-cephstorage-0.localdomain NetworkManager[1683]: <warn> (26) failed to call dispatcher scripts: (dbus-glib-error-quark:16) Type of message, '(sa{sa{sv}}a{sv}a{sv}a{ But it's difficult to understand what is happening without a sosreport from this machine, or full logs. The network disappears right after yum.log show networkmananger updated. So Comment #0 is the yum tail. For one of the nodes I did a yum update iwl* before and then the network disappeared literally right after the Network Manager. (In reply to Randy Perryman from comment #0) > I am running the openstack update command and during the update of the > controllers the network is dropping. > > Controller Intel X710 Nic installed > Looking at the logs I see a Yum update is running and the disconnect occurs > each time with the following packages: > > > Dec 20 15:52:26 Updated: 1:iwl1000-firmware-39.31.5.1-49.el7.noarch > Dec 20 15:52:26 Updated: iwl2000-firmware-18.168.6.1-49.el7.noarch > Dec 20 15:52:26 Updated: iwl5000-firmware-8.83.5.1_1-49.el7.noarch > Dec 20 15:52:27 Updated: iwl2030-firmware-18.168.6.1-49.el7.noarch > Dec 20 15:52:27 Updated: iwl5150-firmware-8.24.2.2-49.el7.noarch > Dec 20 15:52:27 Updated: iwl6000-firmware-9.221.4.1-49.el7.noarch > Dec 20 15:52:27 Updated: iwl3160-firmware-22.0.7.0-49.el7.noarch > Dec 20 15:52:28 Updated: iwl135-firmware-18.168.6.1-49.el7.noarch > Dec 20 15:52:28 Updated: iwl7260-firmware-22.0.7.0-49.el7.noarch > Dec 20 15:52:28 Updated: iwl3945-firmware-15.32.2.9-49.el7.noarch > Dec 20 15:52:28 Updated: iwl6050-firmware-41.28.5.1-49.el7.noarch > Dec 20 15:52:28 Updated: iwl100-firmware-39.31.5.1-49.el7.noarch > Dec 20 15:52:29 Updated: iwl6000g2b-firmware-17.168.5.2-49.el7.noarch > Dec 20 15:52:29 Updated: iwl6000g2a-firmware-17.168.5.3-49.el7.noarch > Dec 20 15:52:29 Updated: iwl4965-firmware-228.61.2.24-49.el7.noarch > Dec 20 15:52:30 Updated: iwl7265-firmware-22.0.7.0-49.el7.noarch > Dec 20 15:52:30 Updated: 1:NetworkManager-config-server-1.4.0-13.el7_3.x86_64 > Dec 20 15:52:30 Updated: iwl105-firmware-18.168.6.1-49.el7.noarch > > Network Manager and Intel Wireless firmware are being updated. I am attaching the output from the heat deployment-show from a 2nd cluster that we hit this with. We have now successfully reproduced this on two separate clusters. Run Openstack Update command for cluster Updates a controller Somewhere during the yum update the networks go offline and do not come back online. Cluster then fails to start with no interfaces defined. Update Fails! if you watch the nodes as the update goes through each and manually restart the networks right after yum updates Network Manager, you can catch it and update will succeed.. I had to do this for all Controllers and Computes. Created attachment 1236906 [details]
heat deployment-show
Shows what the actual error is.
Have you tried applying fix from https://bugzilla.redhat.com/show_bug.cgi?id=1388546 -> https://bugs.launchpad.net/tripleo/+bug/1635205 ? I can confirm that we are going from OVS 2.4 to OVS 2.5 [heat-admin@overcloud-controller-0 ~]$ rpm -qa | grep openv openvswitch-2.4.0-2.el7_2.x86_64 openstack-neutron-openvswitch-7.0.1-15.el7ost.noarch python-openvswitch-2.4.0-2.el7_2.noarch [heat-admin@overcloud-controller-0 ~]$ exit logout Connection to 192.168.120.194 closed. [osp_admin@director ~]$ ssh cntl1 Last login: Tue Jan 3 14:37:27 2017 from gateway [heat-admin@overcloud-controller-1 ~]$ rpm -qa |grep openv openvswitch-2.5.0-14.git20160727.el7fdp.x86_64 openstack-neutron-openvswitch-7.2.0-5.el7ost.noarch python-openvswitch-2.5.0-14.git20160727.el7fdp.noarch [heat-admin@overcloud-controller-1 ~]$ ----------- So the fix looks to be stop the cluster in full stop openvswitch - if all IP's are openvswitch that is an issue update openvswitch restart everything run update? Hi, the workaround that has been implemented revolves around not triggering post hook during install of openvswitch. Could you verify that this code is indeed present in the tripleo heat templates. In particular in the ~/pilot/templates/overcloud/extraconfig/tasks/yum_update.sh (it's usually in /usr/share/openstack-tripleo-heat-templates/extraconfig/tasks/yum_update.sh) A simple: grep -r 'rpm.*nopostrun' ~/pilot/templates should do. [osp_admin@director ~]$ grep -r 'rpm.*nopostrun' ~/pilot/templates [osp_admin@director ~]$ [osp_admin@director ~]$ find ./ -name yum_update.sh ./pilot/templates/overcloud/kilo/extraconfig/tasks/yum_update.sh ./pilot/templates/overcloud/extraconfig/tasks/yum_update.sh [osp_admin@director ~]$ ----------- also here [osp_admin@director tasks]$ pwd /usr/share/openstack-tripleo-heat-templates/extraconfig/tasks [osp_admin@director tasks]$ grep -r 'rpm.*nopostrun' ./* [osp_admin@director tasks]$ heat templates installed: openstack-tripleo-heat-templates-0.8.14-21.el7ost.noarch Checking https://github.com/openstack/tripleo-heat-templates/blob/liberty-eol/extraconfig/tasks/yum_update.sh I see that nopostrun is not part of Liberty tag, but the Mitaka branch has it. Closing this bug as duplicate of 1388546. Going to escalate 1388546. *** This bug has been marked as a duplicate of bug 1388546 *** |