Bug 1459353 - rhosp-director: OSP9 'openstack undercloud upgrade' fails. br-ctlplane is down as a result of yum updating openvswitch-2.4.0-1.el7.x86_64 to openvswitch-2.5.0-14.git20160727.el7fdp.x86_64 "
Summary: rhosp-director: OSP9 'openstack undercloud upgrade' fails. br-ctlplane is dow...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 9.0 (Mitaka)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: async
: 9.0 (Mitaka)
Assignee: Sofer Athlan-Guyot
QA Contact: Amit Ugol
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-06-06 22:19 UTC by Alexander Chuzhoy
Modified: 2018-08-22 13:07 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-08-22 13:07:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 473548 0 None None None 2017-06-12 17:59:37 UTC

Description Alexander Chuzhoy 2017-06-06 22:19:46 UTC
rhosp-director: OSP9 'openstack undercloud upgrade" fails on undercloud. br-ctlplane is down


Environment:
openstack-tripleo-heat-templates-liberty-2.0.0-54.el7ost.noarch
python-neutron-lbaas-tests-8.1.0-2.el7ost.noarch
python-neutronclient-4.1.1-2.el7ost.noarch
python-neutron-vpnaas-8.0.0-1.el7ost.noarch
openvswitch-2.5.0-14.git20160727.el7fdp.x86_64
python-neutron-fwaas-tests-8.0.0-3.el7ost.noarch
python-neutron-8.3.0-5.el7ost.noarch
python-neutron-lbaas-8.1.0-2.el7ost.noarch
openstack-tripleo-heat-templates-2.0.0-54.el7ost.noarch
openstack-puppet-modules-8.1.10-2.el7ost.noarch
python-neutron-fwaas-8.0.0-3.el7ost.noarch
openstack-neutron-common-8.3.0-5.el7ost.noarch
openstack-neutron-ml2-8.3.0-5.el7ost.noarch
python-openvswitch-2.5.0-14.git20160727.el7fdp.noarch
openstack-neutron-openvswitch-8.3.0-5.el7ost.noarch
python-neutron-tests-8.3.0-5.el7ost.noarch
instack-undercloud-4.0.0-16.el7ost.noarch
python-neutron-vpnaas-tests-8.0.0-1.el7ost.noarch
openstack-neutron-8.3.0-5.el7ost.noarch
python-neutron-lib-0.0.2-1.el7ost.noarch



Steps to reproduce:
Follow the minor update procedure on the undercloud:
1. sudo systemctl stop 'openstack-*' 'neutron-*' httpd
2. sudo yum update python-tripleoclient
3. openstack undercloud upgrade

Result:
The process gets stuck and errors are shown:
Notice: /Stage[main]/Ironic::Inspector/Service[ironic-inspector]/ensure: ensure changed 'stopped' to 'running'
Error: Could not start Service[ironic-inspector-dnsmasq]: Execution of '/bin/systemctl start openstack-ironic-inspector-dnsmasq' returned 1: Job for openstack-ironic-inspector-dnsmasq.service failed because the control process exited with error code. See "systemctl status openstack-ironic-inspector-dnsmasq.service" and "journalctl -xe" for details.
Wrapped exception:
Execution of '/bin/systemctl start openstack-ironic-inspector-dnsmasq' returned 1: Job for openstack-ironic-inspector-dnsmasq.service failed because the control process exited with error code. See "systemctl status openstack-ironic-inspector-dnsmasq.service" and "journalctl -xe" for details.
Error: /Stage[main]/Ironic::Inspector/Service[ironic-inspector-dnsmasq]/ensure: change from stopped to running failed: Could not start Service[ironic-inspector-dnsmasq]: Execution of '/bin/systemctl start openstack-ironic-inspector-dnsmasq' returned 1: Job for openstack-ironic-inspector-dnsmasq.service failed because the control process exited with error code. See "systemctl status openstack-ironic-inspector-dnsmasq.service" and "journalctl -xe" for details.
Notice: /Stage[main]/Heat/Heat_config[trustee/project_domain_id]/value: value changed 'Default' to 'default'
Notice: /Stage[main]/Heat/Heat_config[trustee/user_domain_id]/value: value changed 'Default' to 'default'
Notice: /Stage[main]/Ironic::Db/Ironic_config[database/connection]/value: value changed '[old secret redacted]' to '[new secret redacted]'
mysql+pymysql://ceilometer:be1f7b1dd0ea5204e404d859397dcfd86284c9f1.120.101/ceilometer

Error: /Stage[main]/Ironic::Db::Sync/Exec[ironic-dbsync]: Failed to call refresh: Command exceeded timeout
Error: /Stage[main]/Ironic::Db::Sync/Exec[ironic-dbsync]: Command exceeded timeout
Wrapped exception:
execution expired
Notice: /Stage[main]/Ironic::Conductor/Service[ironic-conductor]/ensure: ensure changed 'stopped' to 'running'
Notice: /Stage[main]/Ironic::Api/Service[ironic-api]/ensure: ensure changed 'stopped' to 'running'
Notice: /Stage[main]/Rabbitmq::Service/Service[rabbitmq-server]: Triggered 'refresh' from 1 events
Notice: /Stage[main]/Aodh::Listener/Service[aodh-listener]/ensure: ensure changed 'stopped' to 'running'
Notice: /Stage[main]/Aodh::Notifier/Service[aodh-notifier]/ensure: ensure changed 'stopped' to 'running'
Notice: /Stage[main]/Main/File[/etc/keystone/ssl/private/signing_key.pem]/content: content changed '{md5}e5deff5063bd1a07f875e890979ae397' to '{md5}f184d11e42b9d7e5b0ea0e4c9ea7ec68'
Notice: /Stage[main]/Main/File[/etc/keystone/ssl/certs/ca.pem]/content: content changed '{md5}11fb3e3e8382b9789d2daadfe31604fe' to '{md5}ce2e87a7cec218b5843f5fbdcca17156'
Notice: /Stage[main]/Main/File[/etc/keystone/ssl/certs/signing_cert.pem]/content: content changed '{md5}cbec476ab13a22a40e893b8239ee26a8' to '{md5}cf7ed2b1405ec367800ee324a81e3ae7'
Notice: /Stage[main]/Swift::Proxy/Concat[/etc/swift/proxy-server.conf]/File[/etc/swift/proxy-server.conf]/mode: mode changed '0644' to '0640'
Notice: /Stage[main]/Swift::Storage::All/Swift::Storage::Server[6002]/Concat[/etc/swift/account-server.conf]/File[/etc/swift/account-server.conf]/mode: mode changed '0644' to '0640'
Notice: /Stage[main]/Swift::Storage::All/Swift::Storage::Server[6001]/Concat[/etc/swift/container-server.conf]/File[/etc/swift/container-server.conf]/mode: mode changed '0644' to '0640'
5382a5d036786fcf993343d60b720ddd48ae2d67
Notice: /Stage[main]/Swift::Storage::All/Swift::Storage::Server[6000]/Concat[/etc/swift/object-server.conf]/File[/etc/swift/object-server.conf]/mode: mode changed '0644' to '0640'
Notice: /File[/etc/ironic/ironic.conf]/seluser: seluser changed 'unconfined_u' to 'system_u'
Notice: /Stage[main]/Apache/File[/etc/httpd/conf.modules.d/00-proxy.conf]/ensure: removed
Notice: /Stage[main]/Apache/File[/etc/httpd/conf.modules.d/01-cgi.conf]/ensure: removed
Notice: /Stage[main]/Apache/File[/etc/httpd/conf.modules.d/00-mpm.conf]/ensure: removed
Notice: /Stage[main]/Apache/File[/etc/httpd/conf.modules.d/00-systemd.conf]/ensure: removed
Notice: /Stage[main]/Apache/File[/etc/httpd/conf.modules.d/00-base.conf]/ensure: removed
Notice: /Stage[main]/Apache/File[/etc/httpd/conf.modules.d/00-lua.conf]/ensure: removed
Notice: /Stage[main]/Apache/File[/etc/httpd/conf.modules.d/00-dav.conf]/ensure: removed
Notice: /Stage[main]/Main/Nova_config[DEFAULT/my_ip]/value: value changed '192.168.120.101' to '192.168.250.101'
Notice: /Stage[main]/Nova::Network::Neutron/Nova_config[DEFAULT/dhcp_domain]/value: value changed 'fv1dci.org' to ''
Notice: /Stage[main]/Nova::Deps/Anchor[nova::config::end]: Triggered 'refresh' from 2 events
Error: /Stage[main]/Nova::Db::Sync/Exec[nova-db-sync]: Failed to call refresh: Command exceeded timeout
Error: /Stage[main]/Nova::Db::Sync/Exec[nova-db-sync]: Command exceeded timeout
Wrapped exception:
execution expired




Error: /Stage[main]/Nova::Db::Sync_api/Exec[nova-db-sync-api]: Failed to call refresh: Command exceeded timeout
Error: /Stage[main]/Nova::Db::Sync_api/Exec[nova-db-sync-api]: Command exceeded timeout
Wrapped exception:
execution expired
Notice: /Stage[main]/Nova::Deps/Anchor[nova::service::begin]: Triggered 'refresh' from 1 events
Notice: /Stage[main]/Nova::Cert/Nova::Generic_service[cert]/Service[nova-cert]/ensure: ensure changed 'stopped' to 'running'
Notice: /Stage[main]/Nova::Api/Nova::Generic_service[api]/Service[nova-api]/ensure: ensure changed 'stopped' to 'running'
Error: Could not start Service[nova-scheduler]: Execution of '/bin/systemctl start openstack-nova-scheduler' returned 1: Job for openstack-nova-scheduler.service failed because the control process exited with error code. See "systemctl status openstack-nova-scheduler.service" and "journalctl -xe" for details.
Wrapped exception:
Execution of '/bin/systemctl start openstack-nova-scheduler' returned 1: Job for openstack-nova-scheduler.service failed because the control process exited with error code. See "systemctl status openstack-nova-scheduler.service" and "journalctl -xe" for details.
Error: /Stage[main]/Nova::Scheduler/Nova::Generic_service[scheduler]/Service[nova-scheduler]/ensure: change from stopped to running failed: Could not start Service[nova-scheduler]: Execution of '/bin/systemctl start openstack-nova-scheduler' returned 1: Job for openstack-nova-scheduler.service failed because the control process exited with error code. See "systemctl status openstack-nova-scheduler.service" and "journalctl -xe" for details.
Notice: /Stage[main]/Nova::Scheduler/Nova::Generic_service[scheduler]/Service[nova-scheduler]: Triggered 'refresh' from 1 events
Notice: /Stage[main]/Nova::Conductor/Nova::Generic_service[conductor]/Service[nova-conductor]/ensure: ensure changed 'stopped' to 'running'



[root@director ~]# systemctl status openstack-ironic-inspector-dnsmasq
● openstack-ironic-inspector-dnsmasq.service - PXE boot dnsmasq service for Ironic Inspector
   Loaded: loaded (/usr/lib/systemd/system/openstack-ironic-inspector-dnsmasq.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Tue 2017-06-06 15:34:04 CDT; 1h 44min ago
  Process: 17569 ExecStart=/sbin/dnsmasq --conf-file=/etc/ironic-inspector/dnsmasq.conf (code=exited, status=2)
 Main PID: 28349 (code=exited, status=0/SUCCESS)

Jun 06 15:34:04 director.fv1dci.org systemd[1]: Starting PXE boot dnsmasq service for Ironic Inspector...
Jun 06 15:34:04 director.fv1dci.org dnsmasq[17569]: dnsmasq: unknown interface br-ctlplane
Jun 06 15:34:04 director.fv1dci.org systemd[1]: openstack-ironic-inspector-dnsmasq.service: control process exited, code=exited status=2
Jun 06 15:34:04 director.fv1dci.org systemd[1]: Failed to start PXE boot dnsmasq service for Ironic Inspector.
Jun 06 15:34:04 director.fv1dci.org systemd[1]: Unit openstack-ironic-inspector-dnsmasq.service entered failed state.
Jun 06 15:34:04 director.fv1dci.org systemd[1]: openstack-ironic-inspector-dnsmasq.service failed.


Note the "dnsmasq: unknown interface br-ctlplane "



Expected result:
Minor update on the undercloud completes successfully.

Comment 2 Alexander Chuzhoy 2017-06-06 22:47:56 UTC
Before update:

[stack@director ~]$ sudo ip -4 -o a
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
2: eth0    inet 192.168.250.101/23 brd 192.168.251.255 scope global eth0\       valid_lft forever preferred_lft forever
4: eth2    inet 192.168.110.101/24 brd 192.168.110.255 scope global eth2\       valid_lft forever preferred_lft forever
5: eth3    inet 192.168.140.101/24 brd 192.168.140.255 scope global eth3\       valid_lft forever preferred_lft forever
6: eth4    inet 192.168.190.101/24 brd 192.168.190.255 scope global eth4\       valid_lft forever preferred_lft forever
8: br-ctlplane    inet 192.168.120.101/24 brd 192.168.120.255 scope global br-ctlplane\       valid_lft forever preferred_lft forever

After update:

[stack@director ~]$ sudo ip -4 -o a
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
2: eth0    inet 192.168.250.101/23 brd 192.168.251.255 scope global eth0\       valid_lft forever preferred_lft forever
4: eth2    inet 192.168.110.101/24 brd 192.168.110.255 scope global eth2\       valid_lft forever preferred_lft forever
5: eth3    inet 192.168.140.101/24 brd 192.168.140.255 scope global eth3\       valid_lft forever preferred_lft forever
6: eth4    inet 192.168.190.101/24 brd 192.168.190.255 scope global eth4\       valid_lft forever preferred_lft forever

Comment 3 Assaf Muller 2017-06-06 23:11:02 UTC
Some random notes after looking at the sosreport for a bit:

OVS updated from 2.4 to 2.5:
yum.log: 15:23:45 Updated: openvswitch-2.5.0-14.git20160727.el7fdp.x86_64

OVS restart:
messages: 15:29:25 director systemd: Stopping Open vSwitch...

ovs-vsctl show output shows that br-int is set to fail_mode secure, but br-ctlplane isn't. br-ctlplane dump-flows shows the NORMAL action in place.

Unclear why eth1 and br-ctlplane are down and without an IP.

Comment 4 Alexander Chuzhoy 2017-06-06 23:19:16 UTC
After rebooting the machine, the br-ctlplane was UP. 
Re-ran "openstack undercloud upgrade" and it completed successfully.

Comment 5 Alexander Chuzhoy 2017-06-07 19:27:51 UTC
Easier workaround:
Before running "openstack undercloud upgrade", run "sudo yum update openvswitch".
This will result in br-ctlplane being down.

Then run: "sudo ifup br-ctlplane"
Now run; "openstack undercloud upgrade".

Comment 6 Assaf Muller 2017-06-07 20:02:54 UTC
Hi Sofer,

It seems like we need the same workaround for updating OVS we have on overcloud nodes to be present on undercloud nodes.

Comment 7 Sofer Athlan-Guyot 2017-06-08 09:36:37 UTC
Hi Sasha,

could you try before running "3. openstack undercloud upgrade" to run this shell script in the $HOME directory:

cat > upgrade-ovs.sh <<'EOF'
if [[ -n $(rpm -q --scripts openvswitch | awk '/postuninstall/,/*/' | grep "systemctl.*try-restart") ]]; then
    echo "Manual upgrade of openvswitch - restart in postun detected"
    mkdir OVS_UPGRADE || true
    pushd OVS_UPGRADE
    echo "Attempting to downloading latest openvswitch with yumdownloader"
    yumdownloader --resolve openvswitch
    echo "Updating openvswitch with nopostun option"
    rpm -U --replacepkgs --nopostun ./*.rpm
    popd
else
    echo "Skipping manual upgrade of openvswitch - no restart in postun detected"
fi
EOF

bash -x upgrade-ovs.sh

and then do your update.

It's currently what we have for OSP9 for the overcloud.  I want to make sure that it corrects the issue.

Comment 8 Alexander Chuzhoy 2017-06-09 13:56:15 UTC
Previous version:
openvswitch-2.4.0-1.el7.x86_64

Ran:
[stack@director ~]$ cat > upgrade-ovs.sh <<'EOF'
> if [[ -n $(rpm -q --scripts openvswitch | awk '/postuninstall/,/*/' | grep "systemctl.*try-restart") ]]; then
>     echo "Manual upgrade of openvswitch - restart in postun detected"
>     mkdir OVS_UPGRADE || true
>     pushd OVS_UPGRADE
>     echo "Attempting to downloading latest openvswitch with yumdownloader"
>     yumdownloader --resolve openvswitch
>     echo "Updating openvswitch with nopostun option"
>     rpm -U --replacepkgs --nopostun ./*.rpm
>     popd
> else
>     echo "Skipping manual upgrade of openvswitch - no restart in postun detected"
> fi
> EOF
[stack@director ~]$ sudo bash -x upgrade-ovs.sh
++ rpm -q --scripts openvswitch
++ awk '/postuninstall/,/*/'
++ grep 'systemctl.*try-restart'
+ [[ -n         systemctl try-restart openvswitch.service >/dev/null 2>&1 || :  ]]
+ echo 'Manual upgrade of openvswitch - restart in postun detected'
Manual upgrade of openvswitch - restart in postun detected
+ mkdir OVS_UPGRADE
+ pushd OVS_UPGRADE
/home/stack/OVS_UPGRADE /home/stack
+ echo 'Attempting to downloading latest openvswitch with yumdownloader'
Attempting to downloading latest openvswitch with yumdownloader
+ yumdownloader --resolve openvswitch
Configuration file /etc/yum/pluginconf.d/versionlock.conf not found
Unable to find configuration file for plugin versionlock
Loaded plugins: product-id
--> Running transaction check
---> Package openvswitch.x86_64 0:2.5.0-14.git20160727.el7fdp will be installed
--> Finished Dependency Resolution
openvswitch-2.5.0-14.git20160727.el7fdp.x86_64.rpm                                                                                                                     | 3.6 MB  00:00:01     
+ echo 'Updating openvswitch with nopostun option'
Updating openvswitch with nopostun option
+ rpm -U --replacepkgs --nopostun ./openvswitch-2.5.0-14.git20160727.el7fdp.x86_64.rpm
+ popd
/home/stack
[stack@director ~]$ rpm -q openvswitch
openvswitch-2.5.0-14.git20160727.el7fdp.x86_64


Ran: openstack undercloud upgrade

Completed successfully and the issue with losing IP didn't occur.

Comment 9 Sofer Athlan-Guyot 2018-08-22 13:07:59 UTC
Hi,

So the disconnection issue is mainly tied to the test setup where the br-ctlplane is used as a gateway as well.  This is not happening in real world deployment.  We didn't have any client reporting this issue neither.

So I'm closing it as won't fix.  The workaround provided in comment#7 can still be used.

If one of the assumption were wrong, don't hesitate to re-open the bz.

Regards,


Note You need to log in before you can comment on or make changes to this bug.