Bug 1459353 - rhosp-director: OSP9 'openstack undercloud upgrade' fails. br-ctlplane is down as a result of yum updating openvswitch-2.4.0-1.el7.x86_64 to openvswitch-2.5.0-14.git20160727.el7fdp.x86_64 "
rhosp-director: OSP9 'openstack undercloud upgrade' fails. br-ctlplane is dow...
Status: ON_DEV
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director (Show other bugs)
9.0 (Mitaka)
Unspecified Unspecified
high Severity high
: async
: 9.0 (Mitaka)
Assigned To: Sofer Athlan-Guyot
Amit Ugol
: Triaged, ZStream
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-06 18:19 EDT by Alexander Chuzhoy
Modified: 2017-06-12 13:59 EDT (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
OpenStack gerrit 473548 None None None 2017-06-12 13:59 EDT

  None (edit)
Description Alexander Chuzhoy 2017-06-06 18:19:46 EDT
rhosp-director: OSP9 'openstack undercloud upgrade" fails on undercloud. br-ctlplane is down


Environment:
openstack-tripleo-heat-templates-liberty-2.0.0-54.el7ost.noarch
python-neutron-lbaas-tests-8.1.0-2.el7ost.noarch
python-neutronclient-4.1.1-2.el7ost.noarch
python-neutron-vpnaas-8.0.0-1.el7ost.noarch
openvswitch-2.5.0-14.git20160727.el7fdp.x86_64
python-neutron-fwaas-tests-8.0.0-3.el7ost.noarch
python-neutron-8.3.0-5.el7ost.noarch
python-neutron-lbaas-8.1.0-2.el7ost.noarch
openstack-tripleo-heat-templates-2.0.0-54.el7ost.noarch
openstack-puppet-modules-8.1.10-2.el7ost.noarch
python-neutron-fwaas-8.0.0-3.el7ost.noarch
openstack-neutron-common-8.3.0-5.el7ost.noarch
openstack-neutron-ml2-8.3.0-5.el7ost.noarch
python-openvswitch-2.5.0-14.git20160727.el7fdp.noarch
openstack-neutron-openvswitch-8.3.0-5.el7ost.noarch
python-neutron-tests-8.3.0-5.el7ost.noarch
instack-undercloud-4.0.0-16.el7ost.noarch
python-neutron-vpnaas-tests-8.0.0-1.el7ost.noarch
openstack-neutron-8.3.0-5.el7ost.noarch
python-neutron-lib-0.0.2-1.el7ost.noarch



Steps to reproduce:
Follow the minor update procedure on the undercloud:
1. sudo systemctl stop 'openstack-*' 'neutron-*' httpd
2. sudo yum update python-tripleoclient
3. openstack undercloud upgrade

Result:
The process gets stuck and errors are shown:
Notice: /Stage[main]/Ironic::Inspector/Service[ironic-inspector]/ensure: ensure changed 'stopped' to 'running'
Error: Could not start Service[ironic-inspector-dnsmasq]: Execution of '/bin/systemctl start openstack-ironic-inspector-dnsmasq' returned 1: Job for openstack-ironic-inspector-dnsmasq.service failed because the control process exited with error code. See "systemctl status openstack-ironic-inspector-dnsmasq.service" and "journalctl -xe" for details.
Wrapped exception:
Execution of '/bin/systemctl start openstack-ironic-inspector-dnsmasq' returned 1: Job for openstack-ironic-inspector-dnsmasq.service failed because the control process exited with error code. See "systemctl status openstack-ironic-inspector-dnsmasq.service" and "journalctl -xe" for details.
Error: /Stage[main]/Ironic::Inspector/Service[ironic-inspector-dnsmasq]/ensure: change from stopped to running failed: Could not start Service[ironic-inspector-dnsmasq]: Execution of '/bin/systemctl start openstack-ironic-inspector-dnsmasq' returned 1: Job for openstack-ironic-inspector-dnsmasq.service failed because the control process exited with error code. See "systemctl status openstack-ironic-inspector-dnsmasq.service" and "journalctl -xe" for details.
Notice: /Stage[main]/Heat/Heat_config[trustee/project_domain_id]/value: value changed 'Default' to 'default'
Notice: /Stage[main]/Heat/Heat_config[trustee/user_domain_id]/value: value changed 'Default' to 'default'
Notice: /Stage[main]/Ironic::Db/Ironic_config[database/connection]/value: value changed '[old secret redacted]' to '[new secret redacted]'
mysql+pymysql://ceilometer:be1f7b1dd0ea5204e404d859397dcfd86284c9f1@192.168.120.101/ceilometer

Error: /Stage[main]/Ironic::Db::Sync/Exec[ironic-dbsync]: Failed to call refresh: Command exceeded timeout
Error: /Stage[main]/Ironic::Db::Sync/Exec[ironic-dbsync]: Command exceeded timeout
Wrapped exception:
execution expired
Notice: /Stage[main]/Ironic::Conductor/Service[ironic-conductor]/ensure: ensure changed 'stopped' to 'running'
Notice: /Stage[main]/Ironic::Api/Service[ironic-api]/ensure: ensure changed 'stopped' to 'running'
Notice: /Stage[main]/Rabbitmq::Service/Service[rabbitmq-server]: Triggered 'refresh' from 1 events
Notice: /Stage[main]/Aodh::Listener/Service[aodh-listener]/ensure: ensure changed 'stopped' to 'running'
Notice: /Stage[main]/Aodh::Notifier/Service[aodh-notifier]/ensure: ensure changed 'stopped' to 'running'
Notice: /Stage[main]/Main/File[/etc/keystone/ssl/private/signing_key.pem]/content: content changed '{md5}e5deff5063bd1a07f875e890979ae397' to '{md5}f184d11e42b9d7e5b0ea0e4c9ea7ec68'
Notice: /Stage[main]/Main/File[/etc/keystone/ssl/certs/ca.pem]/content: content changed '{md5}11fb3e3e8382b9789d2daadfe31604fe' to '{md5}ce2e87a7cec218b5843f5fbdcca17156'
Notice: /Stage[main]/Main/File[/etc/keystone/ssl/certs/signing_cert.pem]/content: content changed '{md5}cbec476ab13a22a40e893b8239ee26a8' to '{md5}cf7ed2b1405ec367800ee324a81e3ae7'
Notice: /Stage[main]/Swift::Proxy/Concat[/etc/swift/proxy-server.conf]/File[/etc/swift/proxy-server.conf]/mode: mode changed '0644' to '0640'
Notice: /Stage[main]/Swift::Storage::All/Swift::Storage::Server[6002]/Concat[/etc/swift/account-server.conf]/File[/etc/swift/account-server.conf]/mode: mode changed '0644' to '0640'
Notice: /Stage[main]/Swift::Storage::All/Swift::Storage::Server[6001]/Concat[/etc/swift/container-server.conf]/File[/etc/swift/container-server.conf]/mode: mode changed '0644' to '0640'
5382a5d036786fcf993343d60b720ddd48ae2d67
Notice: /Stage[main]/Swift::Storage::All/Swift::Storage::Server[6000]/Concat[/etc/swift/object-server.conf]/File[/etc/swift/object-server.conf]/mode: mode changed '0644' to '0640'
Notice: /File[/etc/ironic/ironic.conf]/seluser: seluser changed 'unconfined_u' to 'system_u'
Notice: /Stage[main]/Apache/File[/etc/httpd/conf.modules.d/00-proxy.conf]/ensure: removed
Notice: /Stage[main]/Apache/File[/etc/httpd/conf.modules.d/01-cgi.conf]/ensure: removed
Notice: /Stage[main]/Apache/File[/etc/httpd/conf.modules.d/00-mpm.conf]/ensure: removed
Notice: /Stage[main]/Apache/File[/etc/httpd/conf.modules.d/00-systemd.conf]/ensure: removed
Notice: /Stage[main]/Apache/File[/etc/httpd/conf.modules.d/00-base.conf]/ensure: removed
Notice: /Stage[main]/Apache/File[/etc/httpd/conf.modules.d/00-lua.conf]/ensure: removed
Notice: /Stage[main]/Apache/File[/etc/httpd/conf.modules.d/00-dav.conf]/ensure: removed
Notice: /Stage[main]/Main/Nova_config[DEFAULT/my_ip]/value: value changed '192.168.120.101' to '192.168.250.101'
Notice: /Stage[main]/Nova::Network::Neutron/Nova_config[DEFAULT/dhcp_domain]/value: value changed 'fv1dci.org' to ''
Notice: /Stage[main]/Nova::Deps/Anchor[nova::config::end]: Triggered 'refresh' from 2 events
Error: /Stage[main]/Nova::Db::Sync/Exec[nova-db-sync]: Failed to call refresh: Command exceeded timeout
Error: /Stage[main]/Nova::Db::Sync/Exec[nova-db-sync]: Command exceeded timeout
Wrapped exception:
execution expired




Error: /Stage[main]/Nova::Db::Sync_api/Exec[nova-db-sync-api]: Failed to call refresh: Command exceeded timeout
Error: /Stage[main]/Nova::Db::Sync_api/Exec[nova-db-sync-api]: Command exceeded timeout
Wrapped exception:
execution expired
Notice: /Stage[main]/Nova::Deps/Anchor[nova::service::begin]: Triggered 'refresh' from 1 events
Notice: /Stage[main]/Nova::Cert/Nova::Generic_service[cert]/Service[nova-cert]/ensure: ensure changed 'stopped' to 'running'
Notice: /Stage[main]/Nova::Api/Nova::Generic_service[api]/Service[nova-api]/ensure: ensure changed 'stopped' to 'running'
Error: Could not start Service[nova-scheduler]: Execution of '/bin/systemctl start openstack-nova-scheduler' returned 1: Job for openstack-nova-scheduler.service failed because the control process exited with error code. See "systemctl status openstack-nova-scheduler.service" and "journalctl -xe" for details.
Wrapped exception:
Execution of '/bin/systemctl start openstack-nova-scheduler' returned 1: Job for openstack-nova-scheduler.service failed because the control process exited with error code. See "systemctl status openstack-nova-scheduler.service" and "journalctl -xe" for details.
Error: /Stage[main]/Nova::Scheduler/Nova::Generic_service[scheduler]/Service[nova-scheduler]/ensure: change from stopped to running failed: Could not start Service[nova-scheduler]: Execution of '/bin/systemctl start openstack-nova-scheduler' returned 1: Job for openstack-nova-scheduler.service failed because the control process exited with error code. See "systemctl status openstack-nova-scheduler.service" and "journalctl -xe" for details.
Notice: /Stage[main]/Nova::Scheduler/Nova::Generic_service[scheduler]/Service[nova-scheduler]: Triggered 'refresh' from 1 events
Notice: /Stage[main]/Nova::Conductor/Nova::Generic_service[conductor]/Service[nova-conductor]/ensure: ensure changed 'stopped' to 'running'



[root@director ~]# systemctl status openstack-ironic-inspector-dnsmasq
● openstack-ironic-inspector-dnsmasq.service - PXE boot dnsmasq service for Ironic Inspector
   Loaded: loaded (/usr/lib/systemd/system/openstack-ironic-inspector-dnsmasq.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Tue 2017-06-06 15:34:04 CDT; 1h 44min ago
  Process: 17569 ExecStart=/sbin/dnsmasq --conf-file=/etc/ironic-inspector/dnsmasq.conf (code=exited, status=2)
 Main PID: 28349 (code=exited, status=0/SUCCESS)

Jun 06 15:34:04 director.fv1dci.org systemd[1]: Starting PXE boot dnsmasq service for Ironic Inspector...
Jun 06 15:34:04 director.fv1dci.org dnsmasq[17569]: dnsmasq: unknown interface br-ctlplane
Jun 06 15:34:04 director.fv1dci.org systemd[1]: openstack-ironic-inspector-dnsmasq.service: control process exited, code=exited status=2
Jun 06 15:34:04 director.fv1dci.org systemd[1]: Failed to start PXE boot dnsmasq service for Ironic Inspector.
Jun 06 15:34:04 director.fv1dci.org systemd[1]: Unit openstack-ironic-inspector-dnsmasq.service entered failed state.
Jun 06 15:34:04 director.fv1dci.org systemd[1]: openstack-ironic-inspector-dnsmasq.service failed.


Note the "dnsmasq: unknown interface br-ctlplane "



Expected result:
Minor update on the undercloud completes successfully.
Comment 2 Alexander Chuzhoy 2017-06-06 18:47:56 EDT
Before update:

[stack@director ~]$ sudo ip -4 -o a
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
2: eth0    inet 192.168.250.101/23 brd 192.168.251.255 scope global eth0\       valid_lft forever preferred_lft forever
4: eth2    inet 192.168.110.101/24 brd 192.168.110.255 scope global eth2\       valid_lft forever preferred_lft forever
5: eth3    inet 192.168.140.101/24 brd 192.168.140.255 scope global eth3\       valid_lft forever preferred_lft forever
6: eth4    inet 192.168.190.101/24 brd 192.168.190.255 scope global eth4\       valid_lft forever preferred_lft forever
8: br-ctlplane    inet 192.168.120.101/24 brd 192.168.120.255 scope global br-ctlplane\       valid_lft forever preferred_lft forever

After update:

[stack@director ~]$ sudo ip -4 -o a
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
2: eth0    inet 192.168.250.101/23 brd 192.168.251.255 scope global eth0\       valid_lft forever preferred_lft forever
4: eth2    inet 192.168.110.101/24 brd 192.168.110.255 scope global eth2\       valid_lft forever preferred_lft forever
5: eth3    inet 192.168.140.101/24 brd 192.168.140.255 scope global eth3\       valid_lft forever preferred_lft forever
6: eth4    inet 192.168.190.101/24 brd 192.168.190.255 scope global eth4\       valid_lft forever preferred_lft forever
Comment 3 Assaf Muller 2017-06-06 19:11:02 EDT
Some random notes after looking at the sosreport for a bit:

OVS updated from 2.4 to 2.5:
yum.log: 15:23:45 Updated: openvswitch-2.5.0-14.git20160727.el7fdp.x86_64

OVS restart:
messages: 15:29:25 director systemd: Stopping Open vSwitch...

ovs-vsctl show output shows that br-int is set to fail_mode secure, but br-ctlplane isn't. br-ctlplane dump-flows shows the NORMAL action in place.

Unclear why eth1 and br-ctlplane are down and without an IP.
Comment 4 Alexander Chuzhoy 2017-06-06 19:19:16 EDT
After rebooting the machine, the br-ctlplane was UP. 
Re-ran "openstack undercloud upgrade" and it completed successfully.
Comment 5 Alexander Chuzhoy 2017-06-07 15:27:51 EDT
Easier workaround:
Before running "openstack undercloud upgrade", run "sudo yum update openvswitch".
This will result in br-ctlplane being down.

Then run: "sudo ifup br-ctlplane"
Now run; "openstack undercloud upgrade".
Comment 6 Assaf Muller 2017-06-07 16:02:54 EDT
Hi Sofer,

It seems like we need the same workaround for updating OVS we have on overcloud nodes to be present on undercloud nodes.
Comment 7 Sofer Athlan-Guyot 2017-06-08 05:36:37 EDT
Hi Sasha,

could you try before running "3. openstack undercloud upgrade" to run this shell script in the $HOME directory:

cat > upgrade-ovs.sh <<'EOF'
if [[ -n $(rpm -q --scripts openvswitch | awk '/postuninstall/,/*/' | grep "systemctl.*try-restart") ]]; then
    echo "Manual upgrade of openvswitch - restart in postun detected"
    mkdir OVS_UPGRADE || true
    pushd OVS_UPGRADE
    echo "Attempting to downloading latest openvswitch with yumdownloader"
    yumdownloader --resolve openvswitch
    echo "Updating openvswitch with nopostun option"
    rpm -U --replacepkgs --nopostun ./*.rpm
    popd
else
    echo "Skipping manual upgrade of openvswitch - no restart in postun detected"
fi
EOF

bash -x upgrade-ovs.sh

and then do your update.

It's currently what we have for OSP9 for the overcloud.  I want to make sure that it corrects the issue.
Comment 8 Alexander Chuzhoy 2017-06-09 09:56:15 EDT
Previous version:
openvswitch-2.4.0-1.el7.x86_64

Ran:
[stack@director ~]$ cat > upgrade-ovs.sh <<'EOF'
> if [[ -n $(rpm -q --scripts openvswitch | awk '/postuninstall/,/*/' | grep "systemctl.*try-restart") ]]; then
>     echo "Manual upgrade of openvswitch - restart in postun detected"
>     mkdir OVS_UPGRADE || true
>     pushd OVS_UPGRADE
>     echo "Attempting to downloading latest openvswitch with yumdownloader"
>     yumdownloader --resolve openvswitch
>     echo "Updating openvswitch with nopostun option"
>     rpm -U --replacepkgs --nopostun ./*.rpm
>     popd
> else
>     echo "Skipping manual upgrade of openvswitch - no restart in postun detected"
> fi
> EOF
[stack@director ~]$ sudo bash -x upgrade-ovs.sh
++ rpm -q --scripts openvswitch
++ awk '/postuninstall/,/*/'
++ grep 'systemctl.*try-restart'
+ [[ -n         systemctl try-restart openvswitch.service >/dev/null 2>&1 || :  ]]
+ echo 'Manual upgrade of openvswitch - restart in postun detected'
Manual upgrade of openvswitch - restart in postun detected
+ mkdir OVS_UPGRADE
+ pushd OVS_UPGRADE
/home/stack/OVS_UPGRADE /home/stack
+ echo 'Attempting to downloading latest openvswitch with yumdownloader'
Attempting to downloading latest openvswitch with yumdownloader
+ yumdownloader --resolve openvswitch
Configuration file /etc/yum/pluginconf.d/versionlock.conf not found
Unable to find configuration file for plugin versionlock
Loaded plugins: product-id
--> Running transaction check
---> Package openvswitch.x86_64 0:2.5.0-14.git20160727.el7fdp will be installed
--> Finished Dependency Resolution
openvswitch-2.5.0-14.git20160727.el7fdp.x86_64.rpm                                                                                                                     | 3.6 MB  00:00:01     
+ echo 'Updating openvswitch with nopostun option'
Updating openvswitch with nopostun option
+ rpm -U --replacepkgs --nopostun ./openvswitch-2.5.0-14.git20160727.el7fdp.x86_64.rpm
+ popd
/home/stack
[stack@director ~]$ rpm -q openvswitch
openvswitch-2.5.0-14.git20160727.el7fdp.x86_64


Ran: openstack undercloud upgrade

Completed successfully and the issue with losing IP didn't occur.

Note You need to log in before you can comment on or make changes to this bug.