rhosp-director: OSP9 'openstack undercloud upgrade" fails on undercloud. br-ctlplane is down Environment: openstack-tripleo-heat-templates-liberty-2.0.0-54.el7ost.noarch python-neutron-lbaas-tests-8.1.0-2.el7ost.noarch python-neutronclient-4.1.1-2.el7ost.noarch python-neutron-vpnaas-8.0.0-1.el7ost.noarch openvswitch-2.5.0-14.git20160727.el7fdp.x86_64 python-neutron-fwaas-tests-8.0.0-3.el7ost.noarch python-neutron-8.3.0-5.el7ost.noarch python-neutron-lbaas-8.1.0-2.el7ost.noarch openstack-tripleo-heat-templates-2.0.0-54.el7ost.noarch openstack-puppet-modules-8.1.10-2.el7ost.noarch python-neutron-fwaas-8.0.0-3.el7ost.noarch openstack-neutron-common-8.3.0-5.el7ost.noarch openstack-neutron-ml2-8.3.0-5.el7ost.noarch python-openvswitch-2.5.0-14.git20160727.el7fdp.noarch openstack-neutron-openvswitch-8.3.0-5.el7ost.noarch python-neutron-tests-8.3.0-5.el7ost.noarch instack-undercloud-4.0.0-16.el7ost.noarch python-neutron-vpnaas-tests-8.0.0-1.el7ost.noarch openstack-neutron-8.3.0-5.el7ost.noarch python-neutron-lib-0.0.2-1.el7ost.noarch Steps to reproduce: Follow the minor update procedure on the undercloud: 1. sudo systemctl stop 'openstack-*' 'neutron-*' httpd 2. sudo yum update python-tripleoclient 3. openstack undercloud upgrade Result: The process gets stuck and errors are shown: Notice: /Stage[main]/Ironic::Inspector/Service[ironic-inspector]/ensure: ensure changed 'stopped' to 'running' Error: Could not start Service[ironic-inspector-dnsmasq]: Execution of '/bin/systemctl start openstack-ironic-inspector-dnsmasq' returned 1: Job for openstack-ironic-inspector-dnsmasq.service failed because the control process exited with error code. See "systemctl status openstack-ironic-inspector-dnsmasq.service" and "journalctl -xe" for details. Wrapped exception: Execution of '/bin/systemctl start openstack-ironic-inspector-dnsmasq' returned 1: Job for openstack-ironic-inspector-dnsmasq.service failed because the control process exited with error code. See "systemctl status openstack-ironic-inspector-dnsmasq.service" and "journalctl -xe" for details. Error: /Stage[main]/Ironic::Inspector/Service[ironic-inspector-dnsmasq]/ensure: change from stopped to running failed: Could not start Service[ironic-inspector-dnsmasq]: Execution of '/bin/systemctl start openstack-ironic-inspector-dnsmasq' returned 1: Job for openstack-ironic-inspector-dnsmasq.service failed because the control process exited with error code. See "systemctl status openstack-ironic-inspector-dnsmasq.service" and "journalctl -xe" for details. Notice: /Stage[main]/Heat/Heat_config[trustee/project_domain_id]/value: value changed 'Default' to 'default' Notice: /Stage[main]/Heat/Heat_config[trustee/user_domain_id]/value: value changed 'Default' to 'default' Notice: /Stage[main]/Ironic::Db/Ironic_config[database/connection]/value: value changed '[old secret redacted]' to '[new secret redacted]' mysql+pymysql://ceilometer:be1f7b1dd0ea5204e404d859397dcfd86284c9f1.120.101/ceilometer Error: /Stage[main]/Ironic::Db::Sync/Exec[ironic-dbsync]: Failed to call refresh: Command exceeded timeout Error: /Stage[main]/Ironic::Db::Sync/Exec[ironic-dbsync]: Command exceeded timeout Wrapped exception: execution expired Notice: /Stage[main]/Ironic::Conductor/Service[ironic-conductor]/ensure: ensure changed 'stopped' to 'running' Notice: /Stage[main]/Ironic::Api/Service[ironic-api]/ensure: ensure changed 'stopped' to 'running' Notice: /Stage[main]/Rabbitmq::Service/Service[rabbitmq-server]: Triggered 'refresh' from 1 events Notice: /Stage[main]/Aodh::Listener/Service[aodh-listener]/ensure: ensure changed 'stopped' to 'running' Notice: /Stage[main]/Aodh::Notifier/Service[aodh-notifier]/ensure: ensure changed 'stopped' to 'running' Notice: /Stage[main]/Main/File[/etc/keystone/ssl/private/signing_key.pem]/content: content changed '{md5}e5deff5063bd1a07f875e890979ae397' to '{md5}f184d11e42b9d7e5b0ea0e4c9ea7ec68' Notice: /Stage[main]/Main/File[/etc/keystone/ssl/certs/ca.pem]/content: content changed '{md5}11fb3e3e8382b9789d2daadfe31604fe' to '{md5}ce2e87a7cec218b5843f5fbdcca17156' Notice: /Stage[main]/Main/File[/etc/keystone/ssl/certs/signing_cert.pem]/content: content changed '{md5}cbec476ab13a22a40e893b8239ee26a8' to '{md5}cf7ed2b1405ec367800ee324a81e3ae7' Notice: /Stage[main]/Swift::Proxy/Concat[/etc/swift/proxy-server.conf]/File[/etc/swift/proxy-server.conf]/mode: mode changed '0644' to '0640' Notice: /Stage[main]/Swift::Storage::All/Swift::Storage::Server[6002]/Concat[/etc/swift/account-server.conf]/File[/etc/swift/account-server.conf]/mode: mode changed '0644' to '0640' Notice: /Stage[main]/Swift::Storage::All/Swift::Storage::Server[6001]/Concat[/etc/swift/container-server.conf]/File[/etc/swift/container-server.conf]/mode: mode changed '0644' to '0640' 5382a5d036786fcf993343d60b720ddd48ae2d67 Notice: /Stage[main]/Swift::Storage::All/Swift::Storage::Server[6000]/Concat[/etc/swift/object-server.conf]/File[/etc/swift/object-server.conf]/mode: mode changed '0644' to '0640' Notice: /File[/etc/ironic/ironic.conf]/seluser: seluser changed 'unconfined_u' to 'system_u' Notice: /Stage[main]/Apache/File[/etc/httpd/conf.modules.d/00-proxy.conf]/ensure: removed Notice: /Stage[main]/Apache/File[/etc/httpd/conf.modules.d/01-cgi.conf]/ensure: removed Notice: /Stage[main]/Apache/File[/etc/httpd/conf.modules.d/00-mpm.conf]/ensure: removed Notice: /Stage[main]/Apache/File[/etc/httpd/conf.modules.d/00-systemd.conf]/ensure: removed Notice: /Stage[main]/Apache/File[/etc/httpd/conf.modules.d/00-base.conf]/ensure: removed Notice: /Stage[main]/Apache/File[/etc/httpd/conf.modules.d/00-lua.conf]/ensure: removed Notice: /Stage[main]/Apache/File[/etc/httpd/conf.modules.d/00-dav.conf]/ensure: removed Notice: /Stage[main]/Main/Nova_config[DEFAULT/my_ip]/value: value changed '192.168.120.101' to '192.168.250.101' Notice: /Stage[main]/Nova::Network::Neutron/Nova_config[DEFAULT/dhcp_domain]/value: value changed 'fv1dci.org' to '' Notice: /Stage[main]/Nova::Deps/Anchor[nova::config::end]: Triggered 'refresh' from 2 events Error: /Stage[main]/Nova::Db::Sync/Exec[nova-db-sync]: Failed to call refresh: Command exceeded timeout Error: /Stage[main]/Nova::Db::Sync/Exec[nova-db-sync]: Command exceeded timeout Wrapped exception: execution expired Error: /Stage[main]/Nova::Db::Sync_api/Exec[nova-db-sync-api]: Failed to call refresh: Command exceeded timeout Error: /Stage[main]/Nova::Db::Sync_api/Exec[nova-db-sync-api]: Command exceeded timeout Wrapped exception: execution expired Notice: /Stage[main]/Nova::Deps/Anchor[nova::service::begin]: Triggered 'refresh' from 1 events Notice: /Stage[main]/Nova::Cert/Nova::Generic_service[cert]/Service[nova-cert]/ensure: ensure changed 'stopped' to 'running' Notice: /Stage[main]/Nova::Api/Nova::Generic_service[api]/Service[nova-api]/ensure: ensure changed 'stopped' to 'running' Error: Could not start Service[nova-scheduler]: Execution of '/bin/systemctl start openstack-nova-scheduler' returned 1: Job for openstack-nova-scheduler.service failed because the control process exited with error code. See "systemctl status openstack-nova-scheduler.service" and "journalctl -xe" for details. Wrapped exception: Execution of '/bin/systemctl start openstack-nova-scheduler' returned 1: Job for openstack-nova-scheduler.service failed because the control process exited with error code. See "systemctl status openstack-nova-scheduler.service" and "journalctl -xe" for details. Error: /Stage[main]/Nova::Scheduler/Nova::Generic_service[scheduler]/Service[nova-scheduler]/ensure: change from stopped to running failed: Could not start Service[nova-scheduler]: Execution of '/bin/systemctl start openstack-nova-scheduler' returned 1: Job for openstack-nova-scheduler.service failed because the control process exited with error code. See "systemctl status openstack-nova-scheduler.service" and "journalctl -xe" for details. Notice: /Stage[main]/Nova::Scheduler/Nova::Generic_service[scheduler]/Service[nova-scheduler]: Triggered 'refresh' from 1 events Notice: /Stage[main]/Nova::Conductor/Nova::Generic_service[conductor]/Service[nova-conductor]/ensure: ensure changed 'stopped' to 'running' [root@director ~]# systemctl status openstack-ironic-inspector-dnsmasq ● openstack-ironic-inspector-dnsmasq.service - PXE boot dnsmasq service for Ironic Inspector Loaded: loaded (/usr/lib/systemd/system/openstack-ironic-inspector-dnsmasq.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Tue 2017-06-06 15:34:04 CDT; 1h 44min ago Process: 17569 ExecStart=/sbin/dnsmasq --conf-file=/etc/ironic-inspector/dnsmasq.conf (code=exited, status=2) Main PID: 28349 (code=exited, status=0/SUCCESS) Jun 06 15:34:04 director.fv1dci.org systemd[1]: Starting PXE boot dnsmasq service for Ironic Inspector... Jun 06 15:34:04 director.fv1dci.org dnsmasq[17569]: dnsmasq: unknown interface br-ctlplane Jun 06 15:34:04 director.fv1dci.org systemd[1]: openstack-ironic-inspector-dnsmasq.service: control process exited, code=exited status=2 Jun 06 15:34:04 director.fv1dci.org systemd[1]: Failed to start PXE boot dnsmasq service for Ironic Inspector. Jun 06 15:34:04 director.fv1dci.org systemd[1]: Unit openstack-ironic-inspector-dnsmasq.service entered failed state. Jun 06 15:34:04 director.fv1dci.org systemd[1]: openstack-ironic-inspector-dnsmasq.service failed. Note the "dnsmasq: unknown interface br-ctlplane " Expected result: Minor update on the undercloud completes successfully.
Before update: [stack@director ~]$ sudo ip -4 -o a 1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever 2: eth0 inet 192.168.250.101/23 brd 192.168.251.255 scope global eth0\ valid_lft forever preferred_lft forever 4: eth2 inet 192.168.110.101/24 brd 192.168.110.255 scope global eth2\ valid_lft forever preferred_lft forever 5: eth3 inet 192.168.140.101/24 brd 192.168.140.255 scope global eth3\ valid_lft forever preferred_lft forever 6: eth4 inet 192.168.190.101/24 brd 192.168.190.255 scope global eth4\ valid_lft forever preferred_lft forever 8: br-ctlplane inet 192.168.120.101/24 brd 192.168.120.255 scope global br-ctlplane\ valid_lft forever preferred_lft forever After update: [stack@director ~]$ sudo ip -4 -o a 1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever 2: eth0 inet 192.168.250.101/23 brd 192.168.251.255 scope global eth0\ valid_lft forever preferred_lft forever 4: eth2 inet 192.168.110.101/24 brd 192.168.110.255 scope global eth2\ valid_lft forever preferred_lft forever 5: eth3 inet 192.168.140.101/24 brd 192.168.140.255 scope global eth3\ valid_lft forever preferred_lft forever 6: eth4 inet 192.168.190.101/24 brd 192.168.190.255 scope global eth4\ valid_lft forever preferred_lft forever
Some random notes after looking at the sosreport for a bit: OVS updated from 2.4 to 2.5: yum.log: 15:23:45 Updated: openvswitch-2.5.0-14.git20160727.el7fdp.x86_64 OVS restart: messages: 15:29:25 director systemd: Stopping Open vSwitch... ovs-vsctl show output shows that br-int is set to fail_mode secure, but br-ctlplane isn't. br-ctlplane dump-flows shows the NORMAL action in place. Unclear why eth1 and br-ctlplane are down and without an IP.
After rebooting the machine, the br-ctlplane was UP. Re-ran "openstack undercloud upgrade" and it completed successfully.
Easier workaround: Before running "openstack undercloud upgrade", run "sudo yum update openvswitch". This will result in br-ctlplane being down. Then run: "sudo ifup br-ctlplane" Now run; "openstack undercloud upgrade".
Hi Sofer, It seems like we need the same workaround for updating OVS we have on overcloud nodes to be present on undercloud nodes.
Hi Sasha, could you try before running "3. openstack undercloud upgrade" to run this shell script in the $HOME directory: cat > upgrade-ovs.sh <<'EOF' if [[ -n $(rpm -q --scripts openvswitch | awk '/postuninstall/,/*/' | grep "systemctl.*try-restart") ]]; then echo "Manual upgrade of openvswitch - restart in postun detected" mkdir OVS_UPGRADE || true pushd OVS_UPGRADE echo "Attempting to downloading latest openvswitch with yumdownloader" yumdownloader --resolve openvswitch echo "Updating openvswitch with nopostun option" rpm -U --replacepkgs --nopostun ./*.rpm popd else echo "Skipping manual upgrade of openvswitch - no restart in postun detected" fi EOF bash -x upgrade-ovs.sh and then do your update. It's currently what we have for OSP9 for the overcloud. I want to make sure that it corrects the issue.
Previous version: openvswitch-2.4.0-1.el7.x86_64 Ran: [stack@director ~]$ cat > upgrade-ovs.sh <<'EOF' > if [[ -n $(rpm -q --scripts openvswitch | awk '/postuninstall/,/*/' | grep "systemctl.*try-restart") ]]; then > echo "Manual upgrade of openvswitch - restart in postun detected" > mkdir OVS_UPGRADE || true > pushd OVS_UPGRADE > echo "Attempting to downloading latest openvswitch with yumdownloader" > yumdownloader --resolve openvswitch > echo "Updating openvswitch with nopostun option" > rpm -U --replacepkgs --nopostun ./*.rpm > popd > else > echo "Skipping manual upgrade of openvswitch - no restart in postun detected" > fi > EOF [stack@director ~]$ sudo bash -x upgrade-ovs.sh ++ rpm -q --scripts openvswitch ++ awk '/postuninstall/,/*/' ++ grep 'systemctl.*try-restart' + [[ -n systemctl try-restart openvswitch.service >/dev/null 2>&1 || : ]] + echo 'Manual upgrade of openvswitch - restart in postun detected' Manual upgrade of openvswitch - restart in postun detected + mkdir OVS_UPGRADE + pushd OVS_UPGRADE /home/stack/OVS_UPGRADE /home/stack + echo 'Attempting to downloading latest openvswitch with yumdownloader' Attempting to downloading latest openvswitch with yumdownloader + yumdownloader --resolve openvswitch Configuration file /etc/yum/pluginconf.d/versionlock.conf not found Unable to find configuration file for plugin versionlock Loaded plugins: product-id --> Running transaction check ---> Package openvswitch.x86_64 0:2.5.0-14.git20160727.el7fdp will be installed --> Finished Dependency Resolution openvswitch-2.5.0-14.git20160727.el7fdp.x86_64.rpm | 3.6 MB 00:00:01 + echo 'Updating openvswitch with nopostun option' Updating openvswitch with nopostun option + rpm -U --replacepkgs --nopostun ./openvswitch-2.5.0-14.git20160727.el7fdp.x86_64.rpm + popd /home/stack [stack@director ~]$ rpm -q openvswitch openvswitch-2.5.0-14.git20160727.el7fdp.x86_64 Ran: openstack undercloud upgrade Completed successfully and the issue with losing IP didn't occur.
Hi, So the disconnection issue is mainly tied to the test setup where the br-ctlplane is used as a gateway as well. This is not happening in real world deployment. We didn't have any client reporting this issue neither. So I'm closing it as won't fix. The workaround provided in comment#7 can still be used. If one of the assumption were wrong, don't hesitate to re-open the bz. Regards,