Description of problem: OSP9 -> OSP10 with ovs 2.6 upgrade fails with neutron-openvswitch-agent unable to start during major-upgrade-pacemaker.yaml Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-compat-2.0.0-41.el7ost.noarch openstack-tripleo-heat-templates-5.2.0-18.el7ost.noarch openstack-neutron-openvswitch-9.2.0-2.el7ost.noarch openvswitch-2.6.1-10.git20161206.el7fdp.x86_64 python-openvswitch-2.6.1-10.git20161206.el7fdp.noarch How reproducible: 100% Steps to Reproduce: 1. Upgrade undercloud including fix for bug 1431115 (openstack-tripleo-heat-templates-5.2.0-18.el7ost) 2. Upgrade overcloud nodes Actual results: Upgrade fails during major-upgrade-pacemaker.yaml Expected results: Step completes fine. Additional info: Adding sosreports. It looks like neutron-openvswitch-agent is unable to start because openvswtich is not started: [root@overcloud-controller-0 heat-admin]# systemctl status openvswitch ● openvswitch.service - Open vSwitch Loaded: loaded (/usr/lib/systemd/system/openvswitch.service; enabled; vendor preset: disabled) Active: inactive (dead) since Mon 2017-05-22 23:09:44 UTC; 9h ago Main PID: 1020 (code=exited, status=0/SUCCESS) May 22 23:25:10 overcloud-controller-0.localdomain systemd[1]: Dependency failed for Open vSwitch. May 22 23:25:10 overcloud-controller-0.localdomain systemd[1]: Job openvswitch.service/start failed with result 'dependency'. May 22 23:26:40 overcloud-controller-0.localdomain systemd[1]: Dependency failed for Open vSwitch. May 22 23:26:40 overcloud-controller-0.localdomain systemd[1]: Job openvswitch.service/start failed with result 'dependency'. May 22 23:28:15 overcloud-controller-0.localdomain systemd[1]: Dependency failed for Open vSwitch. May 22 23:28:15 overcloud-controller-0.localdomain systemd[1]: Job openvswitch.service/start failed with result 'dependency'. May 22 23:29:45 overcloud-controller-0.localdomain systemd[1]: Dependency failed for Open vSwitch. May 22 23:29:45 overcloud-controller-0.localdomain systemd[1]: Job openvswitch.service/start failed with result 'dependency'. May 22 23:37:01 overcloud-controller-0.localdomain systemd[1]: Dependency failed for Open vSwitch. May 22 23:37:01 overcloud-controller-0.localdomain systemd[1]: Job openvswitch.service/start failed with result 'dependency'. [root@overcloud-controller-0 heat-admin]# systemctl status neutron-openvswitch-agent ● neutron-openvswitch-agent.service - OpenStack Neutron Open vSwitch Agent Loaded: loaded (/usr/lib/systemd/system/neutron-openvswitch-agent.service; disabled; vendor preset: disabled) Active: inactive (dead) May 22 22:56:29 overcloud-controller-0.localdomain ovs-ofctl[29532]: ovs|00001|ofp_util|INFO|normalization changed ofp_match, details: May 22 22:56:29 overcloud-controller-0.localdomain ovs-ofctl[29532]: ovs|00002|ofp_util|INFO| pre: in_port=6,nw_proto=58,tp_src=136 May 22 22:56:29 overcloud-controller-0.localdomain ovs-ofctl[29532]: ovs|00003|ofp_util|INFO|post: in_port=6 May 22 22:56:29 overcloud-controller-0.localdomain ovs-ofctl[29539]: ovs|00001|ofp_util|INFO|normalization changed ofp_match, details: May 22 22:56:29 overcloud-controller-0.localdomain ovs-ofctl[29539]: ovs|00002|ofp_util|INFO| pre: in_port=7,nw_proto=58,tp_src=136 May 22 22:56:29 overcloud-controller-0.localdomain ovs-ofctl[29539]: ovs|00003|ofp_util|INFO|post: in_port=7 May 22 22:56:31 overcloud-controller-0.localdomain systemd[1]: Stopping OpenStack Neutron Open vSwitch Agent... May 22 22:56:33 overcloud-controller-0.localdomain systemd[1]: Stopped OpenStack Neutron Open vSwitch Agent. May 22 23:37:01 overcloud-controller-0.localdomain systemd[1]: Dependency failed for OpenStack Neutron Open vSwitch Agent. May 22 23:37:01 overcloud-controller-0.localdomain systemd[1]: Job neutron-openvswitch-agent.service/start failed with result 'dependency'. [stack@undercloud-0 ~]$ openstack stack failures list overcloud overcloud.UpdateWorkflow.ControllerPacemakerUpgradeDeployment_Step6.1: resource_type: OS::Heat::SoftwareDeployment physical_resource_id: 66c0e162-78aa-4721-b522-b0d7d14b8e93 status: CREATE_FAILED status_reason: | Error: resources[1]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 1 deploy_stdout: | ... neutron-l3-agent is started Mon May 22 23:35:18 UTC 2017 4d45c76e-a87a-41a3-bc44-07edb44c86eb tripleo-upgrade overcloud-controller-1 Going to systemctl start neutron-metadata-agent Mon May 22 23:35:19 UTC 2017 4d45c76e-a87a-41a3-bc44-07edb44c86eb tripleo-upgrade overcloud-controller-1 Going to systemctl enable neutron-metadata-agent Mon May 22 23:35:19 UTC 2017 4d45c76e-a87a-41a3-bc44-07edb44c86eb tripleo-upgrade overcloud-controller-1 Going to check_resource_systemd for neutron-metadata-agent to be started neutron-metadata-agent is started Mon May 22 23:35:20 UTC 2017 4d45c76e-a87a-41a3-bc44-07edb44c86eb tripleo-upgrade overcloud-controller-1 Going to systemctl start neutron-netns-cleanup Mon May 22 23:35:21 UTC 2017 4d45c76e-a87a-41a3-bc44-07edb44c86eb tripleo-upgrade overcloud-controller-1 Going to systemctl enable neutron-netns-cleanup Mon May 22 23:35:21 UTC 2017 4d45c76e-a87a-41a3-bc44-07edb44c86eb tripleo-upgrade overcloud-controller-1 Going to check_resource_systemd for neutron-netns-cleanup to be started neutron-netns-cleanup is started Mon May 22 23:35:22 UTC 2017 4d45c76e-a87a-41a3-bc44-07edb44c86eb tripleo-upgrade overcloud-controller-1 Going to systemctl start neutron-openvswitch-agent (truncated, view all with --long) deploy_stderr: | Created symlink from /etc/systemd/system/multi-user.target.wants/httpd.service to /usr/lib/systemd/system/httpd.service. Created symlink from /etc/systemd/system/multi-user.target.wants/memcached.service to /usr/lib/systemd/system/memcached.service. Created symlink from /etc/systemd/system/multi-user.target.wants/mongod.service to /usr/lib/systemd/system/mongod.service. Created symlink from /etc/systemd/system/multi-user.target.wants/neutron-dhcp-agent.service to /usr/lib/systemd/system/neutron-dhcp-agent.service. Created symlink from /etc/systemd/system/multi-user.target.wants/neutron-l3-agent.service to /usr/lib/systemd/system/neutron-l3-agent.service. Created symlink from /etc/systemd/system/multi-user.target.wants/neutron-metadata-agent.service to /usr/lib/systemd/system/neutron-metadata-agent.service. Created symlink from /etc/systemd/system/multi-user.target.wants/neutron-netns-cleanup.service to /usr/lib/systemd/system/neutron-netns-cleanup.service. A dependency job for neutron-openvswitch-agent.service failed. See 'journalctl -xe' for details. overcloud.UpdateWorkflow.ControllerPacemakerUpgradeDeployment_Step6.0: resource_type: OS::Heat::SoftwareDeployment physical_resource_id: ef47c603-38e8-4e9e-bf3d-f254cbad751e status: CREATE_FAILED status_reason: | Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 1 deploy_stdout: | ... neutron-l3-agent is started Mon May 22 23:35:28 UTC 2017 d5e1af6d-a7bb-4758-8649-df8414a17a81 tripleo-upgrade overcloud-controller-0 Going to systemctl start neutron-metadata-agent Mon May 22 23:35:28 UTC 2017 d5e1af6d-a7bb-4758-8649-df8414a17a81 tripleo-upgrade overcloud-controller-0 Going to systemctl enable neutron-metadata-agent Mon May 22 23:35:28 UTC 2017 d5e1af6d-a7bb-4758-8649-df8414a17a81 tripleo-upgrade overcloud-controller-0 Going to check_resource_systemd for neutron-metadata-agent to be started neutron-metadata-agent is started Mon May 22 23:35:29 UTC 2017 d5e1af6d-a7bb-4758-8649-df8414a17a81 tripleo-upgrade overcloud-controller-0 Going to systemctl start neutron-netns-cleanup Mon May 22 23:35:30 UTC 2017 d5e1af6d-a7bb-4758-8649-df8414a17a81 tripleo-upgrade overcloud-controller-0 Going to systemctl enable neutron-netns-cleanup Mon May 22 23:35:30 UTC 2017 d5e1af6d-a7bb-4758-8649-df8414a17a81 tripleo-upgrade overcloud-controller-0 Going to check_resource_systemd for neutron-netns-cleanup to be started neutron-netns-cleanup is started Mon May 22 23:35:31 UTC 2017 d5e1af6d-a7bb-4758-8649-df8414a17a81 tripleo-upgrade overcloud-controller-0 Going to systemctl start neutron-openvswitch-agent (truncated, view all with --long) deploy_stderr: | Created symlink from /etc/systemd/system/multi-user.target.wants/httpd.service to /usr/lib/systemd/system/httpd.service. Created symlink from /etc/systemd/system/multi-user.target.wants/memcached.service to /usr/lib/systemd/system/memcached.service. Created symlink from /etc/systemd/system/multi-user.target.wants/mongod.service to /usr/lib/systemd/system/mongod.service. Created symlink from /etc/systemd/system/multi-user.target.wants/neutron-dhcp-agent.service to /usr/lib/systemd/system/neutron-dhcp-agent.service. Created symlink from /etc/systemd/system/multi-user.target.wants/neutron-l3-agent.service to /usr/lib/systemd/system/neutron-l3-agent.service. Created symlink from /etc/systemd/system/multi-user.target.wants/neutron-metadata-agent.service to /usr/lib/systemd/system/neutron-metadata-agent.service. Created symlink from /etc/systemd/system/multi-user.target.wants/neutron-netns-cleanup.service to /usr/lib/systemd/system/neutron-netns-cleanup.service. A dependency job for neutron-openvswitch-agent.service failed. See 'journalctl -xe' for details. overcloud.UpdateWorkflow.ControllerPacemakerUpgradeDeployment_Step6.2: resource_type: OS::Heat::SoftwareDeployment physical_resource_id: f8f9b3fe-8c32-4489-a2bc-a734a282a882 status: CREATE_FAILED status_reason: | Error: resources[2]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 1 deploy_stdout: | ... neutron-l3-agent is started Mon May 22 23:35:33 UTC 2017 bf107d86-2041-4e92-899a-ed12291abc5e tripleo-upgrade overcloud-controller-2 Going to systemctl start neutron-metadata-agent Mon May 22 23:35:34 UTC 2017 bf107d86-2041-4e92-899a-ed12291abc5e tripleo-upgrade overcloud-controller-2 Going to systemctl enable neutron-metadata-agent Mon May 22 23:35:34 UTC 2017 bf107d86-2041-4e92-899a-ed12291abc5e tripleo-upgrade overcloud-controller-2 Going to check_resource_systemd for neutron-metadata-agent to be started neutron-metadata-agent is started Mon May 22 23:35:35 UTC 2017 bf107d86-2041-4e92-899a-ed12291abc5e tripleo-upgrade overcloud-controller-2 Going to systemctl start neutron-netns-cleanup Mon May 22 23:35:35 UTC 2017 bf107d86-2041-4e92-899a-ed12291abc5e tripleo-upgrade overcloud-controller-2 Going to systemctl enable neutron-netns-cleanup Mon May 22 23:35:36 UTC 2017 bf107d86-2041-4e92-899a-ed12291abc5e tripleo-upgrade overcloud-controller-2 Going to check_resource_systemd for neutron-netns-cleanup to be started neutron-netns-cleanup is started Mon May 22 23:35:36 UTC 2017 bf107d86-2041-4e92-899a-ed12291abc5e tripleo-upgrade overcloud-controller-2 Going to systemctl start neutron-openvswitch-agent (truncated, view all with --long) deploy_stderr: | Created symlink from /etc/systemd/system/multi-user.target.wants/httpd.service to /usr/lib/systemd/system/httpd.service. Created symlink from /etc/systemd/system/multi-user.target.wants/memcached.service to /usr/lib/systemd/system/memcached.service. Created symlink from /etc/systemd/system/multi-user.target.wants/mongod.service to /usr/lib/systemd/system/mongod.service. Created symlink from /etc/systemd/system/multi-user.target.wants/neutron-dhcp-agent.service to /usr/lib/systemd/system/neutron-dhcp-agent.service. Created symlink from /etc/systemd/system/multi-user.target.wants/neutron-l3-agent.service to /usr/lib/systemd/system/neutron-l3-agent.service. Created symlink from /etc/systemd/system/multi-user.target.wants/neutron-metadata-agent.service to /usr/lib/systemd/system/neutron-metadata-agent.service. Created symlink from /etc/systemd/system/multi-user.target.wants/neutron-netns-cleanup.service to /usr/lib/systemd/system/neutron-netns-cleanup.service. A dependency job for neutron-openvswitch-agent.service failed. See 'journalctl -xe' for details.
o/ mcornea just spent some time looking at logs - I followed controller-0. The question really is why was openvswitch restarted in the first place. Indeed the upgrade fails when openvswitch-agent is started by the upgrade workflow itself, at [0][1]. However the real problem (that openvswitch is stopped) starts about half an hour before that, when os-net-config runs for vlan200 [2]. Did something change in the network configuration that might cause this? For the bug you mention in comment #0 indeed I see the 'manual' upgrade of openvswitch is executed fine and there is no restart of openvswitch because of that so no I don't think this is related to bug 1431115. So if there is nothing you can spot in terms of differences to vlan200 config that might be causing [2] then we may need to get ovs folks involved asap since the whole point of the manual workaround we execute here is to avoid openvswitch restart (as i said that seems to be working/doing its job here, the ovs is upgraded and other stuff happens after it): May 22 19:09:03 localhost os-collect-config: openvswitch-2.5.0-14.git20160727.el7fdp.x86_64 May 22 19:09:03 localhost os-collect-config: Manual upgrade of openvswitch - ovs-2.5.0-14 or restart in postun detected May 22 19:09:03 localhost os-collect-config: /var/lib/heat-config/heat-config-script/OVS_UPGRADE /var/lib/heat-config/heat-config-script May 22 19:09:03 localhost os-collect-config: Attempting to downloading latest openvswitch with yumdownloader May 22 19:09:03 localhost os-collect-config: Loaded plugins: product-id May 22 19:09:03 localhost os-collect-config: --> Running transaction check May 22 19:09:03 localhost os-collect-config: ---> Package openvswitch.x86_64 0:2.6.1-10.git20161206.el7fdp will be installed May 22 19:09:03 localhost os-collect-config: --> Finished Dependency Resolution May 22 19:09:03 localhost os-collect-config: Updating openvswitch-2.6.1-10.git20161206.el7fdp.x86_64.rpm with --nopostun --notriggerun May 22 19:09:03 localhost os-collect-config: /var/lib/heat-config/heat-config-script May 22 19:09:03 localhost os-collect-config: Loaded plugins: product-id, search-disabled-repos, subscription-manager May 22 19:09:03 localhost os-collect-config: Installing: May 22 19:09:03 localhost os-collect-config: mod_ssl x86_64 1:2.4.6-45.el7_3.4 rhelosp-rhel-7.3-server 105 k [0] https://github.com/openstack/tripleo-heat-templates/blob/stable/newton/extraconfig/tasks/major_upgrade_controller_pacemaker_6.sh#L13 [1] https://github.com/openstack/tripleo-heat-templates/blob/stable/newton/extraconfig/tasks/pacemaker_common_functions.sh#L189 [2] May 22 19:09:44 localhost os-collect-config: + os-net-config -c /etc/os-net-config/config.json -v --detailed-exit-codes May 22 19:09:44 localhost os-collect-config: [2017/05/22 11:09:44 PM] [INFO] Using config file at: /etc/os-net-config/config.json May 22 19:09:44 localhost os-collect-config: [2017/05/22 11:09:44 PM] [INFO] No changes required for vlan interface: vlan300 May 22 19:09:44 localhost os-collect-config: [2017/05/22 11:09:44 PM] [INFO] No changes required for vlan interface: vlan100 May 22 19:09:44 localhost os-collect-config: [2017/05/22 11:09:44 PM] [INFO] No changes required for vlan interface: vlan301 May 22 19:09:44 localhost os-collect-config: [2017/05/22 11:09:44 PM] [INFO] running ifdown on interface: vlan200 May 22 19:09:44 localhost systemd: Starting Open vSwitch Database Unit... May 22 19:09:44 localhost ovs-ctl: ovsdb-server is already running. May 22 19:09:44 localhost ovs-ctl: Enabling remote OVSDB managers [ OK ] May 22 19:09:44 localhost systemd: Stopping Open vSwitch... May 22 19:09:44 localhost systemd: Stopped Open vSwitch. May 22 19:09:44 localhost ovs-ctl: Killing ovsdb-server (830) [ OK ] May 22 19:09:44 localhost systemd: Stopped Open vSwitch Database Unit. May 22 19:09:44 localhost ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl -t 10 -- --if-exists del-port br-infra vlan200 May 22 19:09:44 localhost ovs-vsctl: ovs|00002|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory) May 22 19:09:44 localhost os-collect-config: [2017/05/22 11:09:44 PM] [INFO] running ifdown on interface: eth2 May 22 19:09:44 localhost systemd: Starting Open vSwitch Database Unit... May 22 19:09:44 localhost ovs-ctl: Backing up database to /etc/openvswitch/conf.db.backup7.12.1-2211824403 [ OK ] May 22 19:09:44 localhost ovs-ctl: Compacting database [ OK ] May 22 19:09:44 localhost ovs-ctl: Converting database schema [ OK ] May 22 19:09:44 localhost ovs-ctl: Starting ovsdb-server [ OK ] May 22 19:09:48 localhost systemd: Starting Open vSwitch Forwarding Unit... May 22 19:09:48 localhost ovs-vswitchd: ovs|00006|bridge|ERR|another ovs-vswitchd process is running, disabling this process (pid 29426) until it goes away May 22 19:12:53 localhost systemd: ovs-vswitchd.service start operation timed out. Terminating. May 22 19:12:53 localhost systemd: Failed to start Open vSwitch Forwarding Unit. May 22 19:12:53 localhost systemd: Dependency failed for Open vSwitch. May 22 19:12:53 localhost systemd: Job openvswitch.service/start failed with result 'dependency'. May 22 19:12:53 localhost ovs-ctl: Starting ovs-vswitchd May 22 19:12:53 localhost systemd: Unit ovs-vswitchd.service entered failed state. May 22 19:12:53 localhost systemd: ovs-vswitchd.service failed. May 22 19:12:53 localhost ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl -t 10 -- --may-exist add-br br-storage -- set bridge br-storage other-config:hwaddr=52:54:00:37:33:9d -- set bridge br-storage fail_mode=standalone May 22 19:12:53 localhost kernel: device br-storage entered promiscuous mode May 22 19:12:53 localhost NetworkManager[693]: <info> [1495494773.0222] manager: (br-storage): new Generic device (/org/freedesktop/NetworkManager/Devices/26) May 22 19:12:53 localhost NetworkManager[693]: <info> [1495494773.0437] device (br-storage): link connected May 22 19:12:53 localhost os-collect-config: [2017/05/22 11:12:53 PM] [INFO] running ifup on bridge: br-ex May 22 19:12:53 localhost systemd: Starting Open vSwitch Forwarding Unit... May 22 19:12:53 localhost ovs-vswitchd: ovs|00006|bridge|ERR|another ovs-vswitchd process is running, disabling this process (pid 29705) until it goes away
forgot to ask, is this a recent regression as far as you know (e.g. when did this last work if it was recently)
(In reply to marios from comment #3) > forgot to ask, is this a recent regression as far as you know (e.g. when did > this last work if it was recently) This showed up during the 9 to 10 testing ovs 2.6. The current vlan200 configuration (after failed upgrade): [root@overcloud-controller-0 heat-admin]# cat /etc/sysconfig/network-scripts/ifcfg-vlan200 # This file is autogenerated by os-net-config DEVICE=vlan200 ONBOOT=yes HOTPLUG=no NM_CONTROLLED=no PEERDNS=no DEVICETYPE=ovs TYPE=OVSIntPort OVS_BRIDGE=br-infra OVS_OPTIONS="tag=200" BOOTPROTO=static IPADDR=10.0.0.14 NETMASK=255.255.255.128 Will get back to see how it looks on an OSP9 deployment.
Fresh OSP9 deployment: [root@overcloud-controller-0 heat-admin]# cat /etc/sysconfig/network-scripts/ifcfg-vlan200 # This file is autogenerated by os-net-config DEVICE=vlan200 ONBOOT=yes HOTPLUG=no NM_CONTROLLED=no PEERDNS=no DEVICETYPE=ovs TYPE=OVSIntPort OVS_BRIDGE=br-infra OVS_OPTIONS="tag=200" BOOTPROTO=static IPADDR=10.0.0.15 NETMASK=255.255.255.128
assigning amuller and adding DFG:Networking as we discussed during upgrades scrum this afternoon (moving Upgrades as secondary for now) @assaf assigned you as TC based on 'the mojo doc'- should it be assigned to fleitner? This is about testing of ovs 2.5-2.6 upgrade in OSP10 (hitting the OSP9..10 upgrade here). As in comment #2 the workaround (--notriggerun --nopostun) is present and working as expected and ovs is not restarted. However, something (os-net-config) is causing ovs to go down/up. This issue has just shown up having started to test 2.5-2.6 ovs during the OSP10 upgrade as per mcornea comment #4
Can we get the network configuration files for this environment? As already indicated by marios, it seems that the interface data for vlan200 has changed and the change also seems to be triggering an openvswitch restart in on-net-config. From inspecting the code, this shouldn't happen unless OVSDPDK appears somewhere in the interface data. This is not expected given the information provided. In the meantime, we should examine the contents of the ovs-ctl script for that release of Open vSwitch to determine if there are any conditions that would cause it to restart the services.
These are the network related environments: /home/stack/openstack_deployment/environments/network-environment.yaml resource_registry: OS::TripleO::Compute::Net::SoftwareConfig: /home/stack/openstack_deployment/nic-configs/compute.yaml OS::TripleO::Controller::Net::SoftwareConfig: /home/stack/openstack_deployment/nic-configs/controller.yaml OS::TripleO::CephStorage::Net::SoftwareConfig: /home/stack/openstack_deployment/nic-configs/ceph-storage.yaml OS::TripleO::ObjectStorage::Net::SoftwareConfig: /home/stack/openstack_deployment/nic-configs/swift-storage.yaml parameter_defaults: InternalApiNetCidr: 10.0.0.0/25 InternalApiAllocationPools: [{'start': '10.0.0.10', 'end': '10.0.0.100'}] InternalApiNetworkVlanID: 200 StorageNetCidr: 10.0.0.128/25 StorageAllocationPools: [{'start': '10.0.0.138', 'end': '10.0.0.200'}] StorageNetworkVlanID: 300 StorageMgmtNetCidr: 10.0.1.0/25 StorageMgmtAllocationPools: [{'start': '10.0.1.10', 'end': '10.0.1.100'}] StorageMgmtNetworkVlanID: 301 ExternalNetCidr: 172.16.18.0/25 ExternalAllocationPools: [{'start': '172.16.18.25', 'end': '172.16.18.100'}] ExternalInterfaceDefaultRoute: 172.16.18.126 ExternalNetworkVlanID: 100 TenantNetCidr: 10.0.1.128/25 TenantAllocationPools: [{'start': '10.0.1.138', 'end': '10.0.1.200'}] ManagementNetCidr: 172.16.17.128/25 ManagementAllocationPools: [{'start': '172.16.17.181', 'end': '172.16.17.210'}] ManagementInterfaceDefaultRoute: 172.16.17.254 ControlPlaneSubnetCidr: "25" ControlPlaneDefaultRoute: 192.168.0.1 EC2MetadataIp: 192.168.0.1 DnsServers: ["172.16.17.254","172.16.17.254"] NtpServer: ["clock.redhat.com","clock.redhat.com"] /home/stack/openstack_deployment/environments/neutron-settings.yaml parameter_defaults: NeutronExternalNetworkBridge: "''" NeutronBridgeMappings: 'datacentre:br-ex,tenantvlan:br-infra' NeutronEnableIsolatedMetadata: 'True' NeutronNetworkType: 'vxlan,gre,vlan,flat' NeutronTunnelTypes: 'vxlan,gre' NeutronNetworkVLANRanges: 'datacentre:100:199,tenantvlan:200:299' NeutronDhcpAgentsPerNetwork: 3
So there doesn't seem to be DPDK involved here. I will look to see if ovs-ctl could cause a restart for some other reason.
So looking at the log above again: May 22 19:09:44 localhost systemd: Starting Open vSwitch Database Unit... May 22 19:09:44 localhost ovs-ctl: ovsdb-server is already running. May 22 19:09:44 localhost ovs-ctl: Enabling remote OVSDB managers [ OK ] May 22 19:09:44 localhost systemd: Stopping Open vSwitch... May 22 19:09:44 localhost systemd: Stopped Open vSwitch. May 22 19:09:44 localhost ovs-ctl: Killing ovsdb-server (830) [ OK ] May 22 19:09:44 localhost systemd: Stopped Open vSwitch Database Unit. So the script is calling ovs-ctl, which is killing the OVS db server. Not sure if a later version of that script would be better at not doing this, as I noticed this upstream change: commit 452a1f59c9ac25d15a76a0cc0ae617c95f95d5c7 Author: Markos Chandras <mchandras> Date: Mon Sep 12 10:07:57 2016 +0100 ovs-ctl: Handle start up errors. Make sure we take the return values into consideration so we can break early in case of failures. This makes the ovs-ctl helper more accurate in reporting the real status of its managing processes. Asking someone from OVS team about this.
Hi, I skimmed over the bz and if you have upgraded OVS and not restarted the services, the systemd status of the OVS services is unknown. Then if it uses ifdown/ifup, the first thing ifup-ovs/ifdown-ovs does is to query systemd for service status. If it's not running (most probably), it will try to start causing all sorts of things. HTH fbl
From what I understand, our two options are: 1.) update os-net-config and run it before we update OVS 2.) don't allow os-net-config to restart interfaces, even if the interface configuration files have changed As it is possible that updates to services might depend on changes to interfaces, I think option 1 is the safer, yet more awkward solution. Thoughts?
Yes, I think Option 1 could work. How can we test it out?
We are testing option 1 with a test patch directly to newton as well osp 10. See: https://review.openstack.org/#/c/471381/ Due to changes in how network updates are performed, this appears to unnecessary in Ocata.
After applying https://review.openstack.org/#/c/471381/ I was still able to reproduce the issue reported initially.
The test patch was missing several sites where os-net-config would need to be run before updating openvswitch. I've updated the upstream patch and we should re-test once upstream CI has had a chance to exercise it.
Marius - can you try Brent's latest patch? It's passed upstream jobs. Thanks.
(In reply to Brian Haley from comment #20) > Marius - can you try Brent's latest patch? It's passed upstream jobs. > Thanks. I tested with the latest patch and it is still failing, with a different error this time: [stack@undercloud-0 ~]$ openstack stack failures list overcloud overcloud.UpdateWorkflow.ControllerPacemakerUpgradeDeployment_Step2.1: resource_type: OS::Heat::SoftwareDeployment physical_resource_id: 6fe8001e-ecb1-4f51-b2a9-53c1ae48f190 status: CREATE_FAILED status_reason: | Error: resources[1]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2 deploy_stdout: | ... active active active active active active active inactive Delta RPMs disabled because /usr/bin/applydeltarpm not installed. yum update os-net-config return code: 0 (truncated, view all with --long) deploy_stderr: | ... [2017/06/07 08:17:40 PM] [INFO] running ifup on bridge: br-infra [2017/06/07 08:17:44 PM] [INFO] running ifup on bridge: br-storage [2017/06/07 08:17:45 PM] [INFO] running ifup on bridge: br-ex [2017/06/07 08:17:45 PM] [INFO] running ifup on interface: vlan200 [2017/06/07 08:17:50 PM] [INFO] running ifup on interface: eth2 [2017/06/07 08:17:50 PM] [INFO] running ifup on interface: vlan301 [2017/06/07 08:17:55 PM] [INFO] running ifup on interface: vlan300 [2017/06/07 08:17:59 PM] [INFO] running ifup on interface: eth3 [2017/06/07 08:18:00 PM] [INFO] running ifup on interface: vlan100 [2017/06/07 08:18:04 PM] [INFO] running ifup on interface: eth1 (truncated, view all with --long) overcloud.UpdateWorkflow.ControllerPacemakerUpgradeDeployment_Step2.0: resource_type: OS::Heat::SoftwareDeployment physical_resource_id: bd59dafe-7eb2-433e-9a11-c625af0eacd1 status: CREATE_FAILED status_reason: | Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2 deploy_stdout: | ... Wed Jun 7 20:17:37 UTC 2017 8bf07210-53c6-4c4a-bde9-2414e13513dd tripleo-upgrade overcloud-controller-0 Going to systemctl stop openstack-swift-container-updater Wed Jun 7 20:17:37 UTC 2017 8bf07210-53c6-4c4a-bde9-2414e13513dd tripleo-upgrade overcloud-controller-0 Going to systemctl stop openstack-swift-container Wed Jun 7 20:17:37 UTC 2017 8bf07210-53c6-4c4a-bde9-2414e13513dd tripleo-upgrade overcloud-controller-0 Going to systemctl stop openstack-swift-object-auditor Wed Jun 7 20:17:38 UTC 2017 8bf07210-53c6-4c4a-bde9-2414e13513dd tripleo-upgrade overcloud-controller-0 Going to systemctl stop openstack-swift-object-replicator Wed Jun 7 20:17:38 UTC 2017 8bf07210-53c6-4c4a-bde9-2414e13513dd tripleo-upgrade overcloud-controller-0 Going to systemctl stop openstack-swift-object-updater Wed Jun 7 20:17:38 UTC 2017 8bf07210-53c6-4c4a-bde9-2414e13513dd tripleo-upgrade overcloud-controller-0 Going to systemctl stop openstack-swift-object Wed Jun 7 20:17:38 UTC 2017 8bf07210-53c6-4c4a-bde9-2414e13513dd tripleo-upgrade overcloud-controller-0 Going to systemctl stop openstack-swift-proxy inactive Delta RPMs disabled because /usr/bin/applydeltarpm not installed. yum update os-net-config return code: 0 (truncated, view all with --long) deploy_stderr: | ... [2017/06/07 08:17:44 PM] [INFO] running ifup on bridge: br-infra [2017/06/07 08:17:48 PM] [INFO] running ifup on bridge: br-storage [2017/06/07 08:17:49 PM] [INFO] running ifup on bridge: br-ex [2017/06/07 08:17:49 PM] [INFO] running ifup on interface: vlan200 [2017/06/07 08:17:54 PM] [INFO] running ifup on interface: eth2 [2017/06/07 08:17:54 PM] [INFO] running ifup on interface: vlan301 [2017/06/07 08:17:59 PM] [INFO] running ifup on interface: vlan300 [2017/06/07 08:18:03 PM] [INFO] running ifup on interface: eth3 [2017/06/07 08:18:04 PM] [INFO] running ifup on interface: vlan100 [2017/06/07 08:18:08 PM] [INFO] running ifup on interface: eth1 (truncated, view all with --long) overcloud.UpdateWorkflow.ControllerPacemakerUpgradeDeployment_Step2.2: resource_type: OS::Heat::SoftwareDeployment physical_resource_id: 334059cd-fac5-48c7-91a5-b18c52c6920d status: CREATE_FAILED status_reason: | Error: resources[2]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2 deploy_stdout: | ... active active active active active active active inactive Delta RPMs disabled because /usr/bin/applydeltarpm not installed. yum update os-net-config return code: 0 (truncated, view all with --long) deploy_stderr: | ... [2017/06/07 08:17:42 PM] [INFO] running ifup on bridge: br-infra [2017/06/07 08:17:46 PM] [INFO] running ifup on bridge: br-storage [2017/06/07 08:17:47 PM] [INFO] running ifup on bridge: br-ex [2017/06/07 08:17:47 PM] [INFO] running ifup on interface: vlan200 [2017/06/07 08:17:52 PM] [INFO] running ifup on interface: eth2 [2017/06/07 08:17:52 PM] [INFO] running ifup on interface: vlan301 [2017/06/07 08:17:57 PM] [INFO] running ifup on interface: vlan300 [2017/06/07 08:18:01 PM] [INFO] running ifup on interface: eth3 [2017/06/07 08:18:02 PM] [INFO] running ifup on interface: vlan100 [2017/06/07 08:18:06 PM] [INFO] running ifup on interface: eth1 (truncated, view all with --long)
Thanks Marius! The logs are very strange. It appears to get past the OVS problem but then we get logs from os-net-config output that indicate that interfaces haven't changed but then it restarts them anyways and in doing so, errors out. In short, it looks like the fix has introduced new bugs. Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] adding vlan: vlan100 Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] adding custom route for interface: vlan100 Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] adding bridge: br-infra Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] adding interface: eth2 Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] adding vlan: vlan200 Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] adding bridge: br-storage Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] adding interface: eth3 Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] adding vlan: vlan300 Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] adding vlan: vlan301 Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] adding interface: eth4 Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] applying network configs... Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] No changes required for interface: eth4 Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] No changes required for interface: eth3 Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] No changes required for interface: eth2 Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] No changes required for interface: eth1 Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] No changes required for interface: eth0 Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] No changes required for vlan interface: vlan200 Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] No changes required for vlan interface: vlan300 Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] No changes required for vlan interface: vlan100 Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] No changes required for vlan interface: vlan301 Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] running ifdown on interface: vlan200 Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] running ifdown on interface: eth2 Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] running ifdown on interface: vlan301 Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:42 PM] [INFO] running ifdown on interface: vlan300 Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:42 PM] [INFO] running ifdown on interface: eth3 Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:42 PM] [INFO] running ifdown on interface: vlan100 Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:42 PM] [INFO] running ifdown on interface: eth1 Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:43 PM] [INFO] running ifdown on bridge: br-infra Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:43 PM] [INFO] running ifdown on bridge: br-storage Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:43 PM] [INFO] running ifdown on bridge: br-ex Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:44 PM] [INFO] Writing config /etc/sysconfig/network-scripts/route6-br-ex Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:44 PM] [INFO] Writing config /etc/sysconfig/network-scripts/route-br-ex Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:44 PM] [INFO] Writing config /etc/sysconfig/network-scripts/ifcfg-br-infra Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:44 PM] [INFO] Writing config /etc/sysconfig/network-scripts/route-br-storage Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:44 PM] [INFO] Writing config /etc/sysconfig/network-scripts/ifcfg-br-storage Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:44 PM] [INFO] Writing config /etc/sysconfig/network-scripts/route6-br-infra Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:44 PM] [INFO] Writing config /etc/sysconfig/network-scripts/route6-br-storage Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:44 PM] [INFO] Writing config /etc/sysconfig/network-scripts/ifcfg-br-ex Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:44 PM] [INFO] Writing config /etc/sysconfig/network-scripts/route-br-infra Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:44 PM] [INFO] running ifup on bridge: br-infra Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:48 PM] [INFO] running ifup on bridge: br-storage Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:49 PM] [INFO] running ifup on bridge: br-ex Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:49 PM] [INFO] running ifup on interface: vlan200 Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:54 PM] [INFO] running ifup on interface: eth2 Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:54 PM] [INFO] running ifup on interface: vlan301 Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:59 PM] [INFO] running ifup on interface: vlan300 Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:18:03 PM] [INFO] running ifup on interface: eth3 Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:18:04 PM] [INFO] running ifup on interface: vlan100 Jun 7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:18:08 PM] [INFO] running ifup on interface: eth1
Upgrade went past major-upgrade-pacemaker.yaml with the latest revision of the patch but I'm seeing errors while upgrading compute and ceph nodes: compute: /root/tripleo_upgrade_node.sh: line 28: special_case_ovs_upgrade_if_needed: command not found ceph: /root/tripleo_upgrade_node.sh: line 17: special_case_ovs_upgrade_if_needed: command not found
Upstream patch has been updated to remove "test only" commentary in commit message and address feedback.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1585