Bug 1454640 - OSP9 -> OSP10 with ovs 2.6 upgrade fails with neutron-openvswitch-agent unable to start during major-upgrade-pacemaker.yaml
Summary: OSP9 -> OSP10 with ovs 2.6 upgrade fails with neutron-openvswitch-agent unabl...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: z3
: 10.0 (Newton)
Assignee: Brent Eagles
QA Contact: Marius Cornea
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-05-23 08:50 UTC by Marius Cornea
Modified: 2023-02-22 23:02 UTC (History)
14 users (show)

Fixed In Version: openstack-tripleo-heat-templates-5.2.0-20.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-06-28 14:50:50 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1695893 0 None None None 2017-06-05 12:42:40 UTC
OpenStack gerrit 471381 0 'None' 'MERGED' 'Reconfigure interfaces before updating openvswitch' 2019-11-27 05:28:48 UTC
Red Hat Product Errata RHBA-2017:1585 0 normal SHIPPED_LIVE Red Hat OpenStack Platform 10 director Bug Fix Advisory 2017-06-28 18:42:51 UTC

Description Marius Cornea 2017-05-23 08:50:00 UTC
Description of problem:
OSP9 -> OSP10 with ovs 2.6 upgrade fails with neutron-openvswitch-agent unable to start during major-upgrade-pacemaker.yaml

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-compat-2.0.0-41.el7ost.noarch
openstack-tripleo-heat-templates-5.2.0-18.el7ost.noarch
openstack-neutron-openvswitch-9.2.0-2.el7ost.noarch
openvswitch-2.6.1-10.git20161206.el7fdp.x86_64
python-openvswitch-2.6.1-10.git20161206.el7fdp.noarch

How reproducible:
100%

Steps to Reproduce:
1. Upgrade undercloud including fix for bug 1431115 (openstack-tripleo-heat-templates-5.2.0-18.el7ost)
2. Upgrade overcloud nodes

Actual results:
Upgrade fails during major-upgrade-pacemaker.yaml

Expected results:
Step completes fine.

Additional info:

Adding sosreports.

It looks like neutron-openvswitch-agent is unable to start because openvswtich is not started:

[root@overcloud-controller-0 heat-admin]# systemctl status openvswitch
● openvswitch.service - Open vSwitch
   Loaded: loaded (/usr/lib/systemd/system/openvswitch.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Mon 2017-05-22 23:09:44 UTC; 9h ago
 Main PID: 1020 (code=exited, status=0/SUCCESS)

May 22 23:25:10 overcloud-controller-0.localdomain systemd[1]: Dependency failed for Open vSwitch.
May 22 23:25:10 overcloud-controller-0.localdomain systemd[1]: Job openvswitch.service/start failed with result 'dependency'.
May 22 23:26:40 overcloud-controller-0.localdomain systemd[1]: Dependency failed for Open vSwitch.
May 22 23:26:40 overcloud-controller-0.localdomain systemd[1]: Job openvswitch.service/start failed with result 'dependency'.
May 22 23:28:15 overcloud-controller-0.localdomain systemd[1]: Dependency failed for Open vSwitch.
May 22 23:28:15 overcloud-controller-0.localdomain systemd[1]: Job openvswitch.service/start failed with result 'dependency'.
May 22 23:29:45 overcloud-controller-0.localdomain systemd[1]: Dependency failed for Open vSwitch.
May 22 23:29:45 overcloud-controller-0.localdomain systemd[1]: Job openvswitch.service/start failed with result 'dependency'.
May 22 23:37:01 overcloud-controller-0.localdomain systemd[1]: Dependency failed for Open vSwitch.
May 22 23:37:01 overcloud-controller-0.localdomain systemd[1]: Job openvswitch.service/start failed with result 'dependency'.


[root@overcloud-controller-0 heat-admin]# systemctl status neutron-openvswitch-agent
● neutron-openvswitch-agent.service - OpenStack Neutron Open vSwitch Agent
   Loaded: loaded (/usr/lib/systemd/system/neutron-openvswitch-agent.service; disabled; vendor preset: disabled)
   Active: inactive (dead)

May 22 22:56:29 overcloud-controller-0.localdomain ovs-ofctl[29532]: ovs|00001|ofp_util|INFO|normalization changed ofp_match, details:
May 22 22:56:29 overcloud-controller-0.localdomain ovs-ofctl[29532]: ovs|00002|ofp_util|INFO| pre: in_port=6,nw_proto=58,tp_src=136
May 22 22:56:29 overcloud-controller-0.localdomain ovs-ofctl[29532]: ovs|00003|ofp_util|INFO|post: in_port=6
May 22 22:56:29 overcloud-controller-0.localdomain ovs-ofctl[29539]: ovs|00001|ofp_util|INFO|normalization changed ofp_match, details:
May 22 22:56:29 overcloud-controller-0.localdomain ovs-ofctl[29539]: ovs|00002|ofp_util|INFO| pre: in_port=7,nw_proto=58,tp_src=136
May 22 22:56:29 overcloud-controller-0.localdomain ovs-ofctl[29539]: ovs|00003|ofp_util|INFO|post: in_port=7
May 22 22:56:31 overcloud-controller-0.localdomain systemd[1]: Stopping OpenStack Neutron Open vSwitch Agent...
May 22 22:56:33 overcloud-controller-0.localdomain systemd[1]: Stopped OpenStack Neutron Open vSwitch Agent.
May 22 23:37:01 overcloud-controller-0.localdomain systemd[1]: Dependency failed for OpenStack Neutron Open vSwitch Agent.
May 22 23:37:01 overcloud-controller-0.localdomain systemd[1]: Job neutron-openvswitch-agent.service/start failed with result 'dependency'.


[stack@undercloud-0 ~]$ openstack stack failures list overcloud
overcloud.UpdateWorkflow.ControllerPacemakerUpgradeDeployment_Step6.1:
  resource_type: OS::Heat::SoftwareDeployment
  physical_resource_id: 66c0e162-78aa-4721-b522-b0d7d14b8e93
  status: CREATE_FAILED
  status_reason: |
    Error: resources[1]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 1
  deploy_stdout: |
    ...
    neutron-l3-agent is started
    Mon May 22 23:35:18 UTC 2017 4d45c76e-a87a-41a3-bc44-07edb44c86eb tripleo-upgrade overcloud-controller-1 Going to systemctl start neutron-metadata-agent
    Mon May 22 23:35:19 UTC 2017 4d45c76e-a87a-41a3-bc44-07edb44c86eb tripleo-upgrade overcloud-controller-1 Going to systemctl enable neutron-metadata-agent
    Mon May 22 23:35:19 UTC 2017 4d45c76e-a87a-41a3-bc44-07edb44c86eb tripleo-upgrade overcloud-controller-1 Going to check_resource_systemd for neutron-metadata-agent to be started
    neutron-metadata-agent is started
    Mon May 22 23:35:20 UTC 2017 4d45c76e-a87a-41a3-bc44-07edb44c86eb tripleo-upgrade overcloud-controller-1 Going to systemctl start neutron-netns-cleanup
    Mon May 22 23:35:21 UTC 2017 4d45c76e-a87a-41a3-bc44-07edb44c86eb tripleo-upgrade overcloud-controller-1 Going to systemctl enable neutron-netns-cleanup
    Mon May 22 23:35:21 UTC 2017 4d45c76e-a87a-41a3-bc44-07edb44c86eb tripleo-upgrade overcloud-controller-1 Going to check_resource_systemd for neutron-netns-cleanup to be started
    neutron-netns-cleanup is started
    Mon May 22 23:35:22 UTC 2017 4d45c76e-a87a-41a3-bc44-07edb44c86eb tripleo-upgrade overcloud-controller-1 Going to systemctl start neutron-openvswitch-agent
    (truncated, view all with --long)
  deploy_stderr: |
    Created symlink from /etc/systemd/system/multi-user.target.wants/httpd.service to /usr/lib/systemd/system/httpd.service.
    Created symlink from /etc/systemd/system/multi-user.target.wants/memcached.service to /usr/lib/systemd/system/memcached.service.
    Created symlink from /etc/systemd/system/multi-user.target.wants/mongod.service to /usr/lib/systemd/system/mongod.service.
    Created symlink from /etc/systemd/system/multi-user.target.wants/neutron-dhcp-agent.service to /usr/lib/systemd/system/neutron-dhcp-agent.service.
    Created symlink from /etc/systemd/system/multi-user.target.wants/neutron-l3-agent.service to /usr/lib/systemd/system/neutron-l3-agent.service.
    Created symlink from /etc/systemd/system/multi-user.target.wants/neutron-metadata-agent.service to /usr/lib/systemd/system/neutron-metadata-agent.service.
    Created symlink from /etc/systemd/system/multi-user.target.wants/neutron-netns-cleanup.service to /usr/lib/systemd/system/neutron-netns-cleanup.service.
    A dependency job for neutron-openvswitch-agent.service failed. See 'journalctl -xe' for details.
overcloud.UpdateWorkflow.ControllerPacemakerUpgradeDeployment_Step6.0:
  resource_type: OS::Heat::SoftwareDeployment
  physical_resource_id: ef47c603-38e8-4e9e-bf3d-f254cbad751e
  status: CREATE_FAILED
  status_reason: |
    Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 1
  deploy_stdout: |
    ...
    neutron-l3-agent is started
    Mon May 22 23:35:28 UTC 2017 d5e1af6d-a7bb-4758-8649-df8414a17a81 tripleo-upgrade overcloud-controller-0 Going to systemctl start neutron-metadata-agent
    Mon May 22 23:35:28 UTC 2017 d5e1af6d-a7bb-4758-8649-df8414a17a81 tripleo-upgrade overcloud-controller-0 Going to systemctl enable neutron-metadata-agent
    Mon May 22 23:35:28 UTC 2017 d5e1af6d-a7bb-4758-8649-df8414a17a81 tripleo-upgrade overcloud-controller-0 Going to check_resource_systemd for neutron-metadata-agent to be started
    neutron-metadata-agent is started
    Mon May 22 23:35:29 UTC 2017 d5e1af6d-a7bb-4758-8649-df8414a17a81 tripleo-upgrade overcloud-controller-0 Going to systemctl start neutron-netns-cleanup
    Mon May 22 23:35:30 UTC 2017 d5e1af6d-a7bb-4758-8649-df8414a17a81 tripleo-upgrade overcloud-controller-0 Going to systemctl enable neutron-netns-cleanup
    Mon May 22 23:35:30 UTC 2017 d5e1af6d-a7bb-4758-8649-df8414a17a81 tripleo-upgrade overcloud-controller-0 Going to check_resource_systemd for neutron-netns-cleanup to be started
    neutron-netns-cleanup is started
    Mon May 22 23:35:31 UTC 2017 d5e1af6d-a7bb-4758-8649-df8414a17a81 tripleo-upgrade overcloud-controller-0 Going to systemctl start neutron-openvswitch-agent
    (truncated, view all with --long)
  deploy_stderr: |
    Created symlink from /etc/systemd/system/multi-user.target.wants/httpd.service to /usr/lib/systemd/system/httpd.service.
    Created symlink from /etc/systemd/system/multi-user.target.wants/memcached.service to /usr/lib/systemd/system/memcached.service.
    Created symlink from /etc/systemd/system/multi-user.target.wants/mongod.service to /usr/lib/systemd/system/mongod.service.
    Created symlink from /etc/systemd/system/multi-user.target.wants/neutron-dhcp-agent.service to /usr/lib/systemd/system/neutron-dhcp-agent.service.
    Created symlink from /etc/systemd/system/multi-user.target.wants/neutron-l3-agent.service to /usr/lib/systemd/system/neutron-l3-agent.service.
    Created symlink from /etc/systemd/system/multi-user.target.wants/neutron-metadata-agent.service to /usr/lib/systemd/system/neutron-metadata-agent.service.
    Created symlink from /etc/systemd/system/multi-user.target.wants/neutron-netns-cleanup.service to /usr/lib/systemd/system/neutron-netns-cleanup.service.
    A dependency job for neutron-openvswitch-agent.service failed. See 'journalctl -xe' for details.
overcloud.UpdateWorkflow.ControllerPacemakerUpgradeDeployment_Step6.2:
  resource_type: OS::Heat::SoftwareDeployment
  physical_resource_id: f8f9b3fe-8c32-4489-a2bc-a734a282a882
  status: CREATE_FAILED
  status_reason: |
    Error: resources[2]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 1
  deploy_stdout: |
    ...
    neutron-l3-agent is started
    Mon May 22 23:35:33 UTC 2017 bf107d86-2041-4e92-899a-ed12291abc5e tripleo-upgrade overcloud-controller-2 Going to systemctl start neutron-metadata-agent
    Mon May 22 23:35:34 UTC 2017 bf107d86-2041-4e92-899a-ed12291abc5e tripleo-upgrade overcloud-controller-2 Going to systemctl enable neutron-metadata-agent
    Mon May 22 23:35:34 UTC 2017 bf107d86-2041-4e92-899a-ed12291abc5e tripleo-upgrade overcloud-controller-2 Going to check_resource_systemd for neutron-metadata-agent to be started
    neutron-metadata-agent is started
    Mon May 22 23:35:35 UTC 2017 bf107d86-2041-4e92-899a-ed12291abc5e tripleo-upgrade overcloud-controller-2 Going to systemctl start neutron-netns-cleanup
    Mon May 22 23:35:35 UTC 2017 bf107d86-2041-4e92-899a-ed12291abc5e tripleo-upgrade overcloud-controller-2 Going to systemctl enable neutron-netns-cleanup
    Mon May 22 23:35:36 UTC 2017 bf107d86-2041-4e92-899a-ed12291abc5e tripleo-upgrade overcloud-controller-2 Going to check_resource_systemd for neutron-netns-cleanup to be started
    neutron-netns-cleanup is started
    Mon May 22 23:35:36 UTC 2017 bf107d86-2041-4e92-899a-ed12291abc5e tripleo-upgrade overcloud-controller-2 Going to systemctl start neutron-openvswitch-agent
    (truncated, view all with --long)
  deploy_stderr: |
    Created symlink from /etc/systemd/system/multi-user.target.wants/httpd.service to /usr/lib/systemd/system/httpd.service.
    Created symlink from /etc/systemd/system/multi-user.target.wants/memcached.service to /usr/lib/systemd/system/memcached.service.
    Created symlink from /etc/systemd/system/multi-user.target.wants/mongod.service to /usr/lib/systemd/system/mongod.service.
    Created symlink from /etc/systemd/system/multi-user.target.wants/neutron-dhcp-agent.service to /usr/lib/systemd/system/neutron-dhcp-agent.service.
    Created symlink from /etc/systemd/system/multi-user.target.wants/neutron-l3-agent.service to /usr/lib/systemd/system/neutron-l3-agent.service.
    Created symlink from /etc/systemd/system/multi-user.target.wants/neutron-metadata-agent.service to /usr/lib/systemd/system/neutron-metadata-agent.service.
    Created symlink from /etc/systemd/system/multi-user.target.wants/neutron-netns-cleanup.service to /usr/lib/systemd/system/neutron-netns-cleanup.service.
    A dependency job for neutron-openvswitch-agent.service failed. See 'journalctl -xe' for details.

Comment 2 Marios Andreou 2017-05-23 12:39:20 UTC
o/ mcornea just spent some time looking at logs - I followed controller-0. The question really is why was openvswitch restarted in the first place. Indeed the upgrade fails when openvswitch-agent is started by the upgrade workflow itself, at [0][1]. However the real problem (that openvswitch is stopped) starts about half an hour before that, when os-net-config runs for vlan200 [2]. Did something change in the network configuration that might cause this? 

For the bug you mention in comment #0 indeed I see the 'manual' upgrade of openvswitch is executed fine and there is no restart of openvswitch because of that so no I don't think this is related to  bug 1431115. 

So if there is nothing you can spot in terms of differences to vlan200 config that might be causing [2] then we may need to get ovs folks involved asap since the whole point of the manual workaround we execute here is to avoid openvswitch restart (as i said that seems to be working/doing its job here, the ovs is upgraded and other stuff happens after it):


        May 22 19:09:03 localhost os-collect-config: openvswitch-2.5.0-14.git20160727.el7fdp.x86_64
        May 22 19:09:03 localhost os-collect-config: Manual upgrade of openvswitch - ovs-2.5.0-14 or restart in postun detected
        May 22 19:09:03 localhost os-collect-config: /var/lib/heat-config/heat-config-script/OVS_UPGRADE /var/lib/heat-config/heat-config-script
        May 22 19:09:03 localhost os-collect-config: Attempting to downloading latest openvswitch with yumdownloader
        May 22 19:09:03 localhost os-collect-config: Loaded plugins: product-id
        May 22 19:09:03 localhost os-collect-config: --> Running transaction check
        May 22 19:09:03 localhost os-collect-config: ---> Package openvswitch.x86_64 0:2.6.1-10.git20161206.el7fdp will be installed
        May 22 19:09:03 localhost os-collect-config: --> Finished Dependency Resolution
        May 22 19:09:03 localhost os-collect-config: Updating openvswitch-2.6.1-10.git20161206.el7fdp.x86_64.rpm with --nopostun --notriggerun
        May 22 19:09:03 localhost os-collect-config: /var/lib/heat-config/heat-config-script
        May 22 19:09:03 localhost os-collect-config: Loaded plugins: product-id, search-disabled-repos, subscription-manager
        May 22 19:09:03 localhost os-collect-config: Installing:
        May 22 19:09:03 localhost os-collect-config: mod_ssl    x86_64    1:2.4.6-45.el7_3.4       rhelosp-rhel-7.3-server    105 k



[0] https://github.com/openstack/tripleo-heat-templates/blob/stable/newton/extraconfig/tasks/major_upgrade_controller_pacemaker_6.sh#L13

[1] https://github.com/openstack/tripleo-heat-templates/blob/stable/newton/extraconfig/tasks/pacemaker_common_functions.sh#L189

[2] 
May 22 19:09:44 localhost os-collect-config: + os-net-config -c /etc/os-net-config/config.json -v --detailed-exit-codes
May 22 19:09:44 localhost os-collect-config: [2017/05/22 11:09:44 PM] [INFO] Using config file at: /etc/os-net-config/config.json
May 22 19:09:44 localhost os-collect-config: [2017/05/22 11:09:44 PM] [INFO] No changes required for vlan interface: vlan300
May 22 19:09:44 localhost os-collect-config: [2017/05/22 11:09:44 PM] [INFO] No changes required for vlan interface: vlan100
May 22 19:09:44 localhost os-collect-config: [2017/05/22 11:09:44 PM] [INFO] No changes required for vlan interface: vlan301
May 22 19:09:44 localhost os-collect-config: [2017/05/22 11:09:44 PM] [INFO] running ifdown on interface: vlan200
May 22 19:09:44 localhost systemd: Starting Open vSwitch Database Unit...
May 22 19:09:44 localhost ovs-ctl: ovsdb-server is already running.
May 22 19:09:44 localhost ovs-ctl: Enabling remote OVSDB managers [  OK  ]
May 22 19:09:44 localhost systemd: Stopping Open vSwitch...
May 22 19:09:44 localhost systemd: Stopped Open vSwitch.
May 22 19:09:44 localhost ovs-ctl: Killing ovsdb-server (830) [  OK  ]
May 22 19:09:44 localhost systemd: Stopped Open vSwitch Database Unit.
May 22 19:09:44 localhost ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl -t 10 -- --if-exists del-port br-infra vlan200
May 22 19:09:44 localhost ovs-vsctl: ovs|00002|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory)
May 22 19:09:44 localhost os-collect-config: [2017/05/22 11:09:44 PM] [INFO] running ifdown on interface: eth2
May 22 19:09:44 localhost systemd: Starting Open vSwitch Database Unit...
May 22 19:09:44 localhost ovs-ctl: Backing up database to /etc/openvswitch/conf.db.backup7.12.1-2211824403 [  OK  ]
May 22 19:09:44 localhost ovs-ctl: Compacting database [  OK  ]
May 22 19:09:44 localhost ovs-ctl: Converting database schema [  OK  ]
May 22 19:09:44 localhost ovs-ctl: Starting ovsdb-server [  OK  ]

May 22 19:09:48 localhost systemd: Starting Open vSwitch Forwarding Unit...
May 22 19:09:48 localhost ovs-vswitchd: ovs|00006|bridge|ERR|another ovs-vswitchd process is running, disabling this process (pid 29426) until it goes away
May 22 19:12:53 localhost systemd: ovs-vswitchd.service start operation timed out. Terminating.
May 22 19:12:53 localhost systemd: Failed to start Open vSwitch Forwarding Unit.
May 22 19:12:53 localhost systemd: Dependency failed for Open vSwitch.
May 22 19:12:53 localhost systemd: Job openvswitch.service/start failed with result 'dependency'.
May 22 19:12:53 localhost ovs-ctl: Starting ovs-vswitchd
May 22 19:12:53 localhost systemd: Unit ovs-vswitchd.service entered failed state.
May 22 19:12:53 localhost systemd: ovs-vswitchd.service failed.
May 22 19:12:53 localhost ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl -t 10 -- --may-exist add-br br-storage -- set bridge br-storage other-config:hwaddr=52:54:00:37:33:9d -- set bridge br-storage fail_mode=standalone
May 22 19:12:53 localhost kernel: device br-storage entered promiscuous mode
May 22 19:12:53 localhost NetworkManager[693]: <info>  [1495494773.0222] manager: (br-storage): new Generic device (/org/freedesktop/NetworkManager/Devices/26)
May 22 19:12:53 localhost NetworkManager[693]: <info>  [1495494773.0437] device (br-storage): link connected
May 22 19:12:53 localhost os-collect-config: [2017/05/22 11:12:53 PM] [INFO] running ifup on bridge: br-ex
May 22 19:12:53 localhost systemd: Starting Open vSwitch Forwarding Unit...
May 22 19:12:53 localhost ovs-vswitchd: ovs|00006|bridge|ERR|another ovs-vswitchd process is running, disabling this process (pid 29705) until it goes away

Comment 3 Marios Andreou 2017-05-23 12:43:29 UTC
forgot to ask, is this a recent regression as far as you know (e.g. when did this last work if it was recently)

Comment 4 Marius Cornea 2017-05-23 12:53:32 UTC
(In reply to marios from comment #3)
> forgot to ask, is this a recent regression as far as you know (e.g. when did
> this last work if it was recently)

This showed up during the 9 to 10 testing ovs 2.6. 


The current vlan200 configuration (after failed upgrade):

[root@overcloud-controller-0 heat-admin]# cat /etc/sysconfig/network-scripts/ifcfg-vlan200 
# This file is autogenerated by os-net-config
DEVICE=vlan200
ONBOOT=yes
HOTPLUG=no
NM_CONTROLLED=no
PEERDNS=no
DEVICETYPE=ovs
TYPE=OVSIntPort
OVS_BRIDGE=br-infra
OVS_OPTIONS="tag=200"
BOOTPROTO=static
IPADDR=10.0.0.14
NETMASK=255.255.255.128

Will get back to see how it looks on an OSP9 deployment.

Comment 5 Marius Cornea 2017-05-23 14:28:59 UTC
Fresh OSP9 deployment:

[root@overcloud-controller-0 heat-admin]# cat /etc/sysconfig/network-scripts/ifcfg-vlan200 
# This file is autogenerated by os-net-config
DEVICE=vlan200
ONBOOT=yes
HOTPLUG=no
NM_CONTROLLED=no
PEERDNS=no
DEVICETYPE=ovs
TYPE=OVSIntPort
OVS_BRIDGE=br-infra
OVS_OPTIONS="tag=200"
BOOTPROTO=static
IPADDR=10.0.0.15
NETMASK=255.255.255.128

Comment 6 Marios Andreou 2017-05-24 14:34:29 UTC
assigning amuller and adding DFG:Networking as we discussed during upgrades scrum this afternoon (moving Upgrades as secondary for now)

@assaf assigned you as TC based on 'the mojo doc'- should it be assigned to fleitner? This is about testing of ovs 2.5-2.6 upgrade in OSP10 (hitting the OSP9..10 upgrade here). As in comment #2 the workaround (--notriggerun --nopostun) is present and working as expected and ovs is not restarted. However, something (os-net-config) is causing ovs to go down/up.

This issue has just shown up having started to test 2.5-2.6 ovs during the OSP10 upgrade as per mcornea comment #4

Comment 8 Brent Eagles 2017-05-25 19:26:13 UTC
Can we get the network configuration files for this environment? 

As already indicated by marios, it seems that the interface data for vlan200 has changed and the change also seems to be triggering an openvswitch restart in on-net-config. From inspecting the code, this shouldn't happen unless OVSDPDK appears somewhere in the interface data. This is not expected given the information provided.

In the meantime, we should examine the contents of the ovs-ctl script for that release of Open vSwitch to determine if there are any conditions that would cause it to restart the services.

Comment 10 Marius Cornea 2017-05-25 19:36:11 UTC
These are the network related environments:

/home/stack/openstack_deployment/environments/network-environment.yaml
resource_registry:
  OS::TripleO::Compute::Net::SoftwareConfig: /home/stack/openstack_deployment/nic-configs/compute.yaml
  OS::TripleO::Controller::Net::SoftwareConfig: /home/stack/openstack_deployment/nic-configs/controller.yaml
  OS::TripleO::CephStorage::Net::SoftwareConfig: /home/stack/openstack_deployment/nic-configs/ceph-storage.yaml
  OS::TripleO::ObjectStorage::Net::SoftwareConfig: /home/stack/openstack_deployment/nic-configs/swift-storage.yaml

parameter_defaults:
  InternalApiNetCidr: 10.0.0.0/25
  InternalApiAllocationPools: [{'start': '10.0.0.10', 'end': '10.0.0.100'}]
  InternalApiNetworkVlanID: 200

  StorageNetCidr: 10.0.0.128/25
  StorageAllocationPools: [{'start': '10.0.0.138', 'end': '10.0.0.200'}]
  StorageNetworkVlanID: 300

  StorageMgmtNetCidr: 10.0.1.0/25
  StorageMgmtAllocationPools: [{'start': '10.0.1.10', 'end': '10.0.1.100'}]
  StorageMgmtNetworkVlanID: 301

  ExternalNetCidr: 172.16.18.0/25
  ExternalAllocationPools: [{'start': '172.16.18.25', 'end': '172.16.18.100'}]
  ExternalInterfaceDefaultRoute: 172.16.18.126
  ExternalNetworkVlanID: 100

  TenantNetCidr: 10.0.1.128/25
  TenantAllocationPools: [{'start': '10.0.1.138', 'end': '10.0.1.200'}]

  ManagementNetCidr: 172.16.17.128/25
  ManagementAllocationPools: [{'start': '172.16.17.181', 'end': '172.16.17.210'}]
  ManagementInterfaceDefaultRoute: 172.16.17.254

  ControlPlaneSubnetCidr: "25"
  ControlPlaneDefaultRoute: 192.168.0.1

  EC2MetadataIp: 192.168.0.1
  DnsServers: ["172.16.17.254","172.16.17.254"]
  NtpServer: ["clock.redhat.com","clock.redhat.com"]

/home/stack/openstack_deployment/environments/neutron-settings.yaml
parameter_defaults:
  NeutronExternalNetworkBridge: "''"
  NeutronBridgeMappings: 'datacentre:br-ex,tenantvlan:br-infra'
  NeutronEnableIsolatedMetadata: 'True'
  NeutronNetworkType: 'vxlan,gre,vlan,flat'
  NeutronTunnelTypes: 'vxlan,gre'
  NeutronNetworkVLANRanges: 'datacentre:100:199,tenantvlan:200:299'
  NeutronDhcpAgentsPerNetwork: 3

Comment 11 Brian Haley 2017-05-30 20:58:51 UTC
So there doesn't seem to be DPDK involved here.  I will look to see if ovs-ctl could cause a restart for some other reason.

Comment 12 Brian Haley 2017-06-01 15:53:49 UTC
So looking at the log above again:

May 22 19:09:44 localhost systemd: Starting Open vSwitch Database Unit...
May 22 19:09:44 localhost ovs-ctl: ovsdb-server is already running.
May 22 19:09:44 localhost ovs-ctl: Enabling remote OVSDB managers [  OK  ]
May 22 19:09:44 localhost systemd: Stopping Open vSwitch...
May 22 19:09:44 localhost systemd: Stopped Open vSwitch.
May 22 19:09:44 localhost ovs-ctl: Killing ovsdb-server (830) [  OK  ]
May 22 19:09:44 localhost systemd: Stopped Open vSwitch Database Unit.

So the script is calling ovs-ctl, which is killing the OVS db server.

Not sure if a later version of that script would be better at not doing this, as I noticed this upstream change:

commit 452a1f59c9ac25d15a76a0cc0ae617c95f95d5c7
Author: Markos Chandras <mchandras>
Date:   Mon Sep 12 10:07:57 2016 +0100

    ovs-ctl: Handle start up errors.
    
    Make sure we take the return values into consideration so we can
    break early in case of failures. This makes the ovs-ctl helper more
    accurate in reporting the real status of its managing processes.

Asking someone from OVS team about this.

Comment 13 Flavio Leitner 2017-06-01 17:41:06 UTC
Hi,

I skimmed over the bz and if you have upgraded OVS and not restarted the services, the systemd status of the OVS services is unknown.

Then if it uses ifdown/ifup, the first thing ifup-ovs/ifdown-ovs does is to query systemd for service status. If it's not running (most probably), it will try to start causing all sorts of things.

HTH
fbl

Comment 14 Brent Eagles 2017-06-02 14:11:13 UTC
From what I understand, our two options are:

1.) update os-net-config and run it before we update OVS
2.) don't allow os-net-config to restart interfaces, even if the interface configuration files have changed

As it is possible that updates to services might depend on changes to interfaces, I think option 1 is the safer, yet more awkward solution.

Thoughts?

Comment 15 Brian Haley 2017-06-02 18:01:34 UTC
Yes, I think Option 1 could work.  How can we test it out?

Comment 16 Brent Eagles 2017-06-06 19:02:55 UTC
We are testing option 1 with a test patch directly to newton as well osp 10. 

See: https://review.openstack.org/#/c/471381/

Due to changes in how network updates are performed, this appears to unnecessary in Ocata.

Comment 17 Marius Cornea 2017-06-06 20:57:57 UTC
After applying https://review.openstack.org/#/c/471381/ I was still able to reproduce the issue reported initially.

Comment 19 Brent Eagles 2017-06-07 15:51:37 UTC
The test patch was missing several sites where os-net-config would need to be run before updating openvswitch. I've updated the upstream patch and we should re-test once upstream CI has had a chance to exercise it.

Comment 20 Brian Haley 2017-06-07 17:56:50 UTC
Marius - can you try Brent's latest patch?  It's passed upstream jobs.  Thanks.

Comment 21 Marius Cornea 2017-06-07 20:26:16 UTC
(In reply to Brian Haley from comment #20)
> Marius - can you try Brent's latest patch?  It's passed upstream jobs. 
> Thanks.

I tested with the latest patch and it is still failing, with a different error this time:


[stack@undercloud-0 ~]$ openstack stack failures list overcloud
overcloud.UpdateWorkflow.ControllerPacemakerUpgradeDeployment_Step2.1:
  resource_type: OS::Heat::SoftwareDeployment
  physical_resource_id: 6fe8001e-ecb1-4f51-b2a9-53c1ae48f190
  status: CREATE_FAILED
  status_reason: |
    Error: resources[1]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2
  deploy_stdout: |
    ...
    active
    active
    active
    active
    active
    active
    active
    inactive
    Delta RPMs disabled because /usr/bin/applydeltarpm not installed.
    yum update os-net-config return code: 0
    (truncated, view all with --long)
  deploy_stderr: |
    ...
    [2017/06/07 08:17:40 PM] [INFO] running ifup on bridge: br-infra
    [2017/06/07 08:17:44 PM] [INFO] running ifup on bridge: br-storage
    [2017/06/07 08:17:45 PM] [INFO] running ifup on bridge: br-ex
    [2017/06/07 08:17:45 PM] [INFO] running ifup on interface: vlan200
    [2017/06/07 08:17:50 PM] [INFO] running ifup on interface: eth2
    [2017/06/07 08:17:50 PM] [INFO] running ifup on interface: vlan301
    [2017/06/07 08:17:55 PM] [INFO] running ifup on interface: vlan300
    [2017/06/07 08:17:59 PM] [INFO] running ifup on interface: eth3
    [2017/06/07 08:18:00 PM] [INFO] running ifup on interface: vlan100
    [2017/06/07 08:18:04 PM] [INFO] running ifup on interface: eth1
    (truncated, view all with --long)
overcloud.UpdateWorkflow.ControllerPacemakerUpgradeDeployment_Step2.0:
  resource_type: OS::Heat::SoftwareDeployment
  physical_resource_id: bd59dafe-7eb2-433e-9a11-c625af0eacd1
  status: CREATE_FAILED
  status_reason: |
    Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2
  deploy_stdout: |
    ...
    Wed Jun  7 20:17:37 UTC 2017 8bf07210-53c6-4c4a-bde9-2414e13513dd tripleo-upgrade overcloud-controller-0 Going to systemctl stop openstack-swift-container-updater
    Wed Jun  7 20:17:37 UTC 2017 8bf07210-53c6-4c4a-bde9-2414e13513dd tripleo-upgrade overcloud-controller-0 Going to systemctl stop openstack-swift-container
    Wed Jun  7 20:17:37 UTC 2017 8bf07210-53c6-4c4a-bde9-2414e13513dd tripleo-upgrade overcloud-controller-0 Going to systemctl stop openstack-swift-object-auditor
    Wed Jun  7 20:17:38 UTC 2017 8bf07210-53c6-4c4a-bde9-2414e13513dd tripleo-upgrade overcloud-controller-0 Going to systemctl stop openstack-swift-object-replicator
    Wed Jun  7 20:17:38 UTC 2017 8bf07210-53c6-4c4a-bde9-2414e13513dd tripleo-upgrade overcloud-controller-0 Going to systemctl stop openstack-swift-object-updater
    Wed Jun  7 20:17:38 UTC 2017 8bf07210-53c6-4c4a-bde9-2414e13513dd tripleo-upgrade overcloud-controller-0 Going to systemctl stop openstack-swift-object
    Wed Jun  7 20:17:38 UTC 2017 8bf07210-53c6-4c4a-bde9-2414e13513dd tripleo-upgrade overcloud-controller-0 Going to systemctl stop openstack-swift-proxy
    inactive
    Delta RPMs disabled because /usr/bin/applydeltarpm not installed.
    yum update os-net-config return code: 0
    (truncated, view all with --long)
  deploy_stderr: |
    ...
    [2017/06/07 08:17:44 PM] [INFO] running ifup on bridge: br-infra
    [2017/06/07 08:17:48 PM] [INFO] running ifup on bridge: br-storage
    [2017/06/07 08:17:49 PM] [INFO] running ifup on bridge: br-ex
    [2017/06/07 08:17:49 PM] [INFO] running ifup on interface: vlan200
    [2017/06/07 08:17:54 PM] [INFO] running ifup on interface: eth2
    [2017/06/07 08:17:54 PM] [INFO] running ifup on interface: vlan301
    [2017/06/07 08:17:59 PM] [INFO] running ifup on interface: vlan300
    [2017/06/07 08:18:03 PM] [INFO] running ifup on interface: eth3
    [2017/06/07 08:18:04 PM] [INFO] running ifup on interface: vlan100
    [2017/06/07 08:18:08 PM] [INFO] running ifup on interface: eth1
    (truncated, view all with --long)
overcloud.UpdateWorkflow.ControllerPacemakerUpgradeDeployment_Step2.2:
  resource_type: OS::Heat::SoftwareDeployment
  physical_resource_id: 334059cd-fac5-48c7-91a5-b18c52c6920d
  status: CREATE_FAILED
  status_reason: |
    Error: resources[2]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2
  deploy_stdout: |
    ...
    active
    active
    active
    active
    active
    active
    active
    inactive
    Delta RPMs disabled because /usr/bin/applydeltarpm not installed.
    yum update os-net-config return code: 0
    (truncated, view all with --long)
  deploy_stderr: |
    ...
    [2017/06/07 08:17:42 PM] [INFO] running ifup on bridge: br-infra
    [2017/06/07 08:17:46 PM] [INFO] running ifup on bridge: br-storage
    [2017/06/07 08:17:47 PM] [INFO] running ifup on bridge: br-ex
    [2017/06/07 08:17:47 PM] [INFO] running ifup on interface: vlan200
    [2017/06/07 08:17:52 PM] [INFO] running ifup on interface: eth2
    [2017/06/07 08:17:52 PM] [INFO] running ifup on interface: vlan301
    [2017/06/07 08:17:57 PM] [INFO] running ifup on interface: vlan300
    [2017/06/07 08:18:01 PM] [INFO] running ifup on interface: eth3
    [2017/06/07 08:18:02 PM] [INFO] running ifup on interface: vlan100
    [2017/06/07 08:18:06 PM] [INFO] running ifup on interface: eth1
    (truncated, view all with --long)

Comment 23 Brent Eagles 2017-06-08 16:23:40 UTC
Thanks Marius! 

The logs are very strange. It appears to get past the OVS problem but then we get logs from os-net-config output that indicate that interfaces haven't changed but then it restarts them anyways and in doing so, errors out. In short, it looks like the fix has introduced new bugs. 

Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] adding vlan: vlan100
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] adding custom route for interface: vlan100
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] adding bridge: br-infra
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] adding interface: eth2
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] adding vlan: vlan200
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] adding bridge: br-storage
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] adding interface: eth3
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] adding vlan: vlan300
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] adding vlan: vlan301
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] adding interface: eth4
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] applying network configs...
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] No changes required for interface: eth4
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] No changes required for interface: eth3
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] No changes required for interface: eth2
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] No changes required for interface: eth1
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] No changes required for interface: eth0
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] No changes required for vlan interface: vlan200
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] No changes required for vlan interface: vlan300
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] No changes required for vlan interface: vlan100
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] No changes required for vlan interface: vlan301
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] running ifdown on interface: vlan200
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] running ifdown on interface: eth2
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:41 PM] [INFO] running ifdown on interface: vlan301
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:42 PM] [INFO] running ifdown on interface: vlan300
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:42 PM] [INFO] running ifdown on interface: eth3
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:42 PM] [INFO] running ifdown on interface: vlan100
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:42 PM] [INFO] running ifdown on interface: eth1
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:43 PM] [INFO] running ifdown on bridge: br-infra
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:43 PM] [INFO] running ifdown on bridge: br-storage
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:43 PM] [INFO] running ifdown on bridge: br-ex
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:44 PM] [INFO] Writing config /etc/sysconfig/network-scripts/route6-br-ex
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:44 PM] [INFO] Writing config /etc/sysconfig/network-scripts/route-br-ex
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:44 PM] [INFO] Writing config /etc/sysconfig/network-scripts/ifcfg-br-infra
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:44 PM] [INFO] Writing config /etc/sysconfig/network-scripts/route-br-storage
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:44 PM] [INFO] Writing config /etc/sysconfig/network-scripts/ifcfg-br-storage
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:44 PM] [INFO] Writing config /etc/sysconfig/network-scripts/route6-br-infra
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:44 PM] [INFO] Writing config /etc/sysconfig/network-scripts/route6-br-storage
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:44 PM] [INFO] Writing config /etc/sysconfig/network-scripts/ifcfg-br-ex
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:44 PM] [INFO] Writing config /etc/sysconfig/network-scripts/route-br-infra
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:44 PM] [INFO] running ifup on bridge: br-infra
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:48 PM] [INFO] running ifup on bridge: br-storage
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:49 PM] [INFO] running ifup on bridge: br-ex
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:49 PM] [INFO] running ifup on interface: vlan200
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:54 PM] [INFO] running ifup on interface: eth2
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:54 PM] [INFO] running ifup on interface: vlan301
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:17:59 PM] [INFO] running ifup on interface: vlan300
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:18:03 PM] [INFO] running ifup on interface: eth3
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:18:04 PM] [INFO] running ifup on interface: vlan100
Jun  7 16:18:09 host-192-168-0-18 os-collect-config: [2017/06/07 08:18:08 PM] [INFO] running ifup on interface: eth1

Comment 26 Marius Cornea 2017-06-10 11:00:03 UTC
Upgrade went past major-upgrade-pacemaker.yaml with the latest revision of the patch but I'm seeing errors while upgrading compute and ceph nodes:

compute:

/root/tripleo_upgrade_node.sh: line 28: special_case_ovs_upgrade_if_needed: command not found


ceph:

/root/tripleo_upgrade_node.sh: line 17: special_case_ovs_upgrade_if_needed: command not found

Comment 27 Brent Eagles 2017-06-14 13:18:54 UTC
Upstream patch has been updated to remove "test only" commentary in commit message and address feedback.

Comment 30 errata-xmlrpc 2017-06-28 14:50:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1585


Note You need to log in before you can comment on or make changes to this bug.