Bugzilla (bugzilla.redhat.com) will be under maintenance for infrastructure upgrades and will not be unavailable on July 31st between 12:30 AM - 05:30 AM UTC. We appreciate your understanding and patience. You can follow status.redhat.com for details.
Bug 1491628 - OSP11 -> OSP12 upgrade: Unable to spawn instance post upgrade: Failed to allocate the network(s), not rescheduling.", "code": 500, "details": "
Summary: OSP11 -> OSP12 upgrade: Unable to spawn instance post upgrade: Failed to allo...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 12.0 (Pike)
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: beta
: 12.0 (Pike)
Assignee: Sofer Athlan-Guyot
QA Contact: Marius Cornea
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-09-14 10:04 UTC by Marius Cornea
Modified: 2021-06-10 13:04 UTC (History)
16 users (show)

Fixed In Version: openstack-tripleo-heat-templates-7.0.3-0.20171023134948.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-12-13 22:08:57 UTC
Target Upstream Version:


Attachments (Terms of Use)
logs.tar.gz (deleted)
2017-09-14 10:04 UTC, Marius Cornea
no flags Details
logs (1.60 MB, application/x-gzip)
2017-09-14 10:05 UTC, Marius Cornea
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1721073 0 None None None 2017-10-03 16:18:06 UTC
OpenStack gerrit 510577 0 'None' MERGED Special treatment for os-net-config upgrade. 2020-11-19 18:37:08 UTC
Red Hat Bugzilla 1434621 1 None None None 2021-01-20 06:05:38 UTC
Red Hat Bugzilla 1473763 0 urgent CLOSED openstack-neutron: after rebooting compute, br-isolated doesn't have any flows. 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1496468 0 high CLOSED OSP11 -> OSP12 upgrade: connectivity to floating IP gets is disrupted during major-upgrade-composable-steps-docker.yaml 2021-02-22 00:41:40 UTC
Red Hat Product Errata RHEA-2017:3462 0 normal SHIPPED_LIVE Red Hat OpenStack Platform 12.0 Enhancement Advisory 2018-02-16 01:43:25 UTC

Internal Links: 1434621 1473763 1496468 1498639

Description Marius Cornea 2017-09-14 10:04:51 UTC
Description of problem:
OSP11 -> OSP12 upgrade: Unable to spawn instance post upgrade: Failed to allocate the network(s), not rescheduling.", "code": 500, "details": " 

(overcloud) [stack@undercloud-0 ~]$ openstack server show instance_dcb0e241b0 -f json
{
  "OS-EXT-STS:task_state": null, 
  "addresses": "", 
  "image": "fedora (694c014b-ea70-4680-8598-673363feadf8)", 
  "OS-EXT-STS:vm_state": "error", 
  "OS-EXT-SRV-ATTR:instance_name": "instance-0000011f", 
  "OS-SRV-USG:launched_at": null, 
  "flavor": "v1-1G-5G (a2e841d5-044f-4e69-98dc-20154a4aecf5)", 
  "id": "e7531d26-46a7-403d-a23a-7f8b74bcd83b", 
  "volumes_attached": "", 
  "user_id": "ea0acfdcc9194322a960862888413aca", 
  "OS-DCF:diskConfig": "MANUAL", 
  "accessIPv4": "", 
  "accessIPv6": "", 
  "OS-EXT-STS:power_state": "NOSTATE", 
  "OS-EXT-AZ:availability_zone": "nova", 
  "config_drive": "", 
  "status": "ERROR", 
  "updated": "2017-09-14T09:38:07Z", 
  "hostId": "", 
  "OS-EXT-SRV-ATTR:host": null, 
  "OS-SRV-USG:terminated_at": null, 
  "key_name": "userkey", 
  "properties": "", 
  "project_id": "5fcaac30c13544338171170ff452dbfb", 
  "OS-EXT-SRV-ATTR:hypervisor_hostname": null, 
  "name": "instance_dcb0e241b0", 
  "created": "2017-09-14T09:32:29Z", 
  "fault": {
    "message": "Build of instance e7531d26-46a7-403d-a23a-7f8b74bcd83b aborted: Failed to allocate the network(s), not rescheduling.", 
    "code": 500, 
    "details": "  File \"/usr/lib/python2.7/site-packages/nova/compute/manager.py\", line 1829, in _do_build_and_run_instance\n    filter_properties)\n  File \"/usr/lib/python2.7/site-packages/nova/compute/manager.py\", line 2035, in _build_and_run_instance\n    reason=msg)\n", 
    "created": "2017-09-14T09:38:06Z"
  }


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Deploy OSP11 with 3 controllers, 2 compute nodes, 3 ceph nodes
2. Upgrade to OSP12
3. Launch instance connected to vxlan network

Actual results:
Instance ends up in ERROR state.

Expected results:
Instance boots properly.

Additional info:

Attaching /var/log/containers/nova/nova-compute.log and /var/log/neutron logs.

Comment 1 Marius Cornea 2017-09-14 10:05:45 UTC
Created attachment 1325873 [details]
logs

Comment 2 Marius Cornea 2017-09-14 10:07:22 UTC
/var/log/neutron/openvswitch-agent.log on compute shows the following error:

2017-09-14 10:06:43.287 80621 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [req-accbdff4-4ced-4924-b9ff-c155e6f87818 - - - - -] Switch connection timeout: RuntimeError: Switch connection timeout

Comment 3 Marius Cornea 2017-09-14 10:13:21 UTC
Note: after rebooting the compute node I was able to successfully spawn an instance.

Comment 4 Jakub Libosvar 2017-09-18 13:44:50 UTC
I'm assigning it to myself for triage.

Comment 5 Jakub Libosvar 2017-09-18 15:10:03 UTC
Can you please attach a sosreport from the controller nodes and the affected node? Isn't nova supposed to reschedule an instance in case it fails on one node? Are all nodes affected?

Comment 6 Jakub Libosvar 2017-09-18 15:16:06 UTC
Some things I noticed in the agent logs:

1)
2017-09-14 03:19:57.236 80621 INFO neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_bridge [req-accbdff4-4ced-4924-b9ff-c155e6f87818 - - - - -] Bridge br-isolated changed its datapath-ID from 525400008148 to 0000525400008148

this comes from openvswitch datapath, was the openvswitch package updated?

2) I see br-isolated that as per my previous experience is used for management traffic which is not recommended since OSP 11. Does the issue happen also for other provider bridges that are used in the deployment or is it just this bridge? Can you not use br-isolated bridge and replace it with linux devices as per https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/11/html-single/advanced_overcloud_customization/#sect-Isolating_Networks ?

This figure can be helpful: https://docs.google.com/presentation/d/1QkqESJSYIIDxp9D11TaGazAaE5-
ZunaobkYBy77dSsc/edit#slide=id.g1d2ba4a634_0_51

Comment 7 Marius Cornea 2017-09-18 16:26:01 UTC
(In reply to Jakub Libosvar from comment #6)
> Some things I noticed in the agent logs:
> 
> 1)
> 2017-09-14 03:19:57.236 80621 INFO
> neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_bridge
> [req-accbdff4-4ced-4924-b9ff-c155e6f87818 - - - - -] Bridge br-isolated
> changed its datapath-ID from 525400008148 to 0000525400008148
> 
> this comes from openvswitch datapath, was the openvswitch package updated?

Yes, openvswitch gets updated:

/var/log/yum.log
Sep 18 13:18:29 Updated: openvswitch-2.7.2-4.git20170719.el7fdp.x86_64

Note that I keep seeing these messages after upgrade:

2017-09-18 15:54:17.996 2548 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [req-af758045-8745-4972-8acc-d7e84b8ff0a0 - - - - -] Switch connection timeout: RuntimeError: Switch connection timeout
2017-09-18 15:54:17.997 2548 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn command(idx=0): DbGetCommand(column=datapath_id, table=Bridge, record=br-isolated) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
2017-09-18 15:54:17.998 2548 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Transaction caused no change do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:110
2017-09-18 15:54:17.998 2548 INFO neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_bridge [req-af758045-8745-4972-8acc-d7e84b8ff0a0 - - - - -] Bridge br-isolated changed its datapath-ID from 525400f4311e to 0000525400f4311e
2017-09-18 15:54:48.014 2548 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [req-af758045-8745-4972-8acc-d7e84b8ff0a0 - - - - -] Switch connection timeout: RuntimeError: Switch connection timeout
2017-09-18 15:54:48.016 2548 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn command(idx=0): DbGetCommand(column=datapath_id, table=Bridge, record=br-isolated) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84


> 2) I see br-isolated that as per my previous experience is used for
> management traffic which is not recommended since OSP 11. Does the issue
> happen also for other provider bridges that are used in the deployment or is
> it just this bridge? Can you not use br-isolated bridge and replace it with
> linux devices as per
> https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/11/
> html-single/advanced_overcloud_customization/#sect-Isolating_Networks ?
> 
> This figure can be helpful:
> https://docs.google.com/presentation/d/1QkqESJSYIIDxp9D11TaGazAaE5-
> ZunaobkYBy77dSsc/edit#slide=id.g1d2ba4a634_0_51

I'll try to get such a topology tested but it might take a bit to recreate the existing templates we're currently using for testing. Nevertheless I think we should cover this case as well because in < 11 environments this recommendation wasn't present. In the past releases we used to do workarounds that were run to prevent issues caused by ovs upgrade as part of the upgrade workflow so maybe we can do the same here once we know what the cause is.

I'll get back with the sosreports after I reproduce the environment.

Comment 8 Jakub Libosvar 2017-09-18 16:34:55 UTC
(In reply to Marius Cornea from comment #7)
> (In reply to Jakub Libosvar from comment #6)
[...]
> 
> I'll get back with the sosreports after I reproduce the environment.

Thanks. To me it sounds the minimal reproducer could be to create an ovs bridge with interface on old openvswitch. Check its datapath id:

  sudo ovs-ofctl show <the-bridge>
  OFPT_FEATURES_REPLY (xid=0x2): dpid:0000121147706d4a <--- THIS IS THE ID

then update ovs and check the dpid again. I could try it myself if you provide the original openvswitch version and to which it was updated (openvswitch-2.7.2-4.git20170719.el7fdp.x86_64 ?).

Comment 10 Marius Cornea 2017-09-19 14:09:25 UTC
Checking br-isolated on one of the compute nodes:

before upgrade:

[root@compute-1 ~]# ovs-ofctl show br-isolated
OFPT_FEATURES_REPLY (xid=0x2): dpid:0000525400472864
n_tables:254, n_buffers:256
capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IP
actions: output enqueue set_vlan_vid set_vlan_pcp strip_vlan mod_dl_src mod_dl_dst mod_nw_src mod_nw_dst mod_nw_tos mod_tp_src mod_tp_dst
 1(eth1): addr:52:54:00:47:28:64
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 2(vlan20): addr:02:6d:a5:c6:2d:94
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 3(vlan30): addr:c6:7a:8a:2c:a9:e2
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 4(vlan50): addr:32:ce:5e:24:81:9d
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 5(phy-br-isolated): addr:0a:54:b8:d3:56:6a
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 LOCAL(br-isolated): addr:52:54:00:47:28:64
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
OFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0


after upgrade:

[root@compute-1 ~]# ovs-ofctl show br-isolated
OFPT_FEATURES_REPLY (xid=0x2): dpid:0000525400472864
n_tables:254, n_buffers:256
capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IP
actions: output enqueue set_vlan_vid set_vlan_pcp strip_vlan mod_dl_src mod_dl_dst mod_nw_src mod_nw_dst mod_nw_tos mod_tp_src mod_tp_dst
 1(vlan20): addr:c2:13:12:42:fd:63
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 2(vlan30): addr:aa:0d:47:50:bb:03
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 3(vlan50): addr:16:3f:5d:93:bf:f1
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 4(eth1): addr:52:54:00:47:28:64
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 LOCAL(br-isolated): addr:52:54:00:47:28:64
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
OFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0

We can see that the dpid remains the same but /var/log/neutron/openvswitch-agent.log shows that it has changed:

2017-09-19 14:08:04.475 82448 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [req-86bca5ee-e327-4e82-889c-e42641ffb240 - - - - -] Switch connection timeout: RuntimeError: Switch connection timeout
2017-09-19 14:08:04.477 82448 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn command(idx=0): DbGetCommand(column=datapath_id, table=Bridge, record=br-isolated) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
2017-09-19 14:08:04.477 82448 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Transaction caused no change do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:110
2017-09-19 14:08:04.478 82448 INFO neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_bridge [req-86bca5ee-e327-4e82-889c-e42641ffb240 - - - - -] Bridge br-isolated changed its datapath-ID from 525400472864 to 0000525400472864

Comment 11 Marius Cornea 2017-09-19 14:10:20 UTC
Note: this is before rebooting the node so the old ovs is still running:

[root@compute-1 ~]# ovs-vsctl show | grep ovs_version
    ovs_version: "2.6.1"
[root@compute-1 ~]# rpm -qa | grep openvswitch
openvswitch-ovn-central-2.7.2-4.git20170719.el7fdp.x86_64
openvswitch-2.7.2-4.git20170719.el7fdp.x86_64
python-openvswitch-2.7.2-4.git20170719.el7fdp.noarch
openstack-neutron-openvswitch-11.0.1-0.20170913033853.6b26bc5.el7ost.noarch
openvswitch-ovn-host-2.7.2-4.git20170719.el7fdp.x86_64
openvswitch-ovn-common-2.7.2-4.git20170719.el7fdp.x86_64

Comment 12 Jakub Libosvar 2017-09-21 12:38:22 UTC
Marius provided me a failed environment yesterday. The debugging showed that br-isolated provider bridge was correctly initialized by neutron-openvswitch-agent and worked for a while. Then, after ovs-agent was already started, br-isolated ifcfg files were executed that put the bridge to standalone mode and removed controller, hence ovs-agent couldn't communicate with ovs-vswitchd process. Such action was not caused by running systemd unit files.

Comment 13 Jakub Libosvar 2017-09-21 15:20:33 UTC
More information from provided sosreports:

os-net-config package was updated after neutron packages:
Sep 19 11:46:55 Updated: 1:python-neutron-11.0.1-0.20170913033853.6b26bc5.el7ost.noarch
Sep 19 11:47:07 Updated: os-net-config-7.3.0-0.20170910153345.77fe592.el7ost.noarch

neutron-ovs-agent was started at 2017-09-19 11:52:15.002 and configured br-isolated:
2017-09-19 11:52:20.170 82448 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn command(idx=0): SetFailModeCommand(bridge=br-isolated, mode=secure)

br-isolated was taken down and up 18 minutes later
os-collect-config[3061]: [2017/09/19 12:14:21 PM] [INFO] running ifdown on bridge: br-isolated
os-collect-config[3061]: [2017/09/19 12:14:22 PM] [INFO] running ifup on bridge: br-isolated

neutron-ovs-agent tried to perform operation on br-isolated bridge 20 minutes later and failed due to missing controller that was deleted by ifcfg script:
2017-09-19 12:42:00.987 82448 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [req-86bca5ee-e327-4e82-889c-e42641ffb240 - - - - -] Switch connection timeout
2017-09-19 12:42:00.989 82448 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn command(idx=0): DbGetCommand(column=datapath_id, table=Bridge, record=br-isolated)
2017-09-19 12:42:00.989 82448 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Transaction caused no change
2017-09-19 12:42:00.990 82448 INFO neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_bridge [req-86bca5ee-e327-4e82-889c-e42641ffb240 - - - - -] Bridge br-isolated changed its datapath-ID from 525400472864 to 0000525400472864


The upgrade procedure should not touch network interfaces that are configured in Neutron after neutron-ovs-agent was started. Can anybody from upgrades DFG have a look at the procedure please? This is outside of Neutron scope.

Comment 14 Sofer Athlan-Guyot 2017-09-26 17:22:01 UTC
Hi,

in the log we can see that the configuration for the ifcfg-br-isolated has changed:

    Sep 19 12:14:21 compute-0 os-collect-config[3059]: ++ os-apply-config --key os_net_config --type raw --key-default ''
    Sep 19 12:14:22 compute-0 os-collect-config[3059]: + NET_CONFIG='{"network_config": [{"dns_servers": ["10.0.0.1"], "addresses": [{"ip_netmask": "192.168.24.20/24"}], "routes": [{"default": true, "ip_netmask": "0.0.0.0/0", "next_hop": "192.168.24.1"}, {"ip_netmask": "169.254.169.254/32", "next_hop": "192.168.24.1"}], "use_dhcp": false, "type": "interface", "name": "nic1"}, {"use_dhcp": false, "type": "ovs_bridge", "name": "br-isolated", "members": [{"type": "interface", "name": "nic2", "primary": true}, {"type": "vlan", "addresses": [{"ip_netmask": "172.17.1.14/24"}], "vlan_id": 20}, {"type": "vlan", "addresses": [{"ip_netmask": "172.17.3.10/24"}], "vlan_id": 30}, {"type": "vlan", "addresses": [{"ip_netmask": "172.17.2.17/24"}], "vlan_id": 50}]}, {"use_dhcp": false, "type": "interface", "name": "nic3"}]}'
    Sep 19 12:14:22 compute-0 os-collect-config[3059]: + '[' -n '{"network_config": [{"dns_servers": ["10.0.0.1"], "addresses": [{"ip_netmask": "192.168.24.20/24"}], "routes": [{"default": true, "ip_netmask": "0.0.0.0/0", "next_hop": "192.168.24.1"}, {"ip_netmask": "169.254.169.254/32", "next_hop": "192.168.24.1"}], "use_dhcp": false, "type": "interface", "name": "nic1"}, {"use_dhcp": false, "type": "ovs_bridge", "name": "br-isolated", "members": [{"type": "interface", "name": "nic2", "primary": true}, {"type": "vlan", "addresses": [{"ip_netmask": "172.17.1.14/24"}], "vlan_id": 20}, {"type": "vlan", "addresses": [{"ip_netmask": "172.17.3.10/24"}], "vlan_id": 30}, {"type": "vlan", "addresses": [{"ip_netmask": "172.17.2.17/24"}], "vlan_id": 50}]}, {"use_dhcp": false, "type": "interface", "name": "nic3"}]}' ']'
    Sep 19 12:14:22 compute-0 os-collect-config[3059]: + trap configure_safe_defaults EXIT
    Sep 19 12:14:22 compute-0 os-collect-config[3059]: + os-net-config -c /etc/os-net-config/config.json -v --detailed-exit-codes
    Sep 19 12:14:22 compute-0 os-collect-config[3059]: [2017/09/19 12:14:22 PM] [INFO] Using config file at: /etc/os-net-config/config.json
    Sep 19 12:14:22 compute-0 os-collect-config[3059]: [2017/09/19 12:14:22 PM] [INFO] Using mapping file at: /etc/os-net-config/mapping.yaml
    Sep 19 12:14:22 compute-0 os-collect-config[3059]: [2017/09/19 12:14:22 PM] [INFO] Ifcfg net config provider created.
    Sep 19 12:14:22 compute-0 os-collect-config[3059]: [2017/09/19 12:14:22 PM] [INFO] nic3 mapped to: eth2
    Sep 19 12:14:22 compute-0 os-collect-config[3059]: [2017/09/19 12:14:22 PM] [INFO] nic2 mapped to: eth1
    Sep 19 12:14:22 compute-0 os-collect-config[3059]: [2017/09/19 12:14:22 PM] [INFO] nic1 mapped to: eth0
    Sep 19 12:14:22 compute-0 os-collect-config[3059]: [2017/09/19 12:14:22 PM] [INFO] adding interface: eth0
    Sep 19 12:14:22 compute-0 os-collect-config[3059]: [2017/09/19 12:14:22 PM] [INFO] adding custom route for interface: eth0
    Sep 19 12:14:22 compute-0 os-collect-config[3059]: [2017/09/19 12:14:22 PM] [INFO] adding bridge: br-isolated
    Sep 19 12:14:22 compute-0 os-collect-config[3059]: [2017/09/19 12:14:22 PM] [INFO] adding interface: eth1
    Sep 19 12:14:22 compute-0 os-collect-config[3059]: [2017/09/19 12:14:22 PM] [INFO] adding vlan: vlan20
    Sep 19 12:14:22 compute-0 os-collect-config[3059]: [2017/09/19 12:14:22 PM] [INFO] adding vlan: vlan30
    Sep 19 12:14:22 compute-0 os-collect-config[3059]: [2017/09/19 12:14:22 PM] [INFO] adding vlan: vlan50
    Sep 19 12:14:22 compute-0 os-collect-config[3059]: [2017/09/19 12:14:22 PM] [INFO] adding interface: eth2
    Sep 19 12:14:22 compute-0 os-collect-config[3059]: [2017/09/19 12:14:22 PM] [INFO] applying network configs...
    Sep 19 12:14:22 compute-0 os-collect-config[3059]: [2017/09/19 12:14:22 PM] [INFO] No changes required for interface: eth2
    Sep 19 12:14:22 compute-0 os-collect-config[3059]: [2017/09/19 12:14:22 PM] [INFO] No changes required for interface: eth1
    Sep 19 12:14:22 compute-0 os-collect-config[3059]: [2017/09/19 12:14:22 PM] [INFO] No changes required for interface: eth0
    Sep 19 12:14:22 compute-0 os-collect-config[3059]: [2017/09/19 12:14:22 PM] [INFO] No changes required for vlan interface: vlan20
    Sep 19 12:14:22 compute-0 os-collect-config[3059]: [2017/09/19 12:14:22 PM] [INFO] No changes required for vlan interface: vlan30
    Sep 19 12:14:22 compute-0 os-collect-config[3059]: [2017/09/19 12:14:22 PM] [INFO] No changes required for vlan interface: vlan50
    Sep 19 12:14:22 compute-0 os-collect-config[3059]: [2017/09/19 12:14:22 PM] [INFO] running ifdown on interface: vlan20
    Sep 19 12:14:23 compute-0 os-collect-config[3059]: [2017/09/19 12:14:23 PM] [INFO] running ifdown on interface: vlan30
    Sep 19 12:14:23 compute-0 os-collect-config[3059]: [2017/09/19 12:14:23 PM] [INFO] running ifdown on interface: vlan50
    Sep 19 12:14:24 compute-0 os-collect-config[3059]: [2017/09/19 12:14:24 PM] [INFO] running ifdown on interface: eth1
    Sep 19 12:14:24 compute-0 os-collect-config[3059]: [2017/09/19 12:14:24 PM] [INFO] running ifdown on bridge: br-isolated
    Sep 19 12:14:25 compute-0 os-collect-config[3059]: [2017/09/19 12:14:25 PM] [INFO] Writing config /etc/sysconfig/network-scripts/route-br-isolated
    Sep 19 12:14:25 compute-0 os-collect-config[3059]: [2017/09/19 12:14:25 PM] [INFO] Writing config /etc/sysconfig/network-scripts/ifcfg-br-isolated
    Sep 19 12:14:25 compute-0 os-collect-config[3059]: [2017/09/19 12:14:25 PM] [INFO] Writing config /etc/sysconfig/network-scripts/route6-br-isolated
    Sep 19 12:14:25 compute-0 os-collect-config[3059]: [2017/09/19 12:14:25 PM] [INFO] running ifup on bridge: br-isolated
    Sep 19 12:14:25 compute-0 os-collect-config[3059]: [2017/09/19 12:14:25 PM] [INFO] running ifup on interface: vlan20
    Sep 19 12:14:30 compute-0 os-collect-config[3059]: [2017/09/19 12:14:30 PM] [INFO] running ifup on interface: vlan30
    Sep 19 12:14:35 compute-0 os-collect-config[3059]: [2017/09/19 12:14:35 PM] [INFO] running ifup on interface: vlan50
    Sep 19 12:14:40 compute-0 os-collect-config[3059]: [2017/09/19 12:14:40 PM] [INFO] running ifup on interface: eth1

When looking at the new ifcfg-br-isolated, it looks like this:

    # This file is autogenerated by os-net-config
    DEVICE=br-isolated
    ONBOOT=yes
    HOTPLUG=no
    NM_CONTROLLED=no
    PEERDNS=no
    DEVICETYPE=ovs
    TYPE=OVSBridge
    OVS_EXTRA="set bridge br-isolated other-config:hwaddr=52:54:00:2c:28:d6 -- set bridge br-isolated fail_mode=standalone -- del-controller br-isolated"

Notice the "-- del-controller br-isolated".

This was likely added by
https://review.openstack.org/#/c/496707/2/os_net_config/objects.py.

We basically need to port forward this workaround
https://review.openstack.org/#/c/471381/10/extraconfig/tasks/pacemaker_common_functions.sh
in order to make sure that os-net-config is updated and run before
updating everything else especially ovs agent.

Comment 15 Sofer Athlan-Guyot 2017-09-29 12:39:22 UTC
Hi,

we need further help from os-net-config/networking.  Basically we have no right place to put os-net-cnofig upgrade when it restarts all the network interfaces.

if we do it when the cluster is up, so that we can restart the ovs-agent, the disruption induced by the all interfaces restart is often enough to cause all sort of problem to the pacemaker cluster (nodes become unavailable, rabbitmq become and stay unreachable ...).

If we do it when every service is down, then the interface cleanup the ovs database and floating ip reachability is lost.

If we use --no-activate for on-net-config it begs for restart and, maybe, more trouble down the way. 

So how can we restart all interfaces without:
 1. major service disruption;
 2. keeping floating ip reachable;

Adding DFG:DF as they own os-net-config and DFG:Network for insight about what exactly need to be up and need to be restarted for floating ip to survive interface restart.

Thanks,

For reference those two others bugs are affected by this restart all network interface behavior:
https://bugzilla.redhat.com/show_bug.cgi?id=1434621
https://bugzilla.redhat.com/show_bug.cgi?id=1496468

and basically each time we will have an update to os-net-config that change network interface definition massively.

Comment 16 Jakub Libosvar 2017-10-02 16:05:03 UTC
Additional info to comment 13: It's worth to mention that neutron-openvswitch-agent process initialize provider bridges (those from bridge_mappings) with openflow rules based on information obtained from neutron-server. Once the bridge is initialized, agent won't touch it unless there was an update coming from neutron-server. In case 'ifup' is called on such bridge, it's put to standalone mode and all flows configured by neutron-openvswitch-agent are lost and bridge behaves like normal mac learning switch.

Comment 17 Dan Sneddon 2017-10-03 14:24:00 UTC
Sofer,

What would the behavior be if we did not have br-isolated? The isolated networks don't need to be on a bridge. The only interfaces that need to be on a bridge are the interfaces that carry the Neutron tenant and external VLANs. Even the Tenant network does not need to be on a bridge.

We are recommending to installers that their isolated networks should not be on a bridge, in order to reduce the possibility of downtime if the agent can't reach the controller.

I'm just trying to understand which topologies are affected by this bug.

(In reply to Sofer Athlan-Guyot from comment #15)
> Hi,
> 
> we need further help from os-net-config/networking.  Basically we have no
> right place to put os-net-cnofig upgrade when it restarts all the network
> interfaces.
> 
> if we do it when the cluster is up, so that we can restart the ovs-agent,
> the disruption induced by the all interfaces restart is often enough to
> cause all sort of problem to the pacemaker cluster (nodes become
> unavailable, rabbitmq become and stay unreachable ...).
> 
> If we do it when every service is down, then the interface cleanup the ovs
> database and floating ip reachability is lost.
> 
> If we use --no-activate for on-net-config it begs for restart and, maybe,
> more trouble down the way. 
> 
> So how can we restart all interfaces without:
>  1. major service disruption;
>  2. keeping floating ip reachable;
> 
> Adding DFG:DF as they own os-net-config and DFG:Network for insight about
> what exactly need to be up and need to be restarted for floating ip to
> survive interface restart.
> 
> Thanks,
> 
> For reference those two others bugs are affected by this restart all network
> interface behavior:
> https://bugzilla.redhat.com/show_bug.cgi?id=1434621
> https://bugzilla.redhat.com/show_bug.cgi?id=1496468
> 
> and basically each time we will have an update to os-net-config that change
> network interface definition massively.

Comment 18 Sofer Athlan-Guyot 2017-10-03 15:51:31 UTC
So after a meeting we concurred that the current set of changes in os-net-config aren't needed for a live environment.  They are mostly about safe reboot of the overcloud.  

Given this we agreed that we should
 - upgrade os-net-config on its own (doesn't have any dependency, so it's safe)
 - run it with the --no-activate option. 

This will add the new parameters to the configuration but will prevent any unwanted reboot of the interface.

Thanks Jakub, Dan, Brent for the help.

Comment 19 Sofer Athlan-Guyot 2017-10-03 16:16:37 UTC
As additional comment, the br-isolated was removed from recommended network topoology at some time after 11 release, meaning that we may have br-isolated network in the wild, forcing this path to be supported, until we provide a way to move away from this.  This should certainly be a documentation effort and be done outside of the upgrade and manually.

Comment 20 Sofer Athlan-Guyot 2017-10-09 13:57:17 UTC
Backport to stable/pike

Comment 25 errata-xmlrpc 2017-12-13 22:08:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462


Note You need to log in before you can comment on or make changes to this bug.