Bug 1403080 - [rhel-osp-director-10] major-upgrade osp9 to osp10 fails during UPGRADE CONTROLLER AND BLOCKSTORAGE (neutron-ovs-cleanup.service failed)
Summary: [rhel-osp-director-10] major-upgrade osp9 to osp10 fails during UPGRADE CONTR...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 10.0 (Newton)
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ga
: 10.0 (Newton)
Assignee: Sofer Athlan-Guyot
QA Contact: Omri Hochman
URL:
Whiteboard:
Depends On:
Blocks: 1403945
TreeView+ depends on / blocked
 
Reported: 2016-12-09 02:24 UTC by Omri Hochman
Modified: 2016-12-29 16:56 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1403945 (view as bug list)
Environment:
Last Closed: 2016-12-20 16:11:45 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
vswitch log. (69.13 KB, application/x-gzip)
2016-12-09 14:34 UTC, Sofer Athlan-Guyot
no flags Details
ovsdb server log. (353 bytes, application/x-gzip)
2016-12-09 14:35 UTC, Sofer Athlan-Guyot
no flags Details

Description Omri Hochman 2016-12-09 02:24:41 UTC
[rhel-osp-director-10] major-upgrade osp9 to osp10 fails during UPGRADE CONTROLLER AND BLOCKSTORAGE (neutron-ovs-cleanup.service failed) 


Environment: 
-------------
instack-undercloud-5.1.0-4.el7ost.noarch
instack-5.0.0-1.el7ost.noarch
openstack-tripleo-heat-templates-compat-2.0.0-41.el7ost.noarch
openstack-heat-templates-0-0.9.1e6015dgit.el7ost.noarch
openstack-heat-engine-7.0.0-7.el7ost.noarch
openstack-heat-api-cfn-7.0.0-7.el7ost.noarch
openstack-tripleo-heat-templates-5.1.0-7.el7ost.noarch
puppet-heat-9.4.1-1.el7ost.noarch
openstack-heat-common-7.0.0-7.el7ost.noarch
python-heat-tests-7.0.0-7.el7ost.noarch
heat-cfntools-1.3.0-2.el7ost.noarch
openstack-heat-api-7.0.0-7.el7ost.noarch
python-heat-agent-0-0.9.1e6015dgit.el7ost.noarch
python-heatclient-1.5.0-1.el7ost.noarch

[root@controller-0 ~]# rpm -qa | grep openv
---------------------------------------------
python-openvswitch-2.5.0-22.git20160727.el7fdp.noarch
openvswitch-2.5.0-22.git20160727.el7fdp.x86_64
openstack-neutron-openvswitch-9.1.0-8.el7ost.noarch



Steps:
------
Attempt to upgrade osp9 to osp10 

Results :
----------
upgrade osp9 to osp10 - UPDATE_FAILED

+--------------------------------------+------------+---------------+----------------------+----------------------+
| id                                   | stack_name | stack_status  | creation_time        | updated_time         |
+--------------------------------------+------------+---------------+----------------------+----------------------+
| 8012ded2-e534-4fc4-9712-26a5cf5657e1 | overcloud  | UPDATE_FAILED | 2016-12-08T17:21:03Z | 2016-12-08T18:24:03Z |
+--------------------------------------+------------+---------------+----------------------+----------------------+


[heat-admin@controller-0 ~]$ sudo su -
Last login: Thu Dec  8 21:25:06 UTC 2016 on pts/0
[root@controller-0 ~]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-0 (version 1.1.15-11.el7_3.2-e174ec8) - partition WITHOUT quorum
Last updated: Thu Dec  8 21:36:52 2016          Last change: Thu Dec  8 18:48:58 2016 by root via crm_resource on controller-0

3 nodes and 19 resources configured

Online: [ controller-0 ]
OFFLINE: [ controller-1 controller-2 ]

Full list of resources:

 ip-172.17.1.10 (ocf::heartbeat:IPaddr2):       Stopped
 ip-192.0.2.6   (ocf::heartbeat:IPaddr2):       Stopped
 ip-172.17.4.10 (ocf::heartbeat:IPaddr2):       Stopped
 Clone Set: haproxy-clone [haproxy]
     Stopped: [ controller-0 controller-1 controller-2 ]
 Master/Slave Set: galera-master [galera]
     Stopped: [ controller-0 controller-1 controller-2 ]
 ip-172.17.3.10 (ocf::heartbeat:IPaddr2):       Stopped
 ip-10.0.0.101  (ocf::heartbeat:IPaddr2):       Stopped
 Clone Set: rabbitmq-clone [rabbitmq]
     Stopped: [ controller-0 controller-1 controller-2 ]
 Master/Slave Set: redis-master [redis]
     Stopped: [ controller-0 controller-1 controller-2 ]
 ip-172.17.1.11 (ocf::heartbeat:IPaddr2):       Stopped
 openstack-cinder-volume        (systemd:openstack-cinder-volume):      Stopped

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled


[stack@undercloud-0 ~]$ heat resource-list overcloud -n 5 | grep -v COMPLETE
WARNING (shell) "heat resource-list" is deprecated, please use "openstack stack resource list" instead
+--------------------------------------------+---------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------+-----------------+----------------------+----------------------------------------------------------------------------------------------------------------------+
| resource_name                              | physical_resource_id                                                            | resource_type                                                                                                       | resource_status | updated_time         | stack_name                                                                                                           |
+--------------------------------------------+---------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------+-----------------+----------------------+----------------------------------------------------------------------------------------------------------------------+
| UpdateWorkflow                             | ca4f39e5-ef26-4a5c-8341-4da6d0e27dc9                                            | OS::TripleO::Tasks::UpdateWorkflow                                                                                  | UPDATE_FAILED   | 2016-12-08T18:27:47Z | overcloud                                                                                                            |
| 0                                          | 19b2046b-81ef-4ed0-a591-6e2ee54961a9                                            | OS::Heat::SoftwareDeployment                                                                                        | CREATE_FAILED   | 2016-12-08T18:50:11Z | overcloud-UpdateWorkflow-kx4qtxuu7h5x-ControllerPacemakerUpgradeDeployment_Step6-ij3p2wxvddhg                        |
| 1                                          | a867b6eb-3ac1-4680-9b42-f5e3f95946d4                                            | OS::Heat::SoftwareDeployment                                                                                        | CREATE_FAILED   | 2016-12-08T18:50:11Z | overcloud-UpdateWorkflow-kx4qtxuu7h5x-ControllerPacemakerUpgradeDeployment_Step6-ij3p2wxvddhg                        |
| 2                                          | 81afcc28-c639-489b-86f3-b705dc371a86                                            | OS::Heat::SoftwareDeployment                                                                                        | CREATE_FAILED   | 2016-12-08T18:50:11Z | overcloud-UpdateWorkflow-kx4qtxuu7h5x-ControllerPacemakerUpgradeDeployment_Step6-ij3p2wxvddhg                        |
| ControllerPacemakerUpgradeDeployment_Step6 | 0349caa1-d92e-49df-ac66-18d0388752af                                            | OS::Heat::SoftwareDeploymentGroup                                                                                   | CREATE_FAILED   | 2016-12-08T18:50:11Z | overcloud-UpdateWorkflow-kx4qtxuu7h5x                                                                                |
+--------------------------------------------+---------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------+-----------------+----------------------+----------------------------------------------------------------------------------------------------------------------+
[stack@undercloud-0 ~]$ 



[stack@undercloud-0 ~]$ heat deployment-show 81afcc28-c639-489b-86f3-b705dc371a86
WARNING (shell) "heat deployment-show" is deprecated, please use "openstack software deployment show" instead
{
  "status": "FAILED", 
  "server_id": "3d3130d0-1222-4ba6-8bbb-ef00488b8965", 
  "config_id": "34116b17-ed3e-4f86-addb-17fde12f4e8e", 
  "output_values": {
    "deploy_stdout": "Thu Dec  8 18:50:53 UTC 2016 34116b17-ed3e-4f86-addb-17fde12f4e8e tripleo-upgrade controller-2 Going to systemctl start httpd\nThu Dec  8 18:50:53 UTC 2016 34116b17-ed3e-4f86-addb-17fde12f4e8e tripleo-upgrade controller-2 Going to check_resource_systemd for httpd to be started\nhttpd is started\nThu Dec  8 18:50:54 UTC 2016 34116b17-ed3e-4f86-addb-17fde12f4e8e tripleo-upgrade controller-2 Going to systemctl start memcached\nThu Dec  8 18:50:54 UTC 2016 34116b17-ed3e-4f86-addb-17fde12f4e8e tripleo-upgrade controller-2 Going to check_resource_systemd for memcached to be started\nmemcached is started\nThu Dec  8 18:50:55 UTC 2016 34116b17-ed3e-4f86-addb-17fde12f4e8e tripleo-upgrade controller-2 Going to systemctl start mongod\nThu Dec  8 18:50:55 UTC 2016 34116b17-ed3e-4f86-addb-17fde12f4e8e tripleo-upgrade controller-2 Going to check_resource_systemd for mongod to be started\nmongod is started\nThu Dec  8 18:50:56 UTC 2016 34116b17-ed3e-4f86-addb-17fde12f4e8e tripleo-upgrade controller-2 Going to systemctl start neutron-dhcp-agent\nThu Dec  8 18:50:56 UTC 2016 34116b17-ed3e-4f86-addb-17fde12f4e8e tripleo-upgrade controller-2 Going to check_resource_systemd for neutron-dhcp-agent to be started\nneutron-dhcp-agent is started\nThu Dec  8 18:50:57 UTC 2016 34116b17-ed3e-4f86-addb-17fde12f4e8e tripleo-upgrade controller-2 Going to systemctl start neutron-l3-agent\nThu Dec  8 18:50:57 UTC 2016 34116b17-ed3e-4f86-addb-17fde12f4e8e tripleo-upgrade controller-2 Going to check_resource_systemd for neutron-l3-agent to be started\nneutron-l3-agent is started\nThu Dec  8 18:50:58 UTC 2016 34116b17-ed3e-4f86-addb-17fde12f4e8e tripleo-upgrade controller-2 Going to systemctl start neutron-metadata-agent\nThu Dec  8 18:50:58 UTC 2016 34116b17-ed3e-4f86-addb-17fde12f4e8e tripleo-upgrade controller-2 Going to check_resource_systemd for neutron-metadata-agent to be started\nneutron-metadata-agent is started\nThu Dec  8 18:50:59 UTC 2016 34116b17-ed3e-4f86-addb-17fde12f4e8e tripleo-upgrade controller-2 Going to systemctl start neutron-netns-cleanup\nThu Dec  8 18:51:00 UTC 2016 34116b17-ed3e-4f86-addb-17fde12f4e8e tripleo-upgrade controller-2 Going to check_resource_systemd for neutron-netns-cleanup to be started\nneutron-netns-cleanup is started\nThu Dec  8 18:53:08 UTC 2016 34116b17-ed3e-4f86-addb-17fde12f4e8e tripleo-upgrade controller-2 Going to systemctl start neutron-openvswitch-agent\nThu Dec  8 18:53:08 UTC 2016 34116b17-ed3e-4f86-addb-17fde12f4e8e tripleo-upgrade controller-2 Going to check_resource_systemd for neutron-openvswitch-agent to be started\nneutron-openvswitch-agent is started\nThu Dec  8 18:53:12 UTC 2016 34116b17-ed3e-4f86-addb-17fde12f4e8e tripleo-upgrade controller-2 Going to systemctl start neutron-ovs-cleanup\n", 
    "deploy_stderr": "Job for neutron-ovs-cleanup.service failed because the control process exited with error code. See \"systemctl status neutron-ovs-cleanup.service\" and \"journalctl -xe\" for details.\n", 
    "deploy_status_code": 1
  }, 
  "creation_time": "2016-12-08T18:50:12Z", 
  "updated_time": "2016-12-08T18:53:16Z", 
  "input_values": {
    "update_identifier": "", 
    "deploy_identifier": "1481221429"
  }, 
  "action": "CREATE", 
  "status_reason": "deploy_status_code : Deployment exited with non-zero status code: 1", 
  "id": "81afcc28-c639-489b-86f3-b705dc371a86"


Checking the services on the controllers: 
-----------------------------------------
[root@controller-0 ~]# systemctl status neutron-server.service
● neutron-server.service - OpenStack Neutron Server
   Loaded: loaded (/usr/lib/systemd/system/neutron-server.service; disabled; vendor preset: disabled)
   Active: activating (start) since Thu 2016-12-08 21:38:00 UTC; 1min 9s ago
 Main PID: 26100 (neutron-server)
   CGroup: /system.slice/neutron-server.service
           └─26100 /usr/bin/python2 /usr/bin/neutron-server --config-file /usr/share/neutron/neutron-dist.conf --config-dir /usr/share/neutron/server --config-file /etc/neutron/neutron.conf --config-file /etc/n...

Dec 08 21:38:00 controller-0.localdomain systemd[1]: Starting OpenStack Neutron Server...
Dec 08 21:38:00 controller-0.localdomain neutron-server[26100]: Guru meditation now registers SIGUSR1 and SIGUSR2 by default for backward compatibility. SIGUSR1 will no longer be registered in a fut...ate reports.
Dec 08 21:38:00 controller-0.localdomain neutron-server[26100]: Option "verbose" from group "DEFAULT" is deprecated for removal.  Its value may be silently ignored in the future.
Dec 08 21:38:00 controller-0.localdomain neutron-server[26100]: Option "rpc_backend" from group "DEFAULT" is deprecated for removal.  Its value may be silently ignored in the future.
Dec 08 21:38:01 controller-0.localdomain neutron-server[26100]: Option "notification_driver" from group "DEFAULT" is deprecated. Use option "driver" from group "oslo_messaging_notifications".
Hint: Some lines were ellipsized, use -l to show in full.
-----------------------------------------------------------------------------
[root@controller-0 ~]# systemctl status neutron-ovs-cleanup.service -l
● neutron-ovs-cleanup.service - OpenStack Neutron Open vSwitch Cleanup Utility
   Loaded: loaded (/usr/lib/systemd/system/neutron-ovs-cleanup.service; disabled; vendor preset: disabled)
   Active: active (exited) since Thu 2016-12-08 18:51:00 UTC; 2h 56min ago
 Main PID: 10079 (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/neutron-ovs-cleanup.service

Dec 08 18:50:59 controller-0.localdomain systemd[1]: Starting OpenStack Neutron Open vSwitch Cleanup Utility...
Dec 08 18:50:59 controller-0.localdomain neutron-ovs-cleanup[10079]: Guru meditation now registers SIGUSR1 and SIGUSR2 by default for backward compatibility. SIGUSR1 will no longer be registered in a future release, so please use SIGUSR2 to generate reports.
Dec 08 18:51:00 controller-0.localdomain neutron-ovs-cleanup[10079]: Option "verbose" from group "DEFAULT" is deprecated for removal.  Its value may be silently ignored in the future.
Dec 08 18:51:00 controller-0.localdomain systemd[1]: Started OpenStack Neutron Open vSwitch Cleanup Utility.



PCS status : 
-------------
[heat-admin@controller-0 ~]$ sudo su -
Last login: Thu Dec  8 21:25:06 UTC 2016 on pts/0
[root@controller-0 ~]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-0 (version 1.1.15-11.el7_3.2-e174ec8) - partition WITHOUT quorum
Last updated: Thu Dec  8 21:36:52 2016          Last change: Thu Dec  8 18:48:58 2016 by root via crm_resource on controller-0

3 nodes and 19 resources configured

Online: [ controller-0 ]
OFFLINE: [ controller-1 controller-2 ]

Full list of resources:

 ip-172.17.1.10 (ocf::heartbeat:IPaddr2):       Stopped
 ip-192.0.2.6   (ocf::heartbeat:IPaddr2):       Stopped
 ip-172.17.4.10 (ocf::heartbeat:IPaddr2):       Stopped
 Clone Set: haproxy-clone [haproxy]
     Stopped: [ controller-0 controller-1 controller-2 ]
 Master/Slave Set: galera-master [galera]
     Stopped: [ controller-0 controller-1 controller-2 ]
 ip-172.17.3.10 (ocf::heartbeat:IPaddr2):       Stopped
 ip-10.0.0.101  (ocf::heartbeat:IPaddr2):       Stopped
 Clone Set: rabbitmq-clone [rabbitmq]
     Stopped: [ controller-0 controller-1 controller-2 ]
 Master/Slave Set: redis-master [redis]
     Stopped: [ controller-0 controller-1 controller-2 ]
 ip-172.17.1.11 (ocf::heartbeat:IPaddr2):       Stopped
 openstack-cinder-volume        (systemd:openstack-cinder-volume):      Stopped

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled



heat-engine.log (on director) 
------------------------------
639-489b-86f3-b705dc371a86] Stack "overcloud-UpdateWorkflow-kx4qtxuu7h5x-ControllerPacemakerUpgradeDeployment_Step6-ij3p2wxvddhg" [0349caa1-d92e-49df-ac66-18d0388752af]
2016-12-08 13:53:17.330 23650 ERROR heat.engine.resource Traceback (most recent call last):
2016-12-08 13:53:17.330 23650 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 753, in _action_recorder
2016-12-08 13:53:17.330 23650 ERROR heat.engine.resource     yield
2016-12-08 13:53:17.330 23650 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 855, in _do_action
2016-12-08 13:53:17.330 23650 ERROR heat.engine.resource     yield self.action_handler_task(action, args=handler_args)
2016-12-08 13:53:17.330 23650 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 353, in wrapper
2016-12-08 13:53:17.330 23650 ERROR heat.engine.resource     step = next(subtask)
2016-12-08 13:53:17.330 23650 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 806, in action_handler_task
2016-12-08 13:53:17.330 23650 ERROR heat.engine.resource     done = check(handler_data)
2016-12-08 13:53:17.330 23650 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resources/openstack/heat/software_deployment.py", line 435, in check_create_complete
2016-12-08 13:53:17.330 23650 ERROR heat.engine.resource     return self._check_complete()
2016-12-08 13:53:17.330 23650 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resources/openstack/heat/software_deployment.py", line 301, in _check_complete
2016-12-08 13:53:17.330 23650 ERROR heat.engine.resource     raise exception.Error(message)
2016-12-08 13:53:17.330 23650 ERROR heat.engine.resource Error: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 1
2016-12-08 13:53:17.330 23650 ERROR heat.engine.resource 
2016-12-08 13:53:17.346 23650 DEBUG heat.engine.scheduler [req-75270ace-2119-436d-b350-e8eb2cde31f8 3f5e3fdfa6c649049ceb36df85a8e21a 57a76a543c3f4d97badd5f5c091084e4 - - -] Task create cancelled cancel /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:280

Comment 1 Omri Hochman 2016-12-09 07:10:49 UTC
The SOS reports should be available here:

http://rhos-release.virt.bos.redhat.com/log/bz1403080

Comment 2 Marios Andreou 2016-12-09 11:12:48 UTC
o/ spent a little time poking at the logs to start with. I see the error reported in comment #0 only happens on controller-2 which is weird http://rhos-release.virt.bos.redhat.com/log/bz1403080/ - for example from controller1:

Dec  8 12:36:38 localhost crmd[7756]:  notice: Result of start operation for neutron-netns-cleanup on controller-1: 0 (ok)
Dec  8 12:36:39 localhost crmd[7756]:  notice: Result of start operation for neutron-ovs-cleanup on controller-1: 0 (ok)
Dec  8 12:36:40 localhost crmd[7756]:  notice: Result of start operation for openstack-heat-engine on controller-1: 0 (ok)

I think it may be worth checking with the neutron folks to see if there have been any changes to the neutron-ovs-cleanup which may be causing this? (anything landed recently?)

I don't think this is related to any of the openvswitch issues we have already dealt with as alluded to in the email about this (the openvswitch 2.5 upgrade... (I see the openvswitch-2.5.0-22.git20160727.el7fdp.x86_64 in comment #0 and the assumption is you've already received that on latest 9 before doing the upgrade to 10)

I guess we'll triage this on scrum later (added).

Comment 3 Mike Burns 2016-12-09 12:44:08 UTC
(In reply to marios from comment #2)
> o/ spent a little time poking at the logs to start with. I see the error
> reported in comment #0 only happens on controller-2 which is weird
> http://rhos-release.virt.bos.redhat.com/log/bz1403080/ - for example from
> controller1:
> 
> Dec  8 12:36:38 localhost crmd[7756]:  notice: Result of start operation for
> neutron-netns-cleanup on controller-1: 0 (ok)
> Dec  8 12:36:39 localhost crmd[7756]:  notice: Result of start operation for
> neutron-ovs-cleanup on controller-1: 0 (ok)
> Dec  8 12:36:40 localhost crmd[7756]:  notice: Result of start operation for
> openstack-heat-engine on controller-1: 0 (ok)
> 
> I think it may be worth checking with the neutron folks to see if there have
> been any changes to the neutron-ovs-cleanup which may be causing this?
> (anything landed recently?)

Assaf was digging into this yesterday with Omri from a neutron perspective.

https://review.openstack.org/#/c/405151/ is the only neutron change we picked up.

> 
> I don't think this is related to any of the openvswitch issues we have
> already dealt with as alluded to in the email about this (the openvswitch
> 2.5 upgrade... (I see the openvswitch-2.5.0-22.git20160727.el7fdp.x86_64 in
> comment #0 and the assumption is you've already received that on latest 9
> before doing the upgrade to 10)

At the moment, OSP 9 probably still has 2.5.0-14 for ovs, not -22, but we should let Omri confirm that.

> 
> I guess we'll triage this on scrum later (added).

Comment 4 Sofer Athlan-Guyot 2016-12-09 12:52:56 UTC
Looking at the environment, the errors are like this:

● dhcp-interface       loaded    failed     failed          DHCP interface br-ex
● dhcp-interface loaded    failed     failed          DHCP interface br-isolated
● dhcp-interface  loaded    failed     failed          DHCP interface ovs-system
● haproxy.service                    loaded    failed     failed          HAProxy Load Balancer

Doing a status of one of the script gives us this:

sudo systemctl status dhcp-interface

Dec 08 17:24:20 controller-0.localdomain dhcp-all-interfaces.sh[9124]: cat: /sys/class/net/br/isolated/addr_assign_type: No such file or directory

The parameter br-isolated has been splited in two to become
/br/isolated/ which doesn't exists.

Modifying /usr/lib/systemd/system/dhcp-interface\@.service like this
(notice that %I becomes %i)

diff -ruN ~/dhcp-interface\@.service /usr/lib/systemd/system/dhcp-interface\@.service
--- /home/heat-admin/dhcp-interface@.service    2016-12-09 12:26:54.017068606 +0000
+++ /usr/lib/systemd/system/dhcp-interface@.service     2016-12-09 12:27:10.827669175 +0000
@@ -1,14 +1,14 @@
 [Unit]
-Description=DHCP interface %I
+Description=DHCP interface %i
 After=network.service network.target

-ConditionPathExists=!/etc/sysconfig/network-scripts/ifcfg-%I
+ConditionPathExists=!/etc/sysconfig/network-scripts/ifcfg-%i

 [Service]
 Type=oneshot
 User=root
-ExecStartPre=/usr/local/sbin/dhcp-all-interfaces.sh %I
-ExecStart=/sbin/ifup %I
+ExecStartPre=/usr/local/sbin/dhcp-all-interfaces.sh %i
+ExecStart=/sbin/ifup %i
 RemainAfterExit=true

 [Install]

Make the script run properly.

Some maybe this have been changed in the latest puddle.

Comment 5 Marios Andreou 2016-12-09 13:11:25 UTC
(In reply to Mike Burns from comment #3)
> (In reply to marios from comment #2)
> > o/ spent a little time poking at the logs to start with. I see the error
> > reported in comment #0 only happens on controller-2 which is weird
> > http://rhos-release.virt.bos.redhat.com/log/bz1403080/ - for example from
> > controller1:
> > 
> > Dec  8 12:36:38 localhost crmd[7756]:  notice: Result of start operation for
> > neutron-netns-cleanup on controller-1: 0 (ok)
> > Dec  8 12:36:39 localhost crmd[7756]:  notice: Result of start operation for
> > neutron-ovs-cleanup on controller-1: 0 (ok)
> > Dec  8 12:36:40 localhost crmd[7756]:  notice: Result of start operation for
> > openstack-heat-engine on controller-1: 0 (ok)
> > 
> > I think it may be worth checking with the neutron folks to see if there have
> > been any changes to the neutron-ovs-cleanup which may be causing this?
> > (anything landed recently?)
> 
> Assaf was digging into this yesterday with Omri from a neutron perspective.
> 
> https://review.openstack.org/#/c/405151/ is the only neutron change we
> picked up.
> 
> > 
> > I don't think this is related to any of the openvswitch issues we have
> > already dealt with as alluded to in the email about this (the openvswitch
> > 2.5 upgrade... (I see the openvswitch-2.5.0-22.git20160727.el7fdp.x86_64 in
> > comment #0 and the assumption is you've already received that on latest 9
> > before doing the upgrade to 10)
> 
> At the moment, OSP 9 probably still has 2.5.0-14 for ovs, not -22, but we
> should let Omri confirm that.

thanks for the info @mburns... confirmed... at least I found from the logs the manual upgrade and it appears to have gone without issue, for example from controller-2 (where we are seeing the issue/error for this BZ):

Dec  8 13:44:25 localhost os-collect-config: Manual upgrade of openvswitch - restart in postun detected
Dec  8 13:44:25 localhost os-collect-config: /var/lib/heat-config/heat-config-script/OVS_UPGRADE /var/lib/heat-config/heat-config-script
Dec  8 13:44:25 localhost os-collect-config: Attempting to downloading latest openvswitch with yumdownloader
Dec  8 13:44:25 localhost os-collect-config: Loaded plugins: product-id
Dec  8 13:44:25 localhost os-collect-config: --> Running transaction check
Dec  8 13:44:25 localhost os-collect-config: ---> Package openvswitch.x86_64 0:2.5.0-22.git20160727.el7fdp will be installed
Dec  8 13:44:25 localhost os-collect-config: --> Finished Dependency Resolution
Dec  8 13:44:25 localhost os-collect-config: Updating openvswitch with nopostun option
Dec  8 13:44:25 localhost os-collect-config: /var/lib/heat-config/heat-config-script
...




> 
> > 
> > I guess we'll triage this on scrum later (added).

Comment 6 Sofer Athlan-Guyot 2016-12-09 13:25:50 UTC
I'll add to my own comment.  This code comes from openstack/diskimage-builder and this part hasn't been updated since 2014, so it's unlikely that's it the root cause.

Comment 7 Ihar Hrachyshka 2016-12-09 14:07:39 UTC
I am looking at controller-2 sos report.

1. In /var/log/messages, I see the following:

Dec  8 13:53:08 localhost systemd: Starting Open vSwitch...
Dec  8 13:53:08 localhost systemd: Starting Open vSwitch Database Unit...
Dec  8 13:53:08 localhost systemd: Started Open vSwitch.
Dec  8 13:53:08 localhost systemd: Starting OpenStack Neutron Open vSwitch Agent...
Dec  8 13:53:08 localhost neutron-enable-bridge-firewall.sh: net.bridge.bridge-nf-call-arptables = 1
Dec  8 13:53:08 localhost neutron-enable-bridge-firewall.sh: net.bridge.bridge-nf-call-iptables = 1
Dec  8 13:53:08 localhost neutron-enable-bridge-firewall.sh: net.bridge.bridge-nf-call-ip6tables = 1
Dec  8 13:53:08 localhost systemd: Started OpenStack Neutron Open vSwitch Agent.
Dec  8 13:53:08 localhost ovs-ctl: Starting ovsdb-server [  OK  ]
Dec  8 13:53:08 localhost ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait -- init -- set Open_vSwitch . db-version=7.12.1
Dec  8 13:53:08 localhost ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait set Open_vSwitch . ovs-version=2.5.0 "external-ids:system-id=\"ad902052-7ec6-4c54-831c-42cdda0779c9\"" "syst
em-type=\"RedHatEnterpriseServer\"" "system-version=\"7.3-Maipo\""
Dec  8 13:53:08 localhost ovs-ctl: Configuring Open vSwitch system IDs [  OK  ]
Dec  8 13:53:09 localhost systemd: Started Open vSwitch Database Unit.
Dec  8 13:53:09 localhost systemd: Starting Open vSwitch Forwarding Unit...
Dec  8 13:53:09 localhost ovs-vswitchd: ovs|00006|bridge|ERR|another ovs-vswitchd process is running, disabling this process (pid 6224) until it goes away
Dec  8 13:53:09 localhost neutron-openvswitch-agent: Guru meditation now registers SIGUSR1 and SIGUSR2 by default for backward compatibility. SIGUSR1 will no longer be registered in a future relea
se, so please use SIGUSR2 to generate reports.
Dec  8 13:53:09 localhost neutron-openvswitch-agent: Option "verbose" from group "DEFAULT" is deprecated for removal.  Its value may be silently ignored in the future.
Dec  8 13:53:09 localhost neutron-openvswitch-agent: Option "rpc_backend" from group "DEFAULT" is deprecated for removal.  Its value may be silently ignored in the future.
Dec  8 13:53:09 localhost neutron-openvswitch-agent: Option "notification_driver" from group "DEFAULT" is deprecated. Use option "driver" from group "oslo_messaging_notifications".
Dec  8 13:53:10 localhost ovs-vsctl: ovs|00001|vsctl|INFO|Called as /bin/ovs-vsctl set-manager ptcp:6640:127.0.0.1
Dec  8 13:53:12 localhost systemd: Starting OpenStack Neutron Open vSwitch Cleanup Utility...
Dec  8 13:53:12 localhost neutron-ovs-cleanup: Guru meditation now registers SIGUSR1 and SIGUSR2 by default for backward compatibility. SIGUSR1 will no longer be registered in a future release, so please use SIGUSR2 to generate reports.
Dec  8 13:53:13 localhost neutron-ovs-cleanup: Option "verbose" from group "DEFAULT" is deprecated for removal.  Its value may be silently ignored in the future.
Dec  8 13:53:13 localhost ovs-vsctl: ovs|00001|vsctl|INFO|Called as /bin/ovs-vsctl set-manager ptcp:6640:127.0.0.1
Dec  8 13:53:14 localhost systemd: neutron-ovs-cleanup.service: main process exited, code=exited, status=1/FAILURE
Dec  8 13:53:14 localhost systemd: Failed to start OpenStack Neutron Open vSwitch Cleanup Utility.
Dec  8 13:53:14 localhost systemd: Unit neutron-ovs-cleanup.service entered failed state.
Dec  8 13:53:14 localhost systemd: neutron-ovs-cleanup.service failed.

Note several things in the snippet:

- ovs agent is started @ 13:53:08;
- ovsdb unit is started LATER @ 13:53:09 (do we need to make ovs agent depend on the unit to make sure ovsdb is available when the agent is started? I can't find a relevant service in systemctl output, only 'openvswitch' that we already depend on);
- ovs agent fails to connect to 127.0.0.1:6640 to fetch ovsdb schema and hence requests a new local ovsdb manager @ 13:53:10;
- later, we start neutron-ovs-cleanup @ 13:53:12;
- neutron-ovs-cleanup also attempts to set the manager @ 13:53:13, and then fails.

2. If we look into ovs agent logs, we see that it also failed, even before we started ovs-cleanup, so the -cleanup service cannot be the reason of the agent breakage:

2016-12-08 18:53:11.382 6080 ERROR ryu.lib.hub [-] hub: uncaught exception: Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ryu/lib/hub.py", line 54, in _launch
    return func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ovs_ryuapp.py", line 37, in agent_main_wrapper
    ovs_agent.main(bridge_classes)
  File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 2170, in main
    agent = OVSNeutronAgent(bridge_classes, cfg.CONF)
  File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 139, in __init__
    self.ovs = ovs_lib.BaseOVS()
  File "/usr/lib/python2.7/site-packages/neutron/agent/common/ovs_lib.py", line 107, in __init__
    self.ovsdb = ovsdb.API.get(self)
  File "/usr/lib/python2.7/site-packages/neutron/agent/ovsdb/api.py", line 89, in get
    return iface(context)
  File "/usr/lib/python2.7/site-packages/neutron/agent/ovsdb/impl_idl.py", line 196, in __init__
    OvsdbIdl.ovsdb_connection.start()
  File "/usr/lib/python2.7/site-packages/neutron/agent/ovsdb/native/connection.py", line 90, in start
    helper = do_get_schema_helper()
  File "/usr/lib/python2.7/site-packages/retrying.py", line 68, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/usr/lib/python2.7/site-packages/retrying.py", line 229, in call
    raise attempt.get()
  File "/usr/lib/python2.7/site-packages/retrying.py", line 261, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/usr/lib/python2.7/site-packages/retrying.py", line 217, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/usr/lib/python2.7/site-packages/neutron/agent/ovsdb/native/connection.py", line 89, in do_get_schema_helper
    self.schema_name)
  File "/usr/lib/python2.7/site-packages/neutron/agent/ovsdb/native/idlutils.py", line 109, in get_schema_helper
    'err': os.strerror(err)})
Exception: Could not retrieve schema from tcp:127.0.0.1:6640: Connection refused

Observe that the agent crashes @ 18:53:11, ~ 1s after it set the :6640 manager for ovsdb. This corresponds to the fact that in 'native' implementation, we retry for a second until finally bailing out:

https://github.com/openstack/neutron/blob/stable/newton/neutron/agent/ovsdb/native/connection.py#L84

The retry mechanism is added there because there is a potential race between set-manager command and ovsdb actually opening the needed port. It could be that in our case, it takes ovsdb longer to react to the set-manager request, and then 1s retry range does not help, or it just fails to process the request. Were there any changes in OVS component lately?

I was thinking it could be selinux denying the port open, but I don't see relevant errors in audit.log.

So the agent set the manager, then waited for schema available for a sec, then failed; BEFORE ovs-cleanup started. I suspect ovs-cleanup failed for a similar reason.

(It's also interesting that we already assumed the agent started successfully, while in reality it failed. That's probably an issue with our success signaling. I believe we could make the agent to notify on success after ovsdb is actually fetched from.)

Sadly, ovs logs are empty, so it's hard to say if it ever tried to fulfil the request to add a new management port for native interface. We could maybe try enabling logs for ovs to see some details.

3. The fact that we start ovs-cleanup as part of upgrade process while the ovs agent is running is concerning. The tool is destructive and is meant to be executed on node fencing event, not e.g. during upgrade. The tool basically removes all switch ports from OVS, making data plane completely broken. The user is then lucky if his OVS agent is later capable to restore connectivity. In the meantime, I expect downtime on l2 layer.

I was told that we start -cleanup as part of migration from pacemaker to systemd: https://github.com/openstack/tripleo-heat-templates/blob/master/extraconfig/tasks/major_upgrade_pacemaker_migrations.sh#L64 I don't know details, but I am sure that triggering the tool in the middle of upgrade is not correct. I was also told it's easy to remove the service from upgrade process; but we may need to talk to HA folks to understand why we may want to start the service. I don't want us to skip something potentially needed for fencing and such.

===

I have three main points to make:

- OVS agent fails because allegedly OVSDB is slow to process set-manager request, or just fail to do it. We may consider enabling debug logs for OVS to see what happened.

- OVS agent should probably wait until OVSDB is up but it's not clear how to achieve it;

- starting neutron-ovs-cleanup is probably not what we want as part of upgrade process. Same for netns-cleanup. HA expertise is needed to understand how to migrate those tools into tripleo workflow.

Comment 8 Sofer Athlan-Guyot 2016-12-09 14:34:46 UTC
Created attachment 1229993 [details]
vswitch log.

Comment 9 Sofer Athlan-Guyot 2016-12-09 14:35:35 UTC
Created attachment 1229994 [details]
ovsdb server log.

Adding archived log as they are not inside the sos report.

Comment 10 Omri Hochman 2016-12-09 16:35:09 UTC
going to try to use : http://rhos-release.virt.bos.redhat.com/repos/rhos-release/rhos-release-1.1.9-1.noarch.rpm

which will provide the an older openvswitch-2.5.0-14.git20160727.el7fdp.x86_64
instead of openvswitch-2.5.0-22.git20160727.el7fdp.x86_64

Comment 11 Omri Hochman 2016-12-10 03:07:06 UTC
(In reply to Omri Hochman from comment #10)
> going to try to use :
> http://rhos-release.virt.bos.redhat.com/repos/rhos-release/rhos-release-1.1.
> 9-1.noarch.rpm
> 
> which will provide the an older
> openvswitch-2.5.0-14.git20160727.el7fdp.x86_64
> instead of openvswitch-2.5.0-22.git20160727.el7fdp.x86_64

Upgrade completed successfully using rhos-release-1.1.9-1.noarch.rpm,
that installed  older version of openvswitch-2.5.0-14.git20160727.el7fdp.x86_64
on the overcloud controllers.

Comment 12 Sofer Athlan-Guyot 2016-12-12 09:55:48 UTC
I confirm that using openvswitch-2.5.0-14.git20160727.el7fdp.x86_64 solves the problem.

Comment 14 Sofer Athlan-Guyot 2016-12-12 16:14:46 UTC
Re-assigned to the right DFG.

Comment 15 Sofer Athlan-Guyot 2016-12-12 16:28:46 UTC
Ok, so to we keep this bug as is.  It is "solved" by the pinning the openvswitch to -14.  The root of the problem for openvswitch -22 will be investigated in another bugzilla where this bug will be used as a reference.

Comment 16 Ihar Hrachyshka 2016-12-12 17:28:38 UTC
FYI I reported an OVS bug at: https://bugzilla.redhat.com/show_bug.cgi?id=1403958

Comment 17 Sofer Athlan-Guyot 2016-12-13 18:20:52 UTC
Tested the package in the bz -23, in a OSP9/10 upgrade and it worked.  Unfortunately it appears that it won't never be in brew let alone in a puddle https://bugzilla.redhat.com/show_bug.cgi?id=1403958#c14 )

The proper solution is still being discussed in the aforementioned bz.

As for OSP-10 release, I'm putting this bug on qa, as the OSP-10 release is going to ship the -14 version which works.

The root of the problem will be tracked in the other bz.  it will block any

Comment 18 Ihar Hrachyshka 2016-12-14 14:31:28 UTC
I think the resolution to switch to ON_QA is not ideal. I think we would be better off removing the bug from blockers list for GA. We still need a bug to track the integration issue with the new ovs package.

Comment 19 Omri Hochman 2016-12-15 16:01:24 UTC
Verified with 10.GA puddle overcloud node images provides : [root@overcloud-controller-0 ~]# rpm -qa | grep openvswitch
openvswitch-2.5.0-14.git20160727.el7fdp.x86_64

which resolves the issue.


Note You need to log in before you can comment on or make changes to this bug.