Bug 1342646 - osp-director-9: upgrade from 8.0-> 9.0 fails due to pcs cluster down on controllers (UPDATE_FAILED: ControllerPacemakerUpgradeDeployment_Step_1)
Summary: osp-director-9: upgrade from 8.0-> 9.0 fails due to pcs cluster down on contr...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 9.0 (Mitaka)
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Michele Baldessari
QA Contact: Arik Chernetsky
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-06-03 18:36 UTC by Omri Hochman
Modified: 2016-08-29 15:15 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-08-29 15:15:33 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Messages from controller (7.49 MB, text/plain)
2016-06-03 18:41 UTC, Omri Hochman
no flags Details
keystone.log from controller (42.28 KB, text/plain)
2016-06-03 18:42 UTC, Omri Hochman
no flags Details

Description Omri Hochman 2016-06-03 18:36:17 UTC
osp-director-9: upgrade from 8.0-> 9.0 fails due to down pcs cluster on controllers (UPDATE_FAILED: ControllerPacemakerUpgradeDeployment_Step_1) 



Description: 
-------------
Upgrade attempt : 8.0 Async -> 9.0 puddle: 2016-06-02.1.  upgrade failed on ControllerPacemakerUpgradeDeployment_Step_1 , it seems that the pcs cluster was down on the controllers, when I tried to start the cluster manually I couldn't, because openstack-keystone service refuse to start ( see logs)     



Environment: 
-------------
openstack-puppet-modules-8.0.4-1.el7ost.noarch
openstack-tripleo-puppet-elements-2.0.0-2.el7ost.noarch
puppet-3.6.2-2.el7.noarch
instack-undercloud-4.0.0-2.el7ost.noarch
instack-0.0.8-3.el7ost.noarch
openstack-tripleo-heat-templates-2.0.0-8.el7ost.noarch
openstack-heat-engine-6.0.0-3.el7ost.noarch
openstack-tripleo-heat-templates-kilo-2.0.0-8.el7ost.noarch
python-heatclient-1.2.0-1.el7ost.noarch
openstack-heat-common-6.0.0-3.el7ost.noarch
openstack-heat-api-cfn-6.0.0-3.el7ost.noarch
openstack-heat-api-cloudwatch-6.0.0-3.el7ost.noarch
heat-cfntools-1.3.0-2.el7ost.noarch
openstack-heat-templates-0-0.8.20150605git.el7ost.noarch
openstack-tripleo-heat-templates-liberty-2.0.0-8.el7ost.noarch
openstack-heat-api-6.0.0-3.el7ost.noarch



After running the command : 
-----------------------------
openstack overcloud deploy  --templates /usr/share/openstack-tripleo-heat-templates -e   /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml  -e  /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml  -e  /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml  -e  /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /home/stack/network-environment.yaml -e  /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker.yaml

the console view:
------------------
2016-02-10 23:10:41 [ControllerDeployment]: SIGNAL_COMPLETE Unknown
2016-02-10 23:10:43 [NetworkDeployment]: SIGNAL_COMPLETE Unknown
2016-02-10 23:10:44 [0]: SIGNAL_COMPLETE Unknown
2016-02-10 23:10:46 [NetworkDeployment]: SIGNAL_COMPLETE Unknown
2016-02-10 23:10:46 [NetworkDeployment]: SIGNAL_COMPLETE Unknown
2016-02-10 23:10:47 [1]: SIGNAL_COMPLETE Unknown
2016-02-10 23:10:47 [2]: SIGNAL_COMPLETE Unknown
ERROR: Authentication failed: Authentication required
There was an error running Upgrade overcloud - step #3:upgrade. Exiting....


troubleshoot :
---------------
[stack@undercloud72 ~]$ openstack stack resource list -n5 overcloud | grep -v COMPLETE
+--------------------------------------------+-----------------------------------------------+---------------------------------------------------+--------------------+---------------------+-----------------------------------------------------------------------------------------------+
| resource_name                              | physical_resource_id                          | resource_type                                     | resource_status    | updated_time        | stack_name                                                                                    |
+--------------------------------------------+-----------------------------------------------+---------------------------------------------------+--------------------+---------------------+-----------------------------------------------------------------------------------------------+
| UpdateWorkflow                             | 5dfa85d2-27e6-404b-a9d9-cab976c9735f          | OS::TripleO::Tasks::UpdateWorkflow                | UPDATE_FAILED      | 2016-02-10T22:58:41 | overcloud                                                                                     |
| ControllerPacemakerUpgradeDeployment_Step1 | 305c60ab-fb0d-4b75-8333-e78d095fe012          | OS::Heat::SoftwareDeploymentGroup                 | CREATE_FAILED      | 2016-02-10T22:58:51 | overcloud-UpdateWorkflow-hoca4u7thveg                                                         |
| 1                                          | 81c45d3d-ce78-4a4b-8352-00b967a16ac9          | OS::Heat::SoftwareDeployment                      | CREATE_IN_PROGRESS | 2016-02-10T22:58:52 | overcloud-UpdateWorkflow-hoca4u7thveg-ControllerPacemakerUpgradeDeployment_Step1-cjtsjqiwxqyl |
| 0                                          | 01adb662-c21a-478e-b22d-128205676f9a          | OS::Heat::SoftwareDeployment                      | CREATE_IN_PROGRESS | 2016-02-10T22:58:52 | overcloud-UpdateWorkflow-hoca4u7thveg-ControllerPacemakerUpgradeDeployment_Step1-cjtsjqiwxqyl |
| 2                                          | 666147f7-e933-47ce-9868-127445043787          | OS::Heat::SoftwareDeployment                      | CREATE_IN_PROGRESS | 2016-02-10T22:58:52 | overcloud-UpdateWorkflow-hoca4u7thveg-ControllerPacemakerUpgradeDeployment_Step1-cjtsjqiwxqyl |
+--------------------------------------------+-----------------------------------------------+---------------------------------------------------+--------------------+---------------------+-----------------------------------------------------------------------------------------------+



pcs status on the controller:

[root@overcloud-controller-0 ~]# pcs status
Cluster name: tripleo_cluster
Last updated: Fri Jun  3 18:34:34 2016		Last change: Fri Jun  3 17:30:30 2016 by root via crm_resource on overcloud-controller-0
Stack: corosync
Current DC: overcloud-controller-1 (version 1.1.13-10.el7_2.2-44eb2dd) - partition with quorum
3 nodes and 112 resources configured

Online: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]

Full list of resources:

 ip-10.19.184.210	(ocf::heartbeat:IPaddr2):	(target-role:Stopped) Stopped
 ip-192.168.200.10	(ocf::heartbeat:IPaddr2):	(target-role:Stopped) Stopped
 Clone Set: haproxy-clone [haproxy]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 ip-192.168.0.6	(ocf::heartbeat:IPaddr2):	(target-role:Stopped) Stopped
 ip-10.19.105.10	(ocf::heartbeat:IPaddr2):	(target-role:Stopped) Stopped
 ip-10.19.104.11	(ocf::heartbeat:IPaddr2):	(target-role:Stopped) Stopped
 ip-10.19.104.10	(ocf::heartbeat:IPaddr2):	(target-role:Stopped) Stopped
 Master/Slave Set: redis-master [redis]
     Masters: [ overcloud-controller-1 ]
     Slaves: [ overcloud-controller-0 overcloud-controller-2 ]
 Master/Slave Set: galera-master [galera]
     Masters: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: mongod-clone [mongod]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: memcached-clone [memcached]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-nova-scheduler-clone [openstack-nova-scheduler]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-l3-agent-clone [neutron-l3-agent]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-ceilometer-alarm-notifier-clone [openstack-ceilometer-alarm-notifier]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-heat-engine-clone [openstack-heat-engine]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-ceilometer-api-clone [openstack-ceilometer-api]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-metadata-agent-clone [neutron-metadata-agent]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-ovs-cleanup-clone [neutron-ovs-cleanup]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-netns-cleanup-clone [neutron-netns-cleanup]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-heat-api-clone [openstack-heat-api]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-cinder-scheduler-clone [openstack-cinder-scheduler]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-nova-api-clone [openstack-nova-api]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-heat-api-cloudwatch-clone [openstack-heat-api-cloudwatch]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-ceilometer-collector-clone [openstack-ceilometer-collector]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-keystone-clone [openstack-keystone]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-nova-consoleauth-clone [openstack-nova-consoleauth]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-glance-registry-clone [openstack-glance-registry]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-ceilometer-notification-clone [openstack-ceilometer-notification]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-cinder-api-clone [openstack-cinder-api]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-dhcp-agent-clone [neutron-dhcp-agent]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-glance-api-clone [openstack-glance-api]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-openvswitch-agent-clone [neutron-openvswitch-agent]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-nova-novncproxy-clone [openstack-nova-novncproxy]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: delay-clone [delay]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-server-clone [neutron-server]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: httpd-clone [httpd]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-ceilometer-central-clone [openstack-ceilometer-central]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-ceilometer-alarm-evaluator-clone [openstack-ceilometer-alarm-evaluator]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-heat-api-cfn-clone [openstack-heat-api-cfn]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 openstack-cinder-volume	(systemd:openstack-cinder-volume):	Stopped
 Clone Set: openstack-nova-conductor-clone [openstack-nova-conductor]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]

Failed Actions:
* openstack-keystone_start_0 on overcloud-controller-0 'not running' (7): call=234, status=complete, exitreason='none',
    last-rc-change='Fri Jun  3 18:04:06 2016', queued=0ms, exec=2173ms
* httpd_start_0 on overcloud-controller-0 'not running' (7): call=211, status=complete, exitreason='none',
    last-rc-change='Fri Jun  3 18:02:32 2016', queued=0ms, exec=10541ms
* openstack-keystone_start_0 on overcloud-controller-1 'not running' (7): call=233, status=complete, exitreason='none',
    last-rc-change='Fri Jun  3 18:04:06 2016', queued=0ms, exec=2179ms
* httpd_start_0 on overcloud-controller-1 'not running' (7): call=209, status=complete, exitreason='none',
    last-rc-change='Fri Jun  3 18:02:32 2016', queued=0ms, exec=10495ms
* openstack-keystone_start_0 on overcloud-controller-2 'not running' (7): call=228, status=complete, exitreason='none',
    last-rc-change='Fri Jun  3 18:04:06 2016', queued=0ms, exec=2086ms
* httpd_start_0 on overcloud-controller-2 'not running' (7): call=209, status=complete, exitreason='none',
    last-rc-change='Fri Jun  3 18:02:32 2016', queued=0ms, exec=10511ms


PCSD Status:
  overcloud-controller-0: Online
  overcloud-controller-1: Online
  overcloud-controller-2: Online

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled



keyston.log (after attempt to start the service manually )
----------------------
2016-06-03 18:04:08.571 29355 WARNING keystone.assignment.core [req-f3f23f35-da11-421b-88fd-f76f38208b8b - - - - -] Deprecated: Use of the identity driver config to automatically configure the same assignment driver has been deprecated, in the "O" release, the assignment driver will need to be expicitly configured if different than the default (SQL).
2016-06-03 18:04:08.591 29355 WARNING keystone.common.manager [req-f3f23f35-da11-421b-88fd-f76f38208b8b - - - - -] Deprecated: Direct import of driver 'keystone.catalog.backends.sql.Catalog' is deprecated as of Liberty in favor of its entrypoint from 'keystone.catalog' and may be removed in N.
2016-06-03 18:04:08.616 29355 WARNING oslo_log.versionutils [req-f3f23f35-da11-421b-88fd-f76f38208b8b - - - - -] Deprecated: keystone.contrib.federation.backends.sql.Federation is deprecated as of Mitaka in favor of sql and may be removed in O.
2016-06-03 18:04:08.621 29355 CRITICAL keystone [req-f3f23f35-da11-421b-88fd-f76f38208b8b - - - - -] ValueError: Empty module name
2016-06-03 18:04:08.621 29355 ERROR keystone Traceback (most recent call last):
2016-06-03 18:04:08.621 29355 ERROR keystone   File "/usr/bin/keystone-all", line 10, in <module>
2016-06-03 18:04:08.621 29355 ERROR keystone     sys.exit(main())
2016-06-03 18:04:08.621 29355 ERROR keystone   File "/usr/lib/python2.7/site-packages/keystone/cmd/all.py", line 38, in main
2016-06-03 18:04:08.621 29355 ERROR keystone     eventlet_server.run(possible_topdir)
2016-06-03 18:04:08.621 29355 ERROR keystone   File "/usr/lib/python2.7/site-packages/keystone/server/eventlet.py", line 155, in run
2016-06-03 18:04:08.621 29355 ERROR keystone     startup_application_fn=create_servers)
2016-06-03 18:04:08.621 29355 ERROR keystone   File "/usr/lib/python2.7/site-packages/keystone/server/common.py", line 49, in setup_backends
2016-06-03 18:04:08.621 29355 ERROR keystone     drivers = backends.load_backends()
2016-06-03 18:04:08.621 29355 ERROR keystone   File "/usr/lib/python2.7/site-packages/keystone/server/backends.py", line 62, in load_backends
2016-06-03 18:04:08.621 29355 ERROR keystone     shadow_users_api=identity.ShadowUsersManager(),
2016-06-03 18:04:08.621 29355 ERROR keystone   File "/usr/lib/python2.7/site-packages/keystone/common/dependency.py", line 96, in __wrapped_init__
2016-06-03 18:04:08.621 29355 ERROR keystone     init(self, *args, **kwargs)
2016-06-03 18:04:08.621 29355 ERROR keystone   File "/usr/lib/python2.7/site-packages/keystone/identity/core.py", line 1572, in __init__
2016-06-03 18:04:08.621 29355 ERROR keystone     super(ShadowUsersManager, self).__init__(CONF.shadow_users.driver)
2016-06-03 18:04:08.621 29355 ERROR keystone   File "/usr/lib/python2.7/site-packages/keystone/common/manager.py", line 183, in __init__
2016-06-03 18:04:08.621 29355 ERROR keystone     self.driver = load_driver(self.driver_namespace, driver_name)
2016-06-03 18:04:08.621 29355 ERROR keystone   File "/usr/lib/python2.7/site-packages/keystone/common/manager.py", line 80, in load_driver
2016-06-03 18:04:08.621 29355 ERROR keystone     driver = importutils.import_object(driver_name, *args)
2016-06-03 18:04:08.621 29355 ERROR keystone   File "/usr/lib/python2.7/site-packages/oslo_utils/importutils.py", line 44, in import_object
2016-06-03 18:04:08.621 29355 ERROR keystone     return import_class(import_str)(*args, **kwargs)
2016-06-03 18:04:08.621 29355 ERROR keystone   File "/usr/lib/python2.7/site-packages/oslo_utils/importutils.py", line 30, in import_class
2016-06-03 18:04:08.621 29355 ERROR keystone     __import__(mod_str)
2016-06-03 18:04:08.621 29355 ERROR keystone ValueError: Empty module name
2016-06-03 18:04:08.621 29355 ERROR keystone 
(END)

Comment 2 Omri Hochman 2016-06-03 18:41:50 UTC
Created attachment 1164621 [details]
Messages from controller

Comment 3 Omri Hochman 2016-06-03 18:42:20 UTC
Created attachment 1164622 [details]
keystone.log from controller

Comment 4 Marios Andreou 2016-06-30 11:44:01 UTC
Hi Omri, just had a poke at the logs as promised yesterday and since qe is blocking here. Thanks for the logs, especially /var/log/messages is helpful. I am not sure what the root is yet, but afaics it isn't the keystone error you paste in the description. 

From the logs I can see that the original upgrade was run at ~ Jun  3 01:12:54 - but i can also see when you tried to stop/start the cluster again manually at ~ Jun  3 13:29:59. During that second time you also may be hitting https://bugzilla.redhat.com/show_bug.cgi?id=1343905 as I see a trace like:

        Jun  3 13:29:47 localhost rabbitmq-cluster(rabbitmq)[30714]: INFO: RabbitMQ server is not running
        Jun  3 13:29:47 localhost lrmd[15185]:  notice: rabbitmq_stop_0:30714:stderr [ Error: {aborted,{no_exists,[rabbit_user, ]
        Jun  3 13:29:47 localhost lrmd[15185]:  notice: rabbitmq_stop_0:30714:stderr [                             [{{internal_user,'$1','_','_'}, ]
        Jun  3 13:29:47 localhost lrmd[15185]:  notice: rabbitmq_stop_0:30714:stderr [                               [{'/=','$1',<<"guest">>}], ]
        Jun  3 13:29:47 localhost lrmd[15185]:  notice: rabbitmq_stop_0:30714:stderr [                               ['$_']}]]}} ]
        Jun  3 13:29:47 localhost crmd[15188]:  notice: Operation rabbitmq_stop_0: ok (node=overcloud-controller-0, call=759, rc=0, cib-update=296, confirmed=true)
        Jun  3 13:29:57 localhost galera(galera)[32334]: INFO: MySQL is not running

I thought this was initially the root here but then I put the timestamps together. So, in any case, right now I think the mostly likely root is a problem stopping openstack-heat-engine, during the extraconfig/tasks/major_upgrade_controller_pacemaker_1.sh  ([1] for reference) and this concerns me as I haven't seen/noticed before (though let's first see if we can reproduce?). AFAICS:

Cluster being stopped correctly, line 41@[1]

        Jun  3 01:12:54 localhost systemd: Stopped Pacemaker High Availability Cluster Manager.
        Jun  3 01:12:55 localhost systemd: Stopping Corosync Cluster Engine...

The swift services being brought down, line 45@[1]
        
        Jun  3 01:12:56 localhost systemd: Stopped OpenStack Object Storage (swift) - Proxy Server

Packages being updated, line 58@[1]

        Jun  3 01:15:41 localhost yum[36749]: Updated: openstack-utils-2015.2-1.el7ost.noarch

So this confused me... looks like everything is being executed (not much left in [1] at this point apart from the crudinis). I spotted the heat-engine error, basically an error stopping heat-engine (which eventually does stop, the full trace here repeats a few times, but is failed):


        Jun  3 01:10:26 localhost heat-engine: 2016-06-03 05:10:26.794 42827 ERROR heat.engine.service [-] Failed to stop engine service, Cannot switch to MAINLOOP from MAINLOOP

        Jun  3 01:10:26 localhost heat-engine: 2016-06-03 05:10:26.833 42817 ERROR heat.engine.service [-] Failed to stop engine service, Cannot switch to MAINLOOP from MAINLOOP
        Jun  3 01:10:26 localhost heat-engine: 2016-06-03 05:10:26.850 42794 ERROR oslo_db.sqlalchemy.exc_filters [req-823967b7-242e-4e4f-8c86-a6f441fd9728 - -] DB exception wrapped.
        Jun  3 01:10:26 localhost heat-engine: 2016-06-03 05:10:26.850 42794 ERROR oslo_db.sqlalchemy.exc_filters Traceback (most recent call last):
        Jun  3 01:10:26 localhost heat-engine: 2016-06-03 05:10:26.850 42794 ERROR oslo_db.sqlalchemy.exc_filters   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in _execute_context
        Jun  3 01:10:26 localhost heat-engine: 2016-06-03 05:10:26.850 42794 ERROR oslo_db.sqlalchemy.exc_filters     context)
        Jun  3 01:10:26 localhost heat-engine: 2016-06-03 05:10:26.850 42794 ERROR oslo_db.sqlalchemy.exc_filters   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, in do_execute
        Jun  3 01:10:26 localhost heat-engine: 2016-06-03 05:10:26.850 42794 ERROR oslo_db.sqlalchemy.exc_filters     cursor.execute(statement, parameters)
        Jun  3 01:10:26 localhost heat-engine: 2016-06-03 05:10:26.850 42794 ERROR oslo_db.sqlalchemy.exc_filters   File "/usr/lib/python2.7/site-packages/pymysql/cursors.py", line 146, in execute
        Jun  3 01:10:26 localhost heat-engine: 2016-06-03 05:10:26.850 42794 ERROR oslo_db.sqlalchemy.exc_filters     result = self._query(query)
        Jun  3 01:10:26 localhost heat-engine: 2016-06-03 05:10:26.850 42794 ERROR oslo_db.sqlalchemy.exc_filters   File "/usr/lib/python2.7/site-packages/pymysql/cursors.py", line 296, in _query
        Jun  3 01:10:26 localhost heat-engine: 2016-06-03 05:10:26.850 42794 ERROR oslo_db.sqlalchemy.exc_filters     conn.query(q)
        Jun  3 01:10:26 localhost heat-engine: 2016-06-03 05:10:26.850 42794 ERROR oslo_db.sqlalchemy.exc_filters   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 781, in query
        Jun  3 01:10:26 localhost heat-engine: 2016-06-03 05:10:26.850 42794 ERROR oslo_db.sqlalchemy.exc_filters     self._affected_rows = self._read_query_result(unbuffered=unbuffered)
        Jun  3 01:10:26 localhost heat-engine: 2016-06-03 05:10:26.850 42794 ERROR oslo_db.sqlalchemy.exc_filters   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 942, in _read_query_result
        Jun  3 01:10:26 localhost heat-engine: 2016-06-03 05:10:26.850 42794 ERROR oslo_db.sqlalchemy.exc_filters     result.read()
        Jun  3 01:10:26 localhost heat-engine: 2016-06-03 05:10:26.850 42794 ERROR oslo_db.sqlalchemy.exc_filters   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 1138, in read
        Jun  3 01:10:26 localhost heat-engine: 2016-06-03 05:10:26.850 42794 ERROR oslo_db.sqlalchemy.exc_filters     first_packet = self.connection._read_packet()
        Jun  3 01:10:26 localhost heat-engine: 2016-06-03 05:10:26.850 42794 ERROR oslo_db.sqlalchemy.exc_filters   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 895, in _read_packet
        Jun  3 01:10:26 localhost heat-engine: 2016-06-03 05:10:26.850 42794 ERROR oslo_db.sqlalchemy.exc_filters     packet_header = self._read_bytes(4)
        Jun  3 01:10:26 localhost heat-engine: 2016-06-03 05:10:26.850 42794 ERROR oslo_db.sqlalchemy.exc_filters   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 912, in _read_bytes
        Jun  3 01:10:26 localhost heat-engine: 2016-06-03 05:10:26.850 42794 ERROR oslo_db.sqlalchemy.exc_filters     data = self._rfile.read(num_bytes)
        Jun  3 01:10:26 localhost heat-engine: 2016-06-03 05:10:26.850 42794 ERROR oslo_db.sqlalchemy.exc_filters   File "/usr/lib/python2.7/site-packages/pymysql/_socketio.py", line 59, in readinto
        Jun  3 01:10:26 localhost heat-engine: 2016-06-03 05:10:26.850 42794 ERROR oslo_db.sqlalchemy.exc_filters     return self._sock.recv_into(b)
        Jun  3 01:10:26 localhost heat-engine: 2016-06-03 05:10:26.850 42794 ERROR oslo_db.sqlalchemy.exc_filters   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 346, in recv_into
        Jun  3 01:10:26 localhost heat-engine: 2016-06-03 05:10:26.850 42794 ERROR oslo_db.sqlalchemy.exc_filters     timeout_exc=socket.timeout("timed out"))
        Jun  3 01:10:26 localhost heat-engine: 2016-06-03 05:10:26.850 42794 ERROR oslo_db.sqlalchemy.exc_filters   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 201, in _trampoline
        Jun  3 01:10:26 localhost heat-engine: 2016-06-03 05:10:26.850 42794 ERROR oslo_db.sqlalchemy.exc_filters     mark_as_closed=self._mark_as_closed)
        Jun  3 01:10:26 localhost heat-engine: 2016-06-03 05:10:26.850 42794 ERROR oslo_db.sqlalchemy.exc_filters   File "/usr/lib/python2.7/site-packages/eventlet/hubs/__init__.py", line 144, in trampoline
        Jun  3 01:10:26 localhost heat-engine: 2016-06-03 05:10:26.850 42794 ERROR oslo_db.sqlalchemy.exc_filters     assert hub.greenlet is not current, 'do not call blocking functions from the mainloop'
        Jun  3 01:10:26 localhost heat-engine: 2016-06-03 05:10:26.850 42794 ERROR oslo_db.sqlalchemy.exc_filters AssertionError: do not call blocking functions from the mainloop
        Jun  3 01:10:26 localhost heat-engine: 2016-06-03 05:10:26.850 42794 ERROR oslo_db.sqlalchemy.exc_filters
        Jun  3 01:10:26 localhost heat-engine: 2016-06-03 05:10:26.821 42827 ERROR oslo_db.sqlalchemy.exc_filters [req-8d154c6d-9552-4162-bea0-fab590b14e7d - -] DB exception wrapped.
        Jun  3 01:10:26 localhost heat-engine: 2016-06-03 05:10:26.821 42827 ERROR oslo_db.sqlalchemy.exc_filters Traceback (most recent call last):
        Jun  3 01:10:26 localhost heat-engine: 2016-06-03 05:10:26.821 42827 ERROR oslo_db.sqlalchemy.exc_filters   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in _execute_context
        Jun  3 01:10:27 localhost heat-engine: 2016-06-03 05:10:27.067 42827 ERROR sqlalchemy.pool.QueuePool OperationalError: (2014, 'Command Out of Sync')
        Jun  3 01:10:27 localhost systemd: Stopped Openstack Heat Engine Service.
        Jun  3 01:10:27 localhost systemd: Unit openstack-heat-engine.service entered failed state.
        Jun  3 01:10:27 localhost systemd: openstack-heat-engine.service failed.
        Jun  3 01:10:27 localhost systemd: Stopping OpenStack Cinder Volume Server...


SO. Can you please see if you can reproduce this, particularly if we see the same heat-engine issue then it is something to investigate more. As mentioned there are/where other issues, like the rabbit not rejoining the cluster (which you did hit, and would likely have hit anyway had the upgrade proceeded here) that have since been fixed. 

thank, marios


[1] https://github.com/openstack/tripleo-heat-templates/blob/a47d580018625d1b6363228f4254ded317c2091e/extraconfig/tasks/major_upgrade_controller_pacemaker_1.sh

Comment 5 Marios Andreou 2016-06-30 11:49:51 UTC
sorry, one more thing, can you please include ALL commands used to this point so we can get a solid reproducer. I mean, the fail/bug here is about major_upgrade_controller_pacemaker_1.sh i.e. during the controller upgrade.

So was init, aodh migration, keystone migration etc completed correctly and can you please include all the cli for those.

Comment 6 Omri Hochman 2016-06-30 13:45:33 UTC
(In reply to marios from comment #4)
> Can you please see if you can reproduce this, particularly if we see the
> same heat-engine issue then it is something to investigate more. As
> mentioned there are/where other issues, like the rabbit not rejoining the
> cluster (which you did hit, and would likely have hit anyway had the upgrade
> proceeded here) that have since been fixed. 
> 
> thank, marios


Thanks Marios - I was trying to reproduce this issue and encountered this Bz : https://bugzilla.redhat.com/show_bug.cgi?id=1351338 

so it seems that we might had a progress, but now we are failing on a different step in the overcloud upgrade process  
(comment #7 on #1351338)



(In reply to marios from comment #5)
> sorry, one more thing, can you please include ALL commands used to this
> point so we can get a solid reproducer. I mean, the fail/bug here is about
> major_upgrade_controller_pacemaker_1.sh i.e. during the controller upgrade.
> 
> So was init, aodh migration, keystone migration etc completed correctly and
> can you please include all the cli for those.

Sure thing, we have automation script for it, and the steps that i took are documented in this etherpad : http://etherpad.corp.redhat.com/ospd-9-upgrade-ohochman


Note You need to log in before you can comment on or make changes to this bug.