rhel-osp-director: rhel-osp-director: upgrade 7.3->8.0, that follows update 7.2->7.3, times out during "major-upgrade-pacemaker.yaml" step. Environment: Steps to reproduce: 1. Deploy 7.2 HA with netiso (deployment command: openstack overcloud deploy --templates --control-scale 3 --compute-scale 1 --ceph-storage-scale 1 --neutron-network-type vxlan --neutron-tunnel-types vxlan --ntp-server clock.redhat.com --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml) 2. populate the setup. 3. Update to 7.3 4. Add another instance to the setup. 5. Attempt to upgrade to 8.0 Result: The upgrade completed the step with major-upgrade-pacemaker-init.yaml and failed after more than 4 hours during major-upgrade-pacemaker.yaml step, with: 2016-05-12 22:55:07 [2]: SIGNAL_COMPLETE Unknown 2016-05-12 22:55:10 [2]: SIGNAL_COMPLETE Unknown 2016-05-12 22:55:10 [ControllerDeployment]: SIGNAL_COMPLETE Unknown 2016-05-12 22:55:11 [2]: SIGNAL_COMPLETE Unknown 2016-05-12 22:55:12 [NetworkDeployment]: SIGNAL_COMPLETE Unknown There was an error running Upgrade overcloud - step #3:upgrade. Exiting.... ERROR: Authentication failed: Authentication required ERROR: Authentication failed: Authentication required Note: The cluster doesn't run on the controllers.
Environment: openstack-tripleo-heat-templates-0.8.14-9.el7ost.noarch instack-undercloud-2.2.7-4.el7ost.noarch openstack-tripleo-heat-templates-kilo-0.8.14-9.el7ost.noarch openstack-puppet-modules-7.0.17-1.el7ost.noarch
The initial investigation points at something related to pacemaker, could be a bug in it perhaps, as it seems to behave strange, one of the sub-commands is stuck in what seems like some form of busy waiting. The upgrade is stuck on controller-0 in a call to `pcs cluster stop --all`, which is still ongoing and blocks os-collect-config: root 4223 1 0 May12 ? 00:00:46 /usr/bin/python /usr/bin/os-collect-config root 14702 4223 0 May12 ? 00:00:00 \_ /usr/bin/python /usr/bin/os-refresh-config root 15257 14702 0 May12 ? 00:00:00 \_ /bin/bash /usr/local/bin/dib-run-parts /usr/libexec/os-refresh-config/configure.d root 20355 15257 0 May12 ? 00:00:00 \_ python /usr/libexec/os-refresh-config/configure.d/55-heat-config root 20360 20355 0 May12 ? 00:00:00 \_ python /var/lib/heat-config/hooks/script root 20361 20360 0 May12 ? 00:00:00 \_ /bin/bash /var/lib/heat-config/heat-config-script/2b8b02dc-c95c-4abc-83c9-10fa5b348335 root 25809 20361 0 May12 ? 00:05:49 \_ /usr/bin/python2 /usr/sbin/pcs cluster stop --all root 26148 25809 0 May12 ? 00:00:00 \_ /usr/bin/ruby -I/usr/lib/pcsd/ /usr/lib/pcsd/pcsd-cli.rb read_tokens root 26180 26148 99 May12 ? 3-15:49:04 \_ /usr/sbin/corosync-cmapctl totem.cluster_name This further calls into: corosync-cmapctl totem.cluster_name ...which seems stuck. Note the very high CPU usage of this process in the paste above. Also `top` shows it's still consuming all CPU it can get: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 26180 root 20 0 24440 840 672 R 100.0 0.0 5277:18 corosync-cmapct 19392 swift 20 0 401064 51200 2300 S 0.7 0.9 8:25.18 swift-proxy-ser 1198 root 10 -10 271652 36164 9260 S 0.3 0.6 6:09.75 ovs-vswitchd 8136 root 20 0 272556 41940 9568 S 0.3 0.7 6:59.05 ceph-mon 1 root 20 0 191992 6776 3548 S 0.0 0.1 1:14.30 systemd
Actually i wonder if this might be a duplicate of bug 1330688.
The ruby function get_cluster_name is being called from which PCSAuth.getSuperuserSession, COROSYNC_CMAPCTL, "totem.cluster_name" is started. That should fail as we shut down the cluster and after that it should read cluster name from config. Instead it's looping in corosync-cmapctl in print_key where it's getting CS_ERR_LIBRARY.
(In reply to Jiri Stransky from comment #6) > Actually i wonder if this might be a duplicate of bug 1330688. I think the symptoms are a bit different, although the root cause might be the same. Let me loop in some corosync folks here
Created attachment 1158240 [details] Proposed patch cmapctl: Handle corosync errors in print_key func print_key handles only CS_ERR_TRY_AGAIN error. If different error is returned, print_key loops forewer. Solution is to handle all errors.
@Lukas: Thanks for excellent description of problem. It allowed me to create patch in few minutes. Whitebox reproducer: - Add "sleep(2)" into tools/corosync-cmapctl.c line 190 (just before "while (!end_loop) {" line). - Start corosync - while true;do ./corosync-cmapctl totem.cluster_name;done - Stop corosync Watch cmapctl stuck and eating 100% of one core.
Using the updated corosync rpms, the upgrade passed the stage it used to fail in, and failed after 4 hours run during the major-upgrade-pacemaker-converge step with: 2016-05-19 03:28:25 [overcloud-CephStorageCephDeployment-yxak3hlzkhck]: UPDATE_COMPLETE Stack UPDATE completed successfully 2016-05-19 03:28:27 [ControllerCephDeployment]: UPDATE_COMPLETE state changed 2016-05-19 03:28:27 [CephStorageCephDeployment]: UPDATE_COMPLETE state changed ERROR: Authentication failed: Authentication required [stack@instack ~]$ [stack@instack ~]$ [stack@instack ~]$ [stack@instack ~]$ [stack@instack ~]$ [stack@instack ~]$ [stack@instack ~]$ [stack@instack ~]$ [stack@instack ~]$ heat resource-list -n5 overcloud|grep -v COMPLE +--------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+---------------------+-----------------------------------------------------------------------------------------------+ | resource_name | physical_resource_id | resource_type | resource_status | updated_time | stack_name | +--------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+---------------------+-----------------------------------------------------------------------------------------------+ | Compute | 520b5036-48b2-4ff7-9e22-2c0d40d9d6f4 | OS::Heat::ResourceGroup | UPDATE_FAILED | 2016-05-19T03:24:56 | overcloud | | 0 | bbea6423-3e7f-41d2-b8d9-425f9c9bd088 | OS::TripleO::Compute | UPDATE_FAILED | 2016-05-19T03:25:05 | overcloud-Compute-jfxnp5zhnqu2 | | UpdateDeployment | 2c3e65db-b54b-4c07-aa86-c1d77690747d | OS::Heat::SoftwareDeployment | CREATE_FAILED | 2016-05-19T03:26:22 | overcloud-Compute-jfxnp5zhnqu2-0-gwvboa4eenug | +--------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+---------------------+-----------------------------------------------------------------------------------------------+ [stack@instack ~]$ heat deployment-show 2c3e65db-b54b-4c07-aa86-c1d77690747d { "status": "IN_PROGRESS", "server_id": "0c6981c8-0ee7-4df2-be98-50d258992f01", "config_id": "573938c4-542d-4f49-a83a-4c825f6ff097", "output_values": null, "creation_time": "2016-05-19T03:26:29", "input_values": {}, "action": "CREATE", "status_reason": "Deploy data available", "id": "2c3e65db-b54b-4c07-aa86-c1d77690747d" } [stack@instack ~]$
The heat-engine.log (which is too big to attach) has these repeating block of erros: 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service [-] Unhandled exception 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service Traceback (most recent call last): 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 377, in _child_wait_for_exit_or_signal 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service launcher.wait() 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 204, in wait 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service self.services.wait() 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 625, in wait 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service service.wait() 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 105, in wrapper 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service return f(*args, **kwargs) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 591, in wait 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service self._done.wait() 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/eventlet/event.py", line 121, in wait 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service return hubs.get_hub().switch() 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 294, in switch 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service return self.greenlet.switch() 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 346, in run 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service self.wait(sleep_time) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/eventlet/hubs/poll.py", line 85, in wait 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service presult = self.do_poll(seconds) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/eventlet/hubs/epolls.py", line 62, in do_poll 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service return self.poll.poll(seconds) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 160, in _handle_signals 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service handler(signo, frame) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 356, in _sigterm 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service self.launcher.stop() 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 196, in stop 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service self.services.stop() 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 612, in stop 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service service.stop() 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 105, in wrapper 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service return f(*args, **kwargs) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/heat/engine/service.py", line 408, in stop 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service service_objects.Service.delete(ctxt, self.service_id) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/heat/objects/service.py", line 78, in delete 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service db_api.service_delete(context, service_id, soft_delete) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/heat/db/api.py", line 378, in service_delete 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service return IMPL.service_delete(context, service_id, soft_delete) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/heat/db/sqlalchemy/api.py", line 1006, in service_delete 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service service = service_get(context, service_id) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/heat/db/sqlalchemy/api.py", line 1016, in service_get 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service result = model_query(context, models.Service).get(service_id) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 819, in get 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service return self._get_impl(ident, loading.load_on_ident) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 852, in _get_impl 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service return fallback_fn(self, key) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/loading.py", line 219, in load_on_ident 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service return q.one() 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2473, in one 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service ret = list(self) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/loading.py", line 86, in instances 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service util.raise_from_cause(err) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service reraise(type(exception), exception, tb=exc_tb) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/loading.py", line 67, in instances 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service fetch = cursor.fetchall() 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/result.py", line 968, in fetchall 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service self.cursor, self.context) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1337, in _handle_dbapi_exception 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service util.raise_from_cause(newraise, exc_info) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service reraise(type(exception), exception, tb=exc_tb) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/result.py", line 963, in fetchall 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service self._soft_close() 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/result.py", line 667, in _soft_close 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service self.connection.close() 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 826, in close 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service conn.close() 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 882, in close 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service self._checkin() 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 758, in _checkin 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service self._pool, None, self._echo, fairy=self) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 650, in _finalize_fairy 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service connection_record.checkin() 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 509, in checkin 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service pool.dispatch.checkin(connection, self) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib64/python2.7/site-packages/sqlalchemy/event/attr.py", line 256, in __call__ 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service fn(*args, **kw) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/engines.py", line 50, in _thread_yield 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service time.sleep(0) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 31, in sleep 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service assert hub.greenlet is not current, 'do not call blocking functions from the mainloop' 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service DBError: do not call blocking functions from the mainloop 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service
Sasha, since this bug is very specific to corosync. It is best if we open a new one, so this one can take its course and end up in RHEL (and avoid the upgrade to fail early on). Let's take this to a new BZ as this issue is not pacemaker related (I just checked the pcs status and it is 100% started and fine)
Filed BZ #1337624 for the issue reportes in comment #13
So the re-run of the test showed that the patch proposed by Michele actually worked. I was able to deploy 7.2, update to 7.3 , upgrade to 8.0.
Awesome, thanks everyone :)
Adding title to doc text for inclusion in 7.3 release notes.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-2463.html
closed, no need for needinfo.