Bug 1336462
Summary: | corosync: Handle corosync errors in print_key func (was rhel-osp-director: rhel-osp-director: upgrade 7.3->8.0, that follows update 7.2->7.3, times out during "major-upgrade-pacemaker.yaml" step.) | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Alexander Chuzhoy <sasha> | ||||
Component: | corosync | Assignee: | Jan Friesse <jfriesse> | ||||
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | ||||
Severity: | urgent | Docs Contact: | Steven J. Levine <slevine> | ||||
Priority: | urgent | ||||||
Version: | 7.2 | CC: | augol, ccaulfie, cfeist, cluster-maint, dbecker, djansa, fdinitto, jcoufal, jkortus, jruemker, jstransk, lbezdick, mburns, mdolezel, michele, mnavrati, morazi, rhel-osp-director-maint, rscarazz, sasha | ||||
Target Milestone: | pre-dev-freeze | Keywords: | ZStream | ||||
Target Release: | 7.2 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | corosync-2.3.6-1.el7 | Doc Type: | Bug Fix | ||||
Doc Text: |
The `corosync-cmapctl` utility correctly handles errors in the `print_key()` function
Previously, the `corosync-cmapctl` utility did not handle corosync errors in the `print_key()` function correctly. Consequently, `corosync-cmapctl` could enter an infinite loop if the `corosync` utility was killed. The provided fix makes sure all errors returned when Corosync exits are handled correctly. As a result, `corosync-cmapctl` leaves the loop and displays a relevant error message in this scenario.
|
Story Points: | --- | ||||
Clone Of: | |||||||
: | 1338667 (view as bug list) | Environment: | |||||
Last Closed: | 2016-11-04 06:49:49 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1338667 | ||||||
Attachments: |
|
Description
Alexander Chuzhoy
2016-05-16 14:29:39 UTC
Environment: openstack-tripleo-heat-templates-0.8.14-9.el7ost.noarch instack-undercloud-2.2.7-4.el7ost.noarch openstack-tripleo-heat-templates-kilo-0.8.14-9.el7ost.noarch openstack-puppet-modules-7.0.17-1.el7ost.noarch The initial investigation points at something related to pacemaker, could be a bug in it perhaps, as it seems to behave strange, one of the sub-commands is stuck in what seems like some form of busy waiting. The upgrade is stuck on controller-0 in a call to `pcs cluster stop --all`, which is still ongoing and blocks os-collect-config: root 4223 1 0 May12 ? 00:00:46 /usr/bin/python /usr/bin/os-collect-config root 14702 4223 0 May12 ? 00:00:00 \_ /usr/bin/python /usr/bin/os-refresh-config root 15257 14702 0 May12 ? 00:00:00 \_ /bin/bash /usr/local/bin/dib-run-parts /usr/libexec/os-refresh-config/configure.d root 20355 15257 0 May12 ? 00:00:00 \_ python /usr/libexec/os-refresh-config/configure.d/55-heat-config root 20360 20355 0 May12 ? 00:00:00 \_ python /var/lib/heat-config/hooks/script root 20361 20360 0 May12 ? 00:00:00 \_ /bin/bash /var/lib/heat-config/heat-config-script/2b8b02dc-c95c-4abc-83c9-10fa5b348335 root 25809 20361 0 May12 ? 00:05:49 \_ /usr/bin/python2 /usr/sbin/pcs cluster stop --all root 26148 25809 0 May12 ? 00:00:00 \_ /usr/bin/ruby -I/usr/lib/pcsd/ /usr/lib/pcsd/pcsd-cli.rb read_tokens root 26180 26148 99 May12 ? 3-15:49:04 \_ /usr/sbin/corosync-cmapctl totem.cluster_name This further calls into: corosync-cmapctl totem.cluster_name ...which seems stuck. Note the very high CPU usage of this process in the paste above. Also `top` shows it's still consuming all CPU it can get: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 26180 root 20 0 24440 840 672 R 100.0 0.0 5277:18 corosync-cmapct 19392 swift 20 0 401064 51200 2300 S 0.7 0.9 8:25.18 swift-proxy-ser 1198 root 10 -10 271652 36164 9260 S 0.3 0.6 6:09.75 ovs-vswitchd 8136 root 20 0 272556 41940 9568 S 0.3 0.7 6:59.05 ceph-mon 1 root 20 0 191992 6776 3548 S 0.0 0.1 1:14.30 systemd Actually i wonder if this might be a duplicate of bug 1330688. The ruby function get_cluster_name is being called from which PCSAuth.getSuperuserSession, COROSYNC_CMAPCTL, "totem.cluster_name" is started. That should fail as we shut down the cluster and after that it should read cluster name from config. Instead it's looping in corosync-cmapctl in print_key where it's getting CS_ERR_LIBRARY. (In reply to Jiri Stransky from comment #6) > Actually i wonder if this might be a duplicate of bug 1330688. I think the symptoms are a bit different, although the root cause might be the same. Let me loop in some corosync folks here Created attachment 1158240 [details]
Proposed patch
cmapctl: Handle corosync errors in print_key func
print_key handles only CS_ERR_TRY_AGAIN error. If different error is
returned, print_key loops forewer.
Solution is to handle all errors.
@Lukas: Thanks for excellent description of problem. It allowed me to create patch in few minutes. Whitebox reproducer: - Add "sleep(2)" into tools/corosync-cmapctl.c line 190 (just before "while (!end_loop) {" line). - Start corosync - while true;do ./corosync-cmapctl totem.cluster_name;done - Stop corosync Watch cmapctl stuck and eating 100% of one core. Using the updated corosync rpms, the upgrade passed the stage it used to fail in, and failed after 4 hours run during the major-upgrade-pacemaker-converge step with: 2016-05-19 03:28:25 [overcloud-CephStorageCephDeployment-yxak3hlzkhck]: UPDATE_COMPLETE Stack UPDATE completed successfully 2016-05-19 03:28:27 [ControllerCephDeployment]: UPDATE_COMPLETE state changed 2016-05-19 03:28:27 [CephStorageCephDeployment]: UPDATE_COMPLETE state changed ERROR: Authentication failed: Authentication required [stack@instack ~]$ [stack@instack ~]$ [stack@instack ~]$ [stack@instack ~]$ [stack@instack ~]$ [stack@instack ~]$ [stack@instack ~]$ [stack@instack ~]$ [stack@instack ~]$ heat resource-list -n5 overcloud|grep -v COMPLE +--------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+---------------------+-----------------------------------------------------------------------------------------------+ | resource_name | physical_resource_id | resource_type | resource_status | updated_time | stack_name | +--------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+---------------------+-----------------------------------------------------------------------------------------------+ | Compute | 520b5036-48b2-4ff7-9e22-2c0d40d9d6f4 | OS::Heat::ResourceGroup | UPDATE_FAILED | 2016-05-19T03:24:56 | overcloud | | 0 | bbea6423-3e7f-41d2-b8d9-425f9c9bd088 | OS::TripleO::Compute | UPDATE_FAILED | 2016-05-19T03:25:05 | overcloud-Compute-jfxnp5zhnqu2 | | UpdateDeployment | 2c3e65db-b54b-4c07-aa86-c1d77690747d | OS::Heat::SoftwareDeployment | CREATE_FAILED | 2016-05-19T03:26:22 | overcloud-Compute-jfxnp5zhnqu2-0-gwvboa4eenug | +--------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+---------------------+-----------------------------------------------------------------------------------------------+ [stack@instack ~]$ heat deployment-show 2c3e65db-b54b-4c07-aa86-c1d77690747d { "status": "IN_PROGRESS", "server_id": "0c6981c8-0ee7-4df2-be98-50d258992f01", "config_id": "573938c4-542d-4f49-a83a-4c825f6ff097", "output_values": null, "creation_time": "2016-05-19T03:26:29", "input_values": {}, "action": "CREATE", "status_reason": "Deploy data available", "id": "2c3e65db-b54b-4c07-aa86-c1d77690747d" } [stack@instack ~]$ The heat-engine.log (which is too big to attach) has these repeating block of erros: 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service [-] Unhandled exception 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service Traceback (most recent call last): 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 377, in _child_wait_for_exit_or_signal 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service launcher.wait() 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 204, in wait 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service self.services.wait() 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 625, in wait 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service service.wait() 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 105, in wrapper 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service return f(*args, **kwargs) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 591, in wait 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service self._done.wait() 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/eventlet/event.py", line 121, in wait 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service return hubs.get_hub().switch() 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 294, in switch 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service return self.greenlet.switch() 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 346, in run 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service self.wait(sleep_time) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/eventlet/hubs/poll.py", line 85, in wait 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service presult = self.do_poll(seconds) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/eventlet/hubs/epolls.py", line 62, in do_poll 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service return self.poll.poll(seconds) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 160, in _handle_signals 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service handler(signo, frame) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 356, in _sigterm 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service self.launcher.stop() 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 196, in stop 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service self.services.stop() 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 612, in stop 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service service.stop() 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 105, in wrapper 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service return f(*args, **kwargs) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/heat/engine/service.py", line 408, in stop 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service service_objects.Service.delete(ctxt, self.service_id) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/heat/objects/service.py", line 78, in delete 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service db_api.service_delete(context, service_id, soft_delete) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/heat/db/api.py", line 378, in service_delete 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service return IMPL.service_delete(context, service_id, soft_delete) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/heat/db/sqlalchemy/api.py", line 1006, in service_delete 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service service = service_get(context, service_id) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/heat/db/sqlalchemy/api.py", line 1016, in service_get 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service result = model_query(context, models.Service).get(service_id) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 819, in get 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service return self._get_impl(ident, loading.load_on_ident) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 852, in _get_impl 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service return fallback_fn(self, key) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/loading.py", line 219, in load_on_ident 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service return q.one() 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2473, in one 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service ret = list(self) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/loading.py", line 86, in instances 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service util.raise_from_cause(err) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service reraise(type(exception), exception, tb=exc_tb) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/loading.py", line 67, in instances 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service fetch = cursor.fetchall() 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/result.py", line 968, in fetchall 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service self.cursor, self.context) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1337, in _handle_dbapi_exception 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service util.raise_from_cause(newraise, exc_info) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service reraise(type(exception), exception, tb=exc_tb) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/result.py", line 963, in fetchall 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service self._soft_close() 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/result.py", line 667, in _soft_close 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service self.connection.close() 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 826, in close 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service conn.close() 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 882, in close 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service self._checkin() 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 758, in _checkin 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service self._pool, None, self._echo, fairy=self) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 650, in _finalize_fairy 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service connection_record.checkin() 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 509, in checkin 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service pool.dispatch.checkin(connection, self) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib64/python2.7/site-packages/sqlalchemy/event/attr.py", line 256, in __call__ 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service fn(*args, **kw) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/engines.py", line 50, in _thread_yield 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service time.sleep(0) 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 31, in sleep 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service assert hub.greenlet is not current, 'do not call blocking functions from the mainloop' 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service DBError: do not call blocking functions from the mainloop 2016-05-18 17:39:11.156 31851 ERROR oslo_service.service Sasha, since this bug is very specific to corosync. It is best if we open a new one, so this one can take its course and end up in RHEL (and avoid the upgrade to fail early on). Let's take this to a new BZ as this issue is not pacemaker related (I just checked the pcs status and it is 100% started and fine) Filed BZ #1337624 for the issue reportes in comment #13 So the re-run of the test showed that the patch proposed by Michele actually worked. I was able to deploy 7.2, update to 7.3 , upgrade to 8.0. Awesome, thanks everyone :) Adding title to doc text for inclusion in 7.3 release notes. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-2463.html closed, no need for needinfo. |