Bug 1336462

Summary: corosync: Handle corosync errors in print_key func (was rhel-osp-director: rhel-osp-director: upgrade 7.3->8.0, that follows update 7.2->7.3, times out during "major-upgrade-pacemaker.yaml" step.)
Product: Red Hat Enterprise Linux 7 Reporter: Alexander Chuzhoy <sasha>
Component: corosyncAssignee: Jan Friesse <jfriesse>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: urgent Docs Contact: Steven J. Levine <slevine>
Priority: urgent    
Version: 7.2CC: augol, ccaulfie, cfeist, cluster-maint, dbecker, djansa, fdinitto, jcoufal, jkortus, jruemker, jstransk, lbezdick, mburns, mdolezel, michele, mnavrati, morazi, rhel-osp-director-maint, rscarazz, sasha
Target Milestone: pre-dev-freezeKeywords: ZStream
Target Release: 7.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: corosync-2.3.6-1.el7 Doc Type: Bug Fix
Doc Text:
The `corosync-cmapctl` utility correctly handles errors in the `print_key()` function Previously, the `corosync-cmapctl` utility did not handle corosync errors in the `print_key()` function correctly. Consequently, `corosync-cmapctl` could enter an infinite loop if the `corosync` utility was killed. The provided fix makes sure all errors returned when Corosync exits are handled correctly. As a result, `corosync-cmapctl` leaves the loop and displays a relevant error message in this scenario.
Story Points: ---
Clone Of:
: 1338667 (view as bug list) Environment:
Last Closed: 2016-11-04 06:49:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1338667    
Attachments:
Description Flags
Proposed patch none

Description Alexander Chuzhoy 2016-05-16 14:29:39 UTC
rhel-osp-director:  rhel-osp-director: upgrade 7.3->8.0, that follows update 7.2->7.3, times out during "major-upgrade-pacemaker.yaml" step.

Environment:


Steps to reproduce:
1. Deploy 7.2 HA with netiso (deployment command: openstack overcloud deploy --templates --control-scale 3 --compute-scale 1 --ceph-storage-scale 1   --neutron-network-type vxlan --neutron-tunnel-types vxlan  --ntp-server clock.redhat.com --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml)
2. populate the setup.
3. Update to 7.3
4. Add another instance to the setup.
5. Attempt to upgrade to 8.0


Result:
The upgrade completed the step with major-upgrade-pacemaker-init.yaml
and failed after more than 4 hours during major-upgrade-pacemaker.yaml step, with:

2016-05-12 22:55:07 [2]: SIGNAL_COMPLETE  Unknown
2016-05-12 22:55:10 [2]: SIGNAL_COMPLETE  Unknown
2016-05-12 22:55:10 [ControllerDeployment]: SIGNAL_COMPLETE  Unknown
2016-05-12 22:55:11 [2]: SIGNAL_COMPLETE  Unknown
2016-05-12 22:55:12 [NetworkDeployment]: SIGNAL_COMPLETE  Unknown
There was an error running Upgrade overcloud - step #3:upgrade. Exiting....
ERROR: Authentication failed: Authentication required

ERROR: Authentication failed: Authentication required


Note: The cluster doesn't run on the controllers.

Comment 2 Alexander Chuzhoy 2016-05-16 14:34:19 UTC
Environment:
openstack-tripleo-heat-templates-0.8.14-9.el7ost.noarch
instack-undercloud-2.2.7-4.el7ost.noarch
openstack-tripleo-heat-templates-kilo-0.8.14-9.el7ost.noarch
openstack-puppet-modules-7.0.17-1.el7ost.noarch

Comment 4 Jiri Stransky 2016-05-16 15:30:15 UTC
The initial investigation points at something related to pacemaker, could be a bug in it perhaps, as it seems to behave strange, one of the sub-commands is stuck in what seems like some form of busy waiting.

The upgrade is stuck on controller-0 in a call to `pcs cluster stop --all`, which is still ongoing and blocks os-collect-config:

root      4223     1  0 May12 ?        00:00:46 /usr/bin/python /usr/bin/os-collect-config
root     14702  4223  0 May12 ?        00:00:00  \_ /usr/bin/python /usr/bin/os-refresh-config
root     15257 14702  0 May12 ?        00:00:00      \_ /bin/bash /usr/local/bin/dib-run-parts /usr/libexec/os-refresh-config/configure.d
root     20355 15257  0 May12 ?        00:00:00          \_ python /usr/libexec/os-refresh-config/configure.d/55-heat-config
root     20360 20355  0 May12 ?        00:00:00              \_ python /var/lib/heat-config/hooks/script
root     20361 20360  0 May12 ?        00:00:00                  \_ /bin/bash /var/lib/heat-config/heat-config-script/2b8b02dc-c95c-4abc-83c9-10fa5b348335
root     25809 20361  0 May12 ?        00:05:49                      \_ /usr/bin/python2 /usr/sbin/pcs cluster stop --all
root     26148 25809  0 May12 ?        00:00:00                          \_ /usr/bin/ruby -I/usr/lib/pcsd/ /usr/lib/pcsd/pcsd-cli.rb read_tokens
root     26180 26148 99 May12 ?        3-15:49:04                              \_ /usr/sbin/corosync-cmapctl totem.cluster_name

This further calls into:

corosync-cmapctl totem.cluster_name

...which seems stuck. Note the very high CPU usage of this process in the paste above. Also `top` shows it's still consuming all CPU it can get:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                        
26180 root      20   0   24440    840    672 R 100.0  0.0   5277:18 corosync-cmapct                                                                                                                               
19392 swift     20   0  401064  51200   2300 S   0.7  0.9   8:25.18 swift-proxy-ser                                                                                                                               
 1198 root      10 -10  271652  36164   9260 S   0.3  0.6   6:09.75 ovs-vswitchd                                                                                                                                  
 8136 root      20   0  272556  41940   9568 S   0.3  0.7   6:59.05 ceph-mon                                                                                                                                      
    1 root      20   0  191992   6776   3548 S   0.0  0.1   1:14.30 systemd

Comment 6 Jiri Stransky 2016-05-16 16:30:36 UTC
Actually i wonder if this might be a duplicate of bug 1330688.

Comment 7 Lukas Bezdicka 2016-05-16 17:02:18 UTC
The ruby function get_cluster_name is being called from which PCSAuth.getSuperuserSession, COROSYNC_CMAPCTL, "totem.cluster_name" is started. That should fail as we shut down the cluster and after that it should read cluster name from config. Instead it's looping in corosync-cmapctl in print_key where it's getting CS_ERR_LIBRARY.

Comment 8 Michele Baldessari 2016-05-17 09:25:19 UTC
(In reply to Jiri Stransky from comment #6)
> Actually i wonder if this might be a duplicate of bug 1330688.

I think the symptoms are a bit different, although the root cause might be the
same. Let me loop in some corosync folks here

Comment 10 Jan Friesse 2016-05-17 10:13:16 UTC
Created attachment 1158240 [details]
Proposed patch

cmapctl: Handle corosync errors in print_key func

print_key handles only CS_ERR_TRY_AGAIN error. If different error is
returned, print_key loops forewer.

Solution is to handle all errors.

Comment 11 Jan Friesse 2016-05-17 10:19:48 UTC
@Lukas:
Thanks for excellent description of problem. It allowed me to create patch in few minutes.

Whitebox reproducer:
- Add "sleep(2)" into tools/corosync-cmapctl.c line 190 (just before "while (!end_loop) {" line).
- Start corosync
- while true;do ./corosync-cmapctl totem.cluster_name;done
- Stop corosync

Watch cmapctl stuck and eating 100% of one core.

Comment 13 Alexander Chuzhoy 2016-05-19 15:02:31 UTC
Using the updated corosync rpms, the upgrade passed the stage it used to fail in, and failed after 4 hours run during the major-upgrade-pacemaker-converge step with:
2016-05-19 03:28:25 [overcloud-CephStorageCephDeployment-yxak3hlzkhck]: UPDATE_COMPLETE  Stack UPDATE completed successfully
2016-05-19 03:28:27 [ControllerCephDeployment]: UPDATE_COMPLETE  state changed
2016-05-19 03:28:27 [CephStorageCephDeployment]: UPDATE_COMPLETE  state changed



ERROR: Authentication failed: Authentication required
[stack@instack ~]$
[stack@instack ~]$
[stack@instack ~]$
[stack@instack ~]$
[stack@instack ~]$
[stack@instack ~]$
[stack@instack ~]$
[stack@instack ~]$
[stack@instack ~]$ heat resource-list -n5 overcloud|grep -v COMPLE
+--------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+---------------------+-----------------------------------------------------------------------------------------------+
| resource_name                              | physical_resource_id                          | resource_type                                     | resource_status | updated_time        | stack_name                                                                                    |
+--------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+---------------------+-----------------------------------------------------------------------------------------------+
| Compute                                    | 520b5036-48b2-4ff7-9e22-2c0d40d9d6f4          | OS::Heat::ResourceGroup                           | UPDATE_FAILED   | 2016-05-19T03:24:56 | overcloud                                                                                     |
| 0                                          | bbea6423-3e7f-41d2-b8d9-425f9c9bd088          | OS::TripleO::Compute                              | UPDATE_FAILED   | 2016-05-19T03:25:05 | overcloud-Compute-jfxnp5zhnqu2                                                                |
| UpdateDeployment                           | 2c3e65db-b54b-4c07-aa86-c1d77690747d          | OS::Heat::SoftwareDeployment                      | CREATE_FAILED   | 2016-05-19T03:26:22 | overcloud-Compute-jfxnp5zhnqu2-0-gwvboa4eenug                                                 |
+--------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+---------------------+-----------------------------------------------------------------------------------------------+
[stack@instack ~]$ heat deployment-show 2c3e65db-b54b-4c07-aa86-c1d77690747d
{
  "status": "IN_PROGRESS",
  "server_id": "0c6981c8-0ee7-4df2-be98-50d258992f01",
  "config_id": "573938c4-542d-4f49-a83a-4c825f6ff097",
  "output_values": null,
  "creation_time": "2016-05-19T03:26:29",
  "input_values": {},
  "action": "CREATE",
  "status_reason": "Deploy data available",
  "id": "2c3e65db-b54b-4c07-aa86-c1d77690747d"
}
[stack@instack ~]$

Comment 14 Alexander Chuzhoy 2016-05-19 15:19:05 UTC
The heat-engine.log (which is too big to attach) has these repeating block of erros:
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service [-] Unhandled exception
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service Traceback (most recent call last):
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 377, in _child_wait_for_exit_or_signal
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service     launcher.wait()
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 204, in wait
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service     self.services.wait()
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 625, in wait
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service     service.wait()
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 105, in wrapper
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service     return f(*args, **kwargs)
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 591, in wait
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service     self._done.wait()
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/eventlet/event.py", line 121, in wait
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service     return hubs.get_hub().switch()
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 294, in switch
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service     return self.greenlet.switch()
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 346, in run
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service     self.wait(sleep_time)
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/eventlet/hubs/poll.py", line 85, in wait
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service     presult = self.do_poll(seconds)
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/eventlet/hubs/epolls.py", line 62, in do_poll
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service     return self.poll.poll(seconds)
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 160, in _handle_signals
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service     handler(signo, frame)
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 356, in _sigterm
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service     self.launcher.stop()
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 196, in stop
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service     self.services.stop()
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 612, in stop
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service     service.stop()
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 105, in wrapper
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service     return f(*args, **kwargs)
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/heat/engine/service.py", line 408, in stop
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service     service_objects.Service.delete(ctxt, self.service_id)
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/heat/objects/service.py", line 78, in delete
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service     db_api.service_delete(context, service_id, soft_delete)
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/heat/db/api.py", line 378, in service_delete
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service     return IMPL.service_delete(context, service_id, soft_delete)
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/heat/db/sqlalchemy/api.py", line 1006, in service_delete
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service     service = service_get(context, service_id)
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/heat/db/sqlalchemy/api.py", line 1016, in service_get
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service     result = model_query(context, models.Service).get(service_id)
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 819, in get
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service     return self._get_impl(ident, loading.load_on_ident)
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 852, in _get_impl
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service     return fallback_fn(self, key)
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/loading.py", line 219, in load_on_ident
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service     return q.one()
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2473, in one
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service     ret = list(self)
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/loading.py", line 86, in instances
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service     util.raise_from_cause(err)
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service   File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service     reraise(type(exception), exception, tb=exc_tb)
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/loading.py", line 67, in instances
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service     fetch = cursor.fetchall()
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/result.py", line 968, in fetchall
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service     self.cursor, self.context)
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1337, in _handle_dbapi_exception
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service     util.raise_from_cause(newraise, exc_info)
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service   File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service     reraise(type(exception), exception, tb=exc_tb)
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/result.py", line 963, in fetchall
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service     self._soft_close()
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/result.py", line 667, in _soft_close
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service     self.connection.close()
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 826, in close
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service     conn.close()
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service   File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 882, in close
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service     self._checkin()
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service   File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 758, in _checkin
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service     self._pool, None, self._echo, fairy=self)
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service   File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 650, in _finalize_fairy
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service     connection_record.checkin()
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service   File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 509, in checkin
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service     pool.dispatch.checkin(connection, self)
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service   File "/usr/lib64/python2.7/site-packages/sqlalchemy/event/attr.py", line 256, in __call__
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service     fn(*args, **kw)
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/engines.py", line 50, in _thread_yield
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service     time.sleep(0)
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 31, in sleep
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service     assert hub.greenlet is not current, 'do not call blocking functions from the mainloop'
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service DBError: do not call blocking functions from the mainloop
2016-05-18 17:39:11.156 31851 ERROR oslo_service.service

Comment 15 Michele Baldessari 2016-05-19 15:36:09 UTC
Sasha, since this bug is very specific to corosync. It is best if we open a
new one, so this one can take its course and end up in RHEL (and avoid the
upgrade to fail early on). Let's take this to a new BZ as this issue is not
pacemaker related (I just checked the pcs status and it is 100% started and fine)

Comment 16 Alexander Chuzhoy 2016-05-19 16:19:45 UTC
Filed BZ #1337624 for the issue reportes in comment #13

Comment 18 Alexander Chuzhoy 2016-05-19 21:21:21 UTC
So the re-run of the test showed that the patch proposed by Michele actually worked. I was able to deploy 7.2, update to 7.3 , upgrade to 8.0.

Comment 19 Jiri Stransky 2016-05-20 08:20:53 UTC
Awesome, thanks everyone :)

Comment 27 Steven J. Levine 2016-10-21 17:18:49 UTC
Adding title to doc text for inclusion in 7.3 release notes.

Comment 29 errata-xmlrpc 2016-11-04 06:49:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2463.html

Comment 31 Amit Ugol 2018-05-02 10:49:36 UTC
closed, no need for needinfo.