Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
DescriptionChristine Caulfield
2019-06-10 08:01:16 UTC
This bug was initially created as a copy of Bug #1710988
See https://github.com/ClusterLabs/libqb/pull/352/files
+++ This bug was initially created as a clone of Bug #1625671 +++
Description of problem: Large deployments can suffer from timeouts in communication between lrmd and stonith-ng during cluster startup. stonithd's CPU usage nears 100% at those moments. So far, we detected this problem with lrmd_rsc_info, lrmd_rsc_register, st_device_remove operations.
Version-Release number of selected component (if applicable):
corosync-2.4.3-2.el7_5.1.x86_64
pacemaker-1.1.18-11.el7_5.3.x86_64
pcs-0.9.162-5.el7_5.1.x86_64
resource-agents-3.9.5-124.el7.x86_64
How reproducible: During cluster startup, easily with large deployments (number of nodes: 16, number of resources DLM+CLVMD+145 cloned gfs2 resources)
Steps to Reproduce: Start cluster
Actual results:
- lrmd times out waiting for stonithd during a cluster start "lrmd: warning: crm_ipc_send: Request 3 to stonith-ng (0x563971c8e6c0) failed: Connection timed out (-110) after 60000ms". (operations lrmd_rsc_info, slrmd_rsc_register, st_device_remove for example)
Aug 20 10:54:16 [22289] node crmd: ( ipc.c:1309 ) trace: crm_ipc_send: Response not received: rc=-110, errno=110
Aug 20 10:54:16 [22289] node crmd: ( ipc.c:1318 ) warning: crm_ipc_send: Request 997 to lrmd (0x562e78270560) failed: Connection timed out (-110) after 5000ms
Aug 20 10:54:16 [22289] node crmd: (lrmd_client.:828 ) error: lrmd_send_command: Couldn't perform lrmd_rsc_info operation (timeout=0): -110: Connection timed out (110)
Aug 20 10:54:16 [22289] node crmd: (lrmd_client.:812 ) trace: lrmd_send_command: sending lrmd_rsc_register op to lrmd
Aug 20 10:54:16 [22289] node crmd: ( ipc.c:1225 ) trace: crm_ipc_send: Trying again to obtain pending reply from lrmd
- stonithd at 100% of CPU usage at those times:
### stonithd is consistently uses 100% (or almost 100%) cpu during the cluster start:
top - 15:32:34 up 1 day, 23:27, 2 users, load average: 0.88, 0.75, 0.88
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13401 root 20 0 334360 96004 44032 R 97.2 0.0 0:20.26 stonithd
top - 15:32:54 up 1 day, 23:27, 2 users, load average: 0.76, 0.74, 0.87
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13401 root 20 0 334360 96100 44128 R 100.0 0.0 0:40.32 stonithd
top - 15:33:14 up 1 day, 23:28, 2 users, load average: 0.89, 0.77, 0.88
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13401 root 20 0 334360 96180 44208 R 100.0 0.0 1:00.38 stonithd
top - 15:33:34 up 1 day, 23:28, 2 users, load average: 0.99, 0.80, 0.89
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13401 root 20 0 334360 96236 44264 R 100.0 0.0 1:20.45 stonithd
top - 15:33:54 up 1 day, 23:28, 2 users, load average: 1.00, 0.81, 0.89
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13401 root 20 0 334360 96324 44352 R 100.0 0.0 1:40.53 stonithd
----------------------
top - 16:13:42 up 2 days, 8 min, 2 users, load average: 1.90, 1.44, 1.30
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13401 root 20 0 442432 166444 95316 R 99.5 0.1 41:13.41 stonithd
top - 16:14:02 up 2 days, 9 min, 2 users, load average: 1.82, 1.45, 1.31
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13401 root 20 0 442704 166968 95448 R 99.9 0.1 41:33.45 stonithd
top - 16:14:22 up 2 days, 9 min, 2 users, load average: 1.66, 1.43, 1.31
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13401 root 20 0 442972 167336 95536 R 100.0 0.1 41:53.53 stonithd
top - 16:14:42 up 2 days, 9 min, 2 users, load average: 1.32, 1.37, 1.29
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13401 root 20 0 443104 167748 95780 S 53.8 0.1 42:04.33 stonithd
Expected results: lrmd requests do not time out and stonithd does not generate substantial CPU load during cluster start (operations lrmd_rsc_info, st_device_remove etc.)
Additional info: pacemaker debug logs in attachment
--- Additional comment from Red Hat Bugzilla Rules Engine on 2018-09-05 13:33:42 UTC ---
Since this bug report was entered in Red Hat Bugzilla, the release flag has been set to ? to ensure that it is properly evaluated for this release.
--- Additional comment from Ondrej Benes on 2018-09-05 13:35:23 UTC ---
There is an ongoing discussion on this topic upstream:
https://github.com/ClusterLabs/pacemaker/pull/1573
--- Additional comment from Ondrej Benes on 2018-09-05 13:45:50 UTC ---
Attachment too large to upload. It is located in
http://file.brq.redhat.com/~obenes/02159148/bz02159148.tar.xz
The archive has two directories, one for each of the operations lrmd_rsc_info, st_device_remove. Refer to notes for summary and examples. Unprocessed logs available in *corosync.log.
/cases/02159148 $ tree bz
bz
├── bz-text
├── lrmd_rsc_info
│ ├── cppra84a0156.corosync.log
│ ├── cppra85a0156.corosync.log
│ ├── cppra86a0156.corosync.log
│ ├── cppra87a0156.corosync.log
│ ├── cppra88a0156.corosync.log
│ ├── cppra89a0156.corosync.log
│ ├── cppra90a0156.corosync.log
│ ├── cppra93a0156.corosync.log
│ ├── cppra94a0156.corosync.log
│ ├── cppra95a0156.corosync.log
│ ├── cppra96a0156.corosync.log
│ ├── cppra97a0156.corosync.log
│ ├── cppra98a0156.corosync.log
│ ├── cppra99a0156.corosync.log
│ └── lrmd_rsc_info-notes
└── st_device_remove
├── cppra84a0156.corosync.log
├── cppra85a0156.corosync.log
├── cppra87a0156.corosync.log
├── cppra88a0156.corosync.log
├── cppra89a0156.corosync.log
├── cppra90a0156.corosync.log
└── st_device_remove-notes
2 directories, 23 files
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2019:3610