1718773 – Avoid libqb deadlock due to not using selected socket poll priority

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1718773 - Avoid libqb deadlock due to not using selected socket poll priority

Summary: Avoid libqb deadlock due to not using selected socket poll priority

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	libqb
Sub Component:
Version:	8.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	8.1
Assignee:	Christine Caulfield
QA Contact:	michal novacek
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-06-10 08:01 UTC by Christine Caulfield
Modified:	2020-11-14 09:46 UTC (History)
CC List:	5 users (show)
Fixed In Version:	libqb-1.0.3-10.el8
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-11-05 22:27:43 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2019:3610	0	None	None	None	2019-11-05 22:27:46 UTC

Description Christine Caulfield 2019-06-10 08:01:16 UTC

This bug was initially created as a copy of Bug #1710988


See https://github.com/ClusterLabs/libqb/pull/352/files

+++ This bug was initially created as a clone of Bug #1625671 +++

Description of problem: Large deployments can suffer from timeouts in communication between lrmd and stonith-ng during cluster startup. stonithd's CPU usage nears 100% at those moments. So far, we detected this problem with lrmd_rsc_info, lrmd_rsc_register, st_device_remove operations. 


Version-Release number of selected component (if applicable):

corosync-2.4.3-2.el7_5.1.x86_64
pacemaker-1.1.18-11.el7_5.3.x86_64
pcs-0.9.162-5.el7_5.1.x86_64
resource-agents-3.9.5-124.el7.x86_64



How reproducible: During cluster startup, easily with large deployments (number of nodes: 16, number of resources DLM+CLVMD+145 cloned gfs2 resources)


Steps to Reproduce: Start cluster 

Actual results: 

- lrmd times out waiting for stonithd during a cluster start "lrmd:  warning: crm_ipc_send:  Request 3 to stonith-ng (0x563971c8e6c0) failed: Connection timed out (-110) after 60000ms". (operations lrmd_rsc_info, slrmd_rsc_register, st_device_remove for example)


Aug 20 10:54:16 [22289] node       crmd: (       ipc.c:1309  )   trace: crm_ipc_send: Response not received: rc=-110, errno=110
Aug 20 10:54:16 [22289] node       crmd: (       ipc.c:1318  ) warning: crm_ipc_send: Request 997 to lrmd (0x562e78270560) failed: Connection timed out (-110) after 5000ms
Aug 20 10:54:16 [22289] node       crmd: (lrmd_client.:828   )   error: lrmd_send_command:  Couldn't perform lrmd_rsc_info operation (timeout=0): -110: Connection timed out (110)
Aug 20 10:54:16 [22289] node       crmd: (lrmd_client.:812   )   trace: lrmd_send_command:  sending lrmd_rsc_register op to lrmd
Aug 20 10:54:16 [22289] node       crmd: (       ipc.c:1225  )   trace: crm_ipc_send: Trying again to obtain pending reply from lrmd


- stonithd at 100% of CPU usage at those times:


### stonithd is consistently uses 100% (or almost 100%) cpu during the cluster start:

top - 15:32:34 up 1 day, 23:27,  2 users,  load average: 0.88, 0.75, 0.88
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
13401 root      20   0  334360  96004  44032 R  97.2  0.0   0:20.26 stonithd
top - 15:32:54 up 1 day, 23:27,  2 users,  load average: 0.76, 0.74, 0.87
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
13401 root      20   0  334360  96100  44128 R 100.0  0.0   0:40.32 stonithd
top - 15:33:14 up 1 day, 23:28,  2 users,  load average: 0.89, 0.77, 0.88
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
13401 root      20   0  334360  96180  44208 R 100.0  0.0   1:00.38 stonithd
top - 15:33:34 up 1 day, 23:28,  2 users,  load average: 0.99, 0.80, 0.89
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
13401 root      20   0  334360  96236  44264 R 100.0  0.0   1:20.45 stonithd
top - 15:33:54 up 1 day, 23:28,  2 users,  load average: 1.00, 0.81, 0.89
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
13401 root      20   0  334360  96324  44352 R 100.0  0.0   1:40.53 stonithd
----------------------
top - 16:13:42 up 2 days, 8 min,  2 users,  load average: 1.90, 1.44, 1.30
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
13401 root      20   0  442432 166444  95316 R  99.5  0.1  41:13.41 stonithd
top - 16:14:02 up 2 days, 9 min,  2 users,  load average: 1.82, 1.45, 1.31
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
13401 root      20   0  442704 166968  95448 R  99.9  0.1  41:33.45 stonithd
top - 16:14:22 up 2 days, 9 min,  2 users,  load average: 1.66, 1.43, 1.31
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
13401 root      20   0  442972 167336  95536 R 100.0  0.1  41:53.53 stonithd
top - 16:14:42 up 2 days, 9 min,  2 users,  load average: 1.32, 1.37, 1.29
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
13401 root      20   0  443104 167748  95780 S  53.8  0.1  42:04.33 stonithd


Expected results: lrmd requests do not time out and stonithd does not generate substantial CPU load during cluster start (operations lrmd_rsc_info, st_device_remove etc.)


Additional info: pacemaker debug logs in attachment

--- Additional comment from Red Hat Bugzilla Rules Engine on 2018-09-05 13:33:42 UTC ---

Since this bug report was entered in Red Hat Bugzilla, the release flag has been set to ? to ensure that it is properly evaluated for this release.

--- Additional comment from Ondrej Benes on 2018-09-05 13:35:23 UTC ---

There is an ongoing discussion on this topic upstream:

https://github.com/ClusterLabs/pacemaker/pull/1573

--- Additional comment from Ondrej Benes on 2018-09-05 13:45:50 UTC ---

Attachment too large to upload. It is located in 
http://file.brq.redhat.com/~obenes/02159148/bz02159148.tar.xz

The archive has two directories, one for each of the operations lrmd_rsc_info, st_device_remove. Refer to notes for summary and examples. Unprocessed logs available in *corosync.log.

/cases/02159148 $ tree bz
bz
├── bz-text
├── lrmd_rsc_info
│   ├── cppra84a0156.corosync.log
│   ├── cppra85a0156.corosync.log
│   ├── cppra86a0156.corosync.log
│   ├── cppra87a0156.corosync.log
│   ├── cppra88a0156.corosync.log
│   ├── cppra89a0156.corosync.log
│   ├── cppra90a0156.corosync.log
│   ├── cppra93a0156.corosync.log
│   ├── cppra94a0156.corosync.log
│   ├── cppra95a0156.corosync.log
│   ├── cppra96a0156.corosync.log
│   ├── cppra97a0156.corosync.log
│   ├── cppra98a0156.corosync.log
│   ├── cppra99a0156.corosync.log
│   └── lrmd_rsc_info-notes
└── st_device_remove
    ├── cppra84a0156.corosync.log
    ├── cppra85a0156.corosync.log
    ├── cppra87a0156.corosync.log
    ├── cppra88a0156.corosync.log
    ├── cppra89a0156.corosync.log
    ├── cppra90a0156.corosync.log
    └── st_device_remove-notes

2 directories, 23 files

Comment 1 Patrik Hagara 2019-06-10 14:00:56 UTC

qa_ack+

reproducer in description

Comment 3 michal novacek 2019-09-10 13:52:25 UTC

I have verified that there 16 nodes cluster with ~150 gfs resource will start
without any timeout errors with libqb-1.0.3-10.el8.x86_64

---

> [root@virt-152 ~]# pcs cluster status
Cluster Status:
 Stack: corosync
 Current DC: virt-164 (version 2.0.2-3.el8-744a30d655) - partition with quorum
 Last updated: Tue Sep 10 15:45:55 2019
 Last change: Tue Sep 10 15:21:50 2019 by root via cibadmin on virt-152
 16 nodes configured
 144 resources configured

PCSD Status:
  virt-152: Online
  virt-153: Online
  virt-154: Online
  virt-155: Online
  virt-156: Online
  virt-157: Online
  virt-158: Online
  virt-159: Online
  virt-160: Online
  virt-161: Online
  virt-162: Online
  virt-163: Online
  virt-164: Online
  virt-165: Online
  virt-166: Online
  virt-167: Online

> [root@virt-152 ~]# pcs cluster start --all
...

> [root@virt-152 ~]# rpm -q libqb corosync pacemaker
libqb-1.0.3-10.el8.x86_64
corosync-3.0.2-3.el8.x86_64
pacemaker-2.0.2-3.el8.x86_64

> enable debugging output on all nodes;

echo > PCMK_debug=yes
echo > PCMK_trace_functions=crm_ipc_send,lrmd_send_command,stonith_send_command,cib_action_update

> [root@virt-152 ~]# pcs cluster start --all --wai && crm_resource --wait
...

No error or timeout message showed up in log files on any of the cluster nodes.

Comment 5 errata-xmlrpc 2019-11-05 22:27:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:3610

Note You need to log in before you can comment on or make changes to this bug.