Bug 1718773

Summary:	Avoid libqb deadlock due to not using selected socket poll priority
Product:	Red Hat Enterprise Linux 8	Reporter:	Christine Caulfield <ccaulfie>
Component:	libqb	Assignee:	Christine Caulfield <ccaulfie>
Status:	CLOSED ERRATA	QA Contact:	michal novacek <mnovacek>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	8.1	CC:	cluster-maint, jfriesse, kgaillot, mnovacek, phagara
Target Milestone:	rc
Target Release:	8.1
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	libqb-1.0.3-10.el8	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-11-05 22:27:43 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Christine Caulfield 2019-06-10 08:01:16 UTC

This bug was initially created as a copy of Bug #1710988


See https://github.com/ClusterLabs/libqb/pull/352/files

+++ This bug was initially created as a clone of Bug #1625671 +++

Description of problem: Large deployments can suffer from timeouts in communication between lrmd and stonith-ng during cluster startup. stonithd's CPU usage nears 100% at those moments. So far, we detected this problem with lrmd_rsc_info, lrmd_rsc_register, st_device_remove operations. 


Version-Release number of selected component (if applicable):

corosync-2.4.3-2.el7_5.1.x86_64
pacemaker-1.1.18-11.el7_5.3.x86_64
pcs-0.9.162-5.el7_5.1.x86_64
resource-agents-3.9.5-124.el7.x86_64



How reproducible: During cluster startup, easily with large deployments (number of nodes: 16, number of resources DLM+CLVMD+145 cloned gfs2 resources)


Steps to Reproduce: Start cluster 

Actual results: 

- lrmd times out waiting for stonithd during a cluster start "lrmd:  warning: crm_ipc_send:  Request 3 to stonith-ng (0x563971c8e6c0) failed: Connection timed out (-110) after 60000ms". (operations lrmd_rsc_info, slrmd_rsc_register, st_device_remove for example)


Aug 20 10:54:16 [22289] node       crmd: (       ipc.c:1309  )   trace: crm_ipc_send: Response not received: rc=-110, errno=110
Aug 20 10:54:16 [22289] node       crmd: (       ipc.c:1318  ) warning: crm_ipc_send: Request 997 to lrmd (0x562e78270560) failed: Connection timed out (-110) after 5000ms
Aug 20 10:54:16 [22289] node       crmd: (lrmd_client.:828   )   error: lrmd_send_command:  Couldn't perform lrmd_rsc_info operation (timeout=0): -110: Connection timed out (110)
Aug 20 10:54:16 [22289] node       crmd: (lrmd_client.:812   )   trace: lrmd_send_command:  sending lrmd_rsc_register op to lrmd
Aug 20 10:54:16 [22289] node       crmd: (       ipc.c:1225  )   trace: crm_ipc_send: Trying again to obtain pending reply from lrmd


- stonithd at 100% of CPU usage at those times:


### stonithd is consistently uses 100% (or almost 100%) cpu during the cluster start:

top - 15:32:34 up 1 day, 23:27,  2 users,  load average: 0.88, 0.75, 0.88
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
13401 root      20   0  334360  96004  44032 R  97.2  0.0   0:20.26 stonithd
top - 15:32:54 up 1 day, 23:27,  2 users,  load average: 0.76, 0.74, 0.87
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
13401 root      20   0  334360  96100  44128 R 100.0  0.0   0:40.32 stonithd
top - 15:33:14 up 1 day, 23:28,  2 users,  load average: 0.89, 0.77, 0.88
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
13401 root      20   0  334360  96180  44208 R 100.0  0.0   1:00.38 stonithd
top - 15:33:34 up 1 day, 23:28,  2 users,  load average: 0.99, 0.80, 0.89
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
13401 root      20   0  334360  96236  44264 R 100.0  0.0   1:20.45 stonithd
top - 15:33:54 up 1 day, 23:28,  2 users,  load average: 1.00, 0.81, 0.89
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
13401 root      20   0  334360  96324  44352 R 100.0  0.0   1:40.53 stonithd
----------------------
top - 16:13:42 up 2 days, 8 min,  2 users,  load average: 1.90, 1.44, 1.30
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
13401 root      20   0  442432 166444  95316 R  99.5  0.1  41:13.41 stonithd
top - 16:14:02 up 2 days, 9 min,  2 users,  load average: 1.82, 1.45, 1.31
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
13401 root      20   0  442704 166968  95448 R  99.9  0.1  41:33.45 stonithd
top - 16:14:22 up 2 days, 9 min,  2 users,  load average: 1.66, 1.43, 1.31
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
13401 root      20   0  442972 167336  95536 R 100.0  0.1  41:53.53 stonithd
top - 16:14:42 up 2 days, 9 min,  2 users,  load average: 1.32, 1.37, 1.29
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
13401 root      20   0  443104 167748  95780 S  53.8  0.1  42:04.33 stonithd


Expected results: lrmd requests do not time out and stonithd does not generate substantial CPU load during cluster start (operations lrmd_rsc_info, st_device_remove etc.)


Additional info: pacemaker debug logs in attachment

--- Additional comment from Red Hat Bugzilla Rules Engine on 2018-09-05 13:33:42 UTC ---

Since this bug report was entered in Red Hat Bugzilla, the release flag has been set to ? to ensure that it is properly evaluated for this release.

--- Additional comment from Ondrej Benes on 2018-09-05 13:35:23 UTC ---

There is an ongoing discussion on this topic upstream:

https://github.com/ClusterLabs/pacemaker/pull/1573

--- Additional comment from Ondrej Benes on 2018-09-05 13:45:50 UTC ---

Attachment too large to upload. It is located in 
http://file.brq.redhat.com/~obenes/02159148/bz02159148.tar.xz

The archive has two directories, one for each of the operations lrmd_rsc_info, st_device_remove. Refer to notes for summary and examples. Unprocessed logs available in *corosync.log.

/cases/02159148 $ tree bz
bz
├── bz-text
├── lrmd_rsc_info
│   ├── cppra84a0156.corosync.log
│   ├── cppra85a0156.corosync.log
│   ├── cppra86a0156.corosync.log
│   ├── cppra87a0156.corosync.log
│   ├── cppra88a0156.corosync.log
│   ├── cppra89a0156.corosync.log
│   ├── cppra90a0156.corosync.log
│   ├── cppra93a0156.corosync.log
│   ├── cppra94a0156.corosync.log
│   ├── cppra95a0156.corosync.log
│   ├── cppra96a0156.corosync.log
│   ├── cppra97a0156.corosync.log
│   ├── cppra98a0156.corosync.log
│   ├── cppra99a0156.corosync.log
│   └── lrmd_rsc_info-notes
└── st_device_remove
    ├── cppra84a0156.corosync.log
    ├── cppra85a0156.corosync.log
    ├── cppra87a0156.corosync.log
    ├── cppra88a0156.corosync.log
    ├── cppra89a0156.corosync.log
    ├── cppra90a0156.corosync.log
    └── st_device_remove-notes

2 directories, 23 files

Comment 1 Patrik Hagara 2019-06-10 14:00:56 UTC

qa_ack+

reproducer in description

Comment 3 michal novacek 2019-09-10 13:52:25 UTC

I have verified that there 16 nodes cluster with ~150 gfs resource will start
without any timeout errors with libqb-1.0.3-10.el8.x86_64

---

> [root@virt-152 ~]# pcs cluster status
Cluster Status:
 Stack: corosync
 Current DC: virt-164 (version 2.0.2-3.el8-744a30d655) - partition with quorum
 Last updated: Tue Sep 10 15:45:55 2019
 Last change: Tue Sep 10 15:21:50 2019 by root via cibadmin on virt-152
 16 nodes configured
 144 resources configured

PCSD Status:
  virt-152: Online
  virt-153: Online
  virt-154: Online
  virt-155: Online
  virt-156: Online
  virt-157: Online
  virt-158: Online
  virt-159: Online
  virt-160: Online
  virt-161: Online
  virt-162: Online
  virt-163: Online
  virt-164: Online
  virt-165: Online
  virt-166: Online
  virt-167: Online

> [root@virt-152 ~]# pcs cluster start --all
...

> [root@virt-152 ~]# rpm -q libqb corosync pacemaker
libqb-1.0.3-10.el8.x86_64
corosync-3.0.2-3.el8.x86_64
pacemaker-2.0.2-3.el8.x86_64

> enable debugging output on all nodes;

echo > PCMK_debug=yes
echo > PCMK_trace_functions=crm_ipc_send,lrmd_send_command,stonith_send_command,cib_action_update

> [root@virt-152 ~]# pcs cluster start --all --wai && crm_resource --wait
...

No error or timeout message showed up in log files on any of the cluster nodes.

Comment 5 errata-xmlrpc 2019-11-05 22:27:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:3610