Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1710988

Summary: Avoid libqb deadlock due to not using selected socket poll priority
Product: Red Hat Enterprise Linux 7 Reporter: Ken Gaillot <kgaillot>
Component: libqbAssignee: Christine Caulfield <ccaulfie>
Status: CLOSED WONTFIX QA Contact: cluster-qe <cluster-qe>
Severity: medium Docs Contact:
Priority: high    
Version: 7.6CC: cfeist, cluster-maint, cluster-qe, cww, jfriesse, kgaillot, obenes, phagara
Target Milestone: rc   
Target Release: 7.8   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1625671 Environment:
Last Closed: 2021-03-15 07:36:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1625671    

Description Ken Gaillot 2019-05-16 17:33:53 UTC
See https://github.com/ClusterLabs/libqb/pull/352/files

+++ This bug was initially created as a clone of Bug #1625671 +++

Description of problem: Large deployments can suffer from timeouts in communication between lrmd and stonith-ng during cluster startup. stonithd's CPU usage nears 100% at those moments. So far, we detected this problem with lrmd_rsc_info, lrmd_rsc_register, st_device_remove operations. 


Version-Release number of selected component (if applicable):

corosync-2.4.3-2.el7_5.1.x86_64
pacemaker-1.1.18-11.el7_5.3.x86_64
pcs-0.9.162-5.el7_5.1.x86_64
resource-agents-3.9.5-124.el7.x86_64



How reproducible: During cluster startup, easily with large deployments (number of nodes: 16, number of resources DLM+CLVMD+145 cloned gfs2 resources)


Steps to Reproduce: Start cluster 

Actual results: 

- lrmd times out waiting for stonithd during a cluster start "lrmd:  warning: crm_ipc_send:  Request 3 to stonith-ng (0x563971c8e6c0) failed: Connection timed out (-110) after 60000ms". (operations lrmd_rsc_info, slrmd_rsc_register, st_device_remove for example)


Aug 20 10:54:16 [22289] node       crmd: (       ipc.c:1309  )   trace: crm_ipc_send: Response not received: rc=-110, errno=110
Aug 20 10:54:16 [22289] node       crmd: (       ipc.c:1318  ) warning: crm_ipc_send: Request 997 to lrmd (0x562e78270560) failed: Connection timed out (-110) after 5000ms
Aug 20 10:54:16 [22289] node       crmd: (lrmd_client.:828   )   error: lrmd_send_command:  Couldn't perform lrmd_rsc_info operation (timeout=0): -110: Connection timed out (110)
Aug 20 10:54:16 [22289] node       crmd: (lrmd_client.:812   )   trace: lrmd_send_command:  sending lrmd_rsc_register op to lrmd
Aug 20 10:54:16 [22289] node       crmd: (       ipc.c:1225  )   trace: crm_ipc_send: Trying again to obtain pending reply from lrmd


- stonithd at 100% of CPU usage at those times:


### stonithd is consistently uses 100% (or almost 100%) cpu during the cluster start:

top - 15:32:34 up 1 day, 23:27,  2 users,  load average: 0.88, 0.75, 0.88
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
13401 root      20   0  334360  96004  44032 R  97.2  0.0   0:20.26 stonithd
top - 15:32:54 up 1 day, 23:27,  2 users,  load average: 0.76, 0.74, 0.87
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
13401 root      20   0  334360  96100  44128 R 100.0  0.0   0:40.32 stonithd
top - 15:33:14 up 1 day, 23:28,  2 users,  load average: 0.89, 0.77, 0.88
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
13401 root      20   0  334360  96180  44208 R 100.0  0.0   1:00.38 stonithd
top - 15:33:34 up 1 day, 23:28,  2 users,  load average: 0.99, 0.80, 0.89
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
13401 root      20   0  334360  96236  44264 R 100.0  0.0   1:20.45 stonithd
top - 15:33:54 up 1 day, 23:28,  2 users,  load average: 1.00, 0.81, 0.89
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
13401 root      20   0  334360  96324  44352 R 100.0  0.0   1:40.53 stonithd
----------------------
top - 16:13:42 up 2 days, 8 min,  2 users,  load average: 1.90, 1.44, 1.30
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
13401 root      20   0  442432 166444  95316 R  99.5  0.1  41:13.41 stonithd
top - 16:14:02 up 2 days, 9 min,  2 users,  load average: 1.82, 1.45, 1.31
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
13401 root      20   0  442704 166968  95448 R  99.9  0.1  41:33.45 stonithd
top - 16:14:22 up 2 days, 9 min,  2 users,  load average: 1.66, 1.43, 1.31
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
13401 root      20   0  442972 167336  95536 R 100.0  0.1  41:53.53 stonithd
top - 16:14:42 up 2 days, 9 min,  2 users,  load average: 1.32, 1.37, 1.29
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
13401 root      20   0  443104 167748  95780 S  53.8  0.1  42:04.33 stonithd


Expected results: lrmd requests do not time out and stonithd does not generate substantial CPU load during cluster start (operations lrmd_rsc_info, st_device_remove etc.)


Additional info: pacemaker debug logs in attachment

--- Additional comment from Red Hat Bugzilla Rules Engine on 2018-09-05 13:33:42 UTC ---

Since this bug report was entered in Red Hat Bugzilla, the release flag has been set to ? to ensure that it is properly evaluated for this release.

--- Additional comment from Ondrej Benes on 2018-09-05 13:35:23 UTC ---

There is an ongoing discussion on this topic upstream:

https://github.com/ClusterLabs/pacemaker/pull/1573

--- Additional comment from Ondrej Benes on 2018-09-05 13:45:50 UTC ---

Attachment too large to upload. It is located in 
http://file.brq.redhat.com/~obenes/02159148/bz02159148.tar.xz

The archive has two directories, one for each of the operations lrmd_rsc_info, st_device_remove. Refer to notes for summary and examples. Unprocessed logs available in *corosync.log.

/cases/02159148 $ tree bz
bz
├── bz-text
├── lrmd_rsc_info
│   ├── cppra84a0156.corosync.log
│   ├── cppra85a0156.corosync.log
│   ├── cppra86a0156.corosync.log
│   ├── cppra87a0156.corosync.log
│   ├── cppra88a0156.corosync.log
│   ├── cppra89a0156.corosync.log
│   ├── cppra90a0156.corosync.log
│   ├── cppra93a0156.corosync.log
│   ├── cppra94a0156.corosync.log
│   ├── cppra95a0156.corosync.log
│   ├── cppra96a0156.corosync.log
│   ├── cppra97a0156.corosync.log
│   ├── cppra98a0156.corosync.log
│   ├── cppra99a0156.corosync.log
│   └── lrmd_rsc_info-notes
└── st_device_remove
    ├── cppra84a0156.corosync.log
    ├── cppra85a0156.corosync.log
    ├── cppra87a0156.corosync.log
    ├── cppra88a0156.corosync.log
    ├── cppra89a0156.corosync.log
    ├── cppra90a0156.corosync.log
    └── st_device_remove-notes

2 directories, 23 files

Comment 6 RHEL Program Management 2021-03-15 07:36:03 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.