1507013 – Pacemaker LVM monitor causing service restarts due to flock() delays in vgscan/vgs commands

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1507013 - Pacemaker LVM monitor causing service restarts due to flock() delays in vgscan/vgs commands

Summary: Pacemaker LVM monitor causing service restarts due to flock() delays in vgsca...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	resource-agents
Sub Component:
Version:	7.4
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Oyvind Albrigtsen
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-10-27 12:11 UTC by Greg Charles
Modified:	2021-08-19 07:31 UTC (History)
CC List:	12 users (show)
Fixed In Version:	resource-agents-4.1.1-1.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-08-19 07:31:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1470840	1	None	None	None	2022-03-13 14:21:10 UTC
Red Hat Knowledge Base (Solution)	3110711	0	None	None	None	2018-06-26 14:12:55 UTC
Red Hat Knowledge Base (Solution)	3110971	0	None	None	None	2018-06-26 14:11:50 UTC

Internal Links: 1470840 1531465

Description Greg Charles 2017-10-27 12:11:15 UTC

Description of problem:
Pacemaker LVM monitors usually take 1 or 2 seconds to complete but periodically can take more than 90 seconds, which exceeds common timeout thresholds and causes services to be restarted in place.  We have tested on systems where we verified there are no storage-related issues that would contribute to this behavior.

Investigation has found flock() calls within the vgs and vgscan commands used by the Pacemaker LVM monitor appear to be responsible for the long delays. Increasing the number of active services on a cluster node raises the likelihood that more than one LVM monitor could run at the same time, creating the potential for flock() collisions between them.

Reducing the frequency of Pacemaker LVM monitor runs can help this problem but cannot fully avoid it. The Pacemaker LVM monitor should instead be using methods of verifying volume group availability, etc, that do not cause flock() contention in this way.

Version-Release number of selected component (if applicable):

lvm2-libs-2.02.171-8.el7.x86_64
lvm2-2.02.171-8.el7.x86_64
pcs-0.9.158-6.el7.x86_64

How reproducible:
The Pacemaker clusters we are testing on are comprised of three nodes, all Proliant DL380 G9, 16 CPU, 256GB memory.  The entire cluster runs 19 Oracle database services; the node we are testing on is running just 5 of those services.

Process accounting is enabled to track vgs and vgscan process time when monitoring is initiated by the cluster.  

Strace is used to track vgscan processes (strace -T) as called by Pacemaker.

Volume group monitoring intervals are set to 60 seconds, timeouts are set to 90 seconds.  All other resource monitoring is set to defaults.

Steps to Reproduce:
1. Create several database services utilizing SAN storage (IP resource, volume group resource, 4 lvm resources in the volume group, and the LSB script)
2. Start services and let Pacemaker initiate its own built-in LVM monitoring.
3. Observe process times via accounting on time taken to complete vgs and vgscan processes.
4. Timeouts start being registered in pacemaker.log and reflected in pcs status output for the given service..

Actual results:
Eventually, timeouts against volume group monitoring will show up in Failed Actions in pcs output.  syslogs (pacemaker.log) will show 90000msec timeout failures.  After only a few of these, the given service being monitored will restart in place since the cluster has determined there is an issue.

Expected results:
There should be no timeouts registered since there are no SAN or other storage issues being observed in the cluster.


Additional info:
The following comes from an “strace –T” of vgscan called by a Pacemaker LVM monitor. Elapsed time in seconds is within <>; this shows extended-duration flock() calls. Preceding open() calls are included to show the filenames involved.

12561 open("/run/lock/lvm/P_global:aux", O_RDWR|O_CREAT|O_APPEND, 0777) = 3 <0.000012>
12561 flock(3, LOCK_EX)                 = 0 <21.466415>


12561 open("/run/lock/lvm/P_global:aux", O_RDWR|O_CREAT|O_APPEND, 0777) = 3 <0.000019>
12561 flock(3, LOCK_EX)                 = 0 <17.003831>


12561 open("/run/lock/lvm/P_global:aux", O_RDWR|O_CREAT|O_APPEND, 0777) = 3 <0.000013>
12561 flock(3, LOCK_EX)                 = 0 <10.702100>


12561 open("/run/lock/lvm/P_global:aux", O_RDWR|O_CREAT|O_APPEND, 0777) = 3 <0.000019>
12561 flock(3, LOCK_EX)                 = 0 <7.790907>


12561 open("/run/lock/lvm/P_global:aux", O_RDWR|O_CREAT|O_APPEND, 0777) = 3 <0.000013>
12561 flock(3, LOCK_EX)                 = 0 <10.532846>


12561 open("/run/lock/lvm/P_global", O_RDWR|O_CREAT|O_APPEND, 0777) = 4 <0.000010>
12561 flock(4, LOCK_EX)                 = 0 <2.100148>

Comment 2 Ken Gaillot 2017-10-27 14:58:26 UTC

Good work tracking down the issue, and thanks for the extensive context. Re-assigning to the resource-agents component for further investigation.

Comment 3 John Ruemker 2017-10-27 17:14:26 UTC

These observations are in line with what we've seen in several other environments recently where LVM monitor ops were timing out, most frequently when there are several LVM resources managed on that node and firing monitors around the same time.  We've never had an opportunity to get great tracing data though, so definitely this new insight is very useful.

I do wonder what the expectation would be from the lvm team around response times of concurrent vgs operations, as far as whether this should be seen as a typical result or if there is some improvement we can pursue in lvm2 to address this.

Comment 4 Bradley Frank 2017-11-10 13:50:05 UTC

Customer uploaded more data surrounding this issue in case 01957915.

Comment 5 Bradley Frank 2017-11-17 20:13:27 UTC

Is there any update for this issue?
Customer in case 01957915 mentioned the following:

"Is there a way to speed up the response time? Will a business justification help? UPS are really digging into the issue and I would not think they are willing to wait for 15 days before a response may or may not be provided."

Comment 7 Curtis Taylor 2017-12-14 00:08:07 UTC

The vgscan's were added to the heartbeat LVM agent as a result of:
https://bugzilla.redhat.com/show_bug.cgi?id=1454699 and 
https://github.com/ClusterLabs/resource-agents/pull/981 , correct?

That git change ended up revoked from the current upstream git due to "_all_ lvm commands were at one point in time removed from the status action exactly because timeouts were reported." ref: https://github.com/ClusterLabs/resource-agents/pull/981#issuecomment-336949397, correct?

If the upstream code has not accepted adding the vgscans due to timeouts during status, why hasn't a new solution to BZ#1454699 been developed or clear and precise limitations on the number of devices and latency requirements to avoid such timeouts been documented?

Comment 18 RHEL Program Management 2021-08-19 07:31:11 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Note You need to log in before you can comment on or make changes to this bug.