Bug 1507013

Summary:	Pacemaker LVM monitor causing service restarts due to flock() delays in vgscan/vgs commands
Product:	Red Hat Enterprise Linux 7	Reporter:	Greg Charles <gcharles>
Component:	resource-agents	Assignee:	Oyvind Albrigtsen <oalbrigt>
Status:	CLOSED WONTFIX	QA Contact:	cluster-qe <cluster-qe>
Severity:	high	Docs Contact:
Priority:	high
Version:	7.4	CC:	agk, bfrank, cfeist, c.handel, cluster-maint, fdinitto, hlinder, kcleveng, mlisik, oalbrigt, pzimek, sbradley
Target Milestone:	rc	Flags:	pm-rhel: mirror+
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	resource-agents-4.1.1-1.el7	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-08-19 07:31:11 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Greg Charles 2017-10-27 12:11:15 UTC

Description of problem:
Pacemaker LVM monitors usually take 1 or 2 seconds to complete but periodically can take more than 90 seconds, which exceeds common timeout thresholds and causes services to be restarted in place.  We have tested on systems where we verified there are no storage-related issues that would contribute to this behavior.

Investigation has found flock() calls within the vgs and vgscan commands used by the Pacemaker LVM monitor appear to be responsible for the long delays. Increasing the number of active services on a cluster node raises the likelihood that more than one LVM monitor could run at the same time, creating the potential for flock() collisions between them.

Reducing the frequency of Pacemaker LVM monitor runs can help this problem but cannot fully avoid it. The Pacemaker LVM monitor should instead be using methods of verifying volume group availability, etc, that do not cause flock() contention in this way.

Version-Release number of selected component (if applicable):

lvm2-libs-2.02.171-8.el7.x86_64
lvm2-2.02.171-8.el7.x86_64
pcs-0.9.158-6.el7.x86_64

How reproducible:
The Pacemaker clusters we are testing on are comprised of three nodes, all Proliant DL380 G9, 16 CPU, 256GB memory.  The entire cluster runs 19 Oracle database services; the node we are testing on is running just 5 of those services.

Process accounting is enabled to track vgs and vgscan process time when monitoring is initiated by the cluster.  

Strace is used to track vgscan processes (strace -T) as called by Pacemaker.

Volume group monitoring intervals are set to 60 seconds, timeouts are set to 90 seconds.  All other resource monitoring is set to defaults.

Steps to Reproduce:
1. Create several database services utilizing SAN storage (IP resource, volume group resource, 4 lvm resources in the volume group, and the LSB script)
2. Start services and let Pacemaker initiate its own built-in LVM monitoring.
3. Observe process times via accounting on time taken to complete vgs and vgscan processes.
4. Timeouts start being registered in pacemaker.log and reflected in pcs status output for the given service..

Actual results:
Eventually, timeouts against volume group monitoring will show up in Failed Actions in pcs output.  syslogs (pacemaker.log) will show 90000msec timeout failures.  After only a few of these, the given service being monitored will restart in place since the cluster has determined there is an issue.

Expected results:
There should be no timeouts registered since there are no SAN or other storage issues being observed in the cluster.


Additional info:
The following comes from an “strace –T” of vgscan called by a Pacemaker LVM monitor. Elapsed time in seconds is within <>; this shows extended-duration flock() calls. Preceding open() calls are included to show the filenames involved.

12561 open("/run/lock/lvm/P_global:aux", O_RDWR|O_CREAT|O_APPEND, 0777) = 3 <0.000012>
12561 flock(3, LOCK_EX)                 = 0 <21.466415>


12561 open("/run/lock/lvm/P_global:aux", O_RDWR|O_CREAT|O_APPEND, 0777) = 3 <0.000019>
12561 flock(3, LOCK_EX)                 = 0 <17.003831>


12561 open("/run/lock/lvm/P_global:aux", O_RDWR|O_CREAT|O_APPEND, 0777) = 3 <0.000013>
12561 flock(3, LOCK_EX)                 = 0 <10.702100>


12561 open("/run/lock/lvm/P_global:aux", O_RDWR|O_CREAT|O_APPEND, 0777) = 3 <0.000019>
12561 flock(3, LOCK_EX)                 = 0 <7.790907>


12561 open("/run/lock/lvm/P_global:aux", O_RDWR|O_CREAT|O_APPEND, 0777) = 3 <0.000013>
12561 flock(3, LOCK_EX)                 = 0 <10.532846>


12561 open("/run/lock/lvm/P_global", O_RDWR|O_CREAT|O_APPEND, 0777) = 4 <0.000010>
12561 flock(4, LOCK_EX)                 = 0 <2.100148>

Comment 2 Ken Gaillot 2017-10-27 14:58:26 UTC

Good work tracking down the issue, and thanks for the extensive context. Re-assigning to the resource-agents component for further investigation.

Comment 3 John Ruemker 2017-10-27 17:14:26 UTC

These observations are in line with what we've seen in several other environments recently where LVM monitor ops were timing out, most frequently when there are several LVM resources managed on that node and firing monitors around the same time.  We've never had an opportunity to get great tracing data though, so definitely this new insight is very useful.

I do wonder what the expectation would be from the lvm team around response times of concurrent vgs operations, as far as whether this should be seen as a typical result or if there is some improvement we can pursue in lvm2 to address this.

Comment 4 Bradley Frank 2017-11-10 13:50:05 UTC

Customer uploaded more data surrounding this issue in case 01957915.

Comment 5 Bradley Frank 2017-11-17 20:13:27 UTC

Is there any update for this issue?
Customer in case 01957915 mentioned the following:

"Is there a way to speed up the response time? Will a business justification help? UPS are really digging into the issue and I would not think they are willing to wait for 15 days before a response may or may not be provided."

Comment 7 Curtis Taylor 2017-12-14 00:08:07 UTC

The vgscan's were added to the heartbeat LVM agent as a result of:
https://bugzilla.redhat.com/show_bug.cgi?id=1454699 and 
https://github.com/ClusterLabs/resource-agents/pull/981 , correct?

That git change ended up revoked from the current upstream git due to "_all_ lvm commands were at one point in time removed from the status action exactly because timeouts were reported." ref: https://github.com/ClusterLabs/resource-agents/pull/981#issuecomment-336949397, correct?

If the upstream code has not accepted adding the vgscans due to timeouts during status, why hasn't a new solution to BZ#1454699 been developed or clear and precise limitations on the number of devices and latency requirements to avoid such timeouts been documented?

Comment 18 RHEL Program Management 2021-08-19 07:31:11 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.