Bug 1507013
Summary: | Pacemaker LVM monitor causing service restarts due to flock() delays in vgscan/vgs commands | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Greg Charles <gcharles> |
Component: | resource-agents | Assignee: | Oyvind Albrigtsen <oalbrigt> |
Status: | CLOSED WONTFIX | QA Contact: | cluster-qe <cluster-qe> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 7.4 | CC: | agk, bfrank, cfeist, c.handel, cluster-maint, fdinitto, hlinder, kcleveng, mlisik, oalbrigt, pzimek, sbradley |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | resource-agents-4.1.1-1.el7 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-08-19 07:31:11 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Greg Charles
2017-10-27 12:11:15 UTC
Good work tracking down the issue, and thanks for the extensive context. Re-assigning to the resource-agents component for further investigation. These observations are in line with what we've seen in several other environments recently where LVM monitor ops were timing out, most frequently when there are several LVM resources managed on that node and firing monitors around the same time. We've never had an opportunity to get great tracing data though, so definitely this new insight is very useful. I do wonder what the expectation would be from the lvm team around response times of concurrent vgs operations, as far as whether this should be seen as a typical result or if there is some improvement we can pursue in lvm2 to address this. Customer uploaded more data surrounding this issue in case 01957915. Is there any update for this issue? Customer in case 01957915 mentioned the following: "Is there a way to speed up the response time? Will a business justification help? UPS are really digging into the issue and I would not think they are willing to wait for 15 days before a response may or may not be provided." The vgscan's were added to the heartbeat LVM agent as a result of: https://bugzilla.redhat.com/show_bug.cgi?id=1454699 and https://github.com/ClusterLabs/resource-agents/pull/981 , correct? That git change ended up revoked from the current upstream git due to "_all_ lvm commands were at one point in time removed from the status action exactly because timeouts were reported." ref: https://github.com/ClusterLabs/resource-agents/pull/981#issuecomment-336949397, correct? If the upstream code has not accepted adding the vgscans due to timeouts during status, why hasn't a new solution to BZ#1454699 been developed or clear and precise limitations on the number of devices and latency requirements to avoid such timeouts been documented? After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened. |