Bug 2109662

Summary: [RHCS][GSS][OCS/ODF] Cephfs file lock is not being released after pod restart
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Anton Mark <amark>
Component: csi-driverAssignee: Nobody <nobody>
Status: CLOSED INSUFFICIENT_DATA QA Contact: krishnaram Karthick <kramdoss>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.10CC: ceph-eng-bugs, cephqe-warriors, gfarnum, gjose, hnallurv, muagarwa, ndevos, ocs-bugs, odf-bz-bot, pdonnell, vshankar, xiubli
Target Milestone: ---Flags: mrajanna: needinfo? (amark)
Target Release: ---   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-06-06 13:23:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Anton Mark 2022-07-21 18:26:58 UTC
Description of problem:
Possible stale file lock issue with CephFS PV used by an IBM MQ Queue Manager running in OpenShift Container Platform.

Version-Release number of selected component (if applicable):
OCS/ODF 4.8, 4.9 and 4.10
RHCS 5.0 (16.2.0-152.el8cp)
RHCS 4.2 (14.2.11-208.el8cp)

How reproducible:
Can be reproduced by killing mds, mon and osd pods. Also, occurs during OCS/ODF operator upgrade, but not predictably.

Steps to Reproduce:
- Provision/Install OCP 4.8, 4.9 or 4.10 cluster.
- Install corresponding version of OCS/ODF (details of how we configured OCS included below)
- install ibm-operator-catalog, and IBM MQ operator: https://www.ibm.com/docs/en/ibm-mq/9.3?topic=iumorho-installing-mq-operator-using-red-hat-openshift-web-console
- Install at least 2 MI QMs (just to increase chance of recreate, 1 should suffice)
- Upgrade OCS/ODF operator.



Actual results:
When the error occurs, the active container has restarted, but the active lock is not released, so neither running container holds the lock, but neither can get the active lock. The standby continues to be the standby,  as it sees the active lock is taken, and the container that was the active comes up, fails to get any of the 3 locks, and so restarts and tries again, for ever.


Expected results:
Active lock should be released and so that the standby container is able to assume master role when needed.


Additional info:
Reproducer can be provided.

Comment 19 Madhu Rajanna 2022-08-23 07:44:16 UTC
> From the supportshell, I cannot see the PV mentioned above. Can you point me in the right direction or upload the latest ODF must-gather

local-pv-5ac92a54.yaml                         pvc-42b13dbb-eaa4-437d-8090-b810f3ec61fd.yaml  pvc-ab97441e-51b7-4027-b6a7-bfa427fdc67b.yaml
local-pv-b1da38b.yaml                          pvc-591c34f6-ba4d-4d1b-bc24-e83a6f0262e9.yaml  pvc-b10cb295-eb30-4b58-81b1-6811937bf313.yaml
local-pv-cf57db52.yaml                         pvc-611d68c6-d612-4c0a-a045-80e8d4200267.yaml  pvc-baa3af25-226f-45ec-b01a-e6b05acf08d9.yaml
local-pv-e88b8859.yaml                         pvc-7054dbc1-4704-40b3-8b51-d5605796ef5d.yaml  pvc-c586c046-4736-424f-9aeb-54fbc4b29fab.yaml
local-pv-f7989845.yaml                         pvc-8616bd57-3de9-4f8a-aad1-238632c5010c.yaml  pvc-e730a504-70d7-4e0b-b0fe-4059ac71ba38.yaml
pvc-1413481c-4c76-4d5f-8479-748aab8a9304.yaml  pvc-87e7d939-dfbb-4f36-bcb9-804563b0df15.yaml  pvc-eeaeb83b-a8b8-4e92-ad6d-8abda7d73119.yaml
pvc-2eecbaf0-952e-4a8c-8da7-69a40f23c746.yaml  pvc-a68972e9-5d33-497b-9a4b-885355615e58.yaml  pvc-f8cf7550-1387-483d-a316-c53e76579556.yaml
pvc-3d7b7f22-c45f-4c72-85a5-635d27062fd5.yaml  pvc-ab0bdea1-4c3d-4d07-91a2-cd3e9d3b1216.yaml  registry-storage.yaml

cat 0070-410-must-gather-2.tar.gz/must-gather/registry-redhat-io-odf4-ocs-must-gather-rhel8-sha256-65c4b98de58f052d1e2650fc356a3e828364b048289a38186564c95b1a5f7a85/cluster-scoped-resources/core/persistentvolumes/
local-pv-4ffb47ae.yaml                         pvc-572312e4-67a3-4317-8355-10cf0e4b6a87.yaml  pvc-b1650d3f-69f0-4f90-86f0-59e8839490e3.yaml
local-pv-7e58ce2d.yaml                         pvc-602cc60a-57c6-4b38-887d-67546cb8c518.yaml  pvc-ce340ea4-edb1-4067-9b56-eb28dddf2d70.yaml
local-pv-acf29df.yaml                          pvc-64fe8bf7-4ef6-4802-8f08-83d7b53554fd.yaml  pvc-da12c4bc-2653-46e4-ad77-901082034ffa.yaml
local-pv-d0fc14c8.yaml                         pvc-7d542bc5-d8e5-4a87-b5e8-a181d7b70d83.yaml  registry-storage.yaml
local-pv-decf0004.yaml                         pvc-9c801387-126e-41a9-8f87-7df921f5847e.yaml  
pvc-1922b6e2-f884-420c-a15e-0fdb44f11e8d.yaml  pvc-b0019ba7-889e-4e07-acbc-090875686531.yaml