Bug 2282533

Summary: [CephFS - Consistency Group] - quiesce may time out or crash due to an interlock with exporting and other inter-rank operations
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Leonid Usov <lusov>
Component: CephFSAssignee: Leonid Usov <lusov>
Status: CLOSED ERRATA QA Contact: sumr
Severity: medium Docs Contact:
Priority: unspecified    
Version: 7.1CC: ceph-eng-bugs, cephqe-warriors, jcaratza, ngangadh, sumr, tserlin
Target Milestone: ---Flags: ngangadh: needinfo+
Target Release: 7.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-18.2.1-193 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-06-13 14:32:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Leonid Usov 2024-05-22 14:04:34 UTC
Description of problem:
Quiesce may timeout

How reproducible:
Hard to reproduce, but can be caught with high probability given the right workload.

See the linked upstream tickets

Comment 5 sumr 2024-06-03 05:04:51 UTC
Test Plan:
1. Run functional and systemic regression tests for CG quiesce
2. On repeat, Perform the below ops,
    > Set authrules to subvolume, pin subvolume test_dir to a mds rank, perform dir rename
    > Parallel quiesce calls to same set
3. Verify if debug params to quiesce cmds have been removed

Comment 6 sumr 2024-06-05 12:40:07 UTC
Verified fix on ceph build 18.2.1-194.el9cp.

FUNCTIONAL REGRESSION TESTS
---------------------------

http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-BN7JSV/

SYSTEMIC REGRESSION TESTS
-------------------------

SCALE TEST - http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-68WUGA
STRESS TEST: http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-1QDVX1/cg_snap_system_test_0.log

PERFORM FS OPS in parallel to Quiesce
-------------------------------------

Set authrules to subvolume, pin subvolume test_dir to a mds rank, perform dir rename : http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-3XROAN

Parallel quiesce calls to same set
----------------------------------

http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-Z1HB2V/cg_snap_test_0.log

Verify debug param ?q=<secs> not working
[root@ceph-sumar-regression-narjcq-node8 ~]# ceph fs quiesce cephfs --set-id cg_dbg_params1 sv1?q=5 sv2?q=5 sv3?q=5 --timeout 300 --expiration 300 
{
    "epoch": 290,
    "leader": 44133,
    "set_version": 2234,
    "sets": {
        "cg_dbg_params1": {
            "version": 2234,
            "age_ref": 0.0,
            "state": {
                "name": "QUIESCING",
                "age": 0.0
            },
            "timeout": 300.0,
            "expiration": 300.0,
            "members": {
                "file:/volumes/_nogroup/sv3/02b68c49-4327-4309-9417-cd85f629f8a5?q=5": {
                    "excluded": false,
                    "state": {
                        "name": "QUIESCING",
                        "age": 0.0
                    }
                },
                "file:/volumes/_nogroup/sv2/a7cc3735-6a4d-4ebd-9e67-b60bf9b80e10?q=5": {
                    "excluded": false,
                    "state": {
                        "name": "QUIESCING",
                        "age": 0.0
                    }
                },
                "file:/volumes/_nogroup/sv1/1936ca82-f30e-4e88-94f6-fed218be72d2?q=5": {
                    "excluded": false,
                    "state": {
                        "name": "QUIESCING",
                        "age": 0.0
                    }
                }
            }
        }
    }
}
[root@ceph-sumar-regression-narjcq-node8 ~]# ceph fs quiesce cephfs --query --set-id cg_dbg_params1
{
    "epoch": 290,
    "leader": 44133,
    "set_version": 2237,
    "sets": {
        "cg_dbg_params1": {
            "version": 2237,
            "age_ref": 0.0,
            "state": {
                "name": "QUIESCED",
                "age": 2.5
            },
            "timeout": 300.0,
            "expiration": 300.0,
            "members": {
                "file:/volumes/_nogroup/sv3/02b68c49-4327-4309-9417-cd85f629f8a5?q=5": {
                    "excluded": false,
                    "state": {
                        "name": "QUIESCED",
                        "age": 2.5
                    }
                },
                "file:/volumes/_nogroup/sv2/a7cc3735-6a4d-4ebd-9e67-b60bf9b80e10?q=5": {
                    "excluded": false,
                    "state": {
                        "name": "QUIESCED",
                        "age": 2.5
                    }
                },
                "file:/volumes/_nogroup/sv1/1936ca82-f30e-4e88-94f6-fed218be72d2?q=5": {
                    "excluded": false,
                    "state": {
                        "name": "QUIESCED",
                        "age": 2.5
                    }
                }
            }
        }
    }
}
[root@ceph-sumar-regression-narjcq-node8 ~]# 

Marking the BZ as Verified.

Comment 7 errata-xmlrpc 2024-06-13 14:32:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Critical: Red Hat Ceph Storage 7.1 security, enhancements, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:3925