Bug 2282533 - [CephFS - Consistency Group] - quiesce may time out or crash due to an interlock with exporting and other inter-rank operations
Summary: [CephFS - Consistency Group] - quiesce may time out or crash due to an interl...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: CephFS
Version: 7.1
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 7.1
Assignee: Leonid Usov
QA Contact: sumr
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-05-22 14:04 UTC by Leonid Usov
Modified: 2024-06-13 14:32 UTC (History)
6 users (show)

Fixed In Version: ceph-18.2.1-193
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-06-13 14:32:56 UTC
Embargoed:
ngangadh: needinfo+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 66123 0 None None None 2024-05-22 14:04:34 UTC
Ceph Project Bug Tracker 66152 0 None None None 2024-05-22 14:04:34 UTC
Ceph Project Bug Tracker 66208 0 None None None 2024-05-23 15:13:25 UTC
Ceph Project Bug Tracker 66219 0 None None None 2024-05-26 09:30:22 UTC
Ceph Project Bug Tracker 66225 0 None None None 2024-05-27 12:33:07 UTC
Red Hat Issue Tracker RHCEPH-9073 0 None None None 2024-05-22 14:28:18 UTC
Red Hat Product Errata RHSA-2024:3925 0 None None None 2024-06-13 14:32:59 UTC

Description Leonid Usov 2024-05-22 14:04:34 UTC
Description of problem:
Quiesce may timeout

How reproducible:
Hard to reproduce, but can be caught with high probability given the right workload.

See the linked upstream tickets

Comment 5 sumr 2024-06-03 05:04:51 UTC
Test Plan:
1. Run functional and systemic regression tests for CG quiesce
2. On repeat, Perform the below ops,
    > Set authrules to subvolume, pin subvolume test_dir to a mds rank, perform dir rename
    > Parallel quiesce calls to same set
3. Verify if debug params to quiesce cmds have been removed

Comment 6 sumr 2024-06-05 12:40:07 UTC
Verified fix on ceph build 18.2.1-194.el9cp.

FUNCTIONAL REGRESSION TESTS
---------------------------

http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-BN7JSV/

SYSTEMIC REGRESSION TESTS
-------------------------

SCALE TEST - http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-68WUGA
STRESS TEST: http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-1QDVX1/cg_snap_system_test_0.log

PERFORM FS OPS in parallel to Quiesce
-------------------------------------

Set authrules to subvolume, pin subvolume test_dir to a mds rank, perform dir rename : http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-3XROAN

Parallel quiesce calls to same set
----------------------------------

http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-Z1HB2V/cg_snap_test_0.log

Verify debug param ?q=<secs> not working
[root@ceph-sumar-regression-narjcq-node8 ~]# ceph fs quiesce cephfs --set-id cg_dbg_params1 sv1?q=5 sv2?q=5 sv3?q=5 --timeout 300 --expiration 300 
{
    "epoch": 290,
    "leader": 44133,
    "set_version": 2234,
    "sets": {
        "cg_dbg_params1": {
            "version": 2234,
            "age_ref": 0.0,
            "state": {
                "name": "QUIESCING",
                "age": 0.0
            },
            "timeout": 300.0,
            "expiration": 300.0,
            "members": {
                "file:/volumes/_nogroup/sv3/02b68c49-4327-4309-9417-cd85f629f8a5?q=5": {
                    "excluded": false,
                    "state": {
                        "name": "QUIESCING",
                        "age": 0.0
                    }
                },
                "file:/volumes/_nogroup/sv2/a7cc3735-6a4d-4ebd-9e67-b60bf9b80e10?q=5": {
                    "excluded": false,
                    "state": {
                        "name": "QUIESCING",
                        "age": 0.0
                    }
                },
                "file:/volumes/_nogroup/sv1/1936ca82-f30e-4e88-94f6-fed218be72d2?q=5": {
                    "excluded": false,
                    "state": {
                        "name": "QUIESCING",
                        "age": 0.0
                    }
                }
            }
        }
    }
}
[root@ceph-sumar-regression-narjcq-node8 ~]# ceph fs quiesce cephfs --query --set-id cg_dbg_params1
{
    "epoch": 290,
    "leader": 44133,
    "set_version": 2237,
    "sets": {
        "cg_dbg_params1": {
            "version": 2237,
            "age_ref": 0.0,
            "state": {
                "name": "QUIESCED",
                "age": 2.5
            },
            "timeout": 300.0,
            "expiration": 300.0,
            "members": {
                "file:/volumes/_nogroup/sv3/02b68c49-4327-4309-9417-cd85f629f8a5?q=5": {
                    "excluded": false,
                    "state": {
                        "name": "QUIESCED",
                        "age": 2.5
                    }
                },
                "file:/volumes/_nogroup/sv2/a7cc3735-6a4d-4ebd-9e67-b60bf9b80e10?q=5": {
                    "excluded": false,
                    "state": {
                        "name": "QUIESCED",
                        "age": 2.5
                    }
                },
                "file:/volumes/_nogroup/sv1/1936ca82-f30e-4e88-94f6-fed218be72d2?q=5": {
                    "excluded": false,
                    "state": {
                        "name": "QUIESCED",
                        "age": 2.5
                    }
                }
            }
        }
    }
}
[root@ceph-sumar-regression-narjcq-node8 ~]# 

Marking the BZ as Verified.

Comment 7 errata-xmlrpc 2024-06-13 14:32:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Critical: Red Hat Ceph Storage 7.1 security, enhancements, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:3925


Note You need to log in before you can comment on or make changes to this bug.