Bug 2094822

Summary: [CephFS] Clone operations are failing with Assertion Error
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Amarnath <amk>
Component: CephFSAssignee: Kotresh HR <khiremat>
Status: CLOSED ERRATA QA Contact: Amarnath <amk>
Severity: high Docs Contact:
Priority: medium    
Version: 5.2CC: ceph-eng-bugs, hyelloji, khiremat, tserlin, vshankar
Target Milestone: ---   
Target Release: 5.3z1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-16.2.10-100.el8cp Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-02-28 10:05:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2088698, 2182962    

Description Amarnath 2022-06-08 11:12:29 UTC
Description of problem:
Clone operations are failing with Assertion Error.
When we create more clone in my case i have created 130[root@ceph-amk-bz-2-qa3ps0-node7 _nogroup]# ceph fs clone status cephfs clone_status_142
Error EINVAL: Traceback (most recent call last):
  File "/usr/share/ceph/mgr/volumes/fs/operations/versions/__init__.py", line 96, in get_subvolume_object
    self.upgrade_to_v2_subvolume(subvolume)
  File "/usr/share/ceph/mgr/volumes/fs/operations/versions/__init__.py", line 57, in upgrade_to_v2_subvolume
    version = int(subvolume.metadata_mgr.get_global_option('version'))
  File "/usr/share/ceph/mgr/volumes/fs/operations/versions/metadata_manager.py", line 144, in get_global_option
    return self.get_option(MetadataManager.GLOBAL_SECTION, key)
  File "/usr/share/ceph/mgr/volumes/fs/operations/versions/metadata_manager.py", line 138, in get_option
    raise MetadataMgrException(-errno.ENOENT, "section '{0}' does not exist".format(section))
volumes.fs.exception.MetadataMgrException: -2 (section 'GLOBAL' does not exist)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/share/ceph/mgr/mgr_module.py", line 1446, in _handle_command
    return self.handle_command(inbuf, cmd)
  File "/usr/share/ceph/mgr/volumes/module.py", line 437, in handle_command
    return handler(inbuf, cmd)
  File "/usr/share/ceph/mgr/volumes/module.py", line 34, in wrap
    return f(self, inbuf, cmd)
  File "/usr/share/ceph/mgr/volumes/module.py", line 682, in _cmd_fs_clone_status
    vol_name=cmd['vol_name'], clone_name=cmd['clone_name'],  group_name=cmd.get('group_name', None))
  File "/usr/share/ceph/mgr/volumes/fs/volume.py", line 622, in clone_status
    with open_subvol(self.mgr, fs_handle, self.volspec, group, clonename, SubvolumeOpType.CLONE_STATUS) as subvolume:
  File "/lib64/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/usr/share/ceph/mgr/volumes/fs/operations/subvolume.py", line 72, in open_subvol
    subvolume = loaded_subvolumes.get_subvolume_object(mgr, fs, vol_spec, group, subvolname)
  File "/usr/share/ceph/mgr/volumes/fs/operations/versions/__init__.py", line 101, in get_subvolume_object
    self.upgrade_legacy_subvolume(fs, subvolume)
  File "/usr/share/ceph/mgr/volumes/fs/operations/versions/__init__.py", line 78, in upgrade_legacy_subvolume
    assert subvolume.legacy_mode
AssertionError




Version-Release number of selected component (if applicable):
[root@ceph-amk-bz-1-wu0ar7-node7 ~]# ceph versions
{
    "mon": {
        "ceph version 16.2.8-27.el8cp (b0bd3a6c6f24d3ac855dde96982871257bef866f) pacific (stable)": 3
    },
    "mgr": {
        "ceph version 16.2.8-27.el8cp (b0bd3a6c6f24d3ac855dde96982871257bef866f) pacific (stable)": 2
    },
    "osd": {
        "ceph version 16.2.8-27.el8cp (b0bd3a6c6f24d3ac855dde96982871257bef866f) pacific (stable)": 12
    },
    "mds": {
        "ceph version 16.2.8-27.el8cp (b0bd3a6c6f24d3ac855dde96982871257bef866f) pacific (stable)": 3
    },
    "overall": {
        "ceph version 16.2.8-27.el8cp (b0bd3a6c6f24d3ac855dde96982871257bef866f) pacific (stable)": 20
    }
}
[root@ceph-amk-bz-1-wu0ar7-node7 ~]# 


How reproducible:
1/1


Steps to Reproduce:

Create Subvolumegroup
    ceph fs subvolumegroup create cephfs subvolgroup_clone_status_1
Create Subvolume
    ceph fs subvolume create cephfs subvol_clone_status --size 5368706371 --group_name subvolgroup_clone_status_1
Kernel mount the volume and fill data
Create Snapshot
    ceph fs subvolume snapshot create cephfs subvol_clone_status snap_1 --group_name subvolgroup_clone_status_1
Create 200 Clones out of the above subvolume
    ceph fs subvolume snapshot clone cephfs subvol_clone_status snap_1 clone_status_1 --group_name subvolgroup_clone_status_1

Actual results:


Expected results:
Should fail gracefully

Additional info:

Comment 9 Amarnath 2023-01-24 18:08:18 UTC
Tested till 122 clones and we are not seeing the issue

[root@ceph-amk-bootstrap-clx3kj-node7 ~]# ceph fs clone status cephfs clone_status_122
{
  "status": {
    "state": "complete"
  }
}
[root@ceph-amk-bootstrap-clx3kj-node7 ~]# ceph fs subvolume snapshot info cephfs subvol_clone_status snap_1 --group_name subvolgroup_clone_status_1
{
    "created_at": "2023-01-23 08:28:43.445930",
    "data_pool": "cephfs.cephfs.data",
    "has_pending_clones": "no"
}

Cluster has reached full with all the clones
[root@ceph-amk-bootstrap-clx3kj-node7 ~]# ceph -s
  cluster:
    id:     fa611392-9af1-11ed-be7c-fa163e696b4f
    health: HEALTH_ERR
            2 backfillfull osd(s)
            1 full osd(s)
            3 nearfull osd(s)
            Degraded data redundancy: 632/595707 objects degraded (0.106%), 81 pgs degraded
            Full OSDs blocking recovery: 81 pgs recovery_toofull
            4 pool(s) full
 
  services:
    mon: 3 daemons, quorum ceph-amk-bootstrap-clx3kj-node1-installer,ceph-amk-bootstrap-clx3kj-node2,ceph-amk-bootstrap-clx3kj-node3 (age 42m)
    mgr: ceph-amk-bootstrap-clx3kj-node1-installer.aqnxiy(active, since 34h), standbys: ceph-amk-bootstrap-clx3kj-node2.glqgyi
    mds: 2/2 daemons up, 1 standby
    osd: 12 osds: 12 up (since 42m), 12 in (since 42m)
 
  data:
    volumes: 1/1 healthy
    pools:   4 pools, 321 pgs
    objects: 198.57k objects, 49 GiB
    usage:   150 GiB used, 30 GiB / 180 GiB avail
    pgs:     632/595707 objects degraded (0.106%)
             240 active+clean
             81  active+recovery_toofull+degraded
 
  io:
    client:   85 B/s rd, 341 B/s wr, 0 op/s rd, 0 op/s wr
 
  progress:
    Global Recovery Event (23h)
      [====================........] (remaining: 7h)

Versions 
[root@ceph-amk-bootstrap-clx3kj-node7 ~]# ceph versions
{
    "mon": {
        "ceph version 16.2.10-103.el8cp (4a5dd59c2e6616f05cc94e6aab2bddf1339ca4f4) pacific (stable)": 3
    },
    "mgr": {
        "ceph version 16.2.10-103.el8cp (4a5dd59c2e6616f05cc94e6aab2bddf1339ca4f4) pacific (stable)": 2
    },
    "osd": {
        "ceph version 16.2.10-103.el8cp (4a5dd59c2e6616f05cc94e6aab2bddf1339ca4f4) pacific (stable)": 12
    },
    "mds": {
        "ceph version 16.2.10-103.el8cp (4a5dd59c2e6616f05cc94e6aab2bddf1339ca4f4) pacific (stable)": 3
    },
    "overall": {
        "ceph version 16.2.10-103.el8cp (4a5dd59c2e6616f05cc94e6aab2bddf1339ca4f4) pacific (stable)": 20
    }
}
[root@ceph-amk-bootstrap-clx3kj-node7 ~]#

Comment 11 errata-xmlrpc 2023-02-28 10:05:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Ceph Storage 5.3 Bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:0980