Description of problem: Scheduled Snapshot Rotation stops after reaching the retention limit if mds_max_snaps_per_dir mds config is set to some non-default value that is less than 100. Scheduled snapshot creation for next snapshot fails with error "Too many links [Errno 31]" Version-Release number of selected component (if applicable):17.2.6-167.el9cp quincy (stable) How reproducible:Consistent
Venky, I tried Ceph-mgr restart when snap rotation fails, below are logs. This doesn't seem to fix the issue on existing schedule. Logs: [root@ceph-sumar-bz-verify-ti5ilr-node8 .snap]# ceph config set mds mds_max_snaps_per_dir 5 [root@ceph-sumar-bz-verify-ti5ilr-node8 .snap]# [root@ceph-sumar-bz-verify-ti5ilr-node8 .snap]# ls -l total 3 drwxr-xr-x. 4 root root 4730193114 Dec 15 03:10 scheduled-2023-12-19-05_57_00_UTC drwxr-xr-x. 4 root root 4730193114 Dec 15 03:10 scheduled-2023-12-19-05_58_00_UTC drwxr-xr-x. 4 root root 4730193114 Dec 15 03:10 scheduled-2023-12-19-05_59_00_UTC drwxr-xr-x. 4 root root 4730193114 Dec 15 03:10 scheduled-2023-12-19-06_00_00_UTC drwxr-xr-x. 4 root root 4730193114 Dec 15 03:10 scheduled-2023-12-19-06_01_01_UTC [root@ceph-sumar-bz-verify-ti5ilr-node8 .snap]# date Tue Dec 19 01:08:23 EST 2023 [root@ceph-sumar-bz-verify-ti5ilr-node8 .snap]# date Tue Dec 19 01:09:29 EST 2023 [root@ceph-sumar-bz-verify-ti5ilr-node8 .snap]# ls -l total 3 drwxr-xr-x. 4 root root 4730193114 Dec 15 03:10 scheduled-2023-12-19-05_57_00_UTC drwxr-xr-x. 4 root root 4730193114 Dec 15 03:10 scheduled-2023-12-19-05_58_00_UTC drwxr-xr-x. 4 root root 4730193114 Dec 15 03:10 scheduled-2023-12-19-05_59_00_UTC drwxr-xr-x. 4 root root 4730193114 Dec 15 03:10 scheduled-2023-12-19-06_00_00_UTC drwxr-xr-x. 4 root root 4730193114 Dec 15 03:10 scheduled-2023-12-19-06_01_01_UTC [root@ceph-sumar-bz-verify-ti5ilr-node1-installer ~]# systemctl list-units --type=service | grep mgr | awk {{'print $1'}} ceph-7ec04396-9b1f-11ee-99b5-fa163ebdcde7.ncutlx.service [root@ceph-sumar-bz-verify-ti5ilr-node1-installer ~]# systemctl restart ceph-7ec04396-9b1f-11ee-99b5-fa163ebdcde7.ncutlx.service [root@ceph-sumar-bz-verify-ti5ilr-node1-installer ~]# systemctl list-units --type=service | grep mgr | awk {{'print $1'}} ceph-7ec04396-9b1f-11ee-99b5-fa163ebdcde7.ncutlx.service [root@ceph-sumar-bz-verify-ti5ilr-node1-installer ~]# [root@ceph-sumar-bz-verify-ti5ilr-node8 .snap]# ceph status cluster: id: 7ec04396-9b1f-11ee-99b5-fa163ebdcde7 health: HEALTH_OK services: mon: 3 daemons, quorum ceph-sumar-bz-verify-ti5ilr-node1-installer,ceph-sumar-bz-verify-ti5ilr-node2,ceph-sumar-bz-verify-ti5ilr-node3 (age 3d) mgr: ceph-sumar-bz-verify-ti5ilr-node1-installer.ncutlx(active, since 16s), standbys: ceph-sumar-bz-verify-ti5ilr-node2.tiuias mds: 3/3 daemons up, 2 standby osd: 16 osds: 16 up (since 3d), 16 in (since 3d) data: volumes: 2/2 healthy pools: 6 pools, 177 pgs objects: 14.46k objects, 7.9 GiB usage: 33 GiB used, 207 GiB / 240 GiB avail pgs: 177 active+clean io: client: 85 B/s rd, 170 B/s wr, 0 op/s rd, 0 op/s wr [root@ceph-sumar-bz-verify-ti5ilr-node8 .snap]# ceph fs snap-schedule status / {"fs": "cephfs", "subvol": null, "path": "/", "rel_path": "/", "schedule": "1M", "retention": {}, "start": "2023-12-19T00:00:00", "created": "2023-12-19T05:56:34", "first": "2023-12-19T05:57:00", "last": "2023-12-19T06:01:01", "last_pruned": null, "created_count": 5, "pruned_count": 0, "active": false} [root@ceph-sumar-bz-verify-ti5ilr-node8 .snap]# ceph fs snap-schedule activate / Schedule activated for path / [root@ceph-sumar-bz-verify-ti5ilr-node8 .snap]# ceph fs snap-schedule status / {"fs": "cephfs", "subvol": null, "path": "/", "rel_path": "/", "schedule": "1M", "retention": {}, "start": "2023-12-19T00:00:00", "created": "2023-12-19T05:56:34", "first": "2023-12-19T05:57:00", "last": "2023-12-19T06:01:01", "last_pruned": null, "created_count": 5, "pruned_count": 0, "active": true} 2023-12-19T06:15:00.776+0000 7f167709d640 0 [snap_schedule ERROR snap_schedule.fs.schedule_client] create_scheduled_snapshot raised an exception: 2023-12-19T06:15:00.777+0000 7f167709d640 0 [snap_schedule ERROR snap_schedule.fs.schedule_client] Traceback (most recent call last): File "/usr/share/ceph/mgr/snap_schedule/fs/schedule_client.py", line 293, in create_scheduled_snapshot fs_handle.mkdir(snap_name, 0o755) File "cephfs.pyx", line 1027, in cephfs.LibCephFS.mkdir cephfs.OSError: error in mkdir //.snap/scheduled-2023-12-19-06_15_00_UTC: Too many links [Errno 31] 2023-12-19T06:15:00.844+0000 7f167709d640 0 [snap_schedule INFO snap_schedule.fs.schedule_client] no retention set, assuming n: 99 [root@ceph-sumar-bz-verify-ti5ilr-node8 .snap]# ceph fs snap-schedule status / {"fs": "cephfs", "subvol": null, "path": "/", "rel_path": "/", "schedule": "1M", "retention": {}, "start": "2023-12-19T00:00:00", "created": "2023-12-19T05:56:34", "first": "2023-12-19T05:57:00", "last": "2023-12-19T06:01:01", "last_pruned": null, "created_count": 5, "pruned_count": 0, "active": false}
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 7.1 security, bug fix, enhancement, and known issue updates), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2025:1770
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days