2254477 – [CephFS-Snap Schedule] : Snapshot rotation fails when mds_max_snaps_per_dir set to non-default value < 100

Bug 2254477 - [CephFS-Snap Schedule] : Snapshot rotation fails when mds_max_snaps_per_dir set to non-default value < 100

Summary: [CephFS-Snap Schedule] : Snapshot rotation fails when mds_max_snaps_per_dir s...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	6.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	7.1z3
Assignee:	Milind Changire
QA Contact:	sumr
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2311030 2345553
TreeView+	depends on / blocked

Reported:	2023-12-14 06:48 UTC by sumr
Modified:	2025-06-25 04:25 UTC (History)
CC List:	9 users (show)
Fixed In Version:	ceph-18.2.1-263.el9cp
Doc Type:	Bug Fix
Doc Text:	.The `snap-schedule manager` module correctly enforces the global `mds_max_snaps_per_dir` configuration option Previously, the configuration value was not being correctly retrieved from the MDS. As a result, `snap-schedule manager` module would not enforce the `mds_max_snaps_per_dir` setting and would enforce a default limit of 100. With this fix, the configuration item is correctly fetched from the MDS. The `snap-schedule manager` module correctly enforces the global `mds_max_snaps_per_dir` configuration option.
Clone Of:
Clones:	2311030 2345553 (view as bug list)
Environment:
Last Closed:	2025-02-24 15:40:52 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	ceph ceph pull 57388	None	open	mgr/snap_schedule: correctly fetch mds_max_snaps_per_dir from mds	2024-08-09 08:01:01 UTC
Red Hat Issue Tracker	RHCEPH-8035	None	None	None	2023-12-14 06:51:47 UTC
Red Hat Product Errata	RHBA-2025:1770	None	None	None	2025-02-24 15:41:03 UTC

Description sumr 2023-12-14 06:48:02 UTC

Description of problem:

Scheduled Snapshot Rotation stops after reaching the retention limit if mds_max_snaps_per_dir mds config is set to some non-default value that is less than 100.
Scheduled snapshot creation for next snapshot fails with error "Too many links [Errno 31]"

Version-Release number of selected component (if applicable):17.2.6-167.el9cp quincy (stable)


How reproducible:Consistent

Comment 6 sumr 2023-12-19 06:18:22 UTC

Venky, 

I tried Ceph-mgr restart when snap rotation fails, below are logs. 
This doesn't seem to fix the issue on existing schedule.

Logs:

[root@ceph-sumar-bz-verify-ti5ilr-node8 .snap]# ceph config set mds mds_max_snaps_per_dir 5
[root@ceph-sumar-bz-verify-ti5ilr-node8 .snap]#

[root@ceph-sumar-bz-verify-ti5ilr-node8 .snap]# ls -l
total 3
drwxr-xr-x. 4 root root 4730193114 Dec 15 03:10 scheduled-2023-12-19-05_57_00_UTC
drwxr-xr-x. 4 root root 4730193114 Dec 15 03:10 scheduled-2023-12-19-05_58_00_UTC
drwxr-xr-x. 4 root root 4730193114 Dec 15 03:10 scheduled-2023-12-19-05_59_00_UTC
drwxr-xr-x. 4 root root 4730193114 Dec 15 03:10 scheduled-2023-12-19-06_00_00_UTC
drwxr-xr-x. 4 root root 4730193114 Dec 15 03:10 scheduled-2023-12-19-06_01_01_UTC
[root@ceph-sumar-bz-verify-ti5ilr-node8 .snap]# date
Tue Dec 19 01:08:23 EST 2023
[root@ceph-sumar-bz-verify-ti5ilr-node8 .snap]# date
Tue Dec 19 01:09:29 EST 2023
[root@ceph-sumar-bz-verify-ti5ilr-node8 .snap]# ls -l
total 3
drwxr-xr-x. 4 root root 4730193114 Dec 15 03:10 scheduled-2023-12-19-05_57_00_UTC
drwxr-xr-x. 4 root root 4730193114 Dec 15 03:10 scheduled-2023-12-19-05_58_00_UTC
drwxr-xr-x. 4 root root 4730193114 Dec 15 03:10 scheduled-2023-12-19-05_59_00_UTC
drwxr-xr-x. 4 root root 4730193114 Dec 15 03:10 scheduled-2023-12-19-06_00_00_UTC
drwxr-xr-x. 4 root root 4730193114 Dec 15 03:10 scheduled-2023-12-19-06_01_01_UTC

[root@ceph-sumar-bz-verify-ti5ilr-node1-installer ~]# systemctl list-units --type=service | grep mgr | awk {{'print $1'}}
ceph-7ec04396-9b1f-11ee-99b5-fa163ebdcde7.ncutlx.service
[root@ceph-sumar-bz-verify-ti5ilr-node1-installer ~]# systemctl restart ceph-7ec04396-9b1f-11ee-99b5-fa163ebdcde7.ncutlx.service
[root@ceph-sumar-bz-verify-ti5ilr-node1-installer ~]# systemctl list-units --type=service | grep mgr | awk {{'print $1'}}
ceph-7ec04396-9b1f-11ee-99b5-fa163ebdcde7.ncutlx.service
[root@ceph-sumar-bz-verify-ti5ilr-node1-installer ~]# 


[root@ceph-sumar-bz-verify-ti5ilr-node8 .snap]# ceph status
  cluster:
    id:     7ec04396-9b1f-11ee-99b5-fa163ebdcde7
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum ceph-sumar-bz-verify-ti5ilr-node1-installer,ceph-sumar-bz-verify-ti5ilr-node2,ceph-sumar-bz-verify-ti5ilr-node3 (age 3d)
    mgr: ceph-sumar-bz-verify-ti5ilr-node1-installer.ncutlx(active, since 16s), standbys: ceph-sumar-bz-verify-ti5ilr-node2.tiuias
    mds: 3/3 daemons up, 2 standby
    osd: 16 osds: 16 up (since 3d), 16 in (since 3d)
 
  data:
    volumes: 2/2 healthy
    pools:   6 pools, 177 pgs
    objects: 14.46k objects, 7.9 GiB
    usage:   33 GiB used, 207 GiB / 240 GiB avail
    pgs:     177 active+clean
 
  io:
    client:   85 B/s rd, 170 B/s wr, 0 op/s rd, 0 op/s wr
 

[root@ceph-sumar-bz-verify-ti5ilr-node8 .snap]# ceph fs snap-schedule status /
{"fs": "cephfs", "subvol": null, "path": "/", "rel_path": "/", "schedule": "1M", "retention": {}, "start": "2023-12-19T00:00:00", "created": "2023-12-19T05:56:34", "first": "2023-12-19T05:57:00", "last": "2023-12-19T06:01:01", "last_pruned": null, "created_count": 5, "pruned_count": 0, "active": false}
[root@ceph-sumar-bz-verify-ti5ilr-node8 .snap]# ceph fs snap-schedule activate /
Schedule activated for path /
[root@ceph-sumar-bz-verify-ti5ilr-node8 .snap]# ceph fs snap-schedule status /
{"fs": "cephfs", "subvol": null, "path": "/", "rel_path": "/", "schedule": "1M", "retention": {}, "start": "2023-12-19T00:00:00", "created": "2023-12-19T05:56:34", "first": "2023-12-19T05:57:00", "last": "2023-12-19T06:01:01", "last_pruned": null, "created_count": 5, "pruned_count": 0, "active": true}

2023-12-19T06:15:00.776+0000 7f167709d640  0 [snap_schedule ERROR snap_schedule.fs.schedule_client] create_scheduled_snapshot raised an exception:
2023-12-19T06:15:00.777+0000 7f167709d640  0 [snap_schedule ERROR snap_schedule.fs.schedule_client] Traceback (most recent call last):
  File "/usr/share/ceph/mgr/snap_schedule/fs/schedule_client.py", line 293, in create_scheduled_snapshot
    fs_handle.mkdir(snap_name, 0o755)
  File "cephfs.pyx", line 1027, in cephfs.LibCephFS.mkdir
cephfs.OSError: error in mkdir //.snap/scheduled-2023-12-19-06_15_00_UTC: Too many links [Errno 31]

2023-12-19T06:15:00.844+0000 7f167709d640  0 [snap_schedule INFO snap_schedule.fs.schedule_client] no retention set, assuming n: 99

[root@ceph-sumar-bz-verify-ti5ilr-node8 .snap]# ceph fs snap-schedule status /
{"fs": "cephfs", "subvol": null, "path": "/", "rel_path": "/", "schedule": "1M", "retention": {}, "start": "2023-12-19T00:00:00", "created": "2023-12-19T05:56:34", "first": "2023-12-19T05:57:00", "last": "2023-12-19T06:01:01", "last_pruned": null, "created_count": 5, "pruned_count": 0, "active": false}

Comment 24 errata-xmlrpc 2025-02-24 15:40:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 7.1 security, bug fix, enhancement, and known issue updates), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2025:1770

Comment 25 Red Hat Bugzilla 2025-06-25 04:25:03 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.