2268179 – ceph fs snap-schedule command is erroring with EIO: disk I/O error

Bug 2268179 - ceph fs snap-schedule command is erroring with EIO: disk I/O error

Summary: ceph fs snap-schedule command is erroring with EIO: disk I/O error

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	7.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	7.1z5
Assignee:	Milind Changire
QA Contact:	Hemanth Kumar
Docs Contact:	Rivka Pollack
URL:
Whiteboard:
Duplicates (1):	2268545 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-03-06 14:08 UTC by Amarnath
Modified:	2025-06-23 02:51 UTC (History)
CC List:	9 users (show)
Fixed In Version:	ceph-18.2.1-335.el9cp
Doc Type:	Bug Fix
Doc Text:	.Improved Handling of fs_map Notifications After File-System Removal Previously, after a file-system was removed from the cluster, the fs_map notification about the change was not handled properly. This oversight caused the snap_schedule Manager Module to continue accessing the associated snap_schedule SQLite Database in the metadata pool, which in turn resulted in disk I/O errors. With this fix, all timers related to the file-system are now canceled and the SQLite Database connection is closed after deletion, helping ensure no invalid metadata pool references remain. NOTE: A small window still exists between file-system deletion and notification processing, during which a snapshot schedule could run for a recently deleted file-system and occasionally report disk I/O errors in the Manager logs or at the console.
Clone Of:
Environment:
Last Closed:	2025-06-23 02:51:40 UTC
Embargoed:
Dependent Products:
Flags:	hyelloji: needinfo- hyelloji: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	63999	None	None	None	2024-03-12 08:09:31 UTC
Red Hat Issue Tracker	RHCEPH-8454	None	None	None	2024-03-06 14:12:13 UTC
Red Hat Product Errata	RHBA-2025:9335	None	None	None	2025-06-23 02:51:43 UTC

Description Amarnath 2024-03-06 14:08:56 UTC

Description of problem:
ceph fs snap-schedule command is erroring with EIO: disk I/O error

As part of test case we are creating FS with name cephfs_snap_1 and enabling the 
snap-schedule and it works fine and we are deleting the FS.
http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-TCN23X/snap_schedule_test_0.log

on the same setup if we rerun the same test case. we are seeing the above error
[root@ceph-amk-nfs-n308a3-node7 ~]# ceph fs snap-schedule add /dir_kernel 1m --fs cephfs_snap_1
Error EIO: disk I/O error

Log : http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-RQWN0B/snap_schedule_test_0.log

mgr Log : 
2024-03-06T14:05:53.442+0000 7fd68f399640  0 [rbd_support INFO root] MirrorSnapshotScheduleHandler: load_schedules
2024-03-06T14:05:53.454+0000 7fd68ab90640  0 [rbd_support INFO root] TrashPurgeScheduleHandler: load_schedules
2024-03-06T14:05:54.007+0000 7fd67b532640  0 [volumes INFO mgr_util] scanning for idle connections..
2024-03-06T14:05:54.007+0000 7fd67b532640  0 [volumes INFO mgr_util] cleaning up connections: []
2024-03-06T14:05:54.387+0000 7fd6b2d6e640  0 log_channel(audit) log [DBG] : from='client.25922 -' entity='client.admin' cmd=[{"prefix": "fs snap-schedule add", "path": "/dir_kernel", "snap_schedule": "1m", "fs": "cephfs_snap_1", "target": ["mon-mgr", ""]}]: dispatch
2024-03-06T14:05:54.389+0000 7fd67dd37640 -1 client.14706: SimpleRADOSStriper: lock: snap_db_v0.db:  lock failed: (2) No such file or directory
2024-03-06T14:05:54.390+0000 7fd67dd37640 -1 mgr.server reply reply (5) Input/output error disk I/O error

Logs : http://magna002.ceph.redhat.com/ceph-qe-logs/amar/snap-scedule/ceph-mgr.ceph-amk-nfs-n308a3-node1-installer.ddwlwo.log 


Version-Release number of selected component (if applicable):
[root@ceph-amk-nfs-n308a3-node7 ~]# ceph versions
{
    "mon": {
        "ceph version 18.2.1-46.el9cp (141acb8d05e675ccf507f89585369b2c90c6d4a9) reef (stable)": 3
    },
    "mgr": {
        "ceph version 18.2.1-46.el9cp (141acb8d05e675ccf507f89585369b2c90c6d4a9) reef (stable)": 2
    },
    "osd": {
        "ceph version 18.2.1-46.el9cp (141acb8d05e675ccf507f89585369b2c90c6d4a9) reef (stable)": 12
    },
    "mds": {
        "ceph version 18.2.1-46.el9cp (141acb8d05e675ccf507f89585369b2c90c6d4a9) reef (stable)": 5
    },
    "overall": {
        "ceph version 18.2.1-46.el9cp (141acb8d05e675ccf507f89585369b2c90c6d4a9) reef (stable)": 22
    }
}


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 RHEL Program Management 2024-03-06 14:09:07 UTC

Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 2 Venky Shankar 2024-03-12 06:45:10 UTC

(In reply to Amarnath from comment #0)
> Description of problem:
> ceph fs snap-schedule command is erroring with EIO: disk I/O error
> 
> As part of test case we are creating FS with name cephfs_snap_1 and enabling
> the 
> snap-schedule and it works fine and we are deleting the FS.
> http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-TCN23X/
> snap_schedule_test_0.log
> 
> on the same setup if we rerun the same test case. we are seeing the above
> error
> [root@ceph-amk-nfs-n308a3-node7 ~]# ceph fs snap-schedule add /dir_kernel 1m
> --fs cephfs_snap_1
> Error EIO: disk I/O error
> 
> Log :
> http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-RQWN0B/
> snap_schedule_test_0.log
> 
> mgr Log : 
> 2024-03-06T14:05:53.442+0000 7fd68f399640  0 [rbd_support INFO root]
> MirrorSnapshotScheduleHandler: load_schedules
> 2024-03-06T14:05:53.454+0000 7fd68ab90640  0 [rbd_support INFO root]
> TrashPurgeScheduleHandler: load_schedules
> 2024-03-06T14:05:54.007+0000 7fd67b532640  0 [volumes INFO mgr_util]
> scanning for idle connections..
> 2024-03-06T14:05:54.007+0000 7fd67b532640  0 [volumes INFO mgr_util]
> cleaning up connections: []
> 2024-03-06T14:05:54.387+0000 7fd6b2d6e640  0 log_channel(audit) log [DBG] :
> from='client.25922 -' entity='client.admin' cmd=[{"prefix": "fs
> snap-schedule add", "path": "/dir_kernel", "snap_schedule": "1m", "fs":
> "cephfs_snap_1", "target": ["mon-mgr", ""]}]: dispatch
> 2024-03-06T14:05:54.389+0000 7fd67dd37640 -1 client.14706:
> SimpleRADOSStriper: lock: snap_db_v0.db:  lock failed: (2) No such file or
> directory

That the schedules database being loaded where we do handle ENOENT:

```
            with open_ioctx(self, pool_param) as ioctx:
                try:
                    size, _mtime = ioctx.stat(SNAP_DB_OBJECT_NAME)
                    dump = ioctx.read(SNAP_DB_OBJECT_NAME, size).decode('utf-8')
                    db.executescript(dump)
                    ioctx.remove_object(SNAP_DB_OBJECT_NAME)
		except rados.ObjectNotFound:
                    log.debug(f'No legacy schedule DB found in {fs}')
```

Milind?

Comment 21 errata-xmlrpc 2025-06-23 02:51:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 7.1 security and bug fix updates), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2025:9335

Note You need to log in before you can comment on or make changes to this bug.