Bug 2268179

Summary:	ceph fs snap-schedule command is erroring with EIO: disk I/O error
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Amarnath <amk>
Component:	CephFS	Assignee:	Milind Changire <mchangir>
Status:	CLOSED ERRATA	QA Contact:	Hemanth Kumar <hyelloji>
Severity:	medium	Docs Contact:	Rivka Pollack <rpollack>
Priority:	unspecified
Version:	7.1	CC:	ceph-eng-bugs, cephqe-warriors, gfarnum, mchangir, ngangadh, rpollack, sumr, tserlin, vshankar
Target Milestone:	---	Flags:	hyelloji: needinfo- hyelloji: needinfo-
Target Release:	7.1z5
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	ceph-18.2.1-335.el9cp	Doc Type:	Bug Fix
Doc Text:	.Improved Handling of fs_map Notifications After File-System Removal Previously, after a file-system was removed from the cluster, the fs_map notification about the change was not handled properly. This oversight caused the snap_schedule Manager Module to continue accessing the associated snap_schedule SQLite Database in the metadata pool, which in turn resulted in disk I/O errors. With this fix, all timers related to the file-system are now canceled and the SQLite Database connection is closed after deletion, helping ensure no invalid metadata pool references remain. NOTE: A small window still exists between file-system deletion and notification processing, during which a snapshot schedule could run for a recently deleted file-system and occasionally report disk I/O errors in the Manager logs or at the console.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2025-06-23 02:51:40 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Amarnath 2024-03-06 14:08:56 UTC

Description of problem:
ceph fs snap-schedule command is erroring with EIO: disk I/O error

As part of test case we are creating FS with name cephfs_snap_1 and enabling the 
snap-schedule and it works fine and we are deleting the FS.
http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-TCN23X/snap_schedule_test_0.log

on the same setup if we rerun the same test case. we are seeing the above error
[root@ceph-amk-nfs-n308a3-node7 ~]# ceph fs snap-schedule add /dir_kernel 1m --fs cephfs_snap_1
Error EIO: disk I/O error

Log : http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-RQWN0B/snap_schedule_test_0.log

mgr Log : 
2024-03-06T14:05:53.442+0000 7fd68f399640  0 [rbd_support INFO root] MirrorSnapshotScheduleHandler: load_schedules
2024-03-06T14:05:53.454+0000 7fd68ab90640  0 [rbd_support INFO root] TrashPurgeScheduleHandler: load_schedules
2024-03-06T14:05:54.007+0000 7fd67b532640  0 [volumes INFO mgr_util] scanning for idle connections..
2024-03-06T14:05:54.007+0000 7fd67b532640  0 [volumes INFO mgr_util] cleaning up connections: []
2024-03-06T14:05:54.387+0000 7fd6b2d6e640  0 log_channel(audit) log [DBG] : from='client.25922 -' entity='client.admin' cmd=[{"prefix": "fs snap-schedule add", "path": "/dir_kernel", "snap_schedule": "1m", "fs": "cephfs_snap_1", "target": ["mon-mgr", ""]}]: dispatch
2024-03-06T14:05:54.389+0000 7fd67dd37640 -1 client.14706: SimpleRADOSStriper: lock: snap_db_v0.db:  lock failed: (2) No such file or directory
2024-03-06T14:05:54.390+0000 7fd67dd37640 -1 mgr.server reply reply (5) Input/output error disk I/O error

Logs : http://magna002.ceph.redhat.com/ceph-qe-logs/amar/snap-scedule/ceph-mgr.ceph-amk-nfs-n308a3-node1-installer.ddwlwo.log 


Version-Release number of selected component (if applicable):
[root@ceph-amk-nfs-n308a3-node7 ~]# ceph versions
{
    "mon": {
        "ceph version 18.2.1-46.el9cp (141acb8d05e675ccf507f89585369b2c90c6d4a9) reef (stable)": 3
    },
    "mgr": {
        "ceph version 18.2.1-46.el9cp (141acb8d05e675ccf507f89585369b2c90c6d4a9) reef (stable)": 2
    },
    "osd": {
        "ceph version 18.2.1-46.el9cp (141acb8d05e675ccf507f89585369b2c90c6d4a9) reef (stable)": 12
    },
    "mds": {
        "ceph version 18.2.1-46.el9cp (141acb8d05e675ccf507f89585369b2c90c6d4a9) reef (stable)": 5
    },
    "overall": {
        "ceph version 18.2.1-46.el9cp (141acb8d05e675ccf507f89585369b2c90c6d4a9) reef (stable)": 22
    }
}


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 RHEL Program Management 2024-03-06 14:09:07 UTC

Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 2 Venky Shankar 2024-03-12 06:45:10 UTC

(In reply to Amarnath from comment #0)
> Description of problem:
> ceph fs snap-schedule command is erroring with EIO: disk I/O error
> 
> As part of test case we are creating FS with name cephfs_snap_1 and enabling
> the 
> snap-schedule and it works fine and we are deleting the FS.
> http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-TCN23X/
> snap_schedule_test_0.log
> 
> on the same setup if we rerun the same test case. we are seeing the above
> error
> [root@ceph-amk-nfs-n308a3-node7 ~]# ceph fs snap-schedule add /dir_kernel 1m
> --fs cephfs_snap_1
> Error EIO: disk I/O error
> 
> Log :
> http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-RQWN0B/
> snap_schedule_test_0.log
> 
> mgr Log : 
> 2024-03-06T14:05:53.442+0000 7fd68f399640  0 [rbd_support INFO root]
> MirrorSnapshotScheduleHandler: load_schedules
> 2024-03-06T14:05:53.454+0000 7fd68ab90640  0 [rbd_support INFO root]
> TrashPurgeScheduleHandler: load_schedules
> 2024-03-06T14:05:54.007+0000 7fd67b532640  0 [volumes INFO mgr_util]
> scanning for idle connections..
> 2024-03-06T14:05:54.007+0000 7fd67b532640  0 [volumes INFO mgr_util]
> cleaning up connections: []
> 2024-03-06T14:05:54.387+0000 7fd6b2d6e640  0 log_channel(audit) log [DBG] :
> from='client.25922 -' entity='client.admin' cmd=[{"prefix": "fs
> snap-schedule add", "path": "/dir_kernel", "snap_schedule": "1m", "fs":
> "cephfs_snap_1", "target": ["mon-mgr", ""]}]: dispatch
> 2024-03-06T14:05:54.389+0000 7fd67dd37640 -1 client.14706:
> SimpleRADOSStriper: lock: snap_db_v0.db:  lock failed: (2) No such file or
> directory

That the schedules database being loaded where we do handle ENOENT:

```
            with open_ioctx(self, pool_param) as ioctx:
                try:
                    size, _mtime = ioctx.stat(SNAP_DB_OBJECT_NAME)
                    dump = ioctx.read(SNAP_DB_OBJECT_NAME, size).decode('utf-8')
                    db.executescript(dump)
                    ioctx.remove_object(SNAP_DB_OBJECT_NAME)
		except rados.ObjectNotFound:
                    log.debug(f'No legacy schedule DB found in {fs}')
```

Milind?

Comment 21 errata-xmlrpc 2025-06-23 02:51:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 7.1 security and bug fix updates), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2025:9335