Bug 2276862

Summary: [CephFS-Mirror] - Traceback error seen while running - "ceph fs snapshot mirror daemon status"
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Hemanth Kumar <hyelloji>
Component: CephFSAssignee: Jos Collin <jcollin>
Status: CLOSED ERRATA QA Contact: Hemanth Kumar <hyelloji>
Severity: high Docs Contact: Akash Raj <akraj>
Priority: unspecified    
Version: 7.1CC: akraj, ceph-eng-bugs, cephqe-warriors, jcollin, tserlin, vshankar
Target Milestone: ---   
Target Release: 7.1z1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-18.2.1-205.el9cp Doc Type: Bug Fix
Doc Text:
Previously, the `directory_count key` was missing in `self.mgr.get_daemon_status()` output json, intermittently when there was a delay caused by `m_listener.handle_mirroring_enabled()` to update the `directory_count`. This resulted in `ServiceDaemon::update_status()` creating a json without `directory_count` key/value. This issue would occur intermittently when mirroring was enabled/disabled and 'daemon status' was checked in between. Due to this, ceph fs snapshot mirror daemon status would show `KeyError: 'directory_count'` when mirroring is disabled and enabled repeatedly. With this fix, the patch sets a default value 0 for `directory_count` in `doemon_status()` and the key error in `ceph fs snapshot mirror daemon status` no longer occurs.
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-08-07 11:21:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 1 Jos Collin 2024-04-24 15:39:26 UTC
Hemanth,

Could you please attach the mirror logs and mgr logs to check ?

Thanks.

Comment 2 Hemanth Kumar 2024-04-24 17:56:23 UTC
(In reply to Jos Collin from comment #1)
> Hemanth,
> 
> Could you please attach the mirror logs and mgr logs to check ?
> 
> Thanks.

ALl requested logs are uploaded here -- 

http://magna002.ceph.redhat.com/ceph-qe-logs/hemanth_k/Bug_2276862/

Comment 3 Jos Collin 2024-04-25 11:29:22 UTC
'ceph fs snapshot mirror daemon status' didn't show KeyError for me, by following the steps from the mgr logs.

It was still showing:

[{"daemon_id": 4161, "filesystems": [{"filesystem_id": 1, "name": "a", "directory_count": 2, "peers": [{"uuid": "f2b6795b-7d35-4ec0-8507-0713676fae2b", "remote": {"client_name": "client.mirror_remote", "cluster_name": "ceph", "fs_name": "remotefs"}, "stats": {"failure_count": 0, "recovery_count": 0}}]}]}]

So I couldn't reproduce this issue and it should be intermittent.
@Hemanth, could you please check and provide reproducing steps for hitting the KeyError? So that I could fix what's causing the issue instead of a workaround.

Comment 4 Venky Shankar 2024-04-29 13:36:39 UTC
(In reply to Hemanth Kumar from comment #2)
> (In reply to Jos Collin from comment #1)
> > Hemanth,
> > 
> > Could you please attach the mirror logs and mgr logs to check ?
> > 
> > Thanks.
> 
> ALl requested logs are uploaded here -- 
> 
> http://magna002.ceph.redhat.com/ceph-qe-logs/hemanth_k/Bug_2276862/

`SERVICE_DAEMON_DIR_COUNT_KEY` is only updated when a directory is added. See: FSMirror::handle_acquire_directory(). Jos, please try to reproduce on a fresh cluster without any dirs added.

Comment 5 Jos Collin 2024-05-06 05:33:57 UTC
it's not always reproducible. It's intermittent. But when it errored, self.mgr.get_daemon_status returns the below json. When the daemon status  ran again after sometime, the KeyError is gone.

mgr.x.log:2024-05-03T18:34:57.208+0530 7f18d3a046c0  0 [mirroring DEBUG mirroring.fs.snapshot_mirror] daemon_status: {'status_json': '{}'}
mgr.x.log:2024-05-03T18:35:05.077+0530 7f18d3a046c0  0 [mirroring DEBUG mirroring.fs.snapshot_mirror] daemon_status: {'status_json': '{"1":{"name":"a","peers":{}}}'}

Comment 6 Venky Shankar 2024-05-06 07:36:10 UTC
(In reply to Jos Collin from comment #5)
> it's not always reproducible. It's intermittent. But when it errored,
> self.mgr.get_daemon_status returns the below json. When the daemon status 
> ran again after sometime, the KeyError is gone.
> 
> mgr.x.log:2024-05-03T18:34:57.208+0530 7f18d3a046c0  0 [mirroring DEBUG
> mirroring.fs.snapshot_mirror] daemon_status: {'status_json': '{}'}
> mgr.x.log:2024-05-03T18:35:05.077+0530 7f18d3a046c0  0 [mirroring DEBUG
> mirroring.fs.snapshot_mirror] daemon_status: {'status_json':
> '{"1":{"name":"a","peers":{}}}'}

Have you identified the case where the keys go missing from in the JSON output? The fix is likely to be verifying if the key exists and then accessing it, but I would still like to know under which circumstances the keys are missing the daemon status JSON.

Comment 16 errata-xmlrpc 2024-08-07 11:21:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 7.1 security and bug fix update.), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:5080