Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 2149564

Summary:	ceph orch ps has inconsistent information - refresh might not be happening correctly
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Vasishta <vashastr>
Component:	Cephadm	Assignee:	Adam King <adking>
Status:	CLOSED DUPLICATE	QA Contact:	Manisha Saini <msaini>
Severity:	high	Docs Contact:	Anjana Suparna Sriram <asriram>
Priority:	unspecified
Version:	5.3	CC:	cephqe-warriors
Target Milestone:	---
Target Release:	6.1
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-12-08 15:43:55 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Vasishta 2022-11-30 08:17:18 UTC

Description of problem:
Two of 5 monitors in one of the clusters were down due to low disk space. Situation was maintained for more than a day to preserve osd logs.

ceph status and systemctl service status/daemon logs clearly depicts that monitors were not running for 1+ days.

ceph orch ps mentions that service is running despite the service was stopped.
Also ceph orch ps says that the status was refreshed just 10h ago.

So It can be guessed that when cache was refreshed, orchestrator failed to recognise that the daemon is down.

Version-Release number of selected component (if applicable):
16.2.10-78.el8cp

How reproducible:
HAppening on two different monitors in same cluster.

Steps to Reproduce:
1. Configured cluster with separate public netowrk and cluster network.
2. Enabled debug logs and log_to_file
3. Daemon hosts got their disks full
4. Monitors went down.
5. Observed ceph orch ps after two days.
6. ceph orch ps mentioned that monitors were running depite they were down since two days.

Actual results:
ceph orch ps returning inacurate data

Expected results:
ceph orch ps to return accurate data withrepect to mgr/cephadm/daemon_cache_timeout

Additional info:

Current time on system is Wed Nov 30 08:13:38

# ceph orch ps --service-name mon
NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID

mon.f12-h10-000-1029u f12-h10-000-1029u.rdu2.scalelab.redhat.com running (14h) 10h ago 8d 1184M 2048M 16.2.10-78.el8cp f602dce7d85c 54f18020b1bc

[root@e22-h20-b02-fc640 75d6646a-697d-11ed-8d63-004e013de69e]# ssh f12-h11-000-1029u.rdu2.xxxxxxxxxxxx "systemctl -l| grep ceph"
root.xxxxxxxx's password:
..
● ceph-75d6646a-697d-11ed-8d63-004e013de69e.service loaded failed failed Ceph mon.f12-h11-000-1029u for 75d6646a-697d-11ed-8d63-004e013de69e

(Last line in daemon log with timestamp, there has been a crash after that for low disk space)
2022-11-29T03:09:52.884+0000 7fd3a426c700 2 rocksdb: [e/sst_file_manager_impl.cc:290] free space [20480 bytes] is less than required disk buffer [33554432 bytes]

Comment 2 Adam King 2022-11-30 15:23:20 UTC

Is this the same cluster as https://bugzilla.redhat.com/show_bug.cgi?id=2149606? In https://bugzilla.redhat.com/show_bug.cgi?id=2149606#c1 I was actually asking about this exact type of thing. Typically the refresh not happening implies cephadm has crashed or got stuck somehow. If it crashed, `ceph -s` and `ceph health detail` should report on it. Otherwise, if it's stuck, usually the best thing to do is to try `ceph mgr fail` to get cephadm to restart. After that, wait a few minutes and check `ceph orch ps` and `ceph orch device ls` again and see which things got refreshed. If everything did it should be "unstuck", otherwise which things didn't get refreshed will give us some info on where it got stuck.

Comment 3 Vasishta 2022-12-01 07:16:46 UTC

(In reply to Adam King from comment #2)
> Is this the same cluster as
> https://bugzilla.redhat.com/show_bug.cgi?id=2149606? 
Yes it is

> In https://bugzilla.redhat.com/show_bug.cgi?id=2149606#c1 I was actually asking
> about this exact type of thing. Typically the refresh not happening implies
> cephadm has crashed or got stuck somehow. If it crashed, `ceph -s` and `ceph
> health detail` should report on it. 
I don't see any crashes.

> Otherwise, if it's stuck, usually the
> best thing to do is to try `ceph mgr fail` to get cephadm to restart. After
> that, wait a few minutes and check `ceph orch ps` and `ceph orch device ls`
> again and see which things got refreshed. If everything did it should be
> "unstuck", otherwise which things didn't get refreshed will give us some
> info on where it got stuck.

Tried mgr fail
restarted standby mgr manually (which was active before mgr fail)
Tried mgr fail again so that the initial mgr gets back to active

orch ps got refreshed for other nodes but not on mon which went off 4 hours ago than current time.

# ceph orch ps --service-name mon
NAME                                            HOST                                        PORTS  STATUS         REFRESHED
mon.e22-h20-b02-fc640.rdu2.scalelab.redhat.com  e22-h20-b02-fc640.rdu2.scalelab.redhat.com         running (2d)      3m ago
mon.e22-h20-b03-fc640                           e22-h20-b03-fc640.rdu2.scalelab.redhat.com         running (2d)      3m ago
mon.e22-h20-b04-fc640                           e22-h20-b04-fc640.rdu2.scalelab.redhat.com         running (2d)      3m ago
******* mon.f12-h10-000-1029u                           f12-h10-000-1029u.rdu2.scalelab.redhat.com         running (2d)      2d ago ********
mon.f12-h11-000-1029u                           f12-h11-000-1029u.rdu2.scalelab.redhat.com         running (19h)     2m ago


Is there any open BZ for tracking requirement of **ceph mgr fail** to unblock refresh ?

Comment 4 Vasishta 2022-12-08 15:43:55 UTC

Got to know that there are no open BZs for tracking requirement of **ceph mgr fail** to unblock refresh. Will follow it up offline.

Closing this BZ as duplicate of 2151908 .
Cephadm commands were hung with 
..
..
--- Logging error ---
Traceback (most recent call last):
  File "/usr/lib64/python3.6/logging/__init__.py", line 998, in emit
    self.flush()
  File "/usr/lib64/python3.6/logging/__init__.py", line 978, in flush
    self.stream.flush()
OSError: [Errno 28] No space left on device

As this is a configuration issue, cephadm should not get stuck indefinitely and report back to cluster that operation is not possible due to possible x/y/z reason.
This is being tracked under 2151908.

*** This bug has been marked as a duplicate of bug 2151908 ***