Bug 2149564
| Summary: | ceph orch ps has inconsistent information - refresh might not be happening correctly | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Vasishta <vashastr> |
| Component: | Cephadm | Assignee: | Adam King <adking> |
| Status: | CLOSED DUPLICATE | QA Contact: | Manisha Saini <msaini> |
| Severity: | high | Docs Contact: | Anjana Suparna Sriram <asriram> |
| Priority: | unspecified | ||
| Version: | 5.3 | CC: | cephqe-warriors |
| Target Milestone: | --- | ||
| Target Release: | 6.1 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-12-08 15:43:55 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Vasishta
2022-11-30 08:17:18 UTC
Is this the same cluster as https://bugzilla.redhat.com/show_bug.cgi?id=2149606? In https://bugzilla.redhat.com/show_bug.cgi?id=2149606#c1 I was actually asking about this exact type of thing. Typically the refresh not happening implies cephadm has crashed or got stuck somehow. If it crashed, `ceph -s` and `ceph health detail` should report on it. Otherwise, if it's stuck, usually the best thing to do is to try `ceph mgr fail` to get cephadm to restart. After that, wait a few minutes and check `ceph orch ps` and `ceph orch device ls` again and see which things got refreshed. If everything did it should be "unstuck", otherwise which things didn't get refreshed will give us some info on where it got stuck. (In reply to Adam King from comment #2) > Is this the same cluster as > https://bugzilla.redhat.com/show_bug.cgi?id=2149606? Yes it is > In https://bugzilla.redhat.com/show_bug.cgi?id=2149606#c1 I was actually asking > about this exact type of thing. Typically the refresh not happening implies > cephadm has crashed or got stuck somehow. If it crashed, `ceph -s` and `ceph > health detail` should report on it. I don't see any crashes. > Otherwise, if it's stuck, usually the > best thing to do is to try `ceph mgr fail` to get cephadm to restart. After > that, wait a few minutes and check `ceph orch ps` and `ceph orch device ls` > again and see which things got refreshed. If everything did it should be > "unstuck", otherwise which things didn't get refreshed will give us some > info on where it got stuck. Tried mgr fail restarted standby mgr manually (which was active before mgr fail) Tried mgr fail again so that the initial mgr gets back to active orch ps got refreshed for other nodes but not on mon which went off 4 hours ago than current time. # ceph orch ps --service-name mon NAME HOST PORTS STATUS REFRESHED mon.e22-h20-b02-fc640.rdu2.scalelab.redhat.com e22-h20-b02-fc640.rdu2.scalelab.redhat.com running (2d) 3m ago mon.e22-h20-b03-fc640 e22-h20-b03-fc640.rdu2.scalelab.redhat.com running (2d) 3m ago mon.e22-h20-b04-fc640 e22-h20-b04-fc640.rdu2.scalelab.redhat.com running (2d) 3m ago ******* mon.f12-h10-000-1029u f12-h10-000-1029u.rdu2.scalelab.redhat.com running (2d) 2d ago ******** mon.f12-h11-000-1029u f12-h11-000-1029u.rdu2.scalelab.redhat.com running (19h) 2m ago Is there any open BZ for tracking requirement of **ceph mgr fail** to unblock refresh ? Got to know that there are no open BZs for tracking requirement of **ceph mgr fail** to unblock refresh. Will follow it up offline.
Closing this BZ as duplicate of 2151908 .
Cephadm commands were hung with
..
..
--- Logging error ---
Traceback (most recent call last):
File "/usr/lib64/python3.6/logging/__init__.py", line 998, in emit
self.flush()
File "/usr/lib64/python3.6/logging/__init__.py", line 978, in flush
self.stream.flush()
OSError: [Errno 28] No space left on device
As this is a configuration issue, cephadm should not get stuck indefinitely and report back to cluster that operation is not possible due to possible x/y/z reason.
This is being tracked under 2151908.
*** This bug has been marked as a duplicate of bug 2151908 ***
|