Bug 2163697 - cephadm operations get stuck due to zombie process of ceph on certain cluster nodes
Summary: cephadm operations get stuck due to zombie process of ceph on certain cluster...
Keywords:
Status: NEW
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Cephadm
Version: 5.3
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 6.2
Assignee: Adam King
QA Contact: Mohit Bisht
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-01-24 09:49 UTC by Vasishta
Modified: 2023-07-06 17:45 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-6018 0 None None None 2023-01-24 09:52:45 UTC

Description Vasishta 2023-01-24 09:49:34 UTC
Description of problem:
Randomly cephadm operations gets stuck due to Zombie ceph processes.
Another major visible observation is podman ps gets stuck on these nodes.

Version-Release number of selected component (if applicable):
cephadm-16.2.10-94.el9cp.noarch
cluster is in 16.2.10-87.el8cp

How reproducible:
Have observed ~15 times

Steps to Reproduce:
There is no known steps to reproduce.
We are observing it on a cluster with different public and private networks.

Actual results:
cephadm gets stuck performing operations on ANY of the nodes even if there is a zombie ceph process in any one of the node in a large cluster.

Expected results:
Either we need to fix zombie process or make cephadm to handle this situation to be able to perform user operations.

Additional info:
Workaround findout and kill zombie process 
or
Restart the node


Note You need to log in before you can comment on or make changes to this bug.