Bug 2163697

Summary: cephadm operations get stuck due to zombie process of ceph on certain cluster nodes
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Vasishta <vashastr>
Component: CephadmAssignee: Adam King <adking>
Status: NEW --- QA Contact: Mohit Bisht <mobisht>
Severity: high Docs Contact:
Priority: unspecified    
Version: 5.3CC: cephqe-warriors, saraut
Target Milestone: ---   
Target Release: 6.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vasishta 2023-01-24 09:49:34 UTC
Description of problem:
Randomly cephadm operations gets stuck due to Zombie ceph processes.
Another major visible observation is podman ps gets stuck on these nodes.

Version-Release number of selected component (if applicable):
cephadm-16.2.10-94.el9cp.noarch
cluster is in 16.2.10-87.el8cp

How reproducible:
Have observed ~15 times

Steps to Reproduce:
There is no known steps to reproduce.
We are observing it on a cluster with different public and private networks.

Actual results:
cephadm gets stuck performing operations on ANY of the nodes even if there is a zombie ceph process in any one of the node in a large cluster.

Expected results:
Either we need to fix zombie process or make cephadm to handle this situation to be able to perform user operations.

Additional info:
Workaround findout and kill zombie process 
or
Restart the node