Bug 2152053

Summary:	ceph orchestrator affected by ceph-volume inventory commands that hang and stay in D state
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Vasishta <vashastr>
Component:	Ceph-Volume	Assignee:	Guillaume Abrioux <gabrioux>
Status:	CLOSED ERRATA	QA Contact:	Manisha Saini <msaini>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	5.3	CC:	ceph-eng-bugs, cephqe-warriors, gabrioux, tserlin
Target Milestone:	---
Target Release:	5.3z1
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	ceph-16.2.10-98.el8cp	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-02-28 10:06:16 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Vasishta 2022-12-09 03:16:19 UTC

Description of problem:
When a network mount is present in /proc/mounts but for any reason
the corresponding server is down, this function hangs forever.
In a cluster deployed with cephadm, the consequence is that
it triggers ceph-volume inventory commands that hang and stay in D
state.

Downstream Context:
In our env, ceph orch upgrade was stuck indefinitely, upon examining, found out that 1/12 node *might* had some stale cephfs mounts which is causing stuck operations. (df -h, df -l, strace -o df.errors df), the blocker of upgrade could also be due to same reason as ceph-volume inventry check and ceph orch upgrade are blocked.

Contextual Steps to Reproduce:
1. Configure 5.x ceph cluster
2. Have some stale mounts in one of the cluster nodes
3. Try ceph orch upgrade, observe that cluster doesn't get upgraded without giving a clue, check that ceph-volume inventory gets stuck.

Version-Release number of selected component (if applicable):
5.3
16.2.10-75

How reproducible:
Once


Actual results:
ceph-volume inventory gets stuck.

Expected results:
ceph-volume to avoid stale mounts

Additional info:
Fix is already present in quincy.

Comment 1 Vasishta 2022-12-09 03:23:31 UTC

Fix has been backported to pacific also, this is a tracker for downstream inclusion of the fix.
As the issue being one of the reason for upgrade process, created this tracker

[Workaround is to reboot the node, will try and update further]

Comment 9 Manisha Saini 2023-02-09 09:48:41 UTC

Hi @Guillaume Abrioux , Could you please let us know the verification steps for same? 

From the description, it looks like we need to upgrade the cluster with stale mounts in /proc/mounts . I have few questions 

1. Verification of this BZ requires upgrade or this can be tested some other way? 

2. If needs to be tested with upgrade, How to create stale entry for cephfs volume for verification. 

3. Upgrade needs to perform from 5.3z1 to 6.0 for verification and for reproducing this issue we need to perform upgrade from 5.3 (LIVE) to 5.3z1 builds?

Comment 14 Manisha Saini 2023-02-14 10:23:42 UTC

Based on comment #11 comment #12 and comment #13 , Moving this BZ to verified state

Comment 15 errata-xmlrpc 2023-02-28 10:06:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Ceph Storage 5.3 Bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:0980