Description of problem: When a network mount is present in /proc/mounts but for any reason the corresponding server is down, this function hangs forever. In a cluster deployed with cephadm, the consequence is that it triggers ceph-volume inventory commands that hang and stay in D state. Downstream Context: In our env, ceph orch upgrade was stuck indefinitely, upon examining, found out that 1/12 node *might* had some stale cephfs mounts which is causing stuck operations. (df -h, df -l, strace -o df.errors df), the blocker of upgrade could also be due to same reason as ceph-volume inventry check and ceph orch upgrade are blocked. Contextual Steps to Reproduce: 1. Configure 5.x ceph cluster 2. Have some stale mounts in one of the cluster nodes 3. Try ceph orch upgrade, observe that cluster doesn't get upgraded without giving a clue, check that ceph-volume inventory gets stuck. Version-Release number of selected component (if applicable): 5.3 16.2.10-75 How reproducible: Once Actual results: ceph-volume inventory gets stuck. Expected results: ceph-volume to avoid stale mounts Additional info: Fix is already present in quincy.
Fix has been backported to pacific also, this is a tracker for downstream inclusion of the fix. As the issue being one of the reason for upgrade process, created this tracker [Workaround is to reboot the node, will try and update further]
Hi @Guillaume Abrioux , Could you please let us know the verification steps for same? From the description, it looks like we need to upgrade the cluster with stale mounts in /proc/mounts . I have few questions 1. Verification of this BZ requires upgrade or this can be tested some other way? 2. If needs to be tested with upgrade, How to create stale entry for cephfs volume for verification. 3. Upgrade needs to perform from 5.3z1 to 6.0 for verification and for reproducing this issue we need to perform upgrade from 5.3 (LIVE) to 5.3z1 builds?
Based on comment #11 comment #12 and comment #13 , Moving this BZ to verified state
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat Ceph Storage 5.3 Bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:0980