Description of problem (please be detailed as possible and provide log snippets): [Tracker #15887] Linux reboot / shutdown hung by CephFS First off, someone smarter than me, needs to ensure I have the correct Component / Subcomponent From Upstream Tracker: ~~~ Description 1) Mount cephfs on client, 2) Shutdown osd+mon node or make it not reachable 3) While client is accessing the mount(simple ls on dir), reboot the client It will be stuck forever until it can reach the ceph nodes or unless hard reset is done Expected behavior: Reboot should work when ceph nodes are not reachable ~~~ We also see this from GitHub: (https://github.com/rook/rook/issues/2517) This GitHub is now closed, but notice that people are updating the GitHub complaining this is still an issue. I have worked cases where this issue existed and I was surprised to see no BZ and no KCS for this issue. I will fix the no KCS part after finishing this BZ The main point here, customers are bitter about this issue. Imagine having an issue, which prompts one to reboot a node, you are already having a "bad day" and, now, even the reboot is a problem. The case I'm working (SFDC #03494123) has a Must Gather, feel free to look at that. This is what's being spewed in `dmesg -T` while the node is hung in the Linux shutdown ~~~ [Thu Apr 27 17:10:56 2023] libceph: osd9 (1)192.168.65.51:6801 connect error [Thu Apr 27 17:10:56 2023] libceph: connect (1)192.168.65.62:6801 error -101 [Thu Apr 27 17:10:56 2023] libceph: osd0 (1)192.168.65.62:6801 connect error [Thu Apr 27 17:10:56 2023] libceph: connect (1)192.168.65.72:6801 error -101 [Thu Apr 27 17:10:56 2023] libceph: osd16 (1)192.168.65.72:6801 connect error [Thu Apr 27 17:10:57 2023] libceph: connect (1)192.168.65.46:6801 error -101 [Thu Apr 27 17:10:57 2023] libceph: osd2 (1)192.168.65.46:6801 connect error [Thu Apr 27 17:10:57 2023] libceph: connect (1)192.168.65.76:6801 error -101 [Thu Apr 27 17:10:57 2023] libceph: osd22 (1)192.168.65.76:6801 connect error [Thu Apr 27 17:10:58 2023] libceph: connect (1)192.168.65.77:6801 error -101 [Thu Apr 27 17:10:58 2023] libceph: osd3 (1)192.168.65.77:6801 connect error [Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.65.74:6801 error -101 [Thu Apr 27 17:11:00 2023] libceph: osd14 (1)192.168.65.74:6801 connect error [Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.65.52:6801 error -101 [Thu Apr 27 17:11:00 2023] libceph: osd12 (1)192.168.65.52:6801 connect error [Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.65.38:6801 error -101 [Thu Apr 27 17:11:00 2023] libceph: osd18 (1)192.168.65.38:6801 connect error [Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.65.49:6801 error -101 [Thu Apr 27 17:11:00 2023] libceph: osd8 (1)192.168.65.49:6801 connect error [Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.65.56:6801 error -101 [Thu Apr 27 17:11:00 2023] libceph: osd23 (1)192.168.65.56:6801 connect error [Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.65.66:6801 error -101 [Thu Apr 27 17:11:00 2023] libceph: osd6 (1)192.168.65.66:6801 connect error [Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.65.44:6801 error -101 [Thu Apr 27 17:11:00 2023] libceph: osd7 (1)192.168.65.44:6801 connect error [Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.65.11:6801 error -101 [Thu Apr 27 17:11:00 2023] libceph: osd19 (1)192.168.65.11:6801 connect error [Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.65.3:6801 error -101 [Thu Apr 27 17:11:00 2023] libceph: osd1 (1)192.168.65.3:6801 connect error [Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.65.80:6801 error -101 [Thu Apr 27 17:11:00 2023] libceph: osd10 (1)192.168.65.80:6801 connect error [Thu Apr 27 17:11:01 2023] libceph: connect (1)172.30.54.55:6789 error -101 [Thu Apr 27 17:11:01 2023] libceph: mon1 (1)172.30.54.55:6789 connect error [Thu Apr 27 17:11:02 2023] libceph: connect (1)192.168.65.45:6801 error -101 [Thu Apr 27 17:11:02 2023] libceph: osd21 (1)192.168.65.45:6801 connect error [Thu Apr 27 17:11:02 2023] libceph: connect (1)192.168.65.60:6801 error -101 [Thu Apr 27 17:11:02 2023] libceph: osd13 (1)192.168.65.60:6801 connect error [Thu Apr 27 17:11:02 2023] libceph: connect (1)192.168.65.43:6801 error -101 [Thu Apr 27 17:11:02 2023] libceph: osd11 (1)192.168.65.43:6801 connect error [Thu Apr 27 17:11:02 2023] libceph: connect (1)192.168.65.42:6801 error -101 [Thu Apr 27 17:11:02 2023] libceph: osd5 (1)192.168.65.42:6801 connect error [Thu Apr 27 17:11:04 2023] libceph: connect (1)192.168.65.61:6801 error -101 [Thu Apr 27 17:11:04 2023] libceph: osd17 (1)192.168.65.61:6801 connect error ~~~ Version of all relevant components (if applicable): -bash-4.2$ cat ./0010-ocs-must-gather-apr21.tar.gz/must-gather.local.1973180322653989223/registry-redhat-io-odf4-ocs-must-gather-rhel8-sha256-38db379f113fcc5a12a9801926ac0db55a4e613311cd13af1f3c373951b5de6b/namespaces/openshift-storage/oc_output/csv NAME DISPLAY VERSION REPLACES PHASE cert-manager.v1.11.0 cert-manager 1.11.0 cert-manager.v1.10.2 Succeeded cert-utils-operator.v1.3.10 Cert Utils Operator 1.3.10 cert-utils-operator.v1.3.9 Succeeded devworkspace-operator.v0.19.1-0.1679521112.p DevWorkspace Operator 0.19.1+0.1679521112.p devworkspace-operator.v0.19.1 Succeeded mcg-operator.v4.11.4 NooBaa Operator 4.11.4 mcg-operator.v4.11.3 Succeeded node-maintenance-operator.v5.0.0 Node Maintenance Operator 5.0.0 node-maintenance-operator.v4.11.1 Succeeded ocs-operator.v4.10.11 OpenShift Container Storage 4.10.11 ocs-operator.v4.10.10 Succeeded odf-csi-addons-operator.v4.10.11 CSI Addons 4.10.11 odf-csi-addons-operator.v4.10.10 Succeeded odf-operator.v4.10.11 OpenShift Data Foundation 4.10.11 odf-operator.v4.10.10 Succeeded volume-expander-operator.v0.3.6 Volume Expander Operator 0.3.6 volume-expander-operator.v0.3.5 Succeeded web-terminal.v1.6.0 Web Terminal 1.6.0 web-terminal.v1.5.0-0.1657220207.p Succeeded -bash-4.2$ find ./ -type f -iname get_clusterversion ./0010-ocs-must-gather-apr21.tar.gz/must-gather.local.1973180322653989223/registry-redhat-io-odf4-ocs-must-gather-rhel8-sha256-38db379f113fcc5a12a9801926ac0db55a4e613311cd13af1f3c373951b5de6b/cluster-scoped-resources/oc_output/get_clusterversion -bash-4.2$ -bash-4.2$ cat ./0010-ocs-must-gather-apr21.tar.gz/must-gather.local.1973180322653989223/registry-redhat-io-odf4-ocs-must-gather-rhel8-sha256-38db379f113fcc5a12a9801926ac0db55a4e613311cd13af1f3c373951b5de6b/cluster-scoped-resources/oc_output/get_clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.4 True False 225d Error while reconciling 4.11.4: some cluster operators have not yet rolled out Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. 2. 3. Actual results: A Linux reboot hangs due to CephFS not being able to reach an OSD Expected results: A Linux reboot *would not* hangs due to CephFS not being able to reach an OSD Additional info: