Description of problem (please be detailed as possible and provide log snippets): [Tracker #15887] Linux reboot / shutdown hung by CephFS First off, someone smarter than me, needs to ensure I have the correct Component / Subcomponent From Upstream Tracker: ~~~ Description 1) Mount cephfs on client, 2) Shutdown osd+mon node or make it not reachable 3) While client is accessing the mount(simple ls on dir), reboot the client It will be stuck forever until it can reach the ceph nodes or unless hard reset is done Expected behavior: Reboot should work when ceph nodes are not reachable ~~~ We also see this from GitHub: (https://github.com/rook/rook/issues/2517) This GitHub is now closed, but notice that people are updating the GitHub complaining this is still an issue. I have worked cases where this issue existed and I was surprised to see no BZ and no KCS for this issue. I will fix the no KCS part after finishing this BZ The main point here, customers are bitter about this issue. Imagine having an issue, which prompts one to reboot a node, you are already having a "bad day" and, now, even the reboot is a problem. The case I'm working (SFDC #03494123) has a Must Gather, feel free to look at that. This is what's being spewed in `dmesg -T` while the node is hung in the Linux shutdown ~~~ [Thu Apr 27 17:10:56 2023] libceph: osd9 (1)192.168.65.51:6801 connect error [Thu Apr 27 17:10:56 2023] libceph: connect (1)192.168.65.62:6801 error -101 [Thu Apr 27 17:10:56 2023] libceph: osd0 (1)192.168.65.62:6801 connect error [Thu Apr 27 17:10:56 2023] libceph: connect (1)192.168.65.72:6801 error -101 [Thu Apr 27 17:10:56 2023] libceph: osd16 (1)192.168.65.72:6801 connect error [Thu Apr 27 17:10:57 2023] libceph: connect (1)192.168.65.46:6801 error -101 [Thu Apr 27 17:10:57 2023] libceph: osd2 (1)192.168.65.46:6801 connect error [Thu Apr 27 17:10:57 2023] libceph: connect (1)192.168.65.76:6801 error -101 [Thu Apr 27 17:10:57 2023] libceph: osd22 (1)192.168.65.76:6801 connect error [Thu Apr 27 17:10:58 2023] libceph: connect (1)192.168.65.77:6801 error -101 [Thu Apr 27 17:10:58 2023] libceph: osd3 (1)192.168.65.77:6801 connect error [Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.65.74:6801 error -101 [Thu Apr 27 17:11:00 2023] libceph: osd14 (1)192.168.65.74:6801 connect error [Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.65.52:6801 error -101 [Thu Apr 27 17:11:00 2023] libceph: osd12 (1)192.168.65.52:6801 connect error [Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.65.38:6801 error -101 [Thu Apr 27 17:11:00 2023] libceph: osd18 (1)192.168.65.38:6801 connect error [Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.65.49:6801 error -101 [Thu Apr 27 17:11:00 2023] libceph: osd8 (1)192.168.65.49:6801 connect error [Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.65.56:6801 error -101 [Thu Apr 27 17:11:00 2023] libceph: osd23 (1)192.168.65.56:6801 connect error [Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.65.66:6801 error -101 [Thu Apr 27 17:11:00 2023] libceph: osd6 (1)192.168.65.66:6801 connect error [Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.65.44:6801 error -101 [Thu Apr 27 17:11:00 2023] libceph: osd7 (1)192.168.65.44:6801 connect error [Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.65.11:6801 error -101 [Thu Apr 27 17:11:00 2023] libceph: osd19 (1)192.168.65.11:6801 connect error [Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.65.3:6801 error -101 [Thu Apr 27 17:11:00 2023] libceph: osd1 (1)192.168.65.3:6801 connect error [Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.65.80:6801 error -101 [Thu Apr 27 17:11:00 2023] libceph: osd10 (1)192.168.65.80:6801 connect error [Thu Apr 27 17:11:01 2023] libceph: connect (1)172.30.54.55:6789 error -101 [Thu Apr 27 17:11:01 2023] libceph: mon1 (1)172.30.54.55:6789 connect error [Thu Apr 27 17:11:02 2023] libceph: connect (1)192.168.65.45:6801 error -101 [Thu Apr 27 17:11:02 2023] libceph: osd21 (1)192.168.65.45:6801 connect error [Thu Apr 27 17:11:02 2023] libceph: connect (1)192.168.65.60:6801 error -101 [Thu Apr 27 17:11:02 2023] libceph: osd13 (1)192.168.65.60:6801 connect error [Thu Apr 27 17:11:02 2023] libceph: connect (1)192.168.65.43:6801 error -101 [Thu Apr 27 17:11:02 2023] libceph: osd11 (1)192.168.65.43:6801 connect error [Thu Apr 27 17:11:02 2023] libceph: connect (1)192.168.65.42:6801 error -101 [Thu Apr 27 17:11:02 2023] libceph: osd5 (1)192.168.65.42:6801 connect error [Thu Apr 27 17:11:04 2023] libceph: connect (1)192.168.65.61:6801 error -101 [Thu Apr 27 17:11:04 2023] libceph: osd17 (1)192.168.65.61:6801 connect error ~~~ Version of all relevant components (if applicable): -bash-4.2$ cat ./0010-ocs-must-gather-apr21.tar.gz/must-gather.local.1973180322653989223/registry-redhat-io-odf4-ocs-must-gather-rhel8-sha256-38db379f113fcc5a12a9801926ac0db55a4e613311cd13af1f3c373951b5de6b/namespaces/openshift-storage/oc_output/csv NAME DISPLAY VERSION REPLACES PHASE cert-manager.v1.11.0 cert-manager 1.11.0 cert-manager.v1.10.2 Succeeded cert-utils-operator.v1.3.10 Cert Utils Operator 1.3.10 cert-utils-operator.v1.3.9 Succeeded devworkspace-operator.v0.19.1-0.1679521112.p DevWorkspace Operator 0.19.1+0.1679521112.p devworkspace-operator.v0.19.1 Succeeded mcg-operator.v4.11.4 NooBaa Operator 4.11.4 mcg-operator.v4.11.3 Succeeded node-maintenance-operator.v5.0.0 Node Maintenance Operator 5.0.0 node-maintenance-operator.v4.11.1 Succeeded ocs-operator.v4.10.11 OpenShift Container Storage 4.10.11 ocs-operator.v4.10.10 Succeeded odf-csi-addons-operator.v4.10.11 CSI Addons 4.10.11 odf-csi-addons-operator.v4.10.10 Succeeded odf-operator.v4.10.11 OpenShift Data Foundation 4.10.11 odf-operator.v4.10.10 Succeeded volume-expander-operator.v0.3.6 Volume Expander Operator 0.3.6 volume-expander-operator.v0.3.5 Succeeded web-terminal.v1.6.0 Web Terminal 1.6.0 web-terminal.v1.5.0-0.1657220207.p Succeeded -bash-4.2$ find ./ -type f -iname get_clusterversion ./0010-ocs-must-gather-apr21.tar.gz/must-gather.local.1973180322653989223/registry-redhat-io-odf4-ocs-must-gather-rhel8-sha256-38db379f113fcc5a12a9801926ac0db55a4e613311cd13af1f3c373951b5de6b/cluster-scoped-resources/oc_output/get_clusterversion -bash-4.2$ -bash-4.2$ cat ./0010-ocs-must-gather-apr21.tar.gz/must-gather.local.1973180322653989223/registry-redhat-io-odf4-ocs-must-gather-rhel8-sha256-38db379f113fcc5a12a9801926ac0db55a4e613311cd13af1f3c373951b5de6b/cluster-scoped-resources/oc_output/get_clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.4 True False 225d Error while reconciling 4.11.4: some cluster operators have not yet rolled out Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. 2. 3. Actual results: A Linux reboot hangs due to CephFS not being able to reach an OSD Expected results: A Linux reboot *would not* hangs due to CephFS not being able to reach an OSD Additional info:
The customer (nokia) is running into this issue and wont accept this behavior and is requesting a code change or fix for case 03587087
case#03587087. Hello Venky!!! I know you guys are busy and I don t want to take too much your time but this customer (Nokia) doesn t want any workaround or temporary fix. They want to know why shutdown takes more than 20 mns in v4.13.6 after they upgraded from v4.13.4 where it takes 10 mns. Same thing when they were on v4.12 (10mns). What has changed? As a Newer in the company I really don t know what else to do. To make sure that I was on the right path, I reached out to some seniors teammates to help me with this case but we could not find what was causing the issue and then I reached to Shift and they asked me to contact you to explain the situation. Also had some conversations with the TAM for help. I know this case is closed because 'WONTFIX' but is there any way we can take look at it one more time? If not what exactly you want me to tell a customer for them to be okay with my answer? I have all the documents( sosreports from different nodes....). So please advice me in the right direction. I just want to learn. I saw your previous comment you said "one needs to do a `umount -f` to throw away the page cache before initiating a reboot." Is that the solution that I need to tell the customer?