Bug 2190503

Summary: [Tracker #15887] Linux reboot / shutdown hung by CephFS
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Manny <mcaldeir>
Component: cephAssignee: Venky Shankar <vshankar>
ceph sub component: CephFS QA Contact: Elad <ebenahar>
Status: CLOSED WONTFIX Docs Contact:
Severity: medium    
Priority: unspecified CC: bniver, gfarnum, hnallurv, muagarwa, ocs-bugs, odf-bz-bot, sostapov, vshankar, xiubli
Version: 4.11   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-05-17 01:31:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Manny 2023-04-28 15:08:09 UTC
Description of problem (please be detailed as possible and provide log snippets):  [Tracker #15887] Linux reboot / shutdown hung by CephFS

First off, someone smarter than me, needs to ensure I have the correct Component / Subcomponent

From Upstream Tracker:
~~~
Description

1) Mount cephfs on client,
2) Shutdown osd+mon node or make it not reachable
3) While client is accessing the mount(simple ls on dir), reboot the client
   It will be stuck forever until it can reach the ceph nodes or unless hard reset is done

Expected behavior:
Reboot should work when ceph nodes are not reachable
~~~

We also see this from GitHub:  (https://github.com/rook/rook/issues/2517)

This GitHub is now closed, but notice that people are updating the GitHub complaining this is still an issue. I have worked cases where this issue existed and I was surprised to see no BZ and no KCS for this issue. I will fix the no KCS part after finishing this BZ

The main point here, customers are bitter about this issue. Imagine having an issue, which prompts one to reboot a node, you are already having a "bad day" and, now, even the reboot is a problem.

The case I'm working (SFDC #03494123) has a Must Gather, feel free to look at that.

This is what's being spewed in `dmesg -T` while the node is hung in the Linux shutdown

~~~
[Thu Apr 27 17:10:56 2023] libceph: osd9 (1)192.168.65.51:6801 connect error
[Thu Apr 27 17:10:56 2023] libceph: connect (1)192.168.65.62:6801 error -101
[Thu Apr 27 17:10:56 2023] libceph: osd0 (1)192.168.65.62:6801 connect error
[Thu Apr 27 17:10:56 2023] libceph: connect (1)192.168.65.72:6801 error -101
[Thu Apr 27 17:10:56 2023] libceph: osd16 (1)192.168.65.72:6801 connect error
[Thu Apr 27 17:10:57 2023] libceph: connect (1)192.168.65.46:6801 error -101
[Thu Apr 27 17:10:57 2023] libceph: osd2 (1)192.168.65.46:6801 connect error
[Thu Apr 27 17:10:57 2023] libceph: connect (1)192.168.65.76:6801 error -101
[Thu Apr 27 17:10:57 2023] libceph: osd22 (1)192.168.65.76:6801 connect error
[Thu Apr 27 17:10:58 2023] libceph: connect (1)192.168.65.77:6801 error -101
[Thu Apr 27 17:10:58 2023] libceph: osd3 (1)192.168.65.77:6801 connect error
[Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.65.74:6801 error -101
[Thu Apr 27 17:11:00 2023] libceph: osd14 (1)192.168.65.74:6801 connect error
[Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.65.52:6801 error -101
[Thu Apr 27 17:11:00 2023] libceph: osd12 (1)192.168.65.52:6801 connect error
[Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.65.38:6801 error -101
[Thu Apr 27 17:11:00 2023] libceph: osd18 (1)192.168.65.38:6801 connect error
[Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.65.49:6801 error -101
[Thu Apr 27 17:11:00 2023] libceph: osd8 (1)192.168.65.49:6801 connect error
[Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.65.56:6801 error -101
[Thu Apr 27 17:11:00 2023] libceph: osd23 (1)192.168.65.56:6801 connect error
[Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.65.66:6801 error -101
[Thu Apr 27 17:11:00 2023] libceph: osd6 (1)192.168.65.66:6801 connect error
[Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.65.44:6801 error -101
[Thu Apr 27 17:11:00 2023] libceph: osd7 (1)192.168.65.44:6801 connect error
[Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.65.11:6801 error -101
[Thu Apr 27 17:11:00 2023] libceph: osd19 (1)192.168.65.11:6801 connect error
[Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.65.3:6801 error -101
[Thu Apr 27 17:11:00 2023] libceph: osd1 (1)192.168.65.3:6801 connect error
[Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.65.80:6801 error -101
[Thu Apr 27 17:11:00 2023] libceph: osd10 (1)192.168.65.80:6801 connect error
[Thu Apr 27 17:11:01 2023] libceph: connect (1)172.30.54.55:6789 error -101
[Thu Apr 27 17:11:01 2023] libceph: mon1 (1)172.30.54.55:6789 connect error
[Thu Apr 27 17:11:02 2023] libceph: connect (1)192.168.65.45:6801 error -101
[Thu Apr 27 17:11:02 2023] libceph: osd21 (1)192.168.65.45:6801 connect error
[Thu Apr 27 17:11:02 2023] libceph: connect (1)192.168.65.60:6801 error -101
[Thu Apr 27 17:11:02 2023] libceph: osd13 (1)192.168.65.60:6801 connect error
[Thu Apr 27 17:11:02 2023] libceph: connect (1)192.168.65.43:6801 error -101
[Thu Apr 27 17:11:02 2023] libceph: osd11 (1)192.168.65.43:6801 connect error
[Thu Apr 27 17:11:02 2023] libceph: connect (1)192.168.65.42:6801 error -101
[Thu Apr 27 17:11:02 2023] libceph: osd5 (1)192.168.65.42:6801 connect error
[Thu Apr 27 17:11:04 2023] libceph: connect (1)192.168.65.61:6801 error -101
[Thu Apr 27 17:11:04 2023] libceph: osd17 (1)192.168.65.61:6801 connect error
~~~


Version of all relevant components (if applicable):

-bash-4.2$ cat ./0010-ocs-must-gather-apr21.tar.gz/must-gather.local.1973180322653989223/registry-redhat-io-odf4-ocs-must-gather-rhel8-sha256-38db379f113fcc5a12a9801926ac0db55a4e613311cd13af1f3c373951b5de6b/namespaces/openshift-storage/oc_output/csv
NAME                                           DISPLAY                       VERSION                 REPLACES                             PHASE
cert-manager.v1.11.0                           cert-manager                  1.11.0                  cert-manager.v1.10.2                 Succeeded
cert-utils-operator.v1.3.10                    Cert Utils Operator           1.3.10                  cert-utils-operator.v1.3.9           Succeeded
devworkspace-operator.v0.19.1-0.1679521112.p   DevWorkspace Operator         0.19.1+0.1679521112.p   devworkspace-operator.v0.19.1        Succeeded
mcg-operator.v4.11.4                           NooBaa Operator               4.11.4                  mcg-operator.v4.11.3                 Succeeded
node-maintenance-operator.v5.0.0               Node Maintenance Operator     5.0.0                   node-maintenance-operator.v4.11.1    Succeeded
ocs-operator.v4.10.11                          OpenShift Container Storage   4.10.11                 ocs-operator.v4.10.10                Succeeded
odf-csi-addons-operator.v4.10.11               CSI Addons                    4.10.11                 odf-csi-addons-operator.v4.10.10     Succeeded
odf-operator.v4.10.11                          OpenShift Data Foundation     4.10.11                 odf-operator.v4.10.10                Succeeded
volume-expander-operator.v0.3.6                Volume Expander Operator      0.3.6                   volume-expander-operator.v0.3.5      Succeeded
web-terminal.v1.6.0                            Web Terminal                  1.6.0                   web-terminal.v1.5.0-0.1657220207.p   Succeeded

-bash-4.2$ find ./ -type f -iname get_clusterversion
./0010-ocs-must-gather-apr21.tar.gz/must-gather.local.1973180322653989223/registry-redhat-io-odf4-ocs-must-gather-rhel8-sha256-38db379f113fcc5a12a9801926ac0db55a4e613311cd13af1f3c373951b5de6b/cluster-scoped-resources/oc_output/get_clusterversion
-bash-4.2$ 
-bash-4.2$ cat ./0010-ocs-must-gather-apr21.tar.gz/must-gather.local.1973180322653989223/registry-redhat-io-odf4-ocs-must-gather-rhel8-sha256-38db379f113fcc5a12a9801926ac0db55a4e613311cd13af1f3c373951b5de6b/cluster-scoped-resources/oc_output/get_clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.4    True        False         225d    Error while reconciling 4.11.4: some cluster operators have not yet rolled out




Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:  A Linux reboot hangs due to CephFS not being able to reach an OSD


Expected results:  A Linux reboot *would not* hangs due to CephFS not being able to reach an OSD


Additional info: