Bug 2190503 - [Tracker #15887] Linux reboot / shutdown hung by CephFS
Summary: [Tracker #15887] Linux reboot / shutdown hung by CephFS
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.11
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: Venky Shankar
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-04-28 15:08 UTC by Manny
Modified: 2023-08-09 16:37 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-05-17 01:31:27 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 15887 0 None None None 2023-04-28 16:55:49 UTC

Description Manny 2023-04-28 15:08:09 UTC
Description of problem (please be detailed as possible and provide log snippets):  [Tracker #15887] Linux reboot / shutdown hung by CephFS

First off, someone smarter than me, needs to ensure I have the correct Component / Subcomponent

From Upstream Tracker:
~~~
Description

1) Mount cephfs on client,
2) Shutdown osd+mon node or make it not reachable
3) While client is accessing the mount(simple ls on dir), reboot the client
   It will be stuck forever until it can reach the ceph nodes or unless hard reset is done

Expected behavior:
Reboot should work when ceph nodes are not reachable
~~~

We also see this from GitHub:  (https://github.com/rook/rook/issues/2517)

This GitHub is now closed, but notice that people are updating the GitHub complaining this is still an issue. I have worked cases where this issue existed and I was surprised to see no BZ and no KCS for this issue. I will fix the no KCS part after finishing this BZ

The main point here, customers are bitter about this issue. Imagine having an issue, which prompts one to reboot a node, you are already having a "bad day" and, now, even the reboot is a problem.

The case I'm working (SFDC #03494123) has a Must Gather, feel free to look at that.

This is what's being spewed in `dmesg -T` while the node is hung in the Linux shutdown

~~~
[Thu Apr 27 17:10:56 2023] libceph: osd9 (1)192.168.65.51:6801 connect error
[Thu Apr 27 17:10:56 2023] libceph: connect (1)192.168.65.62:6801 error -101
[Thu Apr 27 17:10:56 2023] libceph: osd0 (1)192.168.65.62:6801 connect error
[Thu Apr 27 17:10:56 2023] libceph: connect (1)192.168.65.72:6801 error -101
[Thu Apr 27 17:10:56 2023] libceph: osd16 (1)192.168.65.72:6801 connect error
[Thu Apr 27 17:10:57 2023] libceph: connect (1)192.168.65.46:6801 error -101
[Thu Apr 27 17:10:57 2023] libceph: osd2 (1)192.168.65.46:6801 connect error
[Thu Apr 27 17:10:57 2023] libceph: connect (1)192.168.65.76:6801 error -101
[Thu Apr 27 17:10:57 2023] libceph: osd22 (1)192.168.65.76:6801 connect error
[Thu Apr 27 17:10:58 2023] libceph: connect (1)192.168.65.77:6801 error -101
[Thu Apr 27 17:10:58 2023] libceph: osd3 (1)192.168.65.77:6801 connect error
[Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.65.74:6801 error -101
[Thu Apr 27 17:11:00 2023] libceph: osd14 (1)192.168.65.74:6801 connect error
[Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.65.52:6801 error -101
[Thu Apr 27 17:11:00 2023] libceph: osd12 (1)192.168.65.52:6801 connect error
[Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.65.38:6801 error -101
[Thu Apr 27 17:11:00 2023] libceph: osd18 (1)192.168.65.38:6801 connect error
[Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.65.49:6801 error -101
[Thu Apr 27 17:11:00 2023] libceph: osd8 (1)192.168.65.49:6801 connect error
[Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.65.56:6801 error -101
[Thu Apr 27 17:11:00 2023] libceph: osd23 (1)192.168.65.56:6801 connect error
[Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.65.66:6801 error -101
[Thu Apr 27 17:11:00 2023] libceph: osd6 (1)192.168.65.66:6801 connect error
[Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.65.44:6801 error -101
[Thu Apr 27 17:11:00 2023] libceph: osd7 (1)192.168.65.44:6801 connect error
[Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.65.11:6801 error -101
[Thu Apr 27 17:11:00 2023] libceph: osd19 (1)192.168.65.11:6801 connect error
[Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.65.3:6801 error -101
[Thu Apr 27 17:11:00 2023] libceph: osd1 (1)192.168.65.3:6801 connect error
[Thu Apr 27 17:11:00 2023] libceph: connect (1)192.168.65.80:6801 error -101
[Thu Apr 27 17:11:00 2023] libceph: osd10 (1)192.168.65.80:6801 connect error
[Thu Apr 27 17:11:01 2023] libceph: connect (1)172.30.54.55:6789 error -101
[Thu Apr 27 17:11:01 2023] libceph: mon1 (1)172.30.54.55:6789 connect error
[Thu Apr 27 17:11:02 2023] libceph: connect (1)192.168.65.45:6801 error -101
[Thu Apr 27 17:11:02 2023] libceph: osd21 (1)192.168.65.45:6801 connect error
[Thu Apr 27 17:11:02 2023] libceph: connect (1)192.168.65.60:6801 error -101
[Thu Apr 27 17:11:02 2023] libceph: osd13 (1)192.168.65.60:6801 connect error
[Thu Apr 27 17:11:02 2023] libceph: connect (1)192.168.65.43:6801 error -101
[Thu Apr 27 17:11:02 2023] libceph: osd11 (1)192.168.65.43:6801 connect error
[Thu Apr 27 17:11:02 2023] libceph: connect (1)192.168.65.42:6801 error -101
[Thu Apr 27 17:11:02 2023] libceph: osd5 (1)192.168.65.42:6801 connect error
[Thu Apr 27 17:11:04 2023] libceph: connect (1)192.168.65.61:6801 error -101
[Thu Apr 27 17:11:04 2023] libceph: osd17 (1)192.168.65.61:6801 connect error
~~~


Version of all relevant components (if applicable):

-bash-4.2$ cat ./0010-ocs-must-gather-apr21.tar.gz/must-gather.local.1973180322653989223/registry-redhat-io-odf4-ocs-must-gather-rhel8-sha256-38db379f113fcc5a12a9801926ac0db55a4e613311cd13af1f3c373951b5de6b/namespaces/openshift-storage/oc_output/csv
NAME                                           DISPLAY                       VERSION                 REPLACES                             PHASE
cert-manager.v1.11.0                           cert-manager                  1.11.0                  cert-manager.v1.10.2                 Succeeded
cert-utils-operator.v1.3.10                    Cert Utils Operator           1.3.10                  cert-utils-operator.v1.3.9           Succeeded
devworkspace-operator.v0.19.1-0.1679521112.p   DevWorkspace Operator         0.19.1+0.1679521112.p   devworkspace-operator.v0.19.1        Succeeded
mcg-operator.v4.11.4                           NooBaa Operator               4.11.4                  mcg-operator.v4.11.3                 Succeeded
node-maintenance-operator.v5.0.0               Node Maintenance Operator     5.0.0                   node-maintenance-operator.v4.11.1    Succeeded
ocs-operator.v4.10.11                          OpenShift Container Storage   4.10.11                 ocs-operator.v4.10.10                Succeeded
odf-csi-addons-operator.v4.10.11               CSI Addons                    4.10.11                 odf-csi-addons-operator.v4.10.10     Succeeded
odf-operator.v4.10.11                          OpenShift Data Foundation     4.10.11                 odf-operator.v4.10.10                Succeeded
volume-expander-operator.v0.3.6                Volume Expander Operator      0.3.6                   volume-expander-operator.v0.3.5      Succeeded
web-terminal.v1.6.0                            Web Terminal                  1.6.0                   web-terminal.v1.5.0-0.1657220207.p   Succeeded

-bash-4.2$ find ./ -type f -iname get_clusterversion
./0010-ocs-must-gather-apr21.tar.gz/must-gather.local.1973180322653989223/registry-redhat-io-odf4-ocs-must-gather-rhel8-sha256-38db379f113fcc5a12a9801926ac0db55a4e613311cd13af1f3c373951b5de6b/cluster-scoped-resources/oc_output/get_clusterversion
-bash-4.2$ 
-bash-4.2$ cat ./0010-ocs-must-gather-apr21.tar.gz/must-gather.local.1973180322653989223/registry-redhat-io-odf4-ocs-must-gather-rhel8-sha256-38db379f113fcc5a12a9801926ac0db55a4e613311cd13af1f3c373951b5de6b/cluster-scoped-resources/oc_output/get_clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.4    True        False         225d    Error while reconciling 4.11.4: some cluster operators have not yet rolled out




Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:  A Linux reboot hangs due to CephFS not being able to reach an OSD


Expected results:  A Linux reboot *would not* hangs due to CephFS not being able to reach an OSD


Additional info:


Note You need to log in before you can comment on or make changes to this bug.