Bug 1973256 - [Tracker for BZ #1975608] [Mon Recovery testing(bz1965768)] After replacing degraded cephfs with new cephfs, the cephfs app-pod created before mon corruption is not accessible
Summary: [Tracker for BZ #1975608] [Mon Recovery testing(bz1965768)] After replacing d...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ODF 4.9.0
Assignee: Patrick Donnelly
QA Contact: Parikshith
URL:
Whiteboard:
Depends On: 1975608 2002314 2003219 2004666
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-17 13:53 UTC by Persona non grata
Modified: 2023-08-09 16:37 UTC (History)
16 users (show)

Fixed In Version: v4.9.0-164.ci
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-12-13 17:44:31 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1975608 1 high CLOSED Allow recreating file system with specific fscid 2021-11-02 16:41:52 UTC
Red Hat Product Errata RHSA-2021:5086 0 None None None 2021-12-13 17:44:50 UTC

Internal Links: 1975608

Description Persona non grata 2021-06-17 13:53:41 UTC
Description of problem (please be detailed as possible and provide log
snippests):
From bugs bz1965768, We tried mon corruption and recovery of all 3 MONs DB( MONs recovered successfully), We were able to get rbd app-pods (created before mon db corruption) running and retrieved old data.
In the case of cephfs, which was in 
 
    health: HEALTH_ERR
            1 filesystem is offline
            1 filesystem is online with fewer MDS than max_mds
            3 daemons have recently crashed
 
Later, removed existing cephfs , created new cephfs using https://access.redhat.com/solutions/5441711, and ceph health was OK.
We could not get the old cephfs app-pod (created before mon db corruption) back.
From cephfs app-pod describe:

Events:
  Type     Reason       Age                    From     Message
  ----     ------       ----                   ----     -------
  Warning  FailedMount  9m1s (x502 over 19h)   kubelet  Unable to attach or mount volumes: unmounted volumes=[fedora-vol], unattached volumes=[fedora-vol]: timed out waiting for the condition
  Warning  FailedMount  3m37s (x565 over 19h)  kubelet  MountVolume.MountDevice failed for volume "pvc-d9d90f00-3222-4259-941b-271888506638" : rpc error: code = Internal desc = pool not found: fscID (1) not found in Ceph cluster

Creation of new cephfs pvcs and pods are working

Patrick analysed the cluster and suggested
 ```The CephFS recovery completed okay using [1] but old PVCs won't bind because ceph-csi remembers the old fscid (unique integer assigned to file systems) when remounting. A procedure needs to be in place to update that (new ceph-csi BZ TBC).

[1] https://access.redhat.com/solutions/5441711
```
Version of all relevant components (if applicable):

ocs-operator.v4.6.4-323.ci

ocp 4.6.0-0.nightly-2021-06-16-061653

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, old cephfs pod data is not accessible

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

Can this issue reproducible?
2/2

Can this issue reproduce from the UI?
NA

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. From https://bugzilla.redhat.com/show_bug.cgi?id=1965768 , corrupt MONs db, perform recovery procedure from article  https://access.redhat.com/solutions/6100031/
2. After MONs recovery, since CephFS was offline, performed CephFS recovery  from https://access.redhat.com/solutions/5441711 
3. New CephFS was active, but old cephfs app-pod data can't be accessed


Actual results:
Old cephfs app-pod in ContainerCreating state 

Events:
  Type     Reason       Age                    From     Message
  ----     ------       ----                   ----     -------
  Warning  FailedMount  9m1s (x502 over 19h)   kubelet  Unable to attach or mount volumes: unmounted volumes=[fedora-vol], unattached volumes=[fedora-vol]: timed out waiting for the condition
  Warning  FailedMount  3m37s (x565 over 19h)  kubelet  MountVolume.MountDevice failed for volume "pvc-d9d90f00-3222-4259-941b-271888506638" : rpc error: code = Internal desc = pool not found: fscID (1) not found in Ceph cluster

Expected results:
Old cephfs app-pod should be running and old data from the pod should be retrieved

Additional info:

Comment 24 Mudit Agarwal 2021-09-20 07:54:40 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1975608 is fixed in 5.0z1
We will get a fix in ODF as soon as there is a ceph container build with the fix.

Comment 39 errata-xmlrpc 2021-12-13 17:44:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Data Foundation 4.9.0 enhancement, security, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:5086


Note You need to log in before you can comment on or make changes to this bug.