Bug 1973256

Summary: [Tracker for BZ #1975608] [Mon Recovery testing(bz1965768)] After replacing degraded cephfs with new cephfs, the cephfs app-pod created before mon corruption is not accessible
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Persona non grata <nobody+410372>
Component: cephAssignee: Patrick Donnelly <pdonnell>
Status: CLOSED ERRATA QA Contact: Parikshith <pbyregow>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.6CC: assingh, bkunal, bniver, hchiramm, hnallurv, jdurgin, madam, mrajanna, muagarwa, nojha, ocs-bugs, odf-bz-bot, pdhange, pdonnell, srangana, vumrao
Target Milestone: ---Keywords: AutomationBackLog
Target Release: ODF 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: v4.9.0-164.ci Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-12-13 17:44:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1975608, 2002314, 2003219, 2004666    
Bug Blocks:    

Description Persona non grata 2021-06-17 13:53:41 UTC
Description of problem (please be detailed as possible and provide log
snippests):
From bugs bz1965768, We tried mon corruption and recovery of all 3 MONs DB( MONs recovered successfully), We were able to get rbd app-pods (created before mon db corruption) running and retrieved old data.
In the case of cephfs, which was in 
 
    health: HEALTH_ERR
            1 filesystem is offline
            1 filesystem is online with fewer MDS than max_mds
            3 daemons have recently crashed
 
Later, removed existing cephfs , created new cephfs using https://access.redhat.com/solutions/5441711, and ceph health was OK.
We could not get the old cephfs app-pod (created before mon db corruption) back.
From cephfs app-pod describe:

Events:
  Type     Reason       Age                    From     Message
  ----     ------       ----                   ----     -------
  Warning  FailedMount  9m1s (x502 over 19h)   kubelet  Unable to attach or mount volumes: unmounted volumes=[fedora-vol], unattached volumes=[fedora-vol]: timed out waiting for the condition
  Warning  FailedMount  3m37s (x565 over 19h)  kubelet  MountVolume.MountDevice failed for volume "pvc-d9d90f00-3222-4259-941b-271888506638" : rpc error: code = Internal desc = pool not found: fscID (1) not found in Ceph cluster

Creation of new cephfs pvcs and pods are working

Patrick analysed the cluster and suggested
 ```The CephFS recovery completed okay using [1] but old PVCs won't bind because ceph-csi remembers the old fscid (unique integer assigned to file systems) when remounting. A procedure needs to be in place to update that (new ceph-csi BZ TBC).

[1] https://access.redhat.com/solutions/5441711
```
Version of all relevant components (if applicable):

ocs-operator.v4.6.4-323.ci

ocp 4.6.0-0.nightly-2021-06-16-061653

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, old cephfs pod data is not accessible

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

Can this issue reproducible?
2/2

Can this issue reproduce from the UI?
NA

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. From https://bugzilla.redhat.com/show_bug.cgi?id=1965768 , corrupt MONs db, perform recovery procedure from article  https://access.redhat.com/solutions/6100031/
2. After MONs recovery, since CephFS was offline, performed CephFS recovery  from https://access.redhat.com/solutions/5441711 
3. New CephFS was active, but old cephfs app-pod data can't be accessed


Actual results:
Old cephfs app-pod in ContainerCreating state 

Events:
  Type     Reason       Age                    From     Message
  ----     ------       ----                   ----     -------
  Warning  FailedMount  9m1s (x502 over 19h)   kubelet  Unable to attach or mount volumes: unmounted volumes=[fedora-vol], unattached volumes=[fedora-vol]: timed out waiting for the condition
  Warning  FailedMount  3m37s (x565 over 19h)  kubelet  MountVolume.MountDevice failed for volume "pvc-d9d90f00-3222-4259-941b-271888506638" : rpc error: code = Internal desc = pool not found: fscID (1) not found in Ceph cluster

Expected results:
Old cephfs app-pod should be running and old data from the pod should be retrieved

Additional info:

Comment 24 Mudit Agarwal 2021-09-20 07:54:40 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1975608 is fixed in 5.0z1
We will get a fix in ODF as soon as there is a ceph container build with the fix.

Comment 39 errata-xmlrpc 2021-12-13 17:44:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Data Foundation 4.9.0 enhancement, security, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:5086