[Tracker for BZ #1975608] [Mon Recovery testing(bz1965768)] After replacing degraded cephfs with new cephfs, the cephfs app-pod created before mon corruption is not accessible
Product:
[Red Hat Storage] Red Hat OpenShift Data Foundation
DescriptionPersona non grata
2021-06-17 13:53:41 UTC
Description of problem (please be detailed as possible and provide log
snippests):
From bugs bz1965768, We tried mon corruption and recovery of all 3 MONs DB( MONs recovered successfully), We were able to get rbd app-pods (created before mon db corruption) running and retrieved old data.
In the case of cephfs, which was in
health: HEALTH_ERR
1 filesystem is offline
1 filesystem is online with fewer MDS than max_mds
3 daemons have recently crashed
Later, removed existing cephfs , created new cephfs using https://access.redhat.com/solutions/5441711, and ceph health was OK.
We could not get the old cephfs app-pod (created before mon db corruption) back.
From cephfs app-pod describe:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 9m1s (x502 over 19h) kubelet Unable to attach or mount volumes: unmounted volumes=[fedora-vol], unattached volumes=[fedora-vol]: timed out waiting for the condition
Warning FailedMount 3m37s (x565 over 19h) kubelet MountVolume.MountDevice failed for volume "pvc-d9d90f00-3222-4259-941b-271888506638" : rpc error: code = Internal desc = pool not found: fscID (1) not found in Ceph cluster
Creation of new cephfs pvcs and pods are working
Patrick analysed the cluster and suggested
```The CephFS recovery completed okay using [1] but old PVCs won't bind because ceph-csi remembers the old fscid (unique integer assigned to file systems) when remounting. A procedure needs to be in place to update that (new ceph-csi BZ TBC).
[1] https://access.redhat.com/solutions/5441711
```
Version of all relevant components (if applicable):
ocs-operator.v4.6.4-323.ci
ocp 4.6.0-0.nightly-2021-06-16-061653
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, old cephfs pod data is not accessible
Is there any workaround available to the best of your knowledge?
Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3
Can this issue reproducible?
2/2
Can this issue reproduce from the UI?
NA
If this is a regression, please provide more details to justify this:
Steps to Reproduce:
1. From https://bugzilla.redhat.com/show_bug.cgi?id=1965768 , corrupt MONs db, perform recovery procedure from article https://access.redhat.com/solutions/6100031/
2. After MONs recovery, since CephFS was offline, performed CephFS recovery from https://access.redhat.com/solutions/5441711
3. New CephFS was active, but old cephfs app-pod data can't be accessed
Actual results:
Old cephfs app-pod in ContainerCreating state
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 9m1s (x502 over 19h) kubelet Unable to attach or mount volumes: unmounted volumes=[fedora-vol], unattached volumes=[fedora-vol]: timed out waiting for the condition
Warning FailedMount 3m37s (x565 over 19h) kubelet MountVolume.MountDevice failed for volume "pvc-d9d90f00-3222-4259-941b-271888506638" : rpc error: code = Internal desc = pool not found: fscID (1) not found in Ceph cluster
Expected results:
Old cephfs app-pod should be running and old data from the pod should be retrieved
Additional info:
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: Red Hat OpenShift Data Foundation 4.9.0 enhancement, security, and bug fix update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2021:5086