Bug 1889866
Summary: | Post node power off/on, an unused MON PVC still stays back in the cluster | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | Neha Berry <nberry> |
Component: | rook | Assignee: | Travis Nielsen <tnielsen> |
Status: | CLOSED ERRATA | QA Contact: | Martin Bukatovic <mbukatov> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.5 | CC: | assingh, ebenahar, madam, muagarwa, ocs-bugs, ratamir, tnielsen |
Target Milestone: | --- | Keywords: | AutomationBackLog |
Target Release: | OCS 4.6.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | 4.6.0-148.ci | Doc Type: | No Doc Update |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-12-17 06:24:47 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Neha Berry
2020-10-20 18:17:59 UTC
The deletion of unused (orphaned) mon PVCs is currently only done after a successful mon failover. The orphaned mon PVC remains in this case because the mon failover was cancelled since the original mon came back up. Agreed that Rook should clean up the PVC sooner, but moving this to 4.7 since it doesn't affect functionality. As discussed in the leads meeting, the fix doesn't seem to be risky so we will first fix it in 4.7 and follow with the same in 4.6. Proposing as a blocker for now but in case @Travis will update that the fix is not that intuitive, feel free to move back to 4.7 Thanks The fix is low risk, to move the check for orphaned resources to every mon reconcile instead of only after a successful mon failover. https://github.com/rook/rook/pull/6493 Testing with ============ OCP 4.6.0-0.nightly-2020-11-05-024238 OCS ocs-operator.v4.6.0-624.ci On GCP (a cloud, IPI platform). Full version report =================== storage namespace openshift-cluster-storage-operator image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9a2d75eb606e8cbf2fa0d203bfbc92e3db822286357c46d039ba74080c2dc08f * quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9a2d75eb606e8cbf2fa0d203bfbc92e3db822286357c46d039ba74080c2dc08f image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:423e5b0624ed0bb736c5320c37611b72dcbb2094e785c2ab588f584f65157289 * quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:423e5b0624ed0bb736c5320c37611b72dcbb2094e785c2ab588f584f65157289 image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c02fd4013a52b3d3047ae566f4e7e50c82c1087cb3acc59945cd01d718235e94 * quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c02fd4013a52b3d3047ae566f4e7e50c82c1087cb3acc59945cd01d718235e94 storage namespace openshift-kube-storage-version-migrator image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:da72171372d59ebbd8319073640716c7777a945848a39538224354b1566a0b02 * quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:da72171372d59ebbd8319073640716c7777a945848a39538224354b1566a0b02 storage namespace openshift-kube-storage-version-migrator-operator image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:60ff0a413ba64ee38c13f13902071fc7306f24eb46edcacc8778507cf78f15ef * quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:60ff0a413ba64ee38c13f13902071fc7306f24eb46edcacc8778507cf78f15ef storage namespace openshift-storage image quay.io/rhceph-dev/cephcsi@sha256:3b2fff211845eab398d66262a4c47eb5eadbcd982de80387aa47dd23f6572b22 * quay.io/rhceph-dev/cephcsi@sha256:3b2fff211845eab398d66262a4c47eb5eadbcd982de80387aa47dd23f6572b22 image quay.io/rhceph-dev/ose-csi-node-driver-registrar@sha256:4cf9fb2d021b0ce409ef7fdf2d4b182f655950ba28cb822ffc4549de422d4184 * quay.io/rhceph-dev/ose-csi-node-driver-registrar@sha256:30b3c4f21074d323f5d62500af63251f41f96193907b953a742bfb9067d05114 image quay.io/rhceph-dev/ose-csi-external-attacher@sha256:87db9cca0c2e58343e1ca60e9ae4294f115515e7724984de30207b1205ed3611 * quay.io/rhceph-dev/ose-csi-external-attacher@sha256:79d85b1739ef751175cc33ca15e5d979f4bdf0fa5f41b9b7e66d58015b9af6b8 image quay.io/rhceph-dev/ose-csi-external-provisioner@sha256:376ee9cf355554a3174e12329545d1a89ed0228ac2597adbd282ae513dbb84e8 * quay.io/rhceph-dev/ose-csi-external-provisioner@sha256:376ee9cf355554a3174e12329545d1a89ed0228ac2597adbd282ae513dbb84e8 image quay.io/rhceph-dev/ose-csi-external-resizer@sha256:136a81c87028a8f7e6c1c579923548b36dbf034e4dd24215e1739ac484e7382b * quay.io/rhceph-dev/ose-csi-external-resizer@sha256:136a81c87028a8f7e6c1c579923548b36dbf034e4dd24215e1739ac484e7382b image quay.io/rhceph-dev/ose-csi-external-snapshotter@sha256:90f9dd56fa26339f6d4ff81c7e94794c237ba0963f660480d129c67becdc5e5f * quay.io/rhceph-dev/ose-csi-external-snapshotter@sha256:612307360e8c6bb8994087fc1c44d0e8a35a9e6d5d45b5803d77dd32820484ad image quay.io/rhceph-dev/mcg-core@sha256:01975cd563b7e802973a8dc4f0b79b43df070f666c7993ab51cf3aefda39002a * quay.io/rhceph-dev/mcg-core@sha256:01975cd563b7e802973a8dc4f0b79b43df070f666c7993ab51cf3aefda39002a image registry.redhat.io/rhscl/mongodb-36-rhel7@sha256:6abfa44b8b4d7b45d83b1158865194cb64481148701977167e900e5db4e1eba3 * registry.redhat.io/rhscl/mongodb-36-rhel7@sha256:6abfa44b8b4d7b45d83b1158865194cb64481148701977167e900e5db4e1eba3 image quay.io/rhceph-dev/mcg-operator@sha256:a293f3c5933a28812b84e2fe90de40ad64ad0207660787b66e168303b0aafaac * quay.io/rhceph-dev/mcg-operator@sha256:4ac7bc0e54d6190ece9cbc4c81e0644711f1adbb65fda48a2b43a9ab3b256aa1 image quay.io/rhceph-dev/ocs-operator@sha256:7ba5917c82bd08472a221c4bc12f054fdc66fb02fc36ff59270554ca61379da1 * quay.io/rhceph-dev/ocs-operator@sha256:7ba5917c82bd08472a221c4bc12f054fdc66fb02fc36ff59270554ca61379da1 image quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9 * quay.io/rhceph-dev/rhceph@sha256:22ea8ee38cd8283f636c2eeb640eb4a1bb744efb18abee114517926f4a03bff9 image quay.io/rhceph-dev/rook-ceph@sha256:c14792c0e59cf7866b6a19c970513071d0ea106b28e79733a2d26240adb507cd * quay.io/rhceph-dev/rook-ceph@sha256:c14792c0e59cf7866b6a19c970513071d0ea106b28e79733a2d26240adb507cd Verification ============ 1. Deployed a 3 node OCS 4.6 cluster on GCP 2. Stopped one worker node from GCP Console, so that rook-ceph-operator and ocs-operator were not affected by this. 3. Waited for about half an hour, seen new mon-canary pod and new mon PVC, both in pending state. 4. Started the the node again. 5. I see that new mon pod was deployed, and the PVC of the removed pod is not there. Final state: ``` $ oc get pods -n openshift-storage | grep mon- rook-ceph-mon-a-775c887788-5fs6d 1/1 Running 0 77m rook-ceph-mon-c-5fcf9dbc58-mdz8s 1/1 Running 0 76m rook-ceph-mon-d-54d89bd4d9-r9wpm 1/1 Running 0 10m $ oc get pvc -n openshift-storage | grep mon- rook-ceph-mon-a Bound pvc-61977a0b-1f49-4632-a63f-44539bc3c26a 10Gi RWO standard 80m rook-ceph-mon-c Bound pvc-53b08463-97e3-43c0-b933-29924bfef914 10Gi RWO standard 80m rook-ceph-mon-d Bound pvc-4369dbac-b6db-4887-bd2e-cc8912bdae20 10Gi RWO standard 42m ``` So on the one hand, there is no pending mon PVC, but on the other hand, I seem to observe a successful mon failover, which means I haven't reproduced the bug following the original reproducer. Asking original dev contact and reported to update reproducer to inflict unsuccessful mon failover on OCS 4.6. Asking original reporter to update reproducer to inflict unsuccessful mon failover on OCS 4.6. Given another improvement in the mon failover, there isn't a good way to get a mon failover to fail. The whole point of the operator is to succeed when it performs actions, so apparently the operator is getting too good to simulate failure. The way I verified the fix was to manually create a PVC with the similar labels (but no pod has it mounted) as other mon PVCs so that the operator would find it at the next reconcile and delete it. 1st option + regression cycle of disruptive testing should suffice In comment 11 I mentioned a way to simulate an orphaned PVC from a failed mon failover. If simulating the repro isn't valid, then we must go with option 1 since a reliable way to reproduce the failed mon failover in a real cluster is so difficult. Retracting the fix should not be done IMO unless a regression is found. Based on comment 15, 13, 12 and 8, making as verified (with limitations described in comment 12). Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5605 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |