Bug 1886859
| Summary: | OCS 4.6: Uninstall stuck indefinitely if any Ceph pods are in Pending state before uninstall | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | Neha Berry <nberry> |
| Component: | rook | Assignee: | Santosh Pillai <sapillai> |
| Status: | CLOSED ERRATA | QA Contact: | Oded <oviner> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.6 | CC: | ebenahar, madam, muagarwa, nigoyal, ocs-bugs, oviner, ratamir, sapillai, shan, sostapov, tdesala, tnielsen |
| Target Milestone: | --- | ||
| Target Release: | OCS 4.6.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | 4.6.0-144.ci | Doc Type: | No Doc Update |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-12-17 06:24:47 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Neha Berry
2020-10-09 14:30:44 UTC
Proposing as a blocker since uninstall is getting suck in a situation when it is most definitely needed to work We have a root cause. When MDS pods are not ready, the cephCluster status is set to HEALTH_ERR and rook stops reconciling. We have a couple of solutions but are still debating on which approach to use. We do think this needs to be fixed in 4.6 because failed installs are one of the important scenarios to handle for uninstall. Moving this to assigned and giving devel ack, please consider this a blocker for 4.6. SetUp: Provider: AWS_IPI Instance type: m4.xlarge OCP Version:4.6.0-0.nightly-2020-10-20-172149 Test Process: 1.Check OCP status via UI. 2.Install OCS via UI (4.6.0-169.ci) 3.Check pods status and some pods move to pending state (include rook-ceph-osd-0 pod) $ oc get pods -n openshift-storage | grep -i Pending noobaa-core-0 0/1 Pending 0 28m rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-56447bf8bxngt 0/1 Pending 0 28m rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-645f6488584rm 0/1 Pending 0 28m rook-ceph-osd-0-5fb754b877-f8tsf 0/1 Pending 0 28m 4.Uninstall OCS: a.Check PVC and OBC status: $ oc get pvc -n openshift-image-registry No resources found in openshift-image-registry namespace. $ oc get pvc -n -n openshift-monitoring Error from server (NotFound): namespaces "-n" not found $ oc get pvc -n openshift-logging No resources found in openshift-logging namespace. $ oc get obc -A No resources found b.Delete storagecluster [stuck] $ oc delete -n openshift-storage storagecluster --all --wait=true storagecluster.ocs.openshift.io "ocs-storagecluster" deleted [stuck] test procedure detailed: https://docs.google.com/document/d/1MFRQ3j65uBm3CirM6M6uL4-ybVoFcMmRGI2SFtqvbzI/edit must-gather: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-1886859/ Uninstall stuck because MDS pods and osd-0 pod are in Pending state before uninstall Doesn't look like a high severity issue for me: 1. It's uninstall. 2. It's a negative scenario in that flow. Perhaps I'm missing the severity here? Talur's fix to change the ordering of the cephcluster first has fixed the original issue. Now a new issue is showing up where the uninstall of the cephcluster is failing if any of the ceph pods are in pending state (and not assigned to a node). This is observed by the following entry in the operator log [1]. 2020-11-26T18:37:59.510065295Z 2020-11-26 18:37:59.510012 E | ceph-cluster-controller: failed to reconcile. failed to find valid ceph hosts in the cluster "openshift-storage": failed to get hostname from node "": resource name may not be empty This would only affect clusters with ceph pods stuck pending. Uninstall of a normal cluster would proceed as expected. Since this is an uninstall issue in a failed scenario as mentioned by Yaniv, agreed it's not a blocker. Rook should ignore the error of a pod being in pending state and continue with the uninstall for any pods that are not in pending state. Moving to 4.6.z since this would be a simple and low risk fix. [1] http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-1886859/must-gather.local.676632254385058491/quay-io-rhceph-dev-ocs-must-gather-sha256-e25ccd49c5519f2fc4a4c3c1a57f31737a6539a0e5afbd0ea927341c500d9e1b/namespaces/openshift-storage/pods/rook-ceph-operator-776564f669-zkhn4/rook-ceph-operator/rook-ceph-operator/logs/current.log (In reply to Travis Nielsen from comment #17) > Talur's fix to change the ordering of the cephcluster first has fixed the > original issue. Now a new issue is showing up where the uninstall of the > cephcluster is failing if any of the ceph pods are in pending state (and not > assigned to a node). This is observed by the following entry in the operator > log [1]. > > 2020-11-26T18:37:59.510065295Z 2020-11-26 18:37:59.510012 E | > ceph-cluster-controller: failed to reconcile. failed to find valid ceph > hosts in the cluster "openshift-storage": failed to get hostname from node > "": resource name may not be empty > > This would only affect clusters with ceph pods stuck pending. Uninstall of a > normal cluster would proceed as expected. Since this is an uninstall issue > in a failed scenario as mentioned by Yaniv, agreed it's not a blocker. > > Rook should ignore the error of a pod being in pending state and continue > with the uninstall for any pods that are not in pending state. Moving to > 4.6.z since this would be a simple and low risk fix. What we discussed yesterday was that *if* we need another RC for 4.6.0 anyway and *if* the fix is simple and can be provided quickly, *then* we can consider adding it to 4.6.0 itself. @Santosh - is a fix in sight or will it take a few more days?
> @Santosh - is a fix in sight or will it take a few more days?
PR with a fix should be ready today.
Rook PR to ignore pending ceph daemon pods - https://github.com/rook/rook/pull/6719 Moving it back tp 4.6.0 as discussed in the program meeting. Santosh, please merge the backport PR to 4.6 Bug fixed. Install/Uninstall doc: https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.6/html-single/deploying_openshift_container_storage_using_amazon_web_services/index?lb_target=preview#assembly_uninstalling-openshift-container-storage_rhocs SetUp: Provider: AWS_IPI Instance type: m4.xlarge OCP Version:4.6.0-0.nightly-2020-12-14-082246 Test Process: 1.Check OCP status via UI: 2.Deploy OLM 4.6.0-195.ci(On OCP4.6, OCS Operator won't be shown in the operator hub without this command) $ oc create -f install_olm.yaml namespace/openshift-storage created operatorgroup.operators.coreos.com/openshift-storage-operatorgroup created catalogsource.operators.coreos.com/ocs-catalogsource created 3.Istall OCS 4.6: 4.Check pod status: Check Pod status: $ oc get pods -n openshift-storage | grep -i Pending noobaa-core-0 0/1 Pending 0 6m12s rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-68b94b44wsjbj 0/1 Pending 0 5m54s rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6bbd878bwcgvq 0/1 Pending 0 5m53s rook-ceph-osd-1-d67f9486f-sv85f 0/1 Pending 0 7m14s $ oc get pods -n openshift-storage NAME READY STATUS RESTARTS AGE csi-cephfsplugin-66r7n 3/3 Running 0 11m csi-cephfsplugin-g6jmn 3/3 Running 0 11m csi-cephfsplugin-provisioner-56d4c79c8-54rhh 6/6 Running 0 11m csi-cephfsplugin-provisioner-56d4c79c8-f9mlx 6/6 Running 0 11m csi-cephfsplugin-t6dbc 3/3 Running 0 11m csi-rbdplugin-5xmzk 3/3 Running 0 11m csi-rbdplugin-p42km 3/3 Running 0 11m csi-rbdplugin-provisioner-85c448cfc-fnmzq 6/6 Running 0 11m csi-rbdplugin-provisioner-85c448cfc-wpmqx 6/6 Running 0 11m csi-rbdplugin-rr68d 3/3 Running 0 11m noobaa-core-0 0/1 Pending 0 6m48s noobaa-db-0 1/1 Running 0 6m48s noobaa-operator-75b79c46d7-qlnls 1/1 Running 0 14m ocs-metrics-exporter-d47cd54ff-mr6jf 1/1 Running 0 14m ocs-operator-7cdfc88b6d-ttszl 0/1 Running 0 14m rook-ceph-crashcollector-ip-10-0-159-51-5576c4dc59-hd7jg 1/1 Running 0 9m38s rook-ceph-crashcollector-ip-10-0-185-92-75dcdc855c-fbnrn 1/1 Running 0 9m15s rook-ceph-crashcollector-ip-10-0-197-98-94c8b747b-gpb88 1/1 Running 0 8m45s rook-ceph-drain-canary-2b829a34cbf20b0892bae76197d5649e-57qjm4n 1/1 Running 0 6m50s rook-ceph-drain-canary-f473cf35a5491c9aceb4df0f63881604-76v7l2j 1/1 Running 0 7m58s rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-68b94b44wsjbj 0/1 Pending 0 6m30s rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6bbd878bwcgvq 0/1 Pending 0 6m29s rook-ceph-mgr-a-7f5d849db7-j9qlc 1/1 Running 0 8m24s rook-ceph-mon-a-6f6bd7dd67-vv5jp 1/1 Running 0 9m39s rook-ceph-mon-b-78998fb464-8hfsd 1/1 Running 0 9m16s rook-ceph-mon-c-7f8547897c-vbl48 1/1 Running 0 8m45s rook-ceph-operator-5df7cd94d6-grwlb 1/1 Running 0 14m rook-ceph-osd-0-7d4fdbd474-2vgnl 1/1 Running 0 7m58s rook-ceph-osd-1-d67f9486f-sv85f 0/1 Pending 0 7m50s rook-ceph-osd-2-5d8c4c4fbc-xv5g9 1/1 Running 0 6m50s rook-ceph-osd-prepare-ocs-deviceset-gp2-0-data-0-mtnj6-zkdbg 0/1 Completed 0 8m23s rook-ceph-osd-prepare-ocs-deviceset-gp2-1-data-0-c9fcb-mzgkp 0/1 Completed 0 8m23s rook-ceph-osd-prepare-ocs-deviceset-gp2-2-data-0-dhhwk-xdjrd 0/1 Completed 0 8m22s $ oc get storagecluster -n openshift-storage NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 12m Progressing 2020-12-14T18:56:43Z 4.6.0 Uninstall: 1.Get PVC $ oc get pvc -n openshift-image-registry No resources found in openshift-image-registry namespace. $ oc get pvc -n openshift-monitoring No resources found in openshift-monitoring namespace. $ oc get pvc -n openshift-logging No resources found in openshift-logging namespace. 2.Delete storagecluster $ oc delete -n openshift-storage storagecluster --all --wait=true storagecluster.ocs.openshift.io "ocs-storagecluster" deleted [take 2 minutes] $ oc get storagecluster -n openshift-storage No resources found in openshift-storage namespace. 3.Check for cleanup pods: $ oc get pods -n openshift-storage | grep -i cleanup cluster-cleanup-job-ip-10-0-159-51-zbm9q 0/1 Completed 0 4m18s cluster-cleanup-job-ip-10-0-185-92-4npmh 0/1 Completed 0 4m18s cluster-cleanup-job-ip-10-0-197-98-vjw9q 0/1 Completed 0 4m18s 4.Confirm that the directory /var/lib/rook is now empty. 5.Unlabel the storage nodes. $ oc label nodes --all cluster.ocs.openshift.io/openshift-storage- label "cluster.ocs.openshift.io/openshift-storage" not found. node/ip-10-0-144-130.us-east-2.compute.internal not labeled node/ip-10-0-159-51.us-east-2.compute.internal labeled node/ip-10-0-185-92.us-east-2.compute.internal labeled label "cluster.ocs.openshift.io/openshift-storage" not found. node/ip-10-0-187-77.us-east-2.compute.internal not labeled label "cluster.ocs.openshift.io/openshift-storage" not found. node/ip-10-0-193-88.us-east-2.compute.internal not labeled node/ip-10-0-197-98.us-east-2.compute.internal labeled $ oc label nodes --all topology.rook.io/rack- label "topology.rook.io/rack" not found. node/ip-10-0-144-130.us-east-2.compute.internal not labeled label "topology.rook.io/rack" not found. node/ip-10-0-159-51.us-east-2.compute.internal not labeled label "topology.rook.io/rack" not found. node/ip-10-0-185-92.us-east-2.compute.internal not labeled label "topology.rook.io/rack" not found. node/ip-10-0-187-77.us-east-2.compute.internal not labeled label "topology.rook.io/rack" not found. node/ip-10-0-193-88.us-east-2.compute.internal not labeled label "topology.rook.io/rack" not found. node/ip-10-0-197-98.us-east-2.compute.internal not labeled 6.Remove the OpenShift Container Storage taint if the nodes were tainted. 7.Confirm all PVs provisioned using OpenShift Container deleted $ oc get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-2e6c0aaa-959b-4b22-bf5e-fb1902c2b57e 512Gi RWO Delete Bound openshift-storage/ocs-deviceset-gp2-0-data-0-mtnj6 gp2 24m pvc-51771e31-6098-4931-80dc-0ed96401904d 512Gi RWO Delete Bound openshift-storage/ocs-deviceset-gp2-2-data-0-dhhwk gp2 24m pvc-e3d72c50-d478-40ac-b1d7-6e064c733577 512Gi RWO Delete Bound openshift-storage/ocs-deviceset-gp2-1-data-0-c9fcb gp2 24m 8.Delete the Multicloud Object Gateway storageclass. 9.Remove CustomResourceDefinitions. $ oc delete crd backingstores.noobaa.io bucketclasses.noobaa.io cephblockpools.ceph.rook.io cephclusters.ceph.rook.io cephfilesystems.ceph.rook.io cephnfses.ceph.rook.io cephobjectstores.ceph.rook.io cephobjectstoreusers.ceph.rook.io noobaas.noobaa.io ocsinitializations.ocs.openshift.io storageclusterinitializations.ocs.openshift.io storageclusters.ocs.openshift.io cephclients.ceph.rook.io cephobjectrealms.ceph.rook.io, cephobjectzonegroups.ceph.rook.io cephobjectzones.ceph.rook.io cephrbdmirrors.ceph.rook.io --wait=true --timeout=5m customresourcedefinition.apiextensions.k8s.io "backingstores.noobaa.io" deleted customresourcedefinition.apiextensions.k8s.io "bucketclasses.noobaa.io" deleted customresourcedefinition.apiextensions.k8s.io "cephblockpools.ceph.rook.io" deleted customresourcedefinition.apiextensions.k8s.io "cephclusters.ceph.rook.io" deleted customresourcedefinition.apiextensions.k8s.io "cephfilesystems.ceph.rook.io" deleted customresourcedefinition.apiextensions.k8s.io "cephnfses.ceph.rook.io" deleted customresourcedefinition.apiextensions.k8s.io "cephobjectstores.ceph.rook.io" deleted customresourcedefinition.apiextensions.k8s.io "cephobjectstoreusers.ceph.rook.io" deleted customresourcedefinition.apiextensions.k8s.io "noobaas.noobaa.io" deleted customresourcedefinition.apiextensions.k8s.io "ocsinitializations.ocs.openshift.io" deleted customresourcedefinition.apiextensions.k8s.io "storageclusters.ocs.openshift.io" deleted customresourcedefinition.apiextensions.k8s.io "cephclients.ceph.rook.io" deleted customresourcedefinition.apiextensions.k8s.io "cephobjectzonegroups.ceph.rook.io" deleted customresourcedefinition.apiextensions.k8s.io "cephobjectzones.ceph.rook.io" deleted customresourcedefinition.apiextensions.k8s.io "cephrbdmirrors.ceph.rook.io" deleted Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "storageclusterinitializations.ocs.openshift.io" not found Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "cephobjectrealms.ceph.rook.io," not found 10.Delete the namespace and wait till the deletion is complete $ oc project default Now using project "default" on server "https://api.oviner-awsbug14.qe.rh-ocs.com:6443". $ oc delete project openshift-storage --wait=true --timeout=5m project.project.openshift.io "openshift-storage" deleted $ oc get project openshift-storage Error from server (NotFound): namespaces "openshift-storage" not found 11.Ensure that OpenShift Container Storage is uninstalled completely (via UI) test procedure detailed: https://docs.google.com/document/d/1MFRQ3j65uBm3CirM6M6uL4-ybVoFcMmRGI2SFtqvbzI/edit Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5605 |