Description of problem (please be detailed as possible and provide log snippests): After a failover operation we Unfence the primary cluster and run the graceful-reboot script as the par to the dr steps. On completion some nodes are reported as NotReady and Pods are in CrashLoopBack status Version of all relevant components (if applicable): oc version Client Version: 4.14.0-202309181402.p0.g795bf1a.assembly.stream-795bf1a Kustomize Version: v5.0.1 Server Version: 4.14.0-0.nightly-2023-10-23-223425 Kubernetes Version: v1.27.6+f67aeb3 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? No Is there any workaround available to the best of your knowledge? Yes - required to manually uncordon the NotReady nodes - this remedies the problem Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? yes Can this issue reproduce from the UI? combination of UI and cli If this is a regression, please provide more details to justify this: Steps to Reproduce: After a failover of my application, Unfenced the Primary cluster and ran the graceful reboot script. Had to manually uncorden the control-plaine-1 node again as a node was reported as NotReady and some pods weren't running. Then checked the secondary cluster and found a node Not Ready and many pods terminating. This was due to a resource crunch on the dr environments. When this happens we need to alert the user that a cluster is not ready and recommend remedial action and prevent any further Failover or Relocate operations. oc get nodes NAME STATUS ROLES AGE VERSION compute-0 Ready worker 37d v1.27.6+f67aeb3 compute-1 Ready worker 37d v1.27.6+f67aeb3 compute-2 Ready worker 37d v1.27.6+f67aeb3 control-plane-0 NotReady control-plane,master 37d v1.27.6+f67aeb3 control-plane-1 Ready control-plane,master 37d v1.27.6+f67aeb3 control-plane-2 Ready control-plane,master 37d v1.27.6+f67aeb3 [kgoldbla@fedora OC-Openshift]$ oc get pods -A |grep -v Running |grep -v Completed NAMESPACE NAME READY STATUS RESTARTS AGE openshift-apiserver apiserver-64f44d88fc-gbt59 1/2 Terminating 2 (23h ago) 43h openshift-apiserver apiserver-64f44d88fc-vbclc 0/2 Pending 0 34m openshift-authentication oauth-openshift-6c4666584c-2vqq8 0/1 Pending 0 36m openshift-authentication oauth-openshift-6d9d54994d-bqb7t 1/1 Terminating 0 91m openshift-cloud-controller-manager-operator cluster-cloud-controller-manager-operator-68b6c7d49-kbkb5 2/2 Terminating 5 (7h32m ago) 43h openshift-cloud-controller-manager vsphere-cloud-controller-manager-778f44fb4f-hfgv9 1/1 Terminating 0 43h openshift-cloud-credential-operator cloud-credential-operator-675b4c7779-phdrp 2/2 Terminating 0 43h openshift-cluster-csi-drivers vmware-vsphere-csi-driver-controller-86c4ff44b4-mdwrr 13/13 Terminating 23 (93m ago) 43h openshift-cluster-csi-drivers vmware-vsphere-csi-driver-controller-86c4ff44b4-mhnlv 13/13 Terminating 25 (41m ago) 43h openshift-cluster-csi-drivers vmware-vsphere-csi-driver-webhook-bfdc8fccc-9nkrz 1/1 Terminating 0 43h openshift-cluster-machine-approver machine-approver-754894d4c-6948l 2/2 Terminating 3 (29h ago) 43h openshift-cluster-node-tuning-operator cluster-node-tuning-operator-7cdbfc46cd-k7bwr 1/1 Terminating 2 (29h ago) 43h openshift-cluster-samples-operator cluster-samples-operator-7b7bdfb78f-kkmtm 2/2 Terminating 4 (3h55m ago) 43h openshift-cluster-storage-operator cluster-storage-operator-758866db79-jk6kv 1/1 Terminating 1 (34h ago) 43h openshift-cluster-storage-operator csi-snapshot-controller-799dbd6b57-99b8q 1/1 Terminating 2 (29h ago) 43h openshift-cluster-storage-operator csi-snapshot-controller-799dbd6b57-msqkq 1/1 Terminating 0 43h openshift-cluster-storage-operator csi-snapshot-webhook-5cd46fd9f9-hpdwk 1/1 Terminating 0 43h openshift-cluster-storage-operator csi-snapshot-webhook-5cd46fd9f9-mcn7p 1/1 Terminating 0 43h openshift-cluster-version cluster-version-operator-64bc7b4f46-tsnz5 1/1 Terminating 1 (34h ago) 43h openshift-cnv kubemacpool-cert-manager- 1/1 Terminating 0 43h openshift-config-operator openshift-config-operator-85865499b6-9pn28 1/1 Terminating 11 (18h ago) 43h openshift-console-operator console-operator-64655d8457-g96x5 2/2 Terminating 7 (18h ago) 43h openshift-console console-79644b56d7-jrsf5 1/1 Terminating 3 (29h ago) 43h openshift-controller-manager-operator openshift-controller-manager-operator-7cb69858c7-q2hw6 1/1 Terminating 2 (29h ago) 43h openshift-controller-manager controller-manager-68bdcd4b4f-pt62f 0/1 Pending 0 34m openshift-controller-manager controller-manager-68bdcd4b4f-z5czk 1/1 Terminating 3 (29h ago) 43h openshift-dns-operator dns-operator-594dc99798-h4gqt 2/2 Terminating 0 43h openshift-etcd-operator etcd-operator-77fcb89d6-xczzt 1/1 Terminating 7 (7h33m ago) 43h openshift-image-registry cluster-image-registry-operator-666fdd4cf9-8xsqp 1/1 Terminating 4 (26h ago) 43h openshift-ingress-operator ingress-operator-6cbb759465-6bpwm 2/2 Terminating 0 43h openshift-insights insights-operator-6c68997698-h5r8t 1/1 Terminating 0 43h openshift-kube-controller-manager-operator kube-controller-manager-operator-bfd4bdb9b-6c9xx 1/1 Terminating 4 (29h ago) 43h openshift-kube-scheduler-operator openshift-kube-scheduler-operator-7c8468ffdc-4b97l 1/1 Terminating 4 (7h32m ago) 43h openshift-kube-storage-version-migrator-operator kube-storage-version-migrator-operator-57b6bf9cbc-2xdwd 1/1 Terminating 3 (29h ago) 43h openshift-machine-api cluster-autoscaler-operator-747b555bd8-gmgtj 2/2 Terminating 3 (29h ago) 43h openshift-machine-api cluster-baremetal-operator-7cbc79b5bf-d4f5q 2/2 Terminating 1 (43h ago) 43h openshift-machine-api control-plane-machine-set-operator-6969d8fd95-5fnpb 1/1 Terminating 2 (29h ago) 43h openshift-machine-api machine-api-operator-59bb5844bb-k9ttf 2/2 Terminating 3 (29h ago) 43h openshift-machine-config-operator machine-config-controller-5649bc7bd8-r47qd 2/2 Terminating 2 (34h ago) 43h openshift-machine-config-operator machine-config-operator-d8b7b866-whzns 2/2 Terminating 1 (34h ago) 43h openshift-marketplace marketplace-operator-75bcf6f64b-qv9c4 1/1 Terminating 9 (7h32m ago) 43h openshift-monitoring cluster-monitoring-operator-6cd87666fd-qwmv5 1/1 Terminating 0 43h openshift-monitoring prometheus-operator-59c99bbf7c-rzk6z 2/2 Terminating 0 43h openshift-multus multus-admission-controller-7bdf5df49d-7wrdh 2/2 Terminating 0 43h openshift-multus multus-admission-controller-7bdf5df49d-c9zkt 2/2 Terminating 0 43h openshift-network-operator network-operator-77f69dddcc-nt779 1/1 Terminating 2 (29h ago) 43h openshift-oauth-apiserver apiserver-9c578dc4f-5rv5g 0/1 Pending 0 34m openshift-oauth-apiserver apiserver-9c578dc4f-fnhr6 1/1 Terminating 0 43h openshift-operator-lifecycle-manager package-server-manager-5797c4b465-kcwng 1/1 Terminating 4 (26h ago) 43h openshift-operator-lifecycle-manager packageserver-8749b7667-lxmt4 1/1 Terminating 3 (18h ago) 43h openshift-route-controller-manager route-controller-manager-85c64f8f8d-8p2jg 0/1 Pending 0 34m openshift-route-controller-manager route-controller-manager-85c64f8f8d-npktf 1/1 Terminating 7 (18h ago) 43h openshift-service-ca-operator service-ca-operator-7c5d678d5b-gmvfh 1/1 Terminating 4 (29h ago) 43h Actual results: Expected results: Additional info:
Metrics and alerts for cluster and node health should come from OpenShift. Since DR is a user operation, users should make use of OpenShift monitoring data to make operation decisions. This is not in our scope.