2252687 – Flaky pods due to resource crunch after running graceful reboot - should prevent failover or relocate operations

Bug 2252687 - Flaky pods due to resource crunch after running graceful reboot - should prevent failover or relocate operations

Summary: Flaky pods due to resource crunch after running graceful reboot - should prev...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	odf-dr
Sub Component:
Version:	4.14
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	umanga
QA Contact:	krishnaram Karthick
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-12-03 23:26 UTC by Kevin Alon Goldblatt
Modified:	2023-12-05 14:17 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-12-05 14:17:14 UTC
Embargoed:

Attachments	(Terms of Use)

Description Kevin Alon Goldblatt 2023-12-03 23:26:22 UTC

Description of problem (please be detailed as possible and provide log
snippests):
After a failover operation we Unfence the primary cluster and run the graceful-reboot script as the par to the dr steps. On completion some nodes are reported as NotReady and Pods are in CrashLoopBack status

Version of all relevant components (if applicable):
oc version
Client Version: 4.14.0-202309181402.p0.g795bf1a.assembly.stream-795bf1a
Kustomize Version: v5.0.1
Server Version: 4.14.0-0.nightly-2023-10-23-223425
Kubernetes Version: v1.27.6+f67aeb3


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
No

Is there any workaround available to the best of your knowledge?
Yes - required to manually uncordon the NotReady nodes - this remedies the problem

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
yes

Can this issue reproduce from the UI?
combination of UI and cli

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
After a failover of my application, Unfenced the Primary cluster and ran the graceful reboot script. 
Had to manually uncorden the control-plaine-1 node again as a node was reported as NotReady and some pods weren't running.
Then checked the secondary cluster and found a node Not Ready and many pods terminating. This was due to a resource crunch on the dr environments.
When this happens we need to alert the user that a cluster is not ready and recommend remedial action and prevent any further Failover or Relocate operations.

oc get nodes
NAME              STATUS     ROLES                  AGE   VERSION
compute-0         Ready      worker                 37d   v1.27.6+f67aeb3
compute-1         Ready      worker                 37d   v1.27.6+f67aeb3
compute-2         Ready      worker                 37d   v1.27.6+f67aeb3
control-plane-0   NotReady   control-plane,master   37d   v1.27.6+f67aeb3
control-plane-1   Ready      control-plane,master   37d   v1.27.6+f67aeb3
control-plane-2   Ready      control-plane,master   37d   v1.27.6+f67aeb3
[kgoldbla@fedora OC-Openshift]$ oc get pods -A |grep -v Running |grep -v Completed
NAMESPACE                                          NAME                                                              READY   STATUS        RESTARTS         AGE
openshift-apiserver                                apiserver-64f44d88fc-gbt59                                        1/2     Terminating   2 (23h ago)      43h
openshift-apiserver                                apiserver-64f44d88fc-vbclc                                        0/2     Pending       0                34m
openshift-authentication                           oauth-openshift-6c4666584c-2vqq8                                  0/1     Pending       0                36m
openshift-authentication                           oauth-openshift-6d9d54994d-bqb7t                                  1/1     Terminating   0                91m
openshift-cloud-controller-manager-operator        cluster-cloud-controller-manager-operator-68b6c7d49-kbkb5         2/2     Terminating   5 (7h32m ago)    43h
openshift-cloud-controller-manager                 vsphere-cloud-controller-manager-778f44fb4f-hfgv9                 1/1     Terminating   0                43h
openshift-cloud-credential-operator                cloud-credential-operator-675b4c7779-phdrp                        2/2     Terminating   0                43h
openshift-cluster-csi-drivers                      vmware-vsphere-csi-driver-controller-86c4ff44b4-mdwrr             13/13   Terminating   23 (93m ago)     43h
openshift-cluster-csi-drivers                      vmware-vsphere-csi-driver-controller-86c4ff44b4-mhnlv             13/13   Terminating   25 (41m ago)     43h
openshift-cluster-csi-drivers                      vmware-vsphere-csi-driver-webhook-bfdc8fccc-9nkrz                 1/1     Terminating   0                43h
openshift-cluster-machine-approver                 machine-approver-754894d4c-6948l                                  2/2     Terminating   3 (29h ago)      43h
openshift-cluster-node-tuning-operator             cluster-node-tuning-operator-7cdbfc46cd-k7bwr                     1/1     Terminating   2 (29h ago)      43h
openshift-cluster-samples-operator                 cluster-samples-operator-7b7bdfb78f-kkmtm                         2/2     Terminating   4 (3h55m ago)    43h
openshift-cluster-storage-operator                 cluster-storage-operator-758866db79-jk6kv                         1/1     Terminating   1 (34h ago)      43h
openshift-cluster-storage-operator                 csi-snapshot-controller-799dbd6b57-99b8q                          1/1     Terminating   2 (29h ago)      43h
openshift-cluster-storage-operator                 csi-snapshot-controller-799dbd6b57-msqkq                          1/1     Terminating   0                43h
openshift-cluster-storage-operator                 csi-snapshot-webhook-5cd46fd9f9-hpdwk                             1/1     Terminating   0                43h
openshift-cluster-storage-operator                 csi-snapshot-webhook-5cd46fd9f9-mcn7p                             1/1     Terminating   0                43h
openshift-cluster-version                          cluster-version-operator-64bc7b4f46-tsnz5                         1/1     Terminating   1 (34h ago)      43h
openshift-cnv                                      kubemacpool-cert-manager- 1/1     Terminating   0                43h
openshift-config-operator                          openshift-config-operator-85865499b6-9pn28                        1/1     Terminating   11 (18h ago)     43h
openshift-console-operator                         console-operator-64655d8457-g96x5                                 2/2     Terminating   7 (18h ago)      43h
openshift-console                                  console-79644b56d7-jrsf5                                          1/1     Terminating   3 (29h ago)      43h
openshift-controller-manager-operator              openshift-controller-manager-operator-7cb69858c7-q2hw6            1/1     Terminating   2 (29h ago)      43h
openshift-controller-manager                       controller-manager-68bdcd4b4f-pt62f                               0/1     Pending       0                34m
openshift-controller-manager                       controller-manager-68bdcd4b4f-z5czk                               1/1     Terminating   3 (29h ago)      43h
openshift-dns-operator                             dns-operator-594dc99798-h4gqt                                     2/2     Terminating   0                43h
openshift-etcd-operator                            etcd-operator-77fcb89d6-xczzt                                     1/1     Terminating   7 (7h33m ago)    43h
openshift-image-registry                           cluster-image-registry-operator-666fdd4cf9-8xsqp                  1/1     Terminating   4 (26h ago)      43h
openshift-ingress-operator                         ingress-operator-6cbb759465-6bpwm                                 2/2     Terminating   0                43h
openshift-insights                                 insights-operator-6c68997698-h5r8t                                1/1     Terminating   0                43h
openshift-kube-controller-manager-operator         kube-controller-manager-operator-bfd4bdb9b-6c9xx                  1/1     Terminating   4 (29h ago)      43h
openshift-kube-scheduler-operator                  openshift-kube-scheduler-operator-7c8468ffdc-4b97l                1/1     Terminating   4 (7h32m ago)    43h
openshift-kube-storage-version-migrator-operator   kube-storage-version-migrator-operator-57b6bf9cbc-2xdwd           1/1     Terminating   3 (29h ago)      43h
openshift-machine-api                              cluster-autoscaler-operator-747b555bd8-gmgtj                      2/2
Terminating   3 (29h ago)      43h
openshift-machine-api                              cluster-baremetal-operator-7cbc79b5bf-d4f5q                       2/2     Terminating   1 (43h ago)      43h
openshift-machine-api                              control-plane-machine-set-operator-6969d8fd95-5fnpb               1/1     Terminating   2 (29h ago)      43h
openshift-machine-api                              machine-api-operator-59bb5844bb-k9ttf                             2/2     Terminating   3 (29h ago)      43h
openshift-machine-config-operator                  machine-config-controller-5649bc7bd8-r47qd                        2/2     Terminating   2 (34h ago)      43h
openshift-machine-config-operator                  machine-config-operator-d8b7b866-whzns                            2/2     Terminating   1 (34h ago)      43h
openshift-marketplace                              marketplace-operator-75bcf6f64b-qv9c4                             1/1     Terminating   9 (7h32m ago)    43h
openshift-monitoring                               cluster-monitoring-operator-6cd87666fd-qwmv5                      1/1     Terminating   0                43h
openshift-monitoring                               prometheus-operator-59c99bbf7c-rzk6z                              2/2     Terminating   0                43h
openshift-multus                                   multus-admission-controller-7bdf5df49d-7wrdh                      2/2     Terminating   0                43h
openshift-multus                                   multus-admission-controller-7bdf5df49d-c9zkt                      2/2     Terminating   0                43h
openshift-network-operator                         network-operator-77f69dddcc-nt779                                 1/1     Terminating   2 (29h ago)      43h
openshift-oauth-apiserver                          apiserver-9c578dc4f-5rv5g                                         0/1     Pending       0                34m
openshift-oauth-apiserver                          apiserver-9c578dc4f-fnhr6                                         1/1     Terminating   0                43h
openshift-operator-lifecycle-manager               package-server-manager-5797c4b465-kcwng                           1/1     Terminating   4 (26h ago)      43h
openshift-operator-lifecycle-manager               packageserver-8749b7667-lxmt4                                     1/1     Terminating   3 (18h ago)      43h
openshift-route-controller-manager                 route-controller-manager-85c64f8f8d-8p2jg                         0/1     Pending       0                34m
openshift-route-controller-manager                 route-controller-manager-85c64f8f8d-npktf                         1/1     Terminating   7 (18h ago)      43h
openshift-service-ca-operator                      service-ca-operator-7c5d678d5b-gmvfh                              1/1     Terminating   4 (29h ago)      43h



Actual results:


Expected results:


Additional info:

Comment 2 umanga 2023-12-05 14:17:14 UTC

Metrics and alerts for cluster and node health should come from OpenShift.
Since DR is a user operation, users should make use of OpenShift monitoring data to make operation decisions.
This is not in our scope.

Note You need to log in before you can comment on or make changes to this bug.