Bug 1883622 - [4.6] Restarting more than one node at a time makes the environment very unstable
Summary: [4.6] Restarting more than one node at a time makes the environment very unst...
Keywords:
Status: CLOSED DUPLICATE of bug 1883614
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.7.0
Assignee: Beth White
QA Contact: Amit Ugol
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-29 17:55 UTC by mlammon
Modified: 2020-10-01 14:11 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-01 14:11:25 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Restart UI shit (112.42 KB, image/png)
2020-09-29 17:55 UTC, mlammon
no flags Details

Description mlammon 2020-09-29 17:55:48 UTC
Created attachment 1717597 [details]
Restart UI shit

Description of problem:
UI console should not allow a user to "Restart" more than one node at a time. Results

 {"metadata":{},"status":"Failure","message":"Timeout: request did not complete within 1m0s","reason":"Timeout","details":{},"code":504} and CLI no longer works

Version-Release number of selected component (if applicable):
Cluster version is 4.6.0-fc.8
Cluster is 3 Master,2 Worker

How reproducible:
100%

Steps to Reproduce:
1. Install OCP 4.6
2. Compute->BMH->3 dot kebab <master-0-0> - Restart
3. Popover appears to confirm restart
4. Repeat Steps 2,3 <master-0-1>

By the time, I get to this step, the option to restart is greyed out (see attachment)
5. Repeat Steps 2,3 <master-0-2>

Actual results:
UI stops responding and returns 504 error
{"metadata":{},"status":"Failure","message":"Timeout: request did not complete within 1m0s","reason":"Timeout","details":{},"code":504}
CLI stops responding
Master Node's 0-0,0-1 never power back on and remained Shutoff

Expected results:
Expect the UI never allow a user to get into this situation

Additional info:
These were the console pods at the time of this negative test.
NAME                         READY   STATUS    RESTARTS   AGE     IP            NODE         NOMINATED NODE   READINESS GATES
console-676f59b7fb-67ggv     1/1     Running   0          3h48m   10.128.0.16   master-0-2   <none>           <none>
console-676f59b7fb-rqm57     1/1     Running   0          144m    10.129.0.34   master-0-1   <none>           <none>
downloads-6ddcb844f4-f7r9x   1/1     Running   0          3h48m   10.128.0.20   master-0-2   <none>           <none>
downloads-6ddcb844f4-whkdg   1/1     Running   0          4h2m    10.128.2.5    worker-0-0   <none>           <none>
[root@sealusa6 ~]# oc get clusterversions.config.openshift.io


The user can recover by manually powering back up Master nodes
I see both booted and then master-0-1 powered back off
The UI/CLI did become reachable

Comment 1 mlammon 2020-09-29 18:09:22 UTC
Now the nodes both power off again and UI still not accessible.

# oc get pods -o wide -n openshift-console
The connection to the server api.ocp-edge-cluster-rdu1-0.qe.lab.redhat.com:6443 was refused - did you specify the right host or port?

Power both back on again and can't seem to access UI, get error, finally log back in UO and now see master-0-0 power off again
This system is now really unstable.

Comment 2 Tomas Jelinek 2020-10-01 07:09:54 UTC
I dont think the UI should be the one guarding against such a situation. I think backend should make sure it does not allow the environment to get into such an unstable state. Moving to the backend component.

Comment 3 Steven Hardy 2020-10-01 09:47:45 UTC
I'm not sure how we should handle this - IMO the user shouldn't routinely be forcibly restarting nodes like this - the MCO is supposed to handle configuration changes and reboots in the majority of cases.

However we provide a UX that enables power management, mostly for cases such as hardware maintenance or perhaps forcibly restarting due to an unexpected hardware or software issue.

Also this UX is an advanced/admin only interface, as an admin of a cluster I could also SSH to all the nodes and reboot them - is this case any different, or should we just document "don't do that" ?

Comment 4 Tomas Jelinek 2020-10-01 09:55:37 UTC
Most scary part for me is that the env does not seem to recover after this at all.

Comment 5 Steven Hardy 2020-10-01 14:10:37 UTC
This is essentially the same issue as https://bugzilla.redhat.com/show_bug.cgi?id=1883614 - restart of a BMH is a potentially disruptive action, and we can track making that clearer in the UI via https://bugzilla.redhat.com/show_bug.cgi?id=1883614

As such I'll close this as a duplicate - when the UI changes are completed we can follow up with a docs bz if it's felt needed.

Comment 6 Steven Hardy 2020-10-01 14:11:25 UTC

*** This bug has been marked as a duplicate of bug 1883614 ***


Note You need to log in before you can comment on or make changes to this bug.