Created attachment 1717597 [details] Restart UI shit Description of problem: UI console should not allow a user to "Restart" more than one node at a time. Results {"metadata":{},"status":"Failure","message":"Timeout: request did not complete within 1m0s","reason":"Timeout","details":{},"code":504} and CLI no longer works Version-Release number of selected component (if applicable): Cluster version is 4.6.0-fc.8 Cluster is 3 Master,2 Worker How reproducible: 100% Steps to Reproduce: 1. Install OCP 4.6 2. Compute->BMH->3 dot kebab <master-0-0> - Restart 3. Popover appears to confirm restart 4. Repeat Steps 2,3 <master-0-1> By the time, I get to this step, the option to restart is greyed out (see attachment) 5. Repeat Steps 2,3 <master-0-2> Actual results: UI stops responding and returns 504 error {"metadata":{},"status":"Failure","message":"Timeout: request did not complete within 1m0s","reason":"Timeout","details":{},"code":504} CLI stops responding Master Node's 0-0,0-1 never power back on and remained Shutoff Expected results: Expect the UI never allow a user to get into this situation Additional info: These were the console pods at the time of this negative test. NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES console-676f59b7fb-67ggv 1/1 Running 0 3h48m 10.128.0.16 master-0-2 <none> <none> console-676f59b7fb-rqm57 1/1 Running 0 144m 10.129.0.34 master-0-1 <none> <none> downloads-6ddcb844f4-f7r9x 1/1 Running 0 3h48m 10.128.0.20 master-0-2 <none> <none> downloads-6ddcb844f4-whkdg 1/1 Running 0 4h2m 10.128.2.5 worker-0-0 <none> <none> [root@sealusa6 ~]# oc get clusterversions.config.openshift.io The user can recover by manually powering back up Master nodes I see both booted and then master-0-1 powered back off The UI/CLI did become reachable
Now the nodes both power off again and UI still not accessible. # oc get pods -o wide -n openshift-console The connection to the server api.ocp-edge-cluster-rdu1-0.qe.lab.redhat.com:6443 was refused - did you specify the right host or port? Power both back on again and can't seem to access UI, get error, finally log back in UO and now see master-0-0 power off again This system is now really unstable.
I dont think the UI should be the one guarding against such a situation. I think backend should make sure it does not allow the environment to get into such an unstable state. Moving to the backend component.
I'm not sure how we should handle this - IMO the user shouldn't routinely be forcibly restarting nodes like this - the MCO is supposed to handle configuration changes and reboots in the majority of cases. However we provide a UX that enables power management, mostly for cases such as hardware maintenance or perhaps forcibly restarting due to an unexpected hardware or software issue. Also this UX is an advanced/admin only interface, as an admin of a cluster I could also SSH to all the nodes and reboot them - is this case any different, or should we just document "don't do that" ?
Most scary part for me is that the env does not seem to recover after this at all.
This is essentially the same issue as https://bugzilla.redhat.com/show_bug.cgi?id=1883614 - restart of a BMH is a potentially disruptive action, and we can track making that clearer in the UI via https://bugzilla.redhat.com/show_bug.cgi?id=1883614 As such I'll close this as a duplicate - when the UI changes are completed we can follow up with a docs bz if it's felt needed.
*** This bug has been marked as a duplicate of bug 1883614 ***