Description of problem: After a master is stopped in AWS, cannot rollout kubeapiserver. This is found during testing bug 1713219 . Version-Release number of selected component (if applicable): 4.1.0-0.nightly-2019-05-22-190823 How reproducible: Always Steps to Reproduce: 1. Create a 4.1 IPI env. 2. Stop one master ip-10-0-129-92 in AWS. A new master will be auto created by machine-api and running. 3. Rollout kubeapiserver by: $ oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "forced test 1" } ]' 4. Check: $ oc get po -n openshift-kube-apiserver Actual results: 4. Rollout cannot complete due to Pending pod: NAME READY STATUS RESTARTS AGE installer-2-ip-10-0-129-92.us-east-2.compute.internal 0/1 Completed 0 8h installer-2-ip-10-0-148-23.us-east-2.compute.internal 0/1 Completed 0 8h installer-2-ip-10-0-168-151.us-east-2.compute.internal 0/1 Completed 0 8h installer-3-ip-10-0-129-92.us-east-2.compute.internal 0/1 Completed 0 8h installer-3-ip-10-0-148-23.us-east-2.compute.internal 0/1 Completed 0 8h installer-4-ip-10-0-148-23.us-east-2.compute.internal 0/1 Completed 0 8h installer-6-ip-10-0-129-92.us-east-2.compute.internal 0/1 Completed 0 8h installer-6-ip-10-0-148-23.us-east-2.compute.internal 0/1 Completed 0 8h installer-6-ip-10-0-168-151.us-east-2.compute.internal 0/1 Completed 0 8h installer-7-ip-10-0-129-92.us-east-2.compute.internal 0/1 Pending 0 80m kube-apiserver-ip-10-0-129-92.us-east-2.compute.internal 2/2 Running 0 8h kube-apiserver-ip-10-0-148-23.us-east-2.compute.internal 2/2 Running 0 8h kube-apiserver-ip-10-0-168-151.us-east-2.compute.internal 2/2 Running 0 8h revision-pruner-2-ip-10-0-129-92.us-east-2.compute.internal 0/1 Completed 0 8h revision-pruner-2-ip-10-0-148-23.us-east-2.compute.internal 0/1 Completed 0 8h revision-pruner-2-ip-10-0-168-151.us-east-2.compute.internal 0/1 Completed 0 8h revision-pruner-3-ip-10-0-129-92.us-east-2.compute.internal 0/1 Completed 0 8h revision-pruner-3-ip-10-0-148-23.us-east-2.compute.internal 0/1 Completed 0 8h revision-pruner-4-ip-10-0-148-23.us-east-2.compute.internal 0/1 Completed 0 8h revision-pruner-6-ip-10-0-129-92.us-east-2.compute.internal 0/1 Completed 0 8h revision-pruner-6-ip-10-0-148-23.us-east-2.compute.internal 0/1 Completed 0 8h revision-pruner-6-ip-10-0-168-151.us-east-2.compute.internal 0/1 Completed 0 8h Expected results: 4. Rollout should complete. Additional info:
Reproduction attempt: 1. create AWS env 2. stop one of the masters in AWS 3. Manually approved CSR for new node. 4. kube-apiserver starts up fine, but not all operators available/not degraded: $ oc get clusteroperators.config.openshift.io NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE dns 4.2.0-0.okd-2019-05-23-161649 True True True 99m machine-config 4.2.0-0.okd-2019-05-23-161649 False False True 4m9s monitoring 4.2.0-0.okd-2019-05-23-161649 False True True 6m1s network 4.2.0-0.okd-2019-05-23-161649 True True False 99m (I see now I installed a 4.2 version, will retry....)
Created attachment 1573005 [details] recreation attempt
I got to the same point in the issue using 4.1.0-rc6. The Pending install pod is trying to be scheduled on the 'stopped/deleted' node. Where are the instructions being followed for this scenario? They might be missing a step on how to remove the 'stopped/deleted' node.
I deleted the 'stopped/deleted' node (`oc delete node`) in my environment and the controllers applied the desired state.
(In reply to Luis Sanchez from comment #5) > I got to the same point in the issue using 4.1.0-rc6. The Pending install > pod is trying to be scheduled on the 'stopped/deleted' node. > > Where are the instructions being followed for this scenario? They might be > missing a step on how to remove the 'stopped/deleted' node. The instructions being followed for comment 0 is already written in comment 0's steps where I didn't approve Pending CSR and didn't delete the stopped node. The instructions for comment 2 is: based on comment 0 result, the next day (when env had 24+ hours elapsed) I checked oc get node, all nodes were NotReady; checked oc get csr, many CSRs were Pending; approved these Pending CSRs, new CSRs were created and Pending again, approved again, new Pending CSRs occurred again, this loop seemed to had no ending; after this, all nodes were still NotReady, then I had no other operations.
Today I tried a fresh env with added "approve CSR" and "oc delete stopped master-0" steps: stop master-0, wait new master-0, approve CSR, oc delete stopped master-0, rollout kubeapiserver, then check pods and oc get co. With these added steps, rollout succeeded, `oc get co` shows all clusteroperatos in good state
Why is the node list not updated? If comment 6 is true, the apiserver controller is acting correctly. We have to get an updated nodes list. That list is maintained by the node controller / cloud controller.
@Xingxing: to clarify step 2 in the description: was the node stopped or terminated (i.e. it disappeared) ?
Stopped, not terminated
From slack discussion: Alberto: as soon as the node goes unhealthy -> The machine API watches that "node status update" and update the machine object -> A machine API reconciliation loop is triggered -> machine API sees the instance is not running, a new one is created This looks like not the right behaviour. A stopped node is not removed and should not. There are good reasons to stop a node (e.g. mounting a volume). The machine API should only start a new master if the AWS machine is terminated.
Re-sizing a master to a new instance type is another good reason to stop it - we want to document that scenario.
I have another question, from my testing, seems the bug's nodes become NotReady only at the time point when env elapsed about **24** hours. Does this relate to cert time of https://bugzilla.redhat.com/show_bug.cgi?id=1713999#c3 . If it does, does this mean: if an env already runs well after 24 hours and before the next time of cert renewing (i.e. 30 days period), let user Stop a master and do not manually approving csr for new master, then the env would not hit this bug until the cert renewing time comes? Anyone who knows could help answer, thanks!
Per in slack #forum-cloud asking Alberto about comment 17, comment 17's step "Restart old instance, old node come back healthy" needs be done **before** this bug actually occurs (If only talking about "before", then comment 8 is better. This bug's concern is what if customer forgets, or does not know, to do the workflow). This info plus comment 21, seem to not show how to rescue the env once **after** this bug actually occurs. Comment 18 shows remediation will be documented. Could you show me where the document would be put? Is it used for the situation once **after** this bug actually occurs? Thanks.
There is not much the master team can do here other than documenting this behavior for disaster recovery. We also have Jira card (Stefan linked in one of the comments) where the control plane operator(s) should become Degraded when one or more master nodes are not ready. I'm moving this to documentation team so they can coordinate what and where we should document this.
Verified in 4.2.0-0.nightly-2019-07-30-155738 When a worker instance is stopped from the AWS EC2 console, the node changes to NotReady. The machine shows the instance is in 'stopped' state. oc get node ip-10-0-137-181.ap-northeast-1.compute.internal NAME STATUS ROLES AGE VERSION ip-10-0-137-181.ap-northeast-1.compute.internal NotReady worker 4h54m v1.14.0+2e9d4a117 oc get machine jhou1-jqqs2-worker-ap-northeast-1a-b5bdl oc NAME INSTANCE STATE TYPE REGION ZONE AGE jhou1-jqqs2-worker-ap-northeast-1a-b5bdl i-0932614851631353e stopped m4.large ap-northeast-1 ap-northeast-1a 4h59m Once the instance is started, the node and machine status is recovered. oc get node ip-10-0-137-181.ap-northeast-1.compute.internal NAME STATUS ROLES AGE VERSION ip-10-0-137-181.ap-northeast-1.compute.internal Ready worker 4h57m v1.14.0+2e9d4a117 oc get machine jhou1-jqqs2-worker-ap-northeast-1a-b5bdl NAME INSTANCE STATE TYPE REGION ZONE AGE jhou1-jqqs2-worker-ap-northeast-1a-b5bdl i-0932614851631353e running m4.large ap-northeast-1 ap-northeast-1a 5h2m
Verified in 4.1.0-0.nightly-2019-07-31-005945
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2010