1713292 – Change the behavior of stopped existing instances in case forgot approving CSR AND oc delete stopped node which makes nodes finally NotReady

Bug 1713292 - Change the behavior of stopped existing instances in case forgot approving CSR AND oc delete stopped node which makes nodes finally NotReady

Summary: Change the behavior of stopped existing instances in case forgot approving CS...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.1.z
Assignee:	Jan Chaloupka
QA Contact:	Jianwei Hou
Docs Contact:
URL:
Whiteboard:	4.1.9
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-05-23 10:52 UTC by Xingxing Xia
Modified:	2019-08-07 15:06 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-08-07 15:06:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
recreation attempt (18.46 KB, text/plain) 2019-05-24 19:02 UTC, Luis Sanchez	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:2010	0	None	None	None	2019-08-07 15:06:06 UTC

Description Xingxing Xia 2019-05-23 10:52:52 UTC

Description of problem:
After a master is stopped in AWS, cannot rollout kubeapiserver.
This is found during testing bug 1713219 .

Version-Release number of selected component (if applicable):
4.1.0-0.nightly-2019-05-22-190823

How reproducible:
Always

Steps to Reproduce:
1. Create a 4.1 IPI env.
2. Stop one master ip-10-0-129-92 in AWS. A new master will be auto created by machine-api and running.
3. Rollout kubeapiserver by:
$ oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "forced test 1" } ]'
4. Check:
$ oc get po -n openshift-kube-apiserver

Actual results:
4. Rollout cannot complete due to Pending pod:
NAME                                                           READY   STATUS      RESTARTS   AGE                                                                    
installer-2-ip-10-0-129-92.us-east-2.compute.internal          0/1     Completed   0          8h                                                                     
installer-2-ip-10-0-148-23.us-east-2.compute.internal          0/1     Completed   0          8h                                                                     
installer-2-ip-10-0-168-151.us-east-2.compute.internal         0/1     Completed   0          8h                                                                     
installer-3-ip-10-0-129-92.us-east-2.compute.internal          0/1     Completed   0          8h                                                                     
installer-3-ip-10-0-148-23.us-east-2.compute.internal          0/1     Completed   0          8h                                                                     
installer-4-ip-10-0-148-23.us-east-2.compute.internal          0/1     Completed   0          8h                                                                     
installer-6-ip-10-0-129-92.us-east-2.compute.internal          0/1     Completed   0          8h                                                                     
installer-6-ip-10-0-148-23.us-east-2.compute.internal          0/1     Completed   0          8h                                                                     
installer-6-ip-10-0-168-151.us-east-2.compute.internal         0/1     Completed   0          8h                                                                     
installer-7-ip-10-0-129-92.us-east-2.compute.internal          0/1     Pending     0          80m                                                                    
kube-apiserver-ip-10-0-129-92.us-east-2.compute.internal       2/2     Running     0          8h                                                                     
kube-apiserver-ip-10-0-148-23.us-east-2.compute.internal       2/2     Running     0          8h                                                                     
kube-apiserver-ip-10-0-168-151.us-east-2.compute.internal      2/2     Running     0          8h                                                                     
revision-pruner-2-ip-10-0-129-92.us-east-2.compute.internal    0/1     Completed   0          8h                                                                     
revision-pruner-2-ip-10-0-148-23.us-east-2.compute.internal    0/1     Completed   0          8h                                                                     
revision-pruner-2-ip-10-0-168-151.us-east-2.compute.internal   0/1     Completed   0          8h                                                                     
revision-pruner-3-ip-10-0-129-92.us-east-2.compute.internal    0/1     Completed   0          8h                                                                     
revision-pruner-3-ip-10-0-148-23.us-east-2.compute.internal    0/1     Completed   0          8h                                                                     
revision-pruner-4-ip-10-0-148-23.us-east-2.compute.internal    0/1     Completed   0          8h                                                                     
revision-pruner-6-ip-10-0-129-92.us-east-2.compute.internal    0/1     Completed   0          8h                                                                     
revision-pruner-6-ip-10-0-148-23.us-east-2.compute.internal    0/1     Completed   0          8h                                                                     
revision-pruner-6-ip-10-0-168-151.us-east-2.compute.internal   0/1     Completed   0          8h

Expected results:
4. Rollout should complete.

Additional info:

Comment 1 Luis Sanchez 2019-05-23 18:32:42 UTC

Reproduction attempt:

1. create AWS env 
2. stop one of the masters in AWS
3. Manually approved CSR for new node.
4. kube-apiserver starts up fine, but not all operators available/not degraded:

$ oc get clusteroperators.config.openshift.io 
NAME                                 VERSION                         AVAILABLE   PROGRESSING   DEGRADED   SINCE
dns                                  4.2.0-0.okd-2019-05-23-161649   True        True          True       99m
machine-config                       4.2.0-0.okd-2019-05-23-161649   False       False         True       4m9s
monitoring                           4.2.0-0.okd-2019-05-23-161649   False       True          True       6m1s
network                              4.2.0-0.okd-2019-05-23-161649   True        True          False      99m

(I see now I installed a 4.2 version, will retry....)

Comment 4 Luis Sanchez 2019-05-24 19:02:26 UTC

Created attachment 1573005 [details]
recreation attempt

Comment 5 Luis Sanchez 2019-05-24 19:04:34 UTC

I got to the same point in the issue using 4.1.0-rc6. The Pending install pod is trying to be scheduled on the 'stopped/deleted' node.

Where are the instructions being followed for this scenario? They might be missing a step on how to remove the 'stopped/deleted' node.

Comment 6 Luis Sanchez 2019-05-24 23:05:29 UTC

I deleted the 'stopped/deleted' node (`oc delete node`) in my environment and the controllers applied the desired state.

Comment 7 Xingxing Xia 2019-05-27 06:55:39 UTC

(In reply to Luis Sanchez from comment #5)
> I got to the same point in the issue using 4.1.0-rc6. The Pending install
> pod is trying to be scheduled on the 'stopped/deleted' node.
> 
> Where are the instructions being followed for this scenario? They might be
> missing a step on how to remove the 'stopped/deleted' node.

The instructions being followed for comment 0 is already written in comment 0's steps where I didn't approve Pending CSR and didn't delete the stopped node.
The instructions for comment 2 is: based on comment 0 result, the next day (when env had 24+ hours elapsed) I checked oc get node, all nodes were NotReady; checked oc get csr, many CSRs were Pending; approved these Pending CSRs, new CSRs were created and Pending again, approved again, new Pending CSRs occurred again, this loop seemed to had no ending; after this, all nodes were still NotReady, then I had no other operations.

Comment 8 Xingxing Xia 2019-05-27 07:23:39 UTC

Today I tried a fresh env with added "approve CSR" and "oc delete stopped master-0" steps: stop master-0, wait new master-0, approve CSR, oc delete stopped master-0, rollout kubeapiserver, then check pods and oc get co. With these added steps, rollout succeeded, `oc get co` shows all clusteroperatos in good state

Comment 9 Stefan Schimanski 2019-05-27 08:15:12 UTC

Why is the node list not updated? If comment 6 is true, the apiserver controller is acting correctly. We have to get an updated nodes list. That list is maintained by the node controller / cloud controller.

Comment 11 Stefan Schimanski 2019-05-28 08:07:04 UTC

@Xingxing: to clarify step 2 in the description: was the node stopped or terminated (i.e. it disappeared) ?

Comment 12 Xingxing Xia 2019-05-28 08:45:35 UTC

Stopped, not terminated

Comment 13 Stefan Schimanski 2019-05-28 13:12:20 UTC

From slack discussion:

Alberto: as soon as the node goes unhealthy -> The machine API watches that "node status update" and update the machine object -> A machine API reconciliation loop is triggered -> machine API sees the instance is not running, a new one is created

This looks like not the right behaviour. A stopped node is not removed and should not. There are good reasons to stop a node (e.g. mounting a volume). The machine API should only start a new master if the AWS machine is terminated.

Comment 14 Mike Fiedler 2019-05-28 16:09:22 UTC

Re-sizing a master to a new instance type is another good reason to stop it - we want to document that scenario.

Comment 20 Xingxing Xia 2019-05-30 03:33:12 UTC

I have another question, from my testing, seems the bug's nodes become NotReady only at the time point when env elapsed about **24** hours. Does this relate to cert time of https://bugzilla.redhat.com/show_bug.cgi?id=1713999#c3 . If it does, does this mean: if an env already runs well after 24 hours and before the next time of cert renewing (i.e. 30 days period), let user Stop a master and do not manually approving csr for new master, then the env would not hit this bug until the cert renewing time comes? Anyone who knows could help answer, thanks!

Comment 22 Xingxing Xia 2019-05-31 03:03:11 UTC

Per in slack #forum-cloud asking Alberto about comment 17, comment 17's step "Restart old instance, old node come back healthy" needs be done **before** this bug actually occurs (If only talking about "before", then comment 8 is better. This bug's concern is what if customer forgets, or does not know, to do the workflow). This info plus comment 21, seem to not show how to rescue the env once **after** this bug actually occurs. Comment 18 shows remediation will be documented. Could you show me where the document would be put? Is it used for the situation once **after** this bug actually occurs? Thanks.

Comment 25 Michal Fojtik 2019-06-03 09:21:05 UTC

There is not much the master team can do here other than documenting this behavior for disaster recovery.
We also have Jira card (Stefan linked in one of the comments) where the control plane operator(s) should become Degraded when one or more master nodes are not ready.

I'm moving this to documentation team so they can coordinate what and where we should document this.

Comment 34 Jianwei Hou 2019-07-31 03:21:58 UTC

Verified in 4.2.0-0.nightly-2019-07-30-155738

When a worker instance is stopped from the AWS EC2 console, the node changes to NotReady. The machine shows the instance is in 'stopped' state.

oc get node ip-10-0-137-181.ap-northeast-1.compute.internal
NAME                                              STATUS     ROLES    AGE     VERSION
ip-10-0-137-181.ap-northeast-1.compute.internal   NotReady   worker   4h54m   v1.14.0+2e9d4a117


oc get machine jhou1-jqqs2-worker-ap-northeast-1a-b5bdl
oc NAME                                       INSTANCE              STATE     TYPE       REGION           ZONE              AGE
jhou1-jqqs2-worker-ap-northeast-1a-b5bdl   i-0932614851631353e   stopped   m4.large   ap-northeast-1   ap-northeast-1a   4h59m


Once the instance is started, the node and machine status is recovered.

oc get node ip-10-0-137-181.ap-northeast-1.compute.internal
NAME                                              STATUS   ROLES    AGE     VERSION
ip-10-0-137-181.ap-northeast-1.compute.internal   Ready    worker   4h57m   v1.14.0+2e9d4a117

oc get machine jhou1-jqqs2-worker-ap-northeast-1a-b5bdl
NAME                                       INSTANCE              STATE     TYPE       REGION           ZONE              AGE
jhou1-jqqs2-worker-ap-northeast-1a-b5bdl   i-0932614851631353e   running   m4.large   ap-northeast-1   ap-northeast-1a   5h2m

Comment 35 Xingxing Xia 2019-07-31 06:06:04 UTC

Verified in 4.1.0-0.nightly-2019-07-31-005945

Comment 37 errata-xmlrpc 2019-08-07 15:06:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2010

Note You need to log in before you can comment on or make changes to this bug.