Bug 1713219

Summary:	[DOCS] Document how to add new host in to Ready nodes after a node is stopped in AWS
Product:	OpenShift Container Platform	Reporter:	Xingxing Xia <xxia>
Component:	Documentation	Assignee:	Andrea Hoffer <ahoffer>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Xingxing Xia <xxia>
Severity:	medium	Docs Contact:	Vikram Goyal <vigoyal>
Priority:	medium
Version:	4.1.0	CC:	agarcial, amcdermo, aos-bugs, bleanhar, calfonso, jokerman, mifiedle, mmccomas
Target Milestone:	---	Keywords:	Regression
Target Release:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-05-31 01:49:11 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Xingxing Xia 2019-05-23 07:52:35 UTC

Description of problem:
After a master is stopped in AWS, new host  auto-created by machine-api cannot be ready and shows "E0523 ... 860  kubelet.go:2274] node ... not found".
This blocks testing https://bugzilla.redhat.com/show_bug.cgi?id=1709802 and the doc http://file.rdu.redhat.com/~ahoffer/2019/disaster-recovery/disaster_recovery/scenario-1-infra-recovery.html which require the new host be shown in `oc get node`

Version-Release number of selected component (if applicable):
4.1.0-0.nightly-2019-05-22-050858
4.1.0-0.nightly-2019-05-22-190823

How reproducible:
Always

Steps to Reproduce:
1. Create a 4.1 IPI env
2. In AWS console, stop one master. New master will be auto-created by machine-api and running.
3. Check `oc get node`

Actual results:
3. It never shows the new host (10.0.132.190) auto-created by machine-api (But before, e.g. last week when reporting bug 1709802, it does).
ssh to the new host and check below, found the error:
[core@ip-10-0-132-190 ~]$ journalctl -u kubelet.service -n all
...
May 23 05:40:32 ip-10-0-132-190 hyperkube[860]: E0523 05:40:32.705775     860 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:453: Failed to list *v1.Node: nodes "ip-10-0-132-190.us-east-2.compute.internal" is forbidden: User "system:anonymous" cannot list resource "nodes" in API group "" at the cluster scope
May 23 05:40:32 ip-10-0-132-190 hyperkube[860]: E0523 05:40:32.805909     860 kubelet.go:2274] node "ip-10-0-132-190.us-east-2.compute.internal" not found
May 23 05:40:32 ip-10-0-132-190 hyperkube[860]: E0523 05:40:32.906176     860 kubelet.go:2274] node "ip-10-0-132-190.us-east-2.compute.internal" not found
May 23 05:40:33 ip-10-0-132-190 hyperkube[860]: E0523 05:40:33.006431     860 kubelet.go:2274] node "ip-10-0-132-190.us-east-2.compute.internal" not found
May 23 05:40:33 ip-10-0-132-190 hyperkube[860]: E0523 05:40:33.106686     860 kubelet.go:2274] node "ip-10-0-132-190.us-east-2.compute.internal" not found
May 23 05:40:33 ip-10-0-132-190 hyperkube[860]: E0523 05:40:33.206957     860 kubelet.go:2274] node "ip-10-0-132-190.us-east-2.compute.internal" not found
May 23 05:40:33 ip-10-0-132-190 hyperkube[860]: E0523 05:40:33.307202     860 kubelet.go:2274] node "ip-10-0-132-190.us-east-2.compute.internal" not found
May 23 05:40:33 ip-10-0-132-190 hyperkube[860]: E0523 05:40:33.407430     860 kubelet.go:2274] node "ip-10-0-132-190.us-east-2.compute.internal" not found
May 23 05:40:33 ip-10-0-132-190 hyperkube[860]: E0523 05:40:33.507677     860 kubelet.go:2274] node "ip-10-0-132-190.us-east-2.compute.internal" not found
May 23 05:40:33 ip-10-0-132-190 hyperkube[860]: E0523 05:40:33.607924     860 kubelet.go:2274] node "ip-10-0-132-190.us-east-2.compute.internal" not found
May 23 05:40:33 ip-10-0-132-190 hyperkube[860]: I0523 05:40:33.703978     860 reflector.go:160] Listing and watching *v1.Service from k8s.io/kubernetes/pkg/kubelet/kubelet.go:444
May 23 05:40:33 ip-10-0-132-190 hyperkube[860]: I0523 05:40:33.705117     860 reflector.go:160] Listing and watching *v1.Pod from k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47
May 23 05:40:33 ip-10-0-132-190 hyperkube[860]: E0523 05:40:33.706065     860 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:444: Failed to list *v1.Service: services is forbidden: User "system:anonymous" cannot list resource "services" in API group "" at the cluster scope
May 23 05:40:33 ip-10-0-132-190 hyperkube[860]: I0523 05:40:33.706168     860 reflector.go:160] Listing and watching *v1.Node from k8s.io/kubernetes/pkg/kubelet/kubelet.go:453
May 23 05:40:33 ip-10-0-132-190 hyperkube[860]: E0523 05:40:33.707014     860 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: pods is forbidden: User "system:anonymous" cannot list resource "pods" in API group "" at the cluster scope
May 23 05:40:33 ip-10-0-132-190 hyperkube[860]: E0523 05:40:33.707907     860 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:453: Failed to list *v1.Node: nodes "ip-10-0-132-190.us-east-2.compute.internal" is forbidden: User "system:anonymous" cannot list resource "nodes" in API group "" at the cluster scope
May 23 05:40:33 ip-10-0-132-190 hyperkube[860]: E0523 05:40:33.708126     860 kubelet.go:2274] node "ip-10-0-132-190.us-east-2.compute.internal" not found
May 23 05:40:33 ip-10-0-132-190 hyperkube[860]: E0523 05:40:33.808371     860 kubelet.go:2274] node "ip-10-0-132-190.us-east-2.compute.internal" not found

Expected results:
3. `oc get node` should show the new host, like before.

Additional info:

Comment 1 Jan Chaloupka 2019-05-23 16:50:41 UTC

Machine API is not responsible for what is actually run inside AMI. Given the logs it seems the kubeconfig is not properly generated/provided. Switching to installer team to re-evaluate. As far as the machine API is involved in the process, AWS actuator sends a request to AWS API to provision an instance given AMI and other aws provider specific information. Machine API does not see after a point when an AWS instance starts to run.

Comment 2 Andrew McDermott 2019-05-23 17:37:52 UTC

I experimented with:

- standup a cluster
- all nodes running
- stop one node via the AWS console (running -> stopped)

Node goes NotReady

I see the machine-controller create a new backing instance because the current actuator log for 'does instance exist' must be in 'running' state.

I then resumed the instance in the AWS console.

The existing node transitions back to 'Ready' and the AWS actuator actually deletes the  instance (that failed to turn into a node) that got created but didn't turn into a node - so we don't appear to leak instances.

Comment 3 Brenton Leanhardt 2019-05-23 17:49:22 UTC

In Xingxing's case I believe DR is being tested.  Since we no longer auto approve client CSRs the way we used to (https://bugzilla.redhat.com/show_bug.cgi?id=1707162) the master recovery docs likely need to be updated with details on how to approve them.  Tomas, could you take a look?

Comment 4 Andrew McDermott 2019-05-23 18:03:00 UTC

(In reply to Brenton Leanhardt from comment #3)
> In Xingxing's case I believe DR is being tested.  Since we no longer auto
> approve client CSRs the way we used to
> (https://bugzilla.redhat.com/show_bug.cgi?id=1707162) the master recovery
> docs likely need to be updated with details on how to approve them.  Tomas,
> could you take a look?

And if you manually approve the CSRs then the node comes up just fine. Thanks.

Comment 5 Xingxing Xia 2019-05-24 07:48:11 UTC

Right, I tried approving the CSRs by `oc get csr -ojson | jq -r '.items[] | select(.status == {} ) | .metadata.name' | xargs oc adm certificate approve`, then the new host becomes shown Ready in `oc get node`.
Converting to Doc bug per comment 3 for the doc mentioned in comment 0

Comment 6 Andrea Hoffer 2019-05-30 00:58:55 UTC

I've updated the disaster recovery docs PR [1] to add manual steps for approving the CSRs. @Xingxing Xia, could you please try it out? You can preview here [2], the updates were made to step 2.

[1] https://github.com/openshift/openshift-docs/pull/14859
[2] http://file.rdu.redhat.com/~ahoffer/2019/disaster-recovery/disaster_recovery/scenario-1-infra-recovery.html

Comment 7 Xingxing Xia 2019-05-30 06:46:27 UTC

LGTM, thanks

Comment 8 Andrea Hoffer 2019-05-31 01:49:11 UTC

Updates are live: https://docs.openshift.com/container-platform/4.1/disaster_recovery/scenario-1-infra-recovery.html