Description of problem: After a master is stopped in AWS, new host auto-created by machine-api cannot be ready and shows "E0523 ... 860 kubelet.go:2274] node ... not found". This blocks testing https://bugzilla.redhat.com/show_bug.cgi?id=1709802 and the doc http://file.rdu.redhat.com/~ahoffer/2019/disaster-recovery/disaster_recovery/scenario-1-infra-recovery.html which require the new host be shown in `oc get node` Version-Release number of selected component (if applicable): 4.1.0-0.nightly-2019-05-22-050858 4.1.0-0.nightly-2019-05-22-190823 How reproducible: Always Steps to Reproduce: 1. Create a 4.1 IPI env 2. In AWS console, stop one master. New master will be auto-created by machine-api and running. 3. Check `oc get node` Actual results: 3. It never shows the new host (10.0.132.190) auto-created by machine-api (But before, e.g. last week when reporting bug 1709802, it does). ssh to the new host and check below, found the error: [core@ip-10-0-132-190 ~]$ journalctl -u kubelet.service -n all ... May 23 05:40:32 ip-10-0-132-190 hyperkube[860]: E0523 05:40:32.705775 860 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:453: Failed to list *v1.Node: nodes "ip-10-0-132-190.us-east-2.compute.internal" is forbidden: User "system:anonymous" cannot list resource "nodes" in API group "" at the cluster scope May 23 05:40:32 ip-10-0-132-190 hyperkube[860]: E0523 05:40:32.805909 860 kubelet.go:2274] node "ip-10-0-132-190.us-east-2.compute.internal" not found May 23 05:40:32 ip-10-0-132-190 hyperkube[860]: E0523 05:40:32.906176 860 kubelet.go:2274] node "ip-10-0-132-190.us-east-2.compute.internal" not found May 23 05:40:33 ip-10-0-132-190 hyperkube[860]: E0523 05:40:33.006431 860 kubelet.go:2274] node "ip-10-0-132-190.us-east-2.compute.internal" not found May 23 05:40:33 ip-10-0-132-190 hyperkube[860]: E0523 05:40:33.106686 860 kubelet.go:2274] node "ip-10-0-132-190.us-east-2.compute.internal" not found May 23 05:40:33 ip-10-0-132-190 hyperkube[860]: E0523 05:40:33.206957 860 kubelet.go:2274] node "ip-10-0-132-190.us-east-2.compute.internal" not found May 23 05:40:33 ip-10-0-132-190 hyperkube[860]: E0523 05:40:33.307202 860 kubelet.go:2274] node "ip-10-0-132-190.us-east-2.compute.internal" not found May 23 05:40:33 ip-10-0-132-190 hyperkube[860]: E0523 05:40:33.407430 860 kubelet.go:2274] node "ip-10-0-132-190.us-east-2.compute.internal" not found May 23 05:40:33 ip-10-0-132-190 hyperkube[860]: E0523 05:40:33.507677 860 kubelet.go:2274] node "ip-10-0-132-190.us-east-2.compute.internal" not found May 23 05:40:33 ip-10-0-132-190 hyperkube[860]: E0523 05:40:33.607924 860 kubelet.go:2274] node "ip-10-0-132-190.us-east-2.compute.internal" not found May 23 05:40:33 ip-10-0-132-190 hyperkube[860]: I0523 05:40:33.703978 860 reflector.go:160] Listing and watching *v1.Service from k8s.io/kubernetes/pkg/kubelet/kubelet.go:444 May 23 05:40:33 ip-10-0-132-190 hyperkube[860]: I0523 05:40:33.705117 860 reflector.go:160] Listing and watching *v1.Pod from k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47 May 23 05:40:33 ip-10-0-132-190 hyperkube[860]: E0523 05:40:33.706065 860 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:444: Failed to list *v1.Service: services is forbidden: User "system:anonymous" cannot list resource "services" in API group "" at the cluster scope May 23 05:40:33 ip-10-0-132-190 hyperkube[860]: I0523 05:40:33.706168 860 reflector.go:160] Listing and watching *v1.Node from k8s.io/kubernetes/pkg/kubelet/kubelet.go:453 May 23 05:40:33 ip-10-0-132-190 hyperkube[860]: E0523 05:40:33.707014 860 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: pods is forbidden: User "system:anonymous" cannot list resource "pods" in API group "" at the cluster scope May 23 05:40:33 ip-10-0-132-190 hyperkube[860]: E0523 05:40:33.707907 860 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:453: Failed to list *v1.Node: nodes "ip-10-0-132-190.us-east-2.compute.internal" is forbidden: User "system:anonymous" cannot list resource "nodes" in API group "" at the cluster scope May 23 05:40:33 ip-10-0-132-190 hyperkube[860]: E0523 05:40:33.708126 860 kubelet.go:2274] node "ip-10-0-132-190.us-east-2.compute.internal" not found May 23 05:40:33 ip-10-0-132-190 hyperkube[860]: E0523 05:40:33.808371 860 kubelet.go:2274] node "ip-10-0-132-190.us-east-2.compute.internal" not found Expected results: 3. `oc get node` should show the new host, like before. Additional info:
Machine API is not responsible for what is actually run inside AMI. Given the logs it seems the kubeconfig is not properly generated/provided. Switching to installer team to re-evaluate. As far as the machine API is involved in the process, AWS actuator sends a request to AWS API to provision an instance given AMI and other aws provider specific information. Machine API does not see after a point when an AWS instance starts to run.
I experimented with: - standup a cluster - all nodes running - stop one node via the AWS console (running -> stopped) Node goes NotReady I see the machine-controller create a new backing instance because the current actuator log for 'does instance exist' must be in 'running' state. I then resumed the instance in the AWS console. The existing node transitions back to 'Ready' and the AWS actuator actually deletes the instance (that failed to turn into a node) that got created but didn't turn into a node - so we don't appear to leak instances.
In Xingxing's case I believe DR is being tested. Since we no longer auto approve client CSRs the way we used to (https://bugzilla.redhat.com/show_bug.cgi?id=1707162) the master recovery docs likely need to be updated with details on how to approve them. Tomas, could you take a look?
(In reply to Brenton Leanhardt from comment #3) > In Xingxing's case I believe DR is being tested. Since we no longer auto > approve client CSRs the way we used to > (https://bugzilla.redhat.com/show_bug.cgi?id=1707162) the master recovery > docs likely need to be updated with details on how to approve them. Tomas, > could you take a look? And if you manually approve the CSRs then the node comes up just fine. Thanks.
Right, I tried approving the CSRs by `oc get csr -ojson | jq -r '.items[] | select(.status == {} ) | .metadata.name' | xargs oc adm certificate approve`, then the new host becomes shown Ready in `oc get node`. Converting to Doc bug per comment 3 for the doc mentioned in comment 0
I've updated the disaster recovery docs PR [1] to add manual steps for approving the CSRs. @Xingxing Xia, could you please try it out? You can preview here [2], the updates were made to step 2. [1] https://github.com/openshift/openshift-docs/pull/14859 [2] http://file.rdu.redhat.com/~ahoffer/2019/disaster-recovery/disaster_recovery/scenario-1-infra-recovery.html
LGTM, thanks
Updates are live: https://docs.openshift.com/container-platform/4.1/disaster_recovery/scenario-1-infra-recovery.html