Bug 2036361

Summary:	Kubelet couldn't startup after certificate rotation on AWS
Product:	OpenShift Container Platform	Reporter:	liqcui
Component:	Cloud Compute	Assignee:	Joel Speed <jspeed>
Cloud Compute sub component:	Other Providers	QA Contact:	sunzhaohua <zhsun>
Status:	CLOSED NOTABUG	Docs Contact:
Severity:	high
Priority:	medium	CC:	aos-bugs, harpatil
Version:	4.10
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-02-23 11:10:25 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description liqcui 2021-12-31 12:35:31 UTC

Description of problem:

The kubelet couldn't startup after certificate rotation.

It will threw below error when execute systemctl restart kubelet

OOct 31 12:29:32 ip-10-0-134-182 hyperkube[10938]: W1031 12:29:32.491224   10938 feature_gate.go:223] unrecognized feature gate: LegacyNodeRoleBehavior
Oct 31 12:29:32 ip-10-0-134-182 hyperkube[10938]: W1031 12:29:32.491229   10938 feature_gate.go:223] unrecognized feature gate: NodeDisruptionExclusion
Oct 31 12:29:32 ip-10-0-134-182 hyperkube[10938]: I1031 12:29:32.491234   10938 feature_gate.go:246] feature gates: &{map[APIPriorityAndFairness:true DownwardAPIHugePages:true PodSecurity:true RotateKubeletServerCertificate:true SupportPodPidsLimit:true]}
Oct 31 12:29:32 ip-10-0-134-182 hyperkube[10938]: W1031 12:29:32.491365   10938 plugins.go:132] WARNING: aws built-in cloud provider is now deprecated. The AWS provider is deprecated and will be removed in a future release. Please use https://github.com/kubernetes/cloud-provider-aws
Oct 31 12:29:32 ip-10-0-134-182 hyperkube[10938]: I1031 12:29:32.491643   10938 aws.go:1270] Building AWS cloudprovider
Oct 31 12:29:32 ip-10-0-134-182 hyperkube[10938]: I1031 12:29:32.491707   10938 aws.go:1230] Zone not specified in configuration file; querying AWS metadata service
Oct 31 12:29:32 ip-10-0-134-182 systemd[1]: run-r8684c30c702f435386db156069eafc49.scope: Succeeded.
-- Subject: Unit succeeded
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
-- 
-- The unit run-r8684c30c702f435386db156069eafc49.scope has successfully entered the 'dead' state.
Oct 31 12:29:32 ip-10-0-134-182 systemd[1]: run-r8684c30c702f435386db156069eafc49.scope: Consumed 614us CPU time
-- Subject: Resources consumed by unit runtime
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
-- 
-- The unit run-r8684c30c702f435386db156069eafc49.scope completed and consumed the indicated resources.
Oct 31 12:29:32 ip-10-0-134-182 hyperkube[10938]: E1031 12:29:32.510000   10938 server.go:294] "Failed to run kubelet" err="failed to run Kubelet: could not init cloud provider \"aws\": error finding instance i-018e00e6eba4d5fba: \"error listing AWS instances: \\\"AuthFailure: AWS was not able to validate the provided access credentials\\\\n\\\\tstatus code: 401, request id: a9b4f9ce-65ed-4db3-8442-60e753d11713\\\"\""
Oct 31 12:29:32 ip-10-0-134-182 systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
Oct 31 12:29:32 ip-10-0-134-182 systemd[1]: kubelet.service: Failed with result 'exit-code'.
-- Subject: Unit failed
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
-- 
-- The unit kubelet.service has entered the 'failed' state with result 'exit-code'.
Oct 31 12:29:32 ip-10-0-134-182 systemd[1]: Failed to start Kubernetes Kubelet.
-- Subject: Unit kubelet.service has failed
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
-- 
-- Unit kubelet.service has failed.
-- 
-- The result is failed.
Oct 31 12:29:32 ip-10-0-134-182 systemd[1]: kubelet.service: Consumed 89ms CPU time
-- Subject: Resources consumed by unit runtime
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
-- 
-- The unit kubelet.service completed and consumed the indicated resources.

How reproducible:


Steps to Reproduce:
1. spin up fresh cluster with OCP4.10

2. create a ntp node with local stratum 1 on a dedicated node and run systemctl restart chronyd
```
driftfile /var/lib/chrony/drift
makestep 1.0 3
allow 10.0.0.0/12
local stratum 1
logdir /var/log/chrony
manual
```
3. apply machineconfigs below to get cluster to the ntp server connected
master:
```
# Generated by Butane; do not edit
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: master
  name: 99-master-chrony
spec:
  config:
    ignition:
      version: 3.2.0
    storage:
      files:
      - contents:
          source: data:,pool%20infra-0.anowak4rdu.lab.upshift.rdu2.redhat.com%20iburst%20%0Adriftfile%20%2Fvar%2Flib%2Fchrony%2Fdrift%0Amakestep%201.0%203%0Artcsync%0Alogdir%20%2Fvar%2Flog%2Fchrony%0A
        mode: 420
        overwrite: true
        path: /etc/chrony.conf
```
worker:
```
# Generated by Butane; do not edit
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 99-worker-chrony
spec:
  config:
    ignition:
      version: 3.2.0
    storage:
      files:
      - contents:
          source: data:,pool%20infra-0.anowak4rdu.lab.upshift.rdu2.redhat.com%20iburst%20%0Adriftfile%20%2Fvar%2Flib%2Fchrony%2Fdrift%0Amakestep%201.0%203%0Artcsync%0Alogdir%20%2Fvar%2Flog%2Fchrony%0A
        mode: 420
        overwrite: true
        path: /etc/chrony.conf
```
4. set date on ntp server to 20 months in future and restart chrony on ntp server an all ocp nodes
```
newdate=$(date "+%Y-%m-%d %H:%M:%S" -d '10 months')
sudo timedatectl set-time "$newdate"
sleep 2
sudo systemctl restart chronyd
sleep 10
for i in {0..2}
do
   for j in master worker
   do
      ssh -i /home/quicklab/.ssh/quicklab.key -o 'UserKnownHostsFile /dev/null' -o 'StrictHostKeyChecking no' -l quicklab $j-$i.anowakrdu2a.lab.upshift.rdu2.redhat.com sudo systemctl restart chronyd 
   done
done
```

5. wait a minute and restart kubelet on all master nodes

6. approve all pending certificates to get the master nodes ready
`oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs oc adm certificate approve`

7. reboot one of the master nodes


Actual results:
The kubelet fail to startup and openshift console can not login.

Expected results:
The kubelet should be startup successfully and openshift console can be login


Additional info:

Comment 4 Sai Ramesh Vanka 2022-01-07 11:27:36 UTC

Hi Liquan,

As @harpatil mentioned, this issue is not related to the kubelet.

The error seems to be in the initialization of the Cloud Provider where the listing of AWS instances failed due to invalid access credentials.

"could not init cloud provider \"aws\": error finding instance i-018e00e6eba4d5fba: \"error listing AWS instances: \\\"AuthFailure: AWS was not able to validate the provided access credentials\\\\n\\\\tstatus code: 401, request id: a9b4f9ce-65ed-4db3-8442-60e753d11713\\\"\""


Thanks,
Ramesh