Bug 1940844
| Summary: | Authentication operator is degraded during 4.5 to 4.6 | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | pmali |
| Component: | Networking | Assignee: | Jacob Tanenbaum <jtanenba> |
| Networking sub component: | openshift-sdn | QA Contact: | zhaozhanqi <zzhao> |
| Status: | CLOSED DUPLICATE | Docs Contact: | |
| Severity: | high | ||
| Priority: | unspecified | CC: | aconstan, anbhat, aos-bugs, astoycos, jtanenba, kewang, mfojtik, slaznick, trozet, vpickard |
| Version: | 4.6 | Keywords: | Reopened, Upgrades |
| Target Milestone: | --- | ||
| Target Release: | 4.8.0 | ||
| Hardware: | All | ||
| OS: | All | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-11-11 14:17:04 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
pmali
2021-03-19 10:39:18 UTC
Does the authentication operator stay in the degraded state? Yes, the authentication operator stays in the degraded state. Interesting, looks like only a single of three oauth-apiserver pods is failing to connect to the kube-apiserver: 2021-03-16T11:11:23.06087827Z Error: unable to load configmap based request-header-client-ca-file: Get "https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication?timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) Then again, they should each appear on a different node. Moving to networking to have a look. Please set blocker status I somehow failed to notice the needinfo notification on me.
> I am guessing there is a retry mechanism in place for the oauth-apiserver container to try loading the config map again?
No, there is none in this specific case, but the pod restarted 12 times so I assume whatever needed to be created for it should have been created and if it hasn't, that would be a bug.
Reproduced the bug with below upgrade profile, so reopened the bug,
OCP IPI install on AWS,
Upgrade Path: 4.1.41-x86_64--> 4.2.36-x86_64 -> 4.3.40-x86_64 -> 4.4.33-x86_64 -> 4.5.41-x86_64-> 4.6.49-x86_64
Upgrade was stuck at phase that the cluster operator authentication is degraded
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.5.41 True True 3h1m Unable to apply 4.6.49: the cluster operator authentication is degraded
$ oc get node
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
11-06 11:47:08.751 ip-10-0-132-253.us-east-2.compute.internal Ready master 9h v1.18.3+d8ef5ad 10.0.132.253 <none> Red Hat Enterprise Linux CoreOS 45.82.202106211530-0 (Ootpa) 4.18.0-193.56.1.el8_2.x86_64 cri-o://1.18.4-11.rhaos4.5.gitfa57051.el8
11-06 11:47:08.751 ip-10-0-139-165.us-east-2.compute.internal Ready worker 9h v1.18.3+d8ef5ad 10.0.139.165 <none> Red Hat Enterprise Linux CoreOS 45.82.202106211530-0 (Ootpa) 4.18.0-193.56.1.el8_2.x86_64 cri-o://1.18.4-11.rhaos4.5.gitfa57051.el8
11-06 11:47:08.751 ip-10-0-148-214.us-east-2.compute.internal Ready master 9h v1.18.3+d8ef5ad 10.0.148.214 <none> Red Hat Enterprise Linux CoreOS 45.82.202106211530-0 (Ootpa) 4.18.0-193.56.1.el8_2.x86_64 cri-o://1.18.4-11.rhaos4.5.gitfa57051.el8
11-06 11:47:08.751 ip-10-0-154-164.us-east-2.compute.internal Ready worker 9h v1.18.3+d8ef5ad 10.0.154.164 <none> Red Hat Enterprise Linux CoreOS 45.82.202106211530-0 (Ootpa) 4.18.0-193.56.1.el8_2.x86_64 cri-o://1.18.4-11.rhaos4.5.gitfa57051.el8
11-06 11:47:08.751 ip-10-0-161-60.us-east-2.compute.internal Ready worker 9h v1.18.3+d8ef5ad 10.0.161.60 <none> Red Hat Enterprise Linux CoreOS 45.82.202106211530-0 (Ootpa) 4.18.0-193.56.1.el8_2.x86_64 cri-o://1.18.4-11.rhaos4.5.gitfa57051.el8
11-06 11:47:08.751 ip-10-0-171-220.us-east-2.compute.internal Ready master 9h v1.18.3+d8ef5ad 10.0.171.220 <none> Red Hat Enterprise Linux CoreOS 45.82.202106211530-0 (Ootpa) 4.18.0-193.56.1.el8_2.x86_64 cri-o://1.18.4-11.rhaos4.5.gitfa57051.el8
$ oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
authentication 4.6.49 True False True 145m
cloud-credential 4.6.49 True False False 9h
cluster-autoscaler 4.6.49 True False False 9h
config-operator 4.6.49 True False False 3h55m
console 4.6.49 True False False 144m
csi-snapshot-controller 4.6.49 True False False 146m
dns 4.5.41 True False False 9h
etcd 4.6.49 True False False 5h12m
image-registry 4.6.49 True False False 3h17m
ingress 4.6.49 True False False 157m
insights 4.6.49 True False False 6h49m
kube-apiserver 4.6.49 True False False 9h
kube-controller-manager 4.6.49 True False False 5h9m
kube-scheduler 4.6.49 True False False 9h
kube-storage-version-migrator 4.6.49 True False False 3h17m
machine-api 4.6.49 True False False 9h
machine-approver 4.6.49 True False False 3h40m
machine-config 4.5.41 True False False 4h24m
marketplace 4.6.49 True False False 147m
monitoring 4.6.49 True False False 144m
network 4.6.49 True False False 9h
node-tuning 4.6.49 True False False 157m
openshift-apiserver 4.6.49 True False False 4h38m
openshift-controller-manager 4.6.49 True False False 155m
openshift-samples 4.6.49 True False False 156m
operator-lifecycle-manager 4.6.49 True False False 9h
operator-lifecycle-manager-catalog 4.6.49 True False False 9h
operator-lifecycle-manager-packageserver 4.6.49 True False False 146m
service-ca 4.6.49 True False False 9h
service-catalog-apiserver 4.4.33 True False False 3h15m
service-catalog-controller-manager 4.4.33 True False False 5h12m
storage 4.6.49 True False False 147m
$ oc get co/authentication
Name: authentication
Namespace:˽˽˽˽
Labels: <none>
Annotations: <none>
API Version: config.openshift.io/v1
Kind: ClusterOperator
Metadata:
Creation Timestamp: 2021-11-05T18:01:09Z
Generation: 1
Resource Version: 396460
Self Link: /apis/config.openshift.io/v1/clusteroperators/authentication
UID: 5b0a6128-3e62-11ec-b967-025dae34a298
Spec:
Status:
Conditions:
Last Transition Time: 2021-11-06T01:23:03Z
Message: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver (crashlooping container is waiting in apiserver-754bcf9c6d-8cvms pod)
Reason: APIServerDeployment_UnavailablePod
Status: True
Type: Degraded
Last Transition Time: 2021-11-06T01:21:06Z
Reason: AsExpected
Status: False
Type: Progressing
Last Transition Time: 2021-11-06T01:21:32Z
Message: OAuthServerDeploymentAvailable: availableReplicas==2
Reason: AsExpected
Status: True
Type: Available
Last Transition Time: 2021-11-05T18:01:09Z
Reason: AsExpected
Status: True
Type: Upgradeable
Extension: <nil>
...
Checked the apiserver-754bcf9c6d-8cvms pod logs from must-gather,
2021-11-06T03:44:42.740022522Z Error: unable to load configmap based request-header-client-ca-file: Get "https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication?timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Let see what happened at the same time with kube-apiserver and kubelet service,
$ grep -nr '03:44:42' | grep -E 'kube-apiserver|kubelet_service' | grep -E 'E[0-9]{4}'
namespaces/openshift-kube-apiserver/pods/kube-apiserver-ip-10-0-148-214.us-east-2.compute.internal/kube-apiserver-check-endpoints/kube-apiserver-check-endpoints/logs/current.log:681:2021-11-06T03:44:42.353051845Z E1106 03:44:42.353002 1 reflector.go:127] k8s.io/client-go.0/tools/cache/reflector.go:156: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps "extension-apiserver-authentication" is forbidden: User "system:serviceaccount:openshift-kube-apiserver:check-endpoints" cannot list resource "configmaps" in API group "" in the namespace "kube-system"
host_service_logs/masters/kubelet_service.log:1075267:Nov 06 03:44:42.932216 ip-10-0-148-214 hyperkube[1390]: E1106 03:44:42.931594 1390 pod_workers.go:191] Error syncing pod 80b56c0b-60dd-45aa-a10c-cd5df2bc0544 ("apiserver-754bcf9c6d-8cvms_openshift-oauth-apiserver(80b56c0b-60dd-45aa-a10c-cd5df2bc0544)"), skipping: failed to "StartContainer" for "oauth-apiserver" with CrashLoopBackOff: "back-off 5m0s restarting failed container=oauth-apiserver pod=apiserver-754bcf9c6d-8cvms_openshift-oauth-apiserver(80b56c0b-60dd-45aa-a10c-cd5df2bc0544)"
|