Bug 1684049
Summary: | Worker node client certification rotation fail after it expires | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | weiwei jiang <wjiang> | ||||||||||
Component: | Master | Assignee: | David Eads <deads> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | Xingxing Xia <xxia> | ||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | high | ||||||||||||
Version: | 4.1.0 | CC: | akrzos, aos-bugs, deads, erich, florin-alexandru.peter, hongkliu, jeder, jiazha, jokerman, juzhao, mifiedle, mmccomas, nelluri, sponnaga, wsun, xtian | ||||||||||
Target Milestone: | --- | Keywords: | Regression, TestBlocker | ||||||||||
Target Release: | 4.1.0 | ||||||||||||
Hardware: | Unspecified | ||||||||||||
OS: | Unspecified | ||||||||||||
Whiteboard: | aos-scalability-40 | ||||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2019-06-04 10:44:51 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 1671928 | ||||||||||||
Attachments: |
|
Description
weiwei jiang
2019-02-28 10:21:43 UTC
To debug `openshift-must-gather inspect clusteroperators` is very helpful (see https://github.com/openshift/must-gather). It would gather the KAS and KCM configurations and logs and the current operator states. For kubelet cert issues `openshift-must-gather inspect csr`, the actual certificates being used by the kubelet, and the kubelet logs are necessary. Adding Eric in case he wants to start a KCS article. Adding Seth in case he has other information he would want (this could just as easily arrive as a kubelet issue). Created attachment 1539548 [details] cluster info for Comment 4 Created attachment 1539549 [details]
kubeconfig file
attached kubeconfig file if you need to debug
At least one problem in your cluster was solved by https://github.com/openshift/origin/pull/22178. ``` F0228 18:18:29.430158 1 hooks.go:188] PostStartHook "crd-discovery-available" failed: unable to retrieve the complete list of server APIs: packages.apps.redhat.com/v1alpha1: Unauthorized ``` Which is preventing future rollouts. Created attachment 1539676 [details]
clsuter operator information
Created attachment 1539707 [details]
clsuter operator information when worker nodes in NotReady status
add Regression keyword, the worker nodes could be in Running status with our previous payload, such as 4.0.0-0.nightly-2019-02-26-125216 (In reply to Junqi Zhao from comment #14) > add Regression keyword, the worker nodes could be in Running status with our > previous payload, such as 4.0.0-0.nightly-2019-02-26-125216 I mean worker nodes could be in Running status for more than a day or even longer Just to clarify. Comment 9 indicated that https://github.com/openshift/origin/pull/22178 is necessary for the cluster to survive having aggregate apiservers like metrics and package server going down. I don't know why those two are problematic and not the kube-apiserver, but that pull adds toleration for downed elements. Without that, revisions will fail and certs will expire. Your cluster gives you about 8 hours warning before death right now. Alerts should be firing because the clusteroperator is failing. If you can reproduce after 22178, we may have something new to look at. The latest payload 4.0.0-0.nightly-2019-03-04-033148 doesn't merge above PR in, and thus the env set up with it can reproduce it, so wait for build that includes the fix PR $ oc get no -w # watch nodes ... ip-10-0-170-155.us-east-2.compute.internal Ready worker 5h6m v1.12.4+4dd65df23d ip-10-0-170-155.us-east-2.compute.internal NotReady worker 5h7m v1.12.4+4dd65df23d ... $ oc get po -n openshift-kube-apiserver kube-apiserver-ip-10-0-143-237.us-east-2.compute.internal 1/1 Running 0 96m kube-apiserver-ip-10-0-151-159.us-east-2.compute.internal 0/1 CrashLoopBackOff 19 77m kube-apiserver-ip-10-0-162-22.us-east-2.compute.internal 1/1 Running 0 97m (In reply to Xingxing Xia from comment #19) > The latest payload 4.0.0-0.nightly-2019-03-04-033148 doesn't merge above PR > in, and thus the env set up with it can reproduce it, so wait for build that > includes the fix PR > $ oc get po -n openshift-kube-apiserver > kube-apiserver-ip-10-0-143-237.us-east-2.compute.internal 1/1 > Running 0 96m > kube-apiserver-ip-10-0-151-159.us-east-2.compute.internal 0/1 > CrashLoopBackOff 19 77m > kube-apiserver-ip-10-0-162-22.us-east-2.compute.internal 1/1 > Running 0 97m $ oc -n openshift-kube-apiserver describe pod kube-apiserver-ip-10-0-151-159.us-east-2.compute.internal E0304 14:11:32.058871 1 reflector.go:134] github.com/openshift/client-go/user/informers/externalversions/factory.go:101: Failed to list *v1.Group: the server could not find the requested resource (get groups.user.openshift.io) I0304 14:11:32.068426 1 cache.go:39] Caches are synced for AvailableConditionController controller I0304 14:11:32.069364 1 cache.go:39] Caches are synced for autoregister controller I0304 14:11:32.152218 1 cache.go:39] Caches are synced for APIServiceRegistrationController controller I0304 14:11:32.258916 1 controller_utils.go:1034] Caches are synced for crd-autoregister controller E0304 14:11:32.391394 1 autoregister_controller.go:190] v1beta1.cloudcredential.openshift.io failed with : apiservices.apiregistration.k8s.io "v1beta1.cloudcredential.openshift.io" already exists E0304 14:11:32.391653 1 autoregister_controller.go:190] v1beta1.machine.openshift.io failed with : apiservices.apiregistration.k8s.io "v1beta1.machine.openshift.io" already exists E0304 14:11:32.463740 1 autoregister_controller.go:190] v1alpha1.marketplace.redhat.com failed with : apiservices.apiregistration.k8s.io "v1alpha1.marketplace.redhat.com" already exists W0304 14:11:32.470201 1 lease.go:226] Resetting endpoints for master service "kubernetes" to [10.0.137.255 10.0.144.117 10.0.169.231] E0304 14:11:32.534603 1 memcache.go:140] couldn't get resource list for packages.apps.redhat.com/v1alpha1: Unauthorized E0304 14:11:32.657012 1 memcache.go:140] couldn't get resource list for packages.apps.redhat.com/v1alpha1: Unauthorized I0304 14:11:32.860646 1 storage_scheduling.go:100] all system priority classes are created successfully or already exist. F0304 14:11:32.929751 1 hooks.go:188] PostStartHook "crd-discovery-available" failed: unable to retrieve the complete list of server APIs: packages.apps.redhat.com/v1alpha1: Unauthorized The problem does not seem to be occurring on 4.0.0-0.nightly-2019-03-04-095138. - kube-apiserver pods are stable - no crashloop - describe on the kube-apiserver pods shows no error messages - after 3 hours, nodes are still Ready - oc get clusteroperator does show theat kube-scheduler, kube-controller-manager and openshift-api-server had an issue. investigating the logs: # oc get clusteroperators NAME VERSION AVAILABLE PROGRESSING FAILING SINCE cluster-autoscaler True False False 152m cluster-storage-operator True False False 150m console True False False 144m dns True False False 165m image-registry True False False 149m ingress True False False 145m kube-apiserver True False False 162m kube-controller-manager True False False 35m kube-scheduler True False False 34m machine-api True False False 153m machine-config True False False 35m marketplace-operator True False False 150m monitoring True False False 143m network True False False 145m node-tuning True False False 145m openshift-apiserver True False False 34m openshift-authentication True False False 154m openshift-cloud-credential-operator True False False 153m openshift-controller-manager True False False 150m openshift-samples True False False 149m operator-lifecycle-manager True False False 152m We will continue soaking. This is on 4.0.0-0.nightly-2019-03-04-095138 comment 21 is incorrect, after about 5 hours, openshift-kube-apiserver started crashlooping kube-apiserver-ip-10-0-129-56.us-east-2.compute.internal 1/1 Running 0 88m kube-apiserver-ip-10-0-155-119.us-east-2.compute.internal 0/1 CrashLoopBackOff 17 68m kube-apiserver-ip-10-0-171-110.us-east-2.compute.internal 1/1 Running 0 89m Last State: Terminated Reason: Error Message: 34] github.com/openshift/client-go/security/informers/externalversions/factory.go:101: Failed to list *v1.SecurityContextConstraints: the server could not find the requested resource (get securitycontextconstraints.security.openshift.io) E0304 17:18:04.525810 1 handler_proxy.go:131] error resolving openshift-apiserver/api: service "api" not found E0304 17:18:04.526366 1 handler_proxy.go:131] error resolving openshift-apiserver/api: service "api" not found E0304 17:18:04.526795 1 handler_proxy.go:131] error resolving openshift-apiserver/api: service "api" not found E0304 17:18:04.527313 1 handler_proxy.go:131] error resolving openshift-apiserver/api: service "api" not found E0304 17:18:04.527745 1 handler_proxy.go:131] error resolving openshift-apiserver/api: service "api" not found E0304 17:18:04.528458 1 handler_proxy.go:131] error resolving openshift-apiserver/api: service "api" not found E0304 17:18:04.529128 1 handler_proxy.go:131] error resolving openshift-apiserver/api: service "api" not found E0304 17:18:04.529612 1 handler_proxy.go:131] error resolving openshift-apiserver/api: service "api" not found E0304 17:18:04.530116 1 handler_proxy.go:131] error resolving openshift-apiserver/api: service "api" not found E0304 17:18:04.530649 1 handler_proxy.go:131] error resolving openshift-operator-lifecycle-manager/v1alpha1-packages-apps-redhat-com: service "v1alpha1-packages-apps-redhat-com" not found I0304 17:18:04.627336 1 cache.go:39] Caches are synced for AvailableConditionController controller W0304 17:18:04.738029 1 lease.go:226] Resetting endpoints for master service "kubernetes" to [10.0.129.56 10.0.155.119 10.0.171.110] E0304 17:18:04.880644 1 memcache.go:140] couldn't get resource list for packages.apps.redhat.com/v1alpha1: Unauthorized F0304 17:18:05.004825 1 hooks.go:188] PostStartHook "crd-discovery-available" failed: unable to retrieve the complete list of server APIs: packages.apps.redhat.com/v1alpha1: Unauthorized Verified in 4.0.0-0.nightly-2019-03-04-234414 envs, the envs didn't hit the bug again. $ oc get no -w # didn't see NotReady one NAME STATUS ROLES AGE VERSION ip-10-0-129-108.us-east-2.compute.internal Ready master 7h10m v1.12.4+5dc94f3fda ip-10-0-131-97.us-east-2.compute.internal Ready worker 6h48m v1.12.4+5dc94f3fda ip-10-0-144-198.us-east-2.compute.internal Ready worker 6h49m v1.12.4+5dc94f3fda ip-10-0-156-137.us-east-2.compute.internal Ready master 7h10m v1.12.4+5dc94f3fda ip-10-0-160-55.us-east-2.compute.internal Ready master 7h10m v1.12.4+5dc94f3fda ip-10-0-172-102.us-east-2.compute.internal Ready worker 6h49m v1.12.4+5dc94f3fda $ oc get po -n openshift-kube-apiserver -w -l apiserver # didn't see crashing one NAME READY STATUS RESTARTS AGE kube-apiserver-ip-10-0-129-108.us-east-2.compute.internal 1/1 Running 2 59m kube-apiserver-ip-10-0-156-137.us-east-2.compute.internal 1/1 Running 2 60m kube-apiserver-ip-10-0-160-55.us-east-2.compute.internal 1/1 Running 2 61m We have 3 clusters up for 13h, 15h and 22h and all are healthy - All nodes Ready - no clusteroperators failing - no crashlooping kube-apiserver pods Marking this verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758 |