Bug 1940950
Summary: | vsphere: client/bootstrap CSR double create | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Michael Gugino <mgugino> | |
Component: | Node | Assignee: | Harshal Patil <harpatil> | |
Node sub component: | Kubelet | QA Contact: | MinLi <minmli> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | medium | |||
Priority: | unspecified | CC: | aos-bugs, minmli, tsweeney, zhsun | |
Version: | 4.8 | |||
Target Milestone: | --- | |||
Target Release: | 4.8.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | No Doc Update | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | 1940899 | |||
: | 1943145 (view as bug list) | Environment: | ||
Last Closed: | 2021-07-27 22:54:33 UTC | Type: | --- | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1943145 |
Description
Michael Gugino
2021-03-19 15:43:47 UTC
TL/DR: each node is issueing 2 client certs within seconds of each other, this is unrelated to the CSR approval (which is also buggy on vsphere). Looking at the journals from an affected worker node ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-vsphere/1372795001506893824/artifacts/e2e-vsphere/gather-extra/artifacts/nodes/ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d/journal It appears that the kubelet started before crio, then died, then crio started, then everything worked fine. I think it's expected for the kubelet to create a CSR whenever it's in bootstrap mode and it starts, so that's fine. The bug is (probably?) why is the kubelet starting after CRIO? --snip-- Mar 19 06:39:13.993551 ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d hyperkube[1739]: I0319 06:39:13.993524 1739 kubelet.go:453] Kubelet client is not nil Mar 19 06:39:13.993922 ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d hyperkube[1739]: I0319 06:39:13.993901 1739 reflector.go:219] Starting reflector *v1.Service (0s) from k8s.io/client-go/informers/factory.go:134 Mar 19 06:39:13.994790 ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d hyperkube[1739]: I0319 06:39:13.994750 1739 reflector.go:219] Starting reflector *v1.Node (0s) from k8s.io/client-go/informers/factory.go:134 Mar 19 06:39:13.998086 ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d hyperkube[1739]: E0319 06:39:13.998006 1739 remote_runtime.go:86] Version from runtime service failed: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/run/crio/crio.sock: connect: no such file or directory" Mar 19 06:39:13.998227 ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d hyperkube[1739]: E0319 06:39:13.998193 1739 kuberuntime_manager.go:205] Get runtime version failed: get remote runtime typed version failed: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/run/crio/crio.sock: connect: no such file or directory" Mar 19 06:39:13.998227 ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d hyperkube[1739]: F0319 06:39:13.998221 1739 server.go:269] failed to run Kubelet: failed to create kubelet: get remote runtime typed version failed: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/run/crio/crio.sock: connect: no such file or directory" --snip-- Mar 19 06:39:24.100515 ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d systemd[1]: Stopped Kubernetes Kubelet. Mar 19 06:39:24.100665 ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d systemd[1]: kubelet.service: Consumed 0 CPU time Mar 19 06:39:24.102432 ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d systemd[1]: Starting CRI-O Auto Update Script... Mar 19 06:39:24.191706 ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d crio[1754]: time="2021-03-19T06:39:24Z" level=info msg="Starting CRI-O, version: 1.21.0-29.rhaos4.8.git4fff699.el8-dev, git: ()" Mar 19 06:39:24.197448 ci-op-5dk64ln5-36a8b-9zsjg-worker-mcl4d crio[1754]: time="2021-03-19 06:39:24.197388576Z" level=info msg="File /var/lib/crio/clean.shutdown not found. Wiping storage directory /var/lib/containers/storage because of suspected dirty shutdown" I don't see similar error in recent 4.8 job : https://search.ci.openshift.org/?search=crio-wipe.service%3A+Main+process+exited&maxAge=336h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job also can't see error with the following key words: Failed to wipe storage cleanly Failed to shutdown storage before wiping crio-wipe.service: Failed with result crio.service: Job crio.service/start failed with result 'dependency' Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |