Bug 2082271
| Summary: | Degraded DaemonSets Behavior on ovn with RHCOS based on RHEL 8.6 | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Devan Goodwin <dgoodwin> | ||||
| Component: | Networking | Assignee: | Tim Rozet <trozet> | ||||
| Networking sub component: | ovn-kubernetes | QA Contact: | Anurag saxena <anusaxen> | ||||
| Status: | CLOSED CURRENTRELEASE | Docs Contact: | |||||
| Severity: | high | ||||||
| Priority: | unspecified | CC: | akaris, dcbw, dornelas, dwalsh, ffernand, miabbott, mitr, rphillips, tsweeney | ||||
| Version: | 4.11 | ||||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2022-06-22 18:33:24 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Devan Goodwin
2022-05-05 17:17:53 UTC
As an update... we are working with the RHCOS team to reproduce this out of tree so that we can work together to get to the root cause without breaking OpenShift CI. This is under active investigation on #wg-4_11-triage in Slack. Looking at these logs the delay in bringing up the daemon set pod is because ovnkube-node is delayed in starting. We can see that ovnkube-node binds the port right after it starts with no creation error:
ovnkube-node starts at: 2022-04-29T20:55:45.049903075Z
2022-04-29T20:56:04.212732223Z I0429 20:56:04.212673 13216 cni.go:227] [e2e-k8s-sig-apps-daemonset-upgrade-347/ds1-7gbwt 304ef75a5bf15c520c93eb04670fb08969e2b6a693bcfabec38f6fc88d76d615] ADD starting CNI request [e2e-k8s-sig-apps-daemonset-upgrade-347/ds1-7gbwt 304ef75a5bf15c520c93eb04670fb08969e2b6a693bcfabec38f6fc88d76d615]
2022-04-29T20:56:04.277373950Z I0429 20:56:04.277326 13216 helper_linux.go:257] ConfigureOVS: namespace: e2e-k8s-sig-apps-daemonset-upgrade-347, podName: ds1-7gbwt
2022-04-29T20:56:05.788551703Z I0429 20:56:05.788465 13216 cni.go:248] [e2e-k8s-sig-apps-daemonset-upgrade-347/ds1-7gbwt 304ef75a5bf15c520c93eb04670fb08969e2b6a693bcfabec38f6fc88d76d615] ADD finished CNI request [e2e-k8s-sig-apps-daemonset-upgrade-347/ds1-7gbwt 304ef75a5bf15c520c93eb04670fb08969e2b6a693bcfabec38f6fc88d76d615], result "{\"interfaces\":[{\"name\":\"304ef75a5bf15c5\",\"mac\":\"26:4e:ee:65:a2:45\"},{\"name\":\"eth0\",\"mac\":\"0a:58:0a:81:02:0c\",\"sandbox\":\"/var/run/netns/1c3ba734-90dc-468b-a15a-a9a21abc4bd6\"}],\"ips\":[{\"version\":\"4\",\"interface\":1,\"address\":\"10.129.2.12/23\",\"gateway\":\"10.129.2.1\"}],\"dns\":{}}", err <nil>
The reason for the overall delay in the pod starting is because there is a delay in ovnkube-node starting. This is because the images are attempted to be pulled for 16 minutes and then fail. Looking at the events:
20:36:13
openshift-ovn-kubernetes
node-controller
ovnkube-node-qlg9k
NodeNotReady
Node is not ready
############################
20:39:27
openshift-ovn-kubernetes
kubelet
ovnkube-node-qlg9k
Pulling
Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:de7b14f5ba26c7d807bc140dd61fc9dd9549045b4e225ac67e1b4ea44ed72f3e"
#############################
20:55:44
openshift-ovn-kubernetes
kubelet
ovnkube-node-qlg9k
Failed
Error: ErrImagePull
##############################
20:55:44
openshift-ovn-kubernetes
kubelet
ovnkube-node-qlg9k
Pulled
Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:de7b14f5ba26c7d807bc140dd61fc9dd9549045b4e225ac67e1b4ea44ed72f3e" already present on machine
###############################
20:55:44
openshift-ovn-kubernetes
kubelet
ovnkube-node-qlg9k
Failed
Failed to pull image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:de7b14f5ba26c7d807bc140dd61fc9dd9549045b4e225ac67e1b4ea44ed72f3e": rpc error: code = Unknown desc = can't talk to a V1 container registry
################################
20:55:45
openshift-ovn-kubernetes
kubelet
ovnkube-node-qlg9k
Created
Created container ovnkube-node
Going to look through the journal to see if there is any obvious reason why image pull was a problem.
Looks like another image pull around the same time for another pod also failed: Apr 29 20:39:37.746713 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: E0429 20:39:37.746670 3171 remote_image.go:236] "PullImage from image service failed" err="rpc error: code = Unknown desc = Get \"https://quay.io/v2/openshift-release-dev/ocp-v4.0-art-dev/manifests/sha256:7c24a076f0601077f87181c478d6347a507a45735667c926040a6ee03030c789\": net/http: TLS handshake timeout" image="quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7c24a076f0601077f87181c478d6347a507a45735667c926040a6ee03030c789" Apr 29 20:39:37.747049 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: E0429 20:39:37.746936 3171 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"oauth-proxy\" with ErrImagePull: \"rpc error: code = Unknown desc = Get \\\"https://quay.io/v2/openshift-release-dev/ocp-v4.0-art-dev/manifests/sha256:7c24a076f0601077f87181c478d6347a507a45735667c926040a6ee03030c789\\\": net/http: TLS handshake timeout\"" pod="openshift-machine-config-operator/machine-config-daemon-nszh2" podUID=923a7d23-b17a-42fa-aa1e-b808ae44d21b There is no network outage on the node as far as interfaces going down during this time. Seems somewhat odd that kubelet tries to pull the image for 16 minutes straight before returning an error, and then the image pulls and the container starts right after. Also the journal says the image was pulled much earlier for ovnkube-node: Apr 29 20:39:27.683748 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b crio[3131]: time="2022-04-29 20:39:27.683700050Z" level=info msg="Pulling image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:de7b14f5ba26c7d807bc140dd61fc9dd9549045b4e225ac67e1b4ea44ed72f3e" id=f5529a71-817d-44c7-9582-5f1eabbd8fcf name=/runtime.v1.ImageService/PullImage Apr 29 20:39:27.686297 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b crio[3131]: time="2022-04-29 20:39:27.686256151Z" level=info msg="Trying to access \"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:de7b14f5ba26c7d807bc140dd61fc9dd9549045b4e225ac67e1b4ea44ed72f3e\"" Apr 29 20:39:29.274857 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b crio[3131]: time="2022-04-29 20:39:29.274811493Z" level=info msg="Pulled image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:de7b14f5ba26c7d807bc140dd61fc9dd9549045b4e225ac67e1b4ea44ed72f3e" id=3b533e9e-b269-40c1-a5f3-28f997af9e8a name=/runtime.v1.ImageService/PullImage ^^^ pulled??? Apr 29 20:39:29.276034 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b crio[3131]: time="2022-04-29 20:39:29.275971545Z" level=info msg="Checking image status: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:de7b14f5ba26c7d807bc140dd61fc9dd9549045b4e225ac67e1b4ea44ed72f3e" id=5d4cd1ed-b190-4af4-8b11-8b44e4471316 name=/runtime.v1.ImageService/ImageStatus Apr 29 20:39:29.277140 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b crio[3131]: time="2022-04-29 20:39:29.277094844Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:9a006ad67f771cf2017b360d096273c158ba23791a0f1590649b70d60e5acae3,RepoTags:[],RepoDigests:[quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:de7b14f5ba26c7d807bc140dd61fc9dd9549045b4e225ac67e1b4ea44ed72f3e],Size_:307724197,Uid:&Int64Value{Value:65534,},Username:,Spec:nil,},Info:map[string]string{},}" id=5d4cd1ed-b190-4af4-8b11-8b44e4471316 name=/runtime.v1.ImageService/ImageStatus Apr 29 20:55:44.736044 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: E0429 20:55:44.735974 3171 remote_image.go:236] "PullImage from image service failed" err="rpc error: code = Unknown desc = can't talk to a V1 container registry" image="quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:de7b14f5ba26c7d807bc140dd61fc9dd9549045b4e225ac67e1b4ea44ed72f3e" Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: E0429 20:55:44.736178 3171 kuberuntime_manager.go:919] container &Container{Name:kube-rbac-proxy,Image:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:de7b14f5ba26c7d807bc140dd61fc9dd9549045b4e225ac67e1b4ea44ed72f3e,Command:[/bin/bash -c #!/bin/bash Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: set -euo pipefail Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: TLS_PK=/etc/pki/tls/metrics-cert/tls.key Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: TLS_CERT=/etc/pki/tls/metrics-cert/tls.crt Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: # As the secret mount is optional we must wait for the files to be present. Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: # The service is created in monitor.yaml and this is created in sdn.yaml. Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: # If it isn't created there is probably an issue so we want to crashloop. Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: retries=0 Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: TS=$(date +%s) Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: WARN_TS=$(( ${TS} + $(( 20 * 60)) )) Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: HAS_LOGGED_INFO=0 Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: log_missing_certs(){ Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: CUR_TS=$(date +%s) Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: if [[ "${CUR_TS}" -gt "WARN_TS" ]]; then Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: echo $(date -Iseconds) WARN: ovn-node-metrics-cert not mounted after 20 minutes. Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: elif [[ "${HAS_LOGGED_INFO}" -eq 0 ]] ; then Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: echo $(date -Iseconds) INFO: ovn-node-metrics-cert not mounted. Waiting one hour. Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: HAS_LOGGED_INFO=1 Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: fi Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: } Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: while [[ ! -f "${TLS_PK}" || ! -f "${TLS_CERT}" ]] ; do Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: log_missing_certs Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: sleep 5 Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: done Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: echo $(date -Iseconds) INFO: ovn-node-metrics-certs mounted, starting kube-rbac-proxy Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: exec /usr/bin/kube-rbac-proxy \ Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: --logtostderr \ Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: --secure-listen-address=:9103 \ Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256 \ Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: --upstream=http://127.0.0.1:29103/ \ Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: --tls-private-key-file=${TLS_PK} \ Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: --tls-cert-file=${TLS_CERT} Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: ],Args:[],WorkingDir:,Ports:[]ContainerPort{ContainerPort{Name:https,HostPort:9103,ContainerPort:9103,Protocol:TCP,HostIP:,},},Env:[]EnvVar{},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{cpu: {{10 -3} {<nil>} 10m DecimalSI},memory: {{20971520 0} {<nil>} 20Mi BinarySI},},},VolumeMounts:[]VolumeMount{VolumeMount{Name:ovn-node-metrics-cert,ReadOnly:true,MountPath:/etc/pki/tls/metrics-cert,SubPath:,MountPropagation:nil,SubPathExpr:,},VolumeMount{Name:kube-api-access-wphxf,ReadOnly:true,MountPath:/var/run/secrets/kubernetes.io/serviceaccount,SubPath:,MountPropagation:nil,SubPathExpr:,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:nil,Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:FallbackToLogsOnError,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,} start failed in pod ovnkube-node-qlg9k_openshift-ovn-kubernetes(e2476fab-71e9-43fb-9437-4ef9557d12e0): ErrImagePull: rpc error: code = Unknown desc = can't talk to a V1 container registry Moving this to the node team for now so they can examine the kubelet logs/journal and provide their input. Moving this to the node team for now so they can examine the kubelet logs/journal and provide their input. Devan is this still happening after the fixes went in for 8.6? I checked the links in original report, it looks like this is resolved to me. |