Bug 2082271 - Degraded DaemonSets Behavior on ovn with RHCOS based on RHEL 8.6
Summary: Degraded DaemonSets Behavior on ovn with RHCOS based on RHEL 8.6
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.11
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Tim Rozet
QA Contact: Anurag saxena
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-05-05 17:17 UTC by Devan Goodwin
Modified: 2022-06-22 18:33 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-06-22 18:33:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
custom timeline of events thought to be related to the daemonset-upgrade test (474.57 KB, text/plain)
2022-05-05 17:17 UTC, Devan Goodwin
no flags Details

Description Devan Goodwin 2022-05-05 17:17:53 UTC
Created attachment 1877390 [details]
custom timeline of events thought to be related to the daemonset-upgrade test

Difficult bug I have not been able to fully pin down yet, but we believe that for the few days when RHCOS had picked up RHEL 8.6. The problem surfaces in one or both of two tests:

Test 1: [sig-cluster-lifecycle] ClusterOperators are available and not degraded after upgrade 

Which if you dig into the specific failures, is always a ClusterOperator complaining about a DaemonSet not running on 1 node. (seemingly only ever just 1 missing) The ClusterOperator ranges from network to dns to storage (in just one example).

https://sippy.dptools.openshift.org/sippy-ng/tests/4.11/analysis?test=Cluster%20upgrade.[sig-cluster-lifecycle]%20ClusterOperators%20are%20available%20and%20not%20degraded%20after%20upgrade

Sippy shows the dip between April 29 and May 3 when we picked up RHEL 8.6, and immediately recover on the date we reverted to 8.5. ovn pass rate drops from 100% to about 80% during this time.



Test 2: disruption_tests: [sig-apps] daemonset-upgrade

This is an upstream kube test which creates a daemonset and verifies it exists on all nodes. Similarly it seems to always fail missing one pod.

This test shows the exact same symptoms for ovn as the one above, with the same start/recovery dates.

https://sippy.dptools.openshift.org/sippy-ng/tests/4.11/analysis?test=Cluster%20upgrade.[sig-apps]%20daemonset-upgrade


Because 8.6 was reverted, at present we can only work with the data from the job runs we hit. For most of this bug I will focus on one of the jobs which hit both failures.

If more results are needed they can be found hanging off these two aggregated jobs:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregated-gcp-ovn-upgrade-4.11-minor-release-openshift-release-analysis-aggregator/1520113039418855424
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregated-gcp-ovn-upgrade-4.11-minor-release-openshift-release-analysis-aggregator/1520170908717682688


Job I will be focusing on has a daemonset issue for clusteroperator/network, clusteroperator/dns, as well as the daemonset-upgrade test failure.

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-gcp-ovn-upgrade/1520113036898078720

Focusing on the daemonset-upgrade test, I will attach a csv with search results across all the kube events and e2e events/intervals we recorded. (see daemonset-test.csv) These tests used an unreleased TRT tool based on elasticsearch, which can be shared if needed or I can iterate on data needed. For posterity the query I was using was:

prowjob:"1520113036898078720" and ("ci-op-chnsihgf-de73b-gskks-worker-c-jb49b" and ("ds1" or "node is not ready" or "ovnkube-node"))


The test failed at 20:53:08 on because the ds1 daemonset was missing a pod on the node in my query above.

There is a flurry of pod sandbox errors around 20:40:34 -> 20:54:10 (during which time our test fails)

Some ovnkube-node ContainerStarts after extreme 2000-3000s delays at 20:56:24.

And it looks like our pod/ds1-7gbwt gets created 20:56:24. (same time as ovnkube)



This is one example and I'm sure more can be examined in those test results. We do strongly believe this is something in RHEL 8.6 and it might be related to the other regression we know of in https://bugzilla.redhat.com/show_bug.cgi?id=2081069

Comment 1 Ben Bennett 2022-05-09 14:37:22 UTC
As an update... we are working with the RHCOS team to reproduce this out of tree so that we can work together to get to the root cause without breaking OpenShift CI.

This is under active investigation on #wg-4_11-triage in Slack.

Comment 2 Tim Rozet 2022-05-10 21:23:55 UTC
Looking at these logs the delay in bringing up the daemon set pod is because ovnkube-node is delayed in starting. We can see that ovnkube-node binds the port right after it starts with no creation error:
ovnkube-node starts at: 2022-04-29T20:55:45.049903075Z

2022-04-29T20:56:04.212732223Z I0429 20:56:04.212673   13216 cni.go:227] [e2e-k8s-sig-apps-daemonset-upgrade-347/ds1-7gbwt 304ef75a5bf15c520c93eb04670fb08969e2b6a693bcfabec38f6fc88d76d615] ADD starting CNI request [e2e-k8s-sig-apps-daemonset-upgrade-347/ds1-7gbwt 304ef75a5bf15c520c93eb04670fb08969e2b6a693bcfabec38f6fc88d76d615]
2022-04-29T20:56:04.277373950Z I0429 20:56:04.277326   13216 helper_linux.go:257] ConfigureOVS: namespace: e2e-k8s-sig-apps-daemonset-upgrade-347, podName: ds1-7gbwt
2022-04-29T20:56:05.788551703Z I0429 20:56:05.788465   13216 cni.go:248] [e2e-k8s-sig-apps-daemonset-upgrade-347/ds1-7gbwt 304ef75a5bf15c520c93eb04670fb08969e2b6a693bcfabec38f6fc88d76d615] ADD finished CNI request [e2e-k8s-sig-apps-daemonset-upgrade-347/ds1-7gbwt 304ef75a5bf15c520c93eb04670fb08969e2b6a693bcfabec38f6fc88d76d615], result "{\"interfaces\":[{\"name\":\"304ef75a5bf15c5\",\"mac\":\"26:4e:ee:65:a2:45\"},{\"name\":\"eth0\",\"mac\":\"0a:58:0a:81:02:0c\",\"sandbox\":\"/var/run/netns/1c3ba734-90dc-468b-a15a-a9a21abc4bd6\"}],\"ips\":[{\"version\":\"4\",\"interface\":1,\"address\":\"10.129.2.12/23\",\"gateway\":\"10.129.2.1\"}],\"dns\":{}}", err <nil>

The reason for the overall delay in the pod starting is because there is a delay in ovnkube-node starting. This is because the images are attempted to be pulled for 16 minutes and then fail. Looking at the events:

20:36:13	

openshift-ovn-kubernetes	

node-controller	

ovnkube-node-qlg9k	

NodeNotReady
	Node is not ready


############################


20:39:27	

openshift-ovn-kubernetes	

kubelet	

ovnkube-node-qlg9k	

Pulling
	Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:de7b14f5ba26c7d807bc140dd61fc9dd9549045b4e225ac67e1b4ea44ed72f3e"

#############################

20:55:44	

openshift-ovn-kubernetes	

kubelet	

ovnkube-node-qlg9k	

Failed
	Error: ErrImagePull

##############################
20:55:44	

openshift-ovn-kubernetes	

kubelet	

ovnkube-node-qlg9k	

Pulled
	Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:de7b14f5ba26c7d807bc140dd61fc9dd9549045b4e225ac67e1b4ea44ed72f3e" already present on machine

###############################
20:55:44	

openshift-ovn-kubernetes	

kubelet	

ovnkube-node-qlg9k	

Failed
	Failed to pull image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:de7b14f5ba26c7d807bc140dd61fc9dd9549045b4e225ac67e1b4ea44ed72f3e": rpc error: code = Unknown desc = can't talk to a V1 container registry

################################
20:55:45	

openshift-ovn-kubernetes	

kubelet	

ovnkube-node-qlg9k	

Created
	Created container ovnkube-node


Going to look through the journal to see if there is any obvious reason why image pull was a problem.

Comment 3 Tim Rozet 2022-05-10 22:08:09 UTC
Looks like another image pull around the same time for another pod also failed:

Apr 29 20:39:37.746713 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: E0429 20:39:37.746670    3171 remote_image.go:236] "PullImage from image service failed" err="rpc error: code = Unknown desc = Get \"https://quay.io/v2/openshift-release-dev/ocp-v4.0-art-dev/manifests/sha256:7c24a076f0601077f87181c478d6347a507a45735667c926040a6ee03030c789\": net/http: TLS handshake timeout" image="quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7c24a076f0601077f87181c478d6347a507a45735667c926040a6ee03030c789"

Apr 29 20:39:37.747049 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: E0429 20:39:37.746936    3171 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"oauth-proxy\" with ErrImagePull: \"rpc error: code = Unknown desc = Get \\\"https://quay.io/v2/openshift-release-dev/ocp-v4.0-art-dev/manifests/sha256:7c24a076f0601077f87181c478d6347a507a45735667c926040a6ee03030c789\\\": net/http: TLS handshake timeout\"" pod="openshift-machine-config-operator/machine-config-daemon-nszh2" podUID=923a7d23-b17a-42fa-aa1e-b808ae44d21b


There is no network outage on the node as far as interfaces going down during this time.

Seems somewhat odd that kubelet tries to pull the image for 16 minutes straight before returning an error, and then the image pulls and the container starts right after. Also the journal says the image was pulled much earlier for ovnkube-node:

Apr 29 20:39:27.683748 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b crio[3131]: time="2022-04-29 20:39:27.683700050Z" level=info msg="Pulling image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:de7b14f5ba26c7d807bc140dd61fc9dd9549045b4e225ac67e1b4ea44ed72f3e" id=f5529a71-817d-44c7-9582-5f1eabbd8fcf name=/runtime.v1.ImageService/PullImage
Apr 29 20:39:27.686297 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b crio[3131]: time="2022-04-29 20:39:27.686256151Z" level=info msg="Trying to access \"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:de7b14f5ba26c7d807bc140dd61fc9dd9549045b4e225ac67e1b4ea44ed72f3e\""

Apr 29 20:39:29.274857 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b crio[3131]: time="2022-04-29 20:39:29.274811493Z" level=info msg="Pulled image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:de7b14f5ba26c7d807bc140dd61fc9dd9549045b4e225ac67e1b4ea44ed72f3e" id=3b533e9e-b269-40c1-a5f3-28f997af9e8a name=/runtime.v1.ImageService/PullImage

^^^ pulled???

Apr 29 20:39:29.276034 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b crio[3131]: time="2022-04-29 20:39:29.275971545Z" level=info msg="Checking image status: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:de7b14f5ba26c7d807bc140dd61fc9dd9549045b4e225ac67e1b4ea44ed72f3e" id=5d4cd1ed-b190-4af4-8b11-8b44e4471316 name=/runtime.v1.ImageService/ImageStatus
Apr 29 20:39:29.277140 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b crio[3131]: time="2022-04-29 20:39:29.277094844Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:9a006ad67f771cf2017b360d096273c158ba23791a0f1590649b70d60e5acae3,RepoTags:[],RepoDigests:[quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:de7b14f5ba26c7d807bc140dd61fc9dd9549045b4e225ac67e1b4ea44ed72f3e],Size_:307724197,Uid:&Int64Value{Value:65534,},Username:,Spec:nil,},Info:map[string]string{},}" id=5d4cd1ed-b190-4af4-8b11-8b44e4471316 name=/runtime.v1.ImageService/ImageStatus
Apr 29 20:55:44.736044 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: E0429 20:55:44.735974    3171 remote_image.go:236] "PullImage from image service failed" err="rpc error: code = Unknown desc = can't talk to a V1 container registry" image="quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:de7b14f5ba26c7d807bc140dd61fc9dd9549045b4e225ac67e1b4ea44ed72f3e"
Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: E0429 20:55:44.736178    3171 kuberuntime_manager.go:919] container &Container{Name:kube-rbac-proxy,Image:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:de7b14f5ba26c7d807bc140dd61fc9dd9549045b4e225ac67e1b4ea44ed72f3e,Command:[/bin/bash -c #!/bin/bash
Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: set -euo pipefail
Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: TLS_PK=/etc/pki/tls/metrics-cert/tls.key
Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: TLS_CERT=/etc/pki/tls/metrics-cert/tls.crt
Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: # As the secret mount is optional we must wait for the files to be present.
Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: # The service is created in monitor.yaml and this is created in sdn.yaml.
Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: # If it isn't created there is probably an issue so we want to crashloop.
Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: retries=0
Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: TS=$(date +%s)
Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: WARN_TS=$(( ${TS} + $(( 20 * 60)) ))
Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: HAS_LOGGED_INFO=0
Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: log_missing_certs(){
Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]:     CUR_TS=$(date +%s)
Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]:     if [[ "${CUR_TS}" -gt "WARN_TS"  ]]; then
Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]:       echo $(date -Iseconds) WARN: ovn-node-metrics-cert not mounted after 20 minutes.
Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]:     elif [[ "${HAS_LOGGED_INFO}" -eq 0 ]] ; then
Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]:       echo $(date -Iseconds) INFO: ovn-node-metrics-cert not mounted. Waiting one hour.
Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]:       HAS_LOGGED_INFO=1
Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]:     fi
Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: }
Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: while [[ ! -f "${TLS_PK}" ||  ! -f "${TLS_CERT}" ]] ; do
Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]:   log_missing_certs
Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]:   sleep 5
Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: done
Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: echo $(date -Iseconds) INFO: ovn-node-metrics-certs mounted, starting kube-rbac-proxy
Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: exec /usr/bin/kube-rbac-proxy \
Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]:   --logtostderr \
Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]:   --secure-listen-address=:9103 \
Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]:   --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256 \
Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]:   --upstream=http://127.0.0.1:29103/ \
Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]:   --tls-private-key-file=${TLS_PK} \
Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]:   --tls-cert-file=${TLS_CERT}
Apr 29 20:55:44.736264 ci-op-chnsihgf-de73b-gskks-worker-c-jb49b hyperkube[3171]: ],Args:[],WorkingDir:,Ports:[]ContainerPort{ContainerPort{Name:https,HostPort:9103,ContainerPort:9103,Protocol:TCP,HostIP:,},},Env:[]EnvVar{},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{cpu: {{10 -3} {<nil>} 10m DecimalSI},memory: {{20971520 0} {<nil>} 20Mi BinarySI},},},VolumeMounts:[]VolumeMount{VolumeMount{Name:ovn-node-metrics-cert,ReadOnly:true,MountPath:/etc/pki/tls/metrics-cert,SubPath:,MountPropagation:nil,SubPathExpr:,},VolumeMount{Name:kube-api-access-wphxf,ReadOnly:true,MountPath:/var/run/secrets/kubernetes.io/serviceaccount,SubPath:,MountPropagation:nil,SubPathExpr:,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:nil,Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:FallbackToLogsOnError,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,} start failed in pod ovnkube-node-qlg9k_openshift-ovn-kubernetes(e2476fab-71e9-43fb-9437-4ef9557d12e0): ErrImagePull: rpc error: code = Unknown desc = can't talk to a V1 container registry

Comment 4 Tim Rozet 2022-05-10 22:10:08 UTC
Moving this to the node team for now so they can examine the kubelet logs/journal and provide their input.

Comment 5 Tim Rozet 2022-05-10 22:10:09 UTC
Moving this to the node team for now so they can examine the kubelet logs/journal and provide their input.

Comment 21 Tim Rozet 2022-06-22 18:25:19 UTC
Devan is this still happening after the fixes went in for 8.6?

Comment 22 Devan Goodwin 2022-06-22 18:33:24 UTC
I checked the links in original report, it looks like this is resolved to me.


Note You need to log in before you can comment on or make changes to this bug.