Bug 1982448

Summary:	[OVN Kubernetes] ovnkube-node container withing ovnkube-node pods report readiness probe times out
Product:	OpenShift Container Platform	Reporter:	Kedar Kulkarni <kkulkarn>
Component:	Networking	Assignee:	Mohamed Mahmoud <mmahmoud>
Networking sub component:	ovn-kubernetes	QA Contact:	Kedar Kulkarni <kkulkarn>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	medium
Priority:	unspecified	CC:	anbhat, astoycos, dblack, jlema, pehunt, smalleni
Version:	4.9
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	perfscale-ovn
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-07-23 13:23:36 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Kedar Kulkarni 2021-07-14 20:48:01 UTC

Description of problem:
While executing Cluster Density tests on 250 workers cluster, 2000 iterations, I observed that multiple ovnkube-node pods reported frequently that readiness-probe timed out for ovnkube-node containers.

Version-Release number of selected component (if applicable):
4.9.0-0.nightly-2021-07-12-203753

How reproducible:


Steps to Reproduce:
1.Have a OVN OpenShift Cluster, you could try with smaller cluster, but this issue was observed on a large scale cluster
2.Run Cluster Density from https://github.com/cloud-bulldozer/e2e-benchmarking/blob/master/workloads/kube-burner/run_clusterdensity_test_fromgit.sh
3.Observe the ovnkube-node pods

Actual results:
ovnkube-node container reports errors as:

   Warning  Unhealthy  33s (x92 over 3h42m)  kubelet  Readiness probe failed: command timed out


Expected results:
no readiness probe time outs

Additional info:

Comment 3 Peter Hunt 2021-07-23 13:23:36 UTC

I would say this is a dup of https://bugzilla.redhat.com/show_bug.cgi?id=1978268

*** This bug has been marked as a duplicate of bug 1978268 ***

Comment 4 Jose Castillo Lema 2021-07-23 15:54:08 UTC

I am commenting here because I did not see any OVN related info in bug 1978268, and we are only observing readiness probe timeout in the ovn and openshift-kni-projects.

We are getting the same errors during 120 node baremetal perf testing, both for ovnkube-master pods and ovnkube-node ones:
$ oc get events -n openshift-ovn-kubernetes
LAST SEEN   TYPE      REASON      OBJECT                     MESSAGE
21s         Warning   Unhealthy   pod/ovnkube-master-6qhlm   Readiness probe failed: ++ /usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl --timeout=3 cluster/status OVN_Southbound
++ grep 'Leader: unknown'
++ true
+ leader_status=
4m51s       Warning   Unhealthy   pod/ovnkube-master-6qhlm   Readiness probe failed: ++ /usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=3 cluster/status OVN_Northbound
++ grep 'Leader: unknown'
++ true
+ leader_status=
137m        Warning   Unhealthy   pod/ovnkube-master-qcwqc   Readiness probe failed: ++ /usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=3 cluster/status OVN_Northbound
++ grep 'Leader: unknown'
++ true
+ leader_status=
137m        Warning   Unhealthy   pod/ovnkube-master-qcwqc   Readiness probe failed: ++ /usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl --timeout=3 cluster/status OVN_Southbound
++ grep 'Leader: unknown'
++ true
+ leader_status=
132m        Warning   Unhealthy   pod/ovnkube-master-zhfr7   Readiness probe failed: ++ /usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=3 cluster/status OVN_Northbound
++ grep 'Leader: unknown'
++ true
+ leader_status=
3m39s       Warning   Unhealthy   pod/ovnkube-master-zhfr7   Readiness probe failed: ++ /usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl --timeout=3 cluster/status OVN_Southbound
++ grep 'Leader: unknown'
++ true
+ leader_status=
109s        Warning   Unhealthy   pod/ovnkube-node-4fz8h     Readiness probe failed: command timed out
4s          Warning   Unhealthy   pod/ovnkube-node-8wjjr     Readiness probe failed: command timed out
32m         Warning   Unhealthy   pod/ovnkube-node-jmr47     Readiness probe failed: command timed out
105s        Warning   Unhealthy   pod/ovnkube-node-pn4v8     Readiness probe failed: command timed out
2m18s       Warning   Unhealthy   pod/ovnkube-node-tf576     Readiness probe failed: command timed out

Comment 5 Kedar Kulkarni 2021-07-29 17:57:30 UTC

Hi,

I ran the tests for 120 worker cluster_density with 1000 iterations, and I couldn't reproduce the problem with the latest build of 4.9 nightly. I didn't see any "   Warning  Unhealthy  33s (x92 over 3h42m)  kubelet  Readiness probe failed: command timed out". It is ok for this bz to remain closed duplicate.

Build version:4.9.0-0.nightly-2021-07-29-004741

@jlema Please open a new bug to track the errors mentioned in comment 4.

Thanks,
KK.

Comment 6 Sai Sindhur Malleni 2021-07-29 18:46:39 UTC

(In reply to Jose Castillo Lema from comment #4)
> I am commenting here because I did not see any OVN related info in bug
> 1978268, and we are only observing readiness probe timeout in the ovn and
> openshift-kni-projects.
> 
> We are getting the same errors during 120 node baremetal perf testing, both
> for ovnkube-master pods and ovnkube-node ones:
> $ oc get events -n openshift-ovn-kubernetes
> LAST SEEN   TYPE      REASON      OBJECT                     MESSAGE
> 21s         Warning   Unhealthy   pod/ovnkube-master-6qhlm   Readiness probe
> failed: ++ /usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl --timeout=3
> cluster/status OVN_Southbound
> ++ grep 'Leader: unknown'
> ++ true
> + leader_status=
> 4m51s       Warning   Unhealthy   pod/ovnkube-master-6qhlm   Readiness probe
> failed: ++ /usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=3
> cluster/status OVN_Northbound
> ++ grep 'Leader: unknown'
> ++ true
> + leader_status=
> 137m        Warning   Unhealthy   pod/ovnkube-master-qcwqc   Readiness probe
> failed: ++ /usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=3
> cluster/status OVN_Northbound
> ++ grep 'Leader: unknown'
> ++ true
> + leader_status=
> 137m        Warning   Unhealthy   pod/ovnkube-master-qcwqc   Readiness probe
> failed: ++ /usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl --timeout=3
> cluster/status OVN_Southbound
> ++ grep 'Leader: unknown'
> ++ true
> + leader_status=
> 132m        Warning   Unhealthy   pod/ovnkube-master-zhfr7   Readiness probe
> failed: ++ /usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=3
> cluster/status OVN_Northbound
> ++ grep 'Leader: unknown'
> ++ true
> + leader_status=
> 3m39s       Warning   Unhealthy   pod/ovnkube-master-zhfr7   Readiness probe
> failed: ++ /usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl --timeout=3
> cluster/status OVN_Southbound
> ++ grep 'Leader: unknown'
> ++ true
> + leader_status=
> 109s        Warning   Unhealthy   pod/ovnkube-node-4fz8h     Readiness probe
> failed: command timed out
> 4s          Warning   Unhealthy   pod/ovnkube-node-8wjjr     Readiness probe
> failed: command timed out
> 32m         Warning   Unhealthy   pod/ovnkube-node-jmr47     Readiness probe
> failed: command timed out
> 105s        Warning   Unhealthy   pod/ovnkube-node-pn4v8     Readiness probe
> failed: command timed out
> 2m18s       Warning   Unhealthy   pod/ovnkube-node-tf576     Readiness probe
> failed: command timed out

What version are you seeing this on? I think this BZ tracks 4.9

Comment 7 Jose Castillo Lema 2021-07-29 19:25:24 UTC

Apologies, I haven't realized this is a BZ tracking 4.9.
I have seen this events in a OCP 4.8.0-fc.9 (local gateway / ovn2.13-20.12.0-25.el8fdp.x86_64).