Description of problem: While executing Cluster Density tests on 250 workers cluster, 2000 iterations, I observed that multiple ovnkube-node pods reported frequently that readiness-probe timed out for ovnkube-node containers. Version-Release number of selected component (if applicable): 4.9.0-0.nightly-2021-07-12-203753 How reproducible: Steps to Reproduce: 1.Have a OVN OpenShift Cluster, you could try with smaller cluster, but this issue was observed on a large scale cluster 2.Run Cluster Density from https://github.com/cloud-bulldozer/e2e-benchmarking/blob/master/workloads/kube-burner/run_clusterdensity_test_fromgit.sh 3.Observe the ovnkube-node pods Actual results: ovnkube-node container reports errors as: Warning Unhealthy 33s (x92 over 3h42m) kubelet Readiness probe failed: command timed out Expected results: no readiness probe time outs Additional info:
I would say this is a dup of https://bugzilla.redhat.com/show_bug.cgi?id=1978268 *** This bug has been marked as a duplicate of bug 1978268 ***
I am commenting here because I did not see any OVN related info in bug 1978268, and we are only observing readiness probe timeout in the ovn and openshift-kni-projects. We are getting the same errors during 120 node baremetal perf testing, both for ovnkube-master pods and ovnkube-node ones: $ oc get events -n openshift-ovn-kubernetes LAST SEEN TYPE REASON OBJECT MESSAGE 21s Warning Unhealthy pod/ovnkube-master-6qhlm Readiness probe failed: ++ /usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl --timeout=3 cluster/status OVN_Southbound ++ grep 'Leader: unknown' ++ true + leader_status= 4m51s Warning Unhealthy pod/ovnkube-master-6qhlm Readiness probe failed: ++ /usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=3 cluster/status OVN_Northbound ++ grep 'Leader: unknown' ++ true + leader_status= 137m Warning Unhealthy pod/ovnkube-master-qcwqc Readiness probe failed: ++ /usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=3 cluster/status OVN_Northbound ++ grep 'Leader: unknown' ++ true + leader_status= 137m Warning Unhealthy pod/ovnkube-master-qcwqc Readiness probe failed: ++ /usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl --timeout=3 cluster/status OVN_Southbound ++ grep 'Leader: unknown' ++ true + leader_status= 132m Warning Unhealthy pod/ovnkube-master-zhfr7 Readiness probe failed: ++ /usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=3 cluster/status OVN_Northbound ++ grep 'Leader: unknown' ++ true + leader_status= 3m39s Warning Unhealthy pod/ovnkube-master-zhfr7 Readiness probe failed: ++ /usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl --timeout=3 cluster/status OVN_Southbound ++ grep 'Leader: unknown' ++ true + leader_status= 109s Warning Unhealthy pod/ovnkube-node-4fz8h Readiness probe failed: command timed out 4s Warning Unhealthy pod/ovnkube-node-8wjjr Readiness probe failed: command timed out 32m Warning Unhealthy pod/ovnkube-node-jmr47 Readiness probe failed: command timed out 105s Warning Unhealthy pod/ovnkube-node-pn4v8 Readiness probe failed: command timed out 2m18s Warning Unhealthy pod/ovnkube-node-tf576 Readiness probe failed: command timed out
Hi, I ran the tests for 120 worker cluster_density with 1000 iterations, and I couldn't reproduce the problem with the latest build of 4.9 nightly. I didn't see any " Warning Unhealthy 33s (x92 over 3h42m) kubelet Readiness probe failed: command timed out". It is ok for this bz to remain closed duplicate. Build version:4.9.0-0.nightly-2021-07-29-004741 @jlema Please open a new bug to track the errors mentioned in comment 4. Thanks, KK.
(In reply to Jose Castillo Lema from comment #4) > I am commenting here because I did not see any OVN related info in bug > 1978268, and we are only observing readiness probe timeout in the ovn and > openshift-kni-projects. > > We are getting the same errors during 120 node baremetal perf testing, both > for ovnkube-master pods and ovnkube-node ones: > $ oc get events -n openshift-ovn-kubernetes > LAST SEEN TYPE REASON OBJECT MESSAGE > 21s Warning Unhealthy pod/ovnkube-master-6qhlm Readiness probe > failed: ++ /usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl --timeout=3 > cluster/status OVN_Southbound > ++ grep 'Leader: unknown' > ++ true > + leader_status= > 4m51s Warning Unhealthy pod/ovnkube-master-6qhlm Readiness probe > failed: ++ /usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=3 > cluster/status OVN_Northbound > ++ grep 'Leader: unknown' > ++ true > + leader_status= > 137m Warning Unhealthy pod/ovnkube-master-qcwqc Readiness probe > failed: ++ /usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=3 > cluster/status OVN_Northbound > ++ grep 'Leader: unknown' > ++ true > + leader_status= > 137m Warning Unhealthy pod/ovnkube-master-qcwqc Readiness probe > failed: ++ /usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl --timeout=3 > cluster/status OVN_Southbound > ++ grep 'Leader: unknown' > ++ true > + leader_status= > 132m Warning Unhealthy pod/ovnkube-master-zhfr7 Readiness probe > failed: ++ /usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=3 > cluster/status OVN_Northbound > ++ grep 'Leader: unknown' > ++ true > + leader_status= > 3m39s Warning Unhealthy pod/ovnkube-master-zhfr7 Readiness probe > failed: ++ /usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl --timeout=3 > cluster/status OVN_Southbound > ++ grep 'Leader: unknown' > ++ true > + leader_status= > 109s Warning Unhealthy pod/ovnkube-node-4fz8h Readiness probe > failed: command timed out > 4s Warning Unhealthy pod/ovnkube-node-8wjjr Readiness probe > failed: command timed out > 32m Warning Unhealthy pod/ovnkube-node-jmr47 Readiness probe > failed: command timed out > 105s Warning Unhealthy pod/ovnkube-node-pn4v8 Readiness probe > failed: command timed out > 2m18s Warning Unhealthy pod/ovnkube-node-tf576 Readiness probe > failed: command timed out What version are you seeing this on? I think this BZ tracks 4.9
Apologies, I haven't realized this is a BZ tracking 4.9. I have seen this events in a OCP 4.8.0-fc.9 (local gateway / ovn2.13-20.12.0-25.el8fdp.x86_64).