Description of problem: On the release-openshift-ocp-installer-e2e-ovirt-4.5 job, the [k8s.io] [sig-node] Events should be sent by kubelets and the scheduler about pods scheduling and running [Conformance] test has less than 50% pass rate. The failure seems to be a timeout. example runs: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.5/1271128314311872512 https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.5/1271128314311872512 Query: https://search.apps.build01.ci.devcluster.openshift.com/?maxAge=168h&context=1&type=junit&maxMatches=5&maxBytes=20971520&groupBy=job&name=release-openshift-ocp-installer-e2e-ovirt-4.5&search=%5C%5Bk8s%5C.io%5C%5D+%5C%5Bsig-node%5C%5D+Events+should+be+sent+by+kubelets+and+the+scheduler+about+pods+scheduling+and+running++%5C%5BConformance%5C%5D Did not find a BZ related to this, would like to see if this is an actual issue. Version-Release number of selected component (if applicable): 4.5 How reproducible: 50% Steps to Reproduce: CI Actual results: Test fails Expected results: Test passes Additional info:
Hi all, I'm seeing this bug now I commented on[1](should me mark is as duplicate of this bug?). On ovirt CI we are seeing this a lot, on 4.6 and 4.5: 4.6 - https://search.apps.build01.ci.devcluster.openshift.com/?search=Events+should+be+sent+by+kubelets+and+the+scheduler+about+pods+scheduling+and+running&maxAge=168h&context=1&type=junit&name=release-openshift-ocp-installer-e2e-ovirt-4.6&maxMatches=5&maxBytes=20971520&groupBy=job 4.5 - https://search.apps.build01.ci.devcluster.openshift.com/?search=Events+should+be+sent+by+kubelets+and+the+scheduler+about+pods+scheduling+and+running&maxAge=168h&context=1&type=junit&name=release-openshift-ocp-installer-e2e-ovirt-4.5&maxMatches=5&maxBytes=20971520&groupBy=job I saw this was marked as modified and moved to QE, so I think that we need to understand if the fix is applied in CI and if so, is it enough? Also, can you please explain what is causing the test to fail? [1] https://bugzilla.redhat.com/show_bug.cgi?id=1848081
sorry for the silence, this kept dropping off my radar... If I'm being honest, I do not understand the connection Ryan made between this and the runc fix. I assumed he found evidence that we had a failed `runc state` call (a container mysteriously getting a 255 exit code) but didn't do any verification. At this point, the fix should be in 4.5.z ci, and if the issue is not fixed, then this should not be in modified state.
*** Bug 1848081 has been marked as a duplicate of this bug. ***
*** Bug 1852924 has been marked as a duplicate of this bug. ***
https://search.ci.openshift.org/?search=Events+should+be+sent+by+kubelets+and+the+scheduler+about+pods+scheduling+and+running&maxAge=48h&context=2&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job Looking at a recent failure https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.5/1288135021726535680 STEP: checking for scheduler event about the pod Jul 28 18:43:48.403: INFO: Saw scheduler event for our pod. STEP: checking for kubelet event about the pod STEP: deleting the pod [AfterEach] [k8s.io] [sig-node] Events /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:179 STEP: Collecting events from namespace "e2e-events-6610". STEP: Found 2 events. Jul 28 18:44:48.561: INFO: At 2020-07-28 18:43:42 +0000 UTC - event for send-events-93c2da02-0fb4-47fd-b96b-e4846958f031: {default-scheduler } Scheduled: Successfully assigned e2e-events-6610/send-events-93c2da02-0fb4-47fd-b96b-e4846958f031 to ovirt16-x7nmv-worker-0-hdtkn Jul 28 18:44:48.561: INFO: At 2020-07-28 18:43:44 +0000 UTC - event for send-events-93c2da02-0fb4-47fd-b96b-e4846958f031: {multus } AddedInterface: Add eth0 [10.128.3.86/23] Jul 28 18:44:48.600: INFO: POD NODE PHASE GRACE CONDITIONS Jul 28 18:44:48.600: INFO: send-events-93c2da02-0fb4-47fd-b96b-e4846958f031 ovirt16-x7nmv-worker-0-hdtkn Running 30s [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020-07-28 18:43:42 +0000 UTC } {Ready True 0001-01-01 00:00:00 +0000 UTC 2020-07-28 18:43:45 +0000 UTC } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2020-07-28 18:43:45 +0000 UTC } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-07-28 18:43:42 +0000 UTC }] Jul 28 18:44:48.600: INFO: Jul 28 18:44:48.642: INFO: send-events-93c2da02-0fb4-47fd-b96b-e4846958f031[e2e-events-6610].container[p].log 2020/07/28 18:43:44 Serving on port 9376. 2020/07/28 18:44:48 Shutting down after receiving signal: terminated. 2020/07/28 18:44:48 Awaiting pod deletion. Seems like the events from the kubelet are not observed 18:43:42 - scheduler sees pod 18:43:42.186509 - kubelet sees pod 18:43:45.102351 - PLEG detects container has started 18:44:48.513224 - DELETE from test timeout In the worker logs Jul 28 18:43:44.343218 ovirt16-x7nmv-worker-0-hdtkn crio[1495]: I0728 18:43:44.343171 431456 event.go:221] Event(v1.ObjectReference{Kind:"Pod", Namespace:"e2e-events-6610", Name:"send-events-93c2da02-0fb4-47fd-b96b-e4846958f031", UID:"aef1e984-8249-4228-a134-50bb6495ab5f", APIVersion:"v1", ResourceVersion:"146560", FieldPath:""}): type: 'Normal' reason: 'AddedInterface' Add eth0 [10.128.3.86/23] Jul 28 18:43:44.554372 ovirt16-x7nmv-worker-0-hdtkn hyperkube[1576]: I0728 18:43:44.554338 1576 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"e2e-events-6610", Name:"send-events-93c2da02-0fb4-47fd-b96b-e4846958f031", UID:"aef1e984-8249-4228-a134-50bb6495ab5f", APIVersion:"v1", ResourceVersion:"146558", FieldPath:"spec.containers{p}"}): type: 'Normal' reason: 'Pulled' Container image "us.gcr.io/k8s-artifacts-prod/e2e-test-images/agnhost:2.12" already present on machine Jul 28 18:43:44.888712 ovirt16-x7nmv-worker-0-hdtkn hyperkube[1576]: I0728 18:43:44.888661 1576 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"e2e-events-6610", Name:"send-events-93c2da02-0fb4-47fd-b96b-e4846958f031", UID:"aef1e984-8249-4228-a134-50bb6495ab5f", APIVersion:"v1", ResourceVersion:"146558", FieldPath:"spec.containers{p}"}): type: 'Normal' reason: 'Created' Created container p Jul 28 18:43:44.934049 ovirt16-x7nmv-worker-0-hdtkn hyperkube[1576]: I0728 18:43:44.933917 1576 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"e2e-events-6610", Name:"send-events-93c2da02-0fb4-47fd-b96b-e4846958f031", UID:"aef1e984-8249-4228-a134-50bb6495ab5f", APIVersion:"v1", ResourceVersion:"146558", FieldPath:"spec.containers{p}"}): type: 'Normal' reason: 'Started' Started container p Jul 28 18:43:45.102369 ovirt16-x7nmv-worker-0-hdtkn hyperkube[1576]: I0728 18:43:45.102351 1576 kubelet.go:1970] SyncLoop (PLEG): "send-events-93c2da02-0fb4-47fd-b96b-e4846958f031_e2e-events-6610(aef1e984-8249-4228-a134-50bb6495ab5f)", event: &pleg.PodLifecycleEvent{ID:"aef1e984-8249-4228-a134-50bb6495ab5f", Type:"ContainerStarted", Data:"23e0305517cb98859852679d9da242396ed393b8f726531b8280b473ef7f0c64"} ... 64s delay (timeout on the event watch in the test is 60s) Jul 28 18:44:48.513616 ovirt16-x7nmv-worker-0-hdtkn hyperkube[1576]: I0728 18:44:48.513576 1576 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"e2e-events-6610", Name:"send-events-93c2da02-0fb4-47fd-b96b-e4846958f031", UID:"aef1e984-8249-4228-a134-50bb6495ab5f", APIVersion:"v1", ResourceVersion:"146558", FieldPath:"spec.containers{p}"}): type: 'Normal' reason: 'Killing' Stopping container p Jul 28 18:44:53.487254 ovirt16-x7nmv-worker-0-hdtkn hyperkube[1576]: E0728 18:44:53.487154 1576 event.go:260] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"send-events-93c2da02-0fb4-47fd-b96b-e4846958f031.1625fed2dba3b442", GenerateName:"", Namespace:"e2e-events-6610", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"e2e-events-6610", Name:"send-events-93c2da02-0fb4-47fd-b96b-e4846958f031", UID:"aef1e984-8249-4228-a134-50bb6495ab5f", APIVersion:"v1", ResourceVersion:"146558", FieldPath:"spec.containers{p}"}, Reason:"Pulled", Message:"Container image \"us.gcr.io/k8s-artifacts-prod/e2e-test-images/agnhost:2.12\" already present on machine", Source:v1.EventSource{Component:"kubelet", Host:"ovirt16-x7nmv-worker-0-hdtkn"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbfc03a582108f442, ext:2691956584342, loc:(*time.Location)(0x747ef40)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbfc03a582108f442, ext:2691956584342, loc:(*time.Location)(0x747ef40)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events "send-events-93c2da02-0fb4-47fd-b96b-e4846958f031.1625fed2dba3b442" is forbidden: unable to create new content in namespace e2e-events-6610 because it is being terminated' (will not retry!) Jul 28 18:44:54.087164 ovirt16-x7nmv-worker-0-hdtkn hyperkube[1576]: E0728 18:44:54.087084 1576 event.go:260] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"send-events-93c2da02-0fb4-47fd-b96b-e4846958f031.1625fed2ef91506d", GenerateName:"", Namespace:"e2e-events-6610", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"e2e-events-6610", Name:"send-events-93c2da02-0fb4-47fd-b96b-e4846958f031", UID:"aef1e984-8249-4228-a134-50bb6495ab5f", APIVersion:"v1", ResourceVersion:"146558", FieldPath:"spec.containers{p}"}, Reason:"Created", Message:"Created container p", Source:v1.EventSource{Component:"kubelet", Host:"ovirt16-x7nmv-worker-0-hdtkn"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbfc03a5834f6906d, ext:2692290923450, loc:(*time.Location)(0x747ef40)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbfc03a5834f6906d, ext:2692290923450, loc:(*time.Location)(0x747ef40)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events "send-events-93c2da02-0fb4-47fd-b96b-e4846958f031.1625fed2ef91506d" is forbidden: unable to create new content in namespace e2e-events-6610 because it is being terminated' (will not retry!) Jul 28 18:44:54.487065 ovirt16-x7nmv-worker-0-hdtkn hyperkube[1576]: E0728 18:44:54.486984 1576 event.go:260] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"send-events-93c2da02-0fb4-47fd-b96b-e4846958f031.1625fed2f24479b3", GenerateName:"", Namespace:"e2e-events-6610", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"e2e-events-6610", Name:"send-events-93c2da02-0fb4-47fd-b96b-e4846958f031", UID:"aef1e984-8249-4228-a134-50bb6495ab5f", APIVersion:"v1", ResourceVersion:"146558", FieldPath:"spec.containers{p}"}, Reason:"Started", Message:"Started container p", Source:v1.EventSource{Component:"kubelet", Host:"ovirt16-x7nmv-worker-0-hdtkn"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbfc03a5837a9b9b3, ext:2692336219403, loc:(*time.Location)(0x747ef40)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbfc03a5837a9b9b3, ext:2692336219403, loc:(*time.Location)(0x747ef40)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events "send-events-93c2da02-0fb4-47fd-b96b-e4846958f031.1625fed2f24479b3" is forbidden: unable to create new content in namespace e2e-events-6610 because it is being terminated' (will not retry!) So not sure why the events aren't showing up for the test. Jul 28 18:43:42.685764 ovirt16-x7nmv-worker-0-hdtkn hyperkube[1576]: I0728 18:43:42.685714 1576 request.go:557] Throttling request took 196.799479ms, request: POST:https://api-int.ovirt16.gcp.devcluster.openshift.com:6443/api/v1/namespaces/e2e-network-policy-b-9491/events Some event throttling up to 200ms.
I still don't have an explanation for why the events are getting lost. The pod does start on the kubelet so the event should be emitted. But it does seem to only flake on ovirt, which is strange.
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1294812708624928768 I do see a DNS failures Aug 16 02:45:17.697096 ovirt11-wz8kt-worker-0-5686p hyperkube[1602]: I0816 02:45:17.693816 1602 event.go:291] "Event occurred" object="e2e-events-7293/send-events-50103669-ddb2-4417-bbca-63841a5fd198" kind="Pod" apiVersion="v1" type="Warning" reason="FailedCreatePodSandBox" message="Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_send-events-50103669-ddb2-4417-bbca-63841a5fd198_e2e-events-7293_4b7fef7c-f4ce-4ce8-a859-b7a81344d13c_0(9eeab53055bedd8135451eb19cbb7c5eabe854c132937172b6a3e977f5dfdddd): Multus: [e2e-events-7293/send-events-50103669-ddb2-4417-bbca-63841a5fd198]: error getting pod: Get \"https://[api-int.ovirt11.gcp.devcluster.openshift.com]:6443/api/v1/namespaces/e2e-events-7293/pods/send-events-50103669-ddb2-4417-bbca-63841a5fd198?timeout=1m0s\": dial tcp: lookup api-int.ovirt11.gcp.devcluster.openshift.com on 192.168.211.1:53: no such host" Seems like the cluster DNS is generally unstable $ cat workers-journal | grep "api-int.*53: no such host" | wc -l 759 Just for reference, the expected event activity starting a pod with a single container is $ oc get events LAST SEEN TYPE REASON OBJECT MESSAGE 8s Normal Scheduled pod/busybox Successfully assigned demo/busybox to ip-10-0-172-31.us-west-1.compute.internal 6s Normal AddedInterface pod/busybox Add eth0 [10.128.2.87/23] 6s Normal Pulling pod/busybox Pulling image "busybox:latest" 3s Normal Pulled pod/busybox Successfully pulled image "busybox:latest" in 3.111409964s 3s Normal Created pod/busybox Created container busybox 3s Normal Started pod/busybox Started container busybox The test, when it fails, only shows to the first two (Scheduled and AddedInterface) events in the apiserver, when the kubelet log shows the events being sent
This is a flaking test on only ovirt platform which is relatively new to CI and doesn't have great signal yet https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/?job=release*ovirt* Not a 4.6 blocker. Setting low for now.
Still happening https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-ocp-installer-e2e-ovirt-4.6&include-filter-by-regex=Events%20should%20be%20sent%20by%20kubelets%20and%20the%20scheduler%20about%20pods%20scheduling%20and%20running ovirt release informing tests starting passing overall yesterday 8/27, so the signal is getting stronger, but this flake still persists. Very strange that it only happens on ovirt and no other tests are having issues.
Just renewing investigation, still seeing same failure mode where kubelet sends events and >1m later, kubelet shows the events were rejected by apiserver. From kubelet logs Sep 01 08:27:13.968819 ovirt11-5x8d7-worker-0-7kq8z crio[1537]: I0901 08:27:13.956817 356441 event.go:221] Event(v1.ObjectReference{Kind:"Pod", Namespace:"e2e-events-7913", Name:"send-events-e1d19803-008c-41b0-ae57-6bee509c8d8d", UID:"e0b4d36a-5458-4653-b79b-b892eda4d7a4", APIVersion:"v1", ResourceVersion:"142873", FieldPath:""}): type: 'Normal' reason: 'AddedInterface' Add eth0 [10.128.3.11/23] we see this event ^ Sep 01 08:27:14.873198 ovirt11-5x8d7-worker-0-7kq8z hyperkube[1574]: I0901 08:27:14.637756 1574 event.go:291] "Event occurred" object="e2e-events-7913/send-events-e1d19803-008c-41b0-ae57-6bee509c8d8d" kind="Pod" apiVersion="v1" type="Normal" reason="Pulled" message="Container image \"us.gcr.io/k8s-artifacts-prod/e2e-test-images/agnhost:2.20\" already present on machine" Sep 01 08:27:15.461935 ovirt11-5x8d7-worker-0-7kq8z hyperkube[1574]: I0901 08:27:15.460795 1574 event.go:291] "Event occurred" object="e2e-events-7913/send-events-e1d19803-008c-41b0-ae57-6bee509c8d8d" kind="Pod" apiVersion="v1" type="Normal" reason="Created" message="Created container p" Sep 01 08:27:15.529900 ovirt11-5x8d7-worker-0-7kq8z hyperkube[1574]: I0901 08:27:15.529775 1574 event.go:291] "Event occurred" object="e2e-events-7913/send-events-e1d19803-008c-41b0-ae57-6bee509c8d8d" kind="Pod" apiVersion="v1" type="Normal" reason="Started" message="Started container p" Sep 01 08:28:19.795973 ovirt11-5x8d7-worker-0-7kq8z hyperkube[1574]: I0901 08:28:19.795914 1574 event.go:291] "Event occurred" object="e2e-events-7913/send-events-e1d19803-008c-41b0-ae57-6bee509c8d8d" kind="Pod" apiVersion="v1" type="Normal" reason="Killing" message="Stopping container p" we don't see these ^ +1m23s Sep 01 08:29:42.322552 ovirt11-5x8d7-worker-0-7kq8z hyperkube[1574]: E0901 08:29:42.322437 1574 event.go:264] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"send-events-e1d19803-008c-41b0-ae57-6bee509c8d8d.16309b7e5699e7a5", GenerateName:"", Namespace:"e2e-events-7913", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"e2e-events-7913", Name:"send-events-e1d19803-008c-41b0-ae57-6bee509c8d8d", UID:"e0b4d36a-5458-4653-b79b-b892eda4d7a4", APIVersion:"v1", ResourceVersion:"142869", FieldPath:"spec.containers{p}"}, Reason:"Pulled", Message:"Container image \"us.gcr.io/k8s-artifacts-prod/e2e-test-images/agnhost:2.20\" already present on machine", Source:v1.EventSource{Component:"kubelet", Host:"ovirt11-5x8d7-worker-0-7kq8z"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbfcb9f58a60193a5, ext:2194327082791, loc:(*time.Location)(0x75222a0)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbfcb9f58a60193a5, ext:2194327082791, loc:(*time.Location)(0x75222a0)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'namespaces "e2e-events-7913" not found' (will not retry!) Sep 01 08:29:42.714506 ovirt11-5x8d7-worker-0-7kq8z hyperkube[1574]: E0901 08:29:42.714384 1574 event.go:264] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"send-events-e1d19803-008c-41b0-ae57-6bee509c8d8d.16309b7e87a8dd9f", GenerateName:"", Namespace:"e2e-events-7913", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"e2e-events-7913", Name:"send-events-e1d19803-008c-41b0-ae57-6bee509c8d8d", UID:"e0b4d36a-5458-4653-b79b-b892eda4d7a4", APIVersion:"v1", ResourceVersion:"142869", FieldPath:"spec.containers{p}"}, Reason:"Created", Message:"Created container p", Source:v1.EventSource{Component:"kubelet", Host:"ovirt11-5x8d7-worker-0-7kq8z"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbfcb9f58db75bf9f, ext:2195150146832, loc:(*time.Location)(0x75222a0)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbfcb9f58db75bf9f, ext:2195150146832, loc:(*time.Location)(0x75222a0)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'namespaces "e2e-events-7913" not found' (will not retry!) Sep 01 08:29:43.552564 ovirt11-5x8d7-worker-0-7kq8z hyperkube[1574]: E0901 08:29:43.550609 1574 event.go:264] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"send-events-e1d19803-008c-41b0-ae57-6bee509c8d8d.16309b7e8bc49981", GenerateName:"", Namespace:"e2e-events-7913", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"e2e-events-7913", Name:"send-events-e1d19803-008c-41b0-ae57-6bee509c8d8d", UID:"e0b4d36a-5458-4653-b79b-b892eda4d7a4", APIVersion:"v1", ResourceVersion:"142869", FieldPath:"spec.containers{p}"}, Reason:"Started", Message:"Started container p", Source:v1.EventSource{Component:"kubelet", Host:"ovirt11-5x8d7-worker-0-7kq8z"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbfcb9f58df917b81, ext:2195219073266, loc:(*time.Location)(0x75222a0)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbfcb9f58df917b81, ext:2195219073266, loc:(*time.Location)(0x75222a0)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'namespaces "e2e-events-7913" not found' (will not retry!) Sep 01 08:30:57.602940 ovirt11-5x8d7-worker-0-7kq8z hyperkube[1574]: E0901 08:30:57.514574 1574 event.go:264] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"send-events-e1d19803-008c-41b0-ae57-6bee509c8d8d.16309b8d82502fc8", GenerateName:"", Namespace:"e2e-events-7913", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"e2e-events-7913", Name:"send-events-e1d19803-008c-41b0-ae57-6bee509c8d8d", UID:"e0b4d36a-5458-4653-b79b-b892eda4d7a4", APIVersion:"v1", ResourceVersion:"142869", FieldPath:"spec.containers{p}"}, Reason:"Killing", Message:"Stopping container p", Source:v1.EventSource{Component:"kubelet", Host:"ovirt11-5x8d7-worker-0-7kq8z"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbfcb9f68ef6a91c8, ext:2259484958523, loc:(*time.Location)(0x75222a0)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbfcb9f68ef6a91c8, ext:2259484958523, loc:(*time.Location)(0x75222a0)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'namespaces "e2e-events-7913" not found' (will not retry!) server rejects events 1m23s after the events are sent From test output STEP: checking for scheduler event about the pod Sep 1 08:27:19.683: INFO: Saw scheduler event for our pod. STEP: checking for kubelet event about the pod STEP: deleting the pod [AfterEach] [k8s.io] [sig-node] Events k8s.io/kubernetes.0-rc.2/test/e2e/framework/framework.go:175 STEP: Collecting events from namespace "e2e-events-7913". STEP: Found 2 events. Sep 1 08:28:19.838: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for send-events-e1d19803-008c-41b0-ae57-6bee509c8d8d: { } Scheduled: Successfully assigned e2e-events-7913/send-events-e1d19803-008c-41b0-ae57-6bee509c8d8d to ovirt11-5x8d7-worker-0-7kq8z Sep 1 08:28:19.838: INFO: At 2020-09-01 08:27:13 +0000 UTC - event for send-events-e1d19803-008c-41b0-ae57-6bee509c8d8d: {multus } AddedInterface: Add eth0 [10.128.3.11/23] <expected container related events here> Sep 1 08:28:19.876: INFO: POD NODE PHASE GRACE CONDITIONS Sep 1 08:28:19.876: INFO: send-events-e1d19803-008c-41b0-ae57-6bee509c8d8d ovirt11-5x8d7-worker-0-7kq8z Running 30s [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020-09-01 08:27:11 +0000 UTC } {Ready True 0001-01-01 00:00:00 +0000 UTC 2020-09-01 08:27:16 +0000 UTC } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2020-09-01 08:27:16 +0000 UTC } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-09-01 08:27:11 +0000 UTC }]
Code from the test, the timeout on waiting for the kubelet events is 60s // Check for kubelet event about the pod. ginkgo.By("checking for kubelet event about the pod") framework.ExpectNoError(wait.Poll(time.Second*2, time.Second*60, func() (bool, error) { selector := fields.Set{ "involvedObject.uid": string(podWithUID.UID), "involvedObject.kind": "Pod", "involvedObject.namespace": f.Namespace.Name, "source": "kubelet", }.AsSelector().String() options := metav1.ListOptions{FieldSelector: selector} events, err = f.ClientSet.CoreV1().Events(f.Namespace.Name).List(context.TODO(), options) if err != nil { return false, err } if len(events.Items) > 0 { framework.Logf("Saw kubelet event for our pod.") return true, nil } return false, nil }))
hmm, so this test is explicitly testing "Events", but events are not guaranteed delivery as we drop events if there is spam, or apiserver rejects traffic more than 12 times.
this will help us see in logs why earlier attempts to send the event to the server were rejected (not just the last reason). https://github.com/kubernetes/kubernetes/pull/94483
This looks like a DNS issue. The problem is that if DNS blips for even a small period of time, any lookups that were started during the blip hang for 60-90s before failing. This timeout exceeds the timeout for the test. Hence the flakiness. $ grep "no such host" journal | grep pod_workers | wc -l 60 Failures are in the time window 08:15:40 to 08:33:52 which includes this test. It also explains why we don't see this error if attempts to write events to a connected apiserver were failing: https://github.com/kubernetes/kubernetes/blob/f736be5721e370db8fa4445a8a3264bb3770fb6e/staging/src/k8s.io/client-go/tools/record/event.go#L273 The multus event is not reported by multus directly, not the kubelet. Thus it is subject to a different DNS lookup, which explains why we see it but not the kubelet events. I would imagine we see the multus event failing to show up sometimes too but the test doesn't check for it. Sending to Networking for DNS unreliability on ovirt.
Should be been "The multus event _is_ reported by multus directly"
(In reply to Seth Jennings from comment #18) > https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift- > ocp-installer-e2e-ovirt-4.6/1294812708624928768 > > I do see a DNS failures > > Aug 16 02:45:17.697096 ovirt11-wz8kt-worker-0-5686p hyperkube[1602]: I0816 > 02:45:17.693816 1602 event.go:291] "Event occurred" > object="e2e-events-7293/send-events-50103669-ddb2-4417-bbca-63841a5fd198" > kind="Pod" apiVersion="v1" type="Warning" reason="FailedCreatePodSandBox" > message="Failed to create pod sandbox: rpc error: code = Unknown desc = > failed to create pod network sandbox > k8s_send-events-50103669-ddb2-4417-bbca-63841a5fd198_e2e-events- > 7293_4b7fef7c-f4ce-4ce8-a859- > b7a81344d13c_0(9eeab53055bedd8135451eb19cbb7c5eabe854c132937172b6a3e977f5dfdd > dd): Multus: > [e2e-events-7293/send-events-50103669-ddb2-4417-bbca-63841a5fd198]: error > getting pod: Get > \"https://[api-int.ovirt11.gcp.devcluster.openshift.com]:6443/api/v1/ > namespaces/e2e-events-7293/pods/send-events-50103669-ddb2-4417-bbca- > 63841a5fd198?timeout=1m0s\": dial tcp: lookup > api-int.ovirt11.gcp.devcluster.openshift.com on 192.168.211.1:53: no such > host" > > Seems like the cluster DNS is generally unstable > > $ cat workers-journal | grep "api-int.*53: no such host" | wc -l > 759 > > Just for reference, the expected event activity starting a pod with a single > container is > > $ oc get events > LAST SEEN TYPE REASON OBJECT MESSAGE > 8s Normal Scheduled pod/busybox Successfully assigned > demo/busybox to ip-10-0-172-31.us-west-1.compute.internal > 6s Normal AddedInterface pod/busybox Add eth0 [10.128.2.87/23] > 6s Normal Pulling pod/busybox Pulling image > "busybox:latest" > 3s Normal Pulled pod/busybox Successfully pulled > image "busybox:latest" in 3.111409964s > 3s Normal Created pod/busybox Created container busybox > 3s Normal Started pod/busybox Started container busybox > > The test, when it fails, only shows to the first two (Scheduled and > AddedInterface) events in the apiserver, when the kubelet log shows the > events being sent 192.168.211.1 is oVirt infra upstream DNS server, does api-int.ovirt11.gcp.devcluster.openshift.com should be able resolved by the upstream DNS ?
by looking at ovirt11-wz8kt-worker-0-5686p workers journal [0]: during this lookup failure we can see : # cat workers-journal | grep nm-dispatcher | grep 'worker-0-5686p' | grep resolv-prepender Aug 16 02:45:15.570577 ovirt11-wz8kt-worker-0-5686p nm-dispatcher[302731]: <13>Aug 16 02:45:15 root: NM resolv-prepender triggered by ens3 dhcp4-change. Aug 16 02:45:17.726649 ovirt11-wz8kt-worker-0-5686p nm-dispatcher[302731]: <13>Aug 16 02:45:17 root: NM resolv-prepender: Prepending 'nameserver 192.168.211.118' to /etc/resolv.conf (other nameservers from /var/run/NetworkManager/resolv.conf) it takes the prepender pod 20ms to complete [1] , during this time window there is no internal dns record in /etc/resolv.conf (so only upstream can be reached) . [0] https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1294812708624928768/artifacts/e2e-ovirt/nodes/workers-journal [1] https://github.com/openshift/machine-config-operator/blob/f8ff1d10d55425cd33c367c0911efcdd371125c6/templates/common/ovirt/files/NetworkManager-resolv-prepender.yaml#L40
(In reply to Evgeny Slutsky from comment #28) > by looking at ovirt11-wz8kt-worker-0-5686p workers journal [0]: > > during this lookup failure we can see : > > # cat workers-journal | grep nm-dispatcher | grep 'worker-0-5686p' | grep > resolv-prepender > Aug 16 02:45:15.570577 ovirt11-wz8kt-worker-0-5686p nm-dispatcher[302731]: > <13>Aug 16 02:45:15 root: NM resolv-prepender triggered by ens3 dhcp4-change. > Aug 16 02:45:17.726649 ovirt11-wz8kt-worker-0-5686p nm-dispatcher[302731]: > <13>Aug 16 02:45:17 root: NM resolv-prepender: Prepending 'nameserver > 192.168.211.118' to /etc/resolv.conf (other nameservers from > /var/run/NetworkManager/resolv.conf) > > it takes the prepender pod 20ms to complete [1] , during this time window > there is no internal dns record in /etc/resolv.conf (so only upstream can be > reached) . > > > [0] > https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp- > installer-e2e-ovirt-4.6/1294812708624928768/artifacts/e2e-ovirt/nodes/ > workers-journal > [1] > https://github.com/openshift/machine-config-operator/blob/ > f8ff1d10d55425cd33c367c0911efcdd371125c6/templates/common/ovirt/files/ > NetworkManager-resolv-prepender.yaml#L40 correction its: 2 seconds .
> 192.168.211.1 is oVirt infra upstream DNS server, > does api-int.ovirt11.gcp.devcluster.openshift.com should be able resolved by the upstream DNS ? I'm not sure how it works for ovirt. Every platform is a little different. On Openstack, where there is no software defined load balancer, they run keepalived to assign a VIP to the master leader and the DNS resolves to that VIP. Long story short though, I don't think the upstream resolver (outside the cluster) would be able to resolve the api-int name for a cluster.
This sounds very similar to the bug being fixed in https://github.com/openshift/machine-config-operator/pull/2062.
(In reply to Ben Nemec from comment #31) > This sounds very similar to the bug being fixed in > https://github.com/openshift/machine-config-operator/pull/2062. yeap, this BZ is one of the motivators for this PR..
2062 has merged, so I believe this should be fixed now. I'm going to dupe it to 1876216 which has already been verified. *** This bug has been marked as a duplicate of bug 1876216 ***