Bug 1846529 - [buildcop] ovirt CI failure: [sig-node] Events should be sent by kubelets and the scheduler about pods scheduling and running [Conformance]
Summary: [buildcop] ovirt CI failure: [sig-node] Events should be sent by kubelets and...
Keywords:
Status: CLOSED DUPLICATE of bug 1876216
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.5
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.7.0
Assignee: Ben Nemec
QA Contact: Victor Voronkov
URL:
Whiteboard:
: 1848081 1852924 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-11 19:35 UTC by Yu Qi Zhang
Modified: 2020-10-13 02:52 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-09-28 20:39:41 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Yu Qi Zhang 2020-06-11 19:35:51 UTC
Description of problem:

On the release-openshift-ocp-installer-e2e-ovirt-4.5 job, the [k8s.io] [sig-node] Events should be sent by kubelets and the scheduler about pods scheduling and running [Conformance] test has less than 50% pass rate. The failure seems to be a timeout.

example runs:
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.5/1271128314311872512

https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.5/1271128314311872512

Query:
https://search.apps.build01.ci.devcluster.openshift.com/?maxAge=168h&context=1&type=junit&maxMatches=5&maxBytes=20971520&groupBy=job&name=release-openshift-ocp-installer-e2e-ovirt-4.5&search=%5C%5Bk8s%5C.io%5C%5D+%5C%5Bsig-node%5C%5D+Events+should+be+sent+by+kubelets+and+the+scheduler+about+pods+scheduling+and+running++%5C%5BConformance%5C%5D

Did not find a BZ related to this, would like to see if this is an actual issue.

Version-Release number of selected component (if applicable):
4.5

How reproducible:
50%

Steps to Reproduce:
CI

Actual results:
Test fails

Expected results:
Test passes

Additional info:

Comment 8 Gal Zaidman 2020-07-01 10:58:29 UTC
Hi all,
I'm seeing this bug now I commented on[1](should me mark is as duplicate of this bug?).

On ovirt CI we are seeing this a lot, on 4.6 and 4.5:

4.6 - https://search.apps.build01.ci.devcluster.openshift.com/?search=Events+should+be+sent+by+kubelets+and+the+scheduler+about+pods+scheduling+and+running&maxAge=168h&context=1&type=junit&name=release-openshift-ocp-installer-e2e-ovirt-4.6&maxMatches=5&maxBytes=20971520&groupBy=job

4.5 - https://search.apps.build01.ci.devcluster.openshift.com/?search=Events+should+be+sent+by+kubelets+and+the+scheduler+about+pods+scheduling+and+running&maxAge=168h&context=1&type=junit&name=release-openshift-ocp-installer-e2e-ovirt-4.5&maxMatches=5&maxBytes=20971520&groupBy=job

I saw this was marked as modified and moved to QE, so I think that we need to understand if the fix is applied in CI and if so, is it enough?
Also, can you please explain what is causing the test to fail?

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1848081

Comment 9 Peter Hunt 2020-07-14 17:21:11 UTC
sorry for the silence, this kept dropping off my radar...

If I'm being honest, I do not understand the connection Ryan made between this and the runc fix. I assumed he found evidence that we had a failed `runc state` call (a container mysteriously getting a 255 exit code) but didn't do any verification. At this point, the fix should be in 4.5.z ci, and if the issue is not fixed, then this should not be in modified state.

Comment 14 Gal Zaidman 2020-07-22 10:44:04 UTC
*** Bug 1848081 has been marked as a duplicate of this bug. ***

Comment 15 Gal Zaidman 2020-07-22 10:45:05 UTC
*** Bug 1852924 has been marked as a duplicate of this bug. ***

Comment 16 Seth Jennings 2020-07-28 21:47:04 UTC
https://search.ci.openshift.org/?search=Events+should+be+sent+by+kubelets+and+the+scheduler+about+pods+scheduling+and+running&maxAge=48h&context=2&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Looking at a recent failure

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.5/1288135021726535680

STEP: checking for scheduler event about the pod
Jul 28 18:43:48.403: INFO: Saw scheduler event for our pod.
STEP: checking for kubelet event about the pod
STEP: deleting the pod
[AfterEach] [k8s.io] [sig-node] Events
  /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:179
STEP: Collecting events from namespace "e2e-events-6610".
STEP: Found 2 events.
Jul 28 18:44:48.561: INFO: At 2020-07-28 18:43:42 +0000 UTC - event for send-events-93c2da02-0fb4-47fd-b96b-e4846958f031: {default-scheduler } Scheduled: Successfully assigned e2e-events-6610/send-events-93c2da02-0fb4-47fd-b96b-e4846958f031 to ovirt16-x7nmv-worker-0-hdtkn
Jul 28 18:44:48.561: INFO: At 2020-07-28 18:43:44 +0000 UTC - event for send-events-93c2da02-0fb4-47fd-b96b-e4846958f031: {multus } AddedInterface: Add eth0 [10.128.3.86/23]
Jul 28 18:44:48.600: INFO: POD                                               NODE                          PHASE    GRACE  CONDITIONS
Jul 28 18:44:48.600: INFO: send-events-93c2da02-0fb4-47fd-b96b-e4846958f031  ovirt16-x7nmv-worker-0-hdtkn  Running  30s    [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020-07-28 18:43:42 +0000 UTC  } {Ready True 0001-01-01 00:00:00 +0000 UTC 2020-07-28 18:43:45 +0000 UTC  } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2020-07-28 18:43:45 +0000 UTC  } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-07-28 18:43:42 +0000 UTC  }]
Jul 28 18:44:48.600: INFO: 
Jul 28 18:44:48.642: INFO: send-events-93c2da02-0fb4-47fd-b96b-e4846958f031[e2e-events-6610].container[p].log
2020/07/28 18:43:44 Serving on port 9376.
2020/07/28 18:44:48 Shutting down after receiving signal: terminated.
2020/07/28 18:44:48 Awaiting pod deletion.

Seems like the events from the kubelet are not observed


18:43:42 - scheduler sees pod
18:43:42.186509 - kubelet sees pod
18:43:45.102351 - PLEG detects container has started
18:44:48.513224 - DELETE from test timeout

In the worker logs

Jul 28 18:43:44.343218 ovirt16-x7nmv-worker-0-hdtkn crio[1495]: I0728 18:43:44.343171  431456 event.go:221] Event(v1.ObjectReference{Kind:"Pod", Namespace:"e2e-events-6610", Name:"send-events-93c2da02-0fb4-47fd-b96b-e4846958f031", UID:"aef1e984-8249-4228-a134-50bb6495ab5f", APIVersion:"v1", ResourceVersion:"146560", FieldPath:""}): type: 'Normal' reason: 'AddedInterface' Add eth0 [10.128.3.86/23]
Jul 28 18:43:44.554372 ovirt16-x7nmv-worker-0-hdtkn hyperkube[1576]: I0728 18:43:44.554338    1576 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"e2e-events-6610", Name:"send-events-93c2da02-0fb4-47fd-b96b-e4846958f031", UID:"aef1e984-8249-4228-a134-50bb6495ab5f", APIVersion:"v1", ResourceVersion:"146558", FieldPath:"spec.containers{p}"}): type: 'Normal' reason: 'Pulled' Container image "us.gcr.io/k8s-artifacts-prod/e2e-test-images/agnhost:2.12" already present on machine
Jul 28 18:43:44.888712 ovirt16-x7nmv-worker-0-hdtkn hyperkube[1576]: I0728 18:43:44.888661    1576 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"e2e-events-6610", Name:"send-events-93c2da02-0fb4-47fd-b96b-e4846958f031", UID:"aef1e984-8249-4228-a134-50bb6495ab5f", APIVersion:"v1", ResourceVersion:"146558", FieldPath:"spec.containers{p}"}): type: 'Normal' reason: 'Created' Created container p
Jul 28 18:43:44.934049 ovirt16-x7nmv-worker-0-hdtkn hyperkube[1576]: I0728 18:43:44.933917    1576 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"e2e-events-6610", Name:"send-events-93c2da02-0fb4-47fd-b96b-e4846958f031", UID:"aef1e984-8249-4228-a134-50bb6495ab5f", APIVersion:"v1", ResourceVersion:"146558", FieldPath:"spec.containers{p}"}): type: 'Normal' reason: 'Started' Started container p

Jul 28 18:43:45.102369 ovirt16-x7nmv-worker-0-hdtkn hyperkube[1576]: I0728 18:43:45.102351    1576 kubelet.go:1970] SyncLoop (PLEG): "send-events-93c2da02-0fb4-47fd-b96b-e4846958f031_e2e-events-6610(aef1e984-8249-4228-a134-50bb6495ab5f)", event: &pleg.PodLifecycleEvent{ID:"aef1e984-8249-4228-a134-50bb6495ab5f", Type:"ContainerStarted", Data:"23e0305517cb98859852679d9da242396ed393b8f726531b8280b473ef7f0c64"}

... 64s delay (timeout on the event watch in the test is 60s)

Jul 28 18:44:48.513616 ovirt16-x7nmv-worker-0-hdtkn hyperkube[1576]: I0728 18:44:48.513576    1576 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"e2e-events-6610", Name:"send-events-93c2da02-0fb4-47fd-b96b-e4846958f031", UID:"aef1e984-8249-4228-a134-50bb6495ab5f", APIVersion:"v1", ResourceVersion:"146558", FieldPath:"spec.containers{p}"}): type: 'Normal' reason: 'Killing' Stopping container p

Jul 28 18:44:53.487254 ovirt16-x7nmv-worker-0-hdtkn hyperkube[1576]: E0728 18:44:53.487154    1576 event.go:260] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"send-events-93c2da02-0fb4-47fd-b96b-e4846958f031.1625fed2dba3b442", GenerateName:"", Namespace:"e2e-events-6610", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"e2e-events-6610", Name:"send-events-93c2da02-0fb4-47fd-b96b-e4846958f031", UID:"aef1e984-8249-4228-a134-50bb6495ab5f", APIVersion:"v1", ResourceVersion:"146558", FieldPath:"spec.containers{p}"}, Reason:"Pulled", Message:"Container image \"us.gcr.io/k8s-artifacts-prod/e2e-test-images/agnhost:2.12\" already present on machine", Source:v1.EventSource{Component:"kubelet", Host:"ovirt16-x7nmv-worker-0-hdtkn"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbfc03a582108f442, ext:2691956584342, loc:(*time.Location)(0x747ef40)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbfc03a582108f442, ext:2691956584342, loc:(*time.Location)(0x747ef40)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events "send-events-93c2da02-0fb4-47fd-b96b-e4846958f031.1625fed2dba3b442" is forbidden: unable to create new content in namespace e2e-events-6610 because it is being terminated' (will not retry!)

Jul 28 18:44:54.087164 ovirt16-x7nmv-worker-0-hdtkn hyperkube[1576]: E0728 18:44:54.087084    1576 event.go:260] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"send-events-93c2da02-0fb4-47fd-b96b-e4846958f031.1625fed2ef91506d", GenerateName:"", Namespace:"e2e-events-6610", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"e2e-events-6610", Name:"send-events-93c2da02-0fb4-47fd-b96b-e4846958f031", UID:"aef1e984-8249-4228-a134-50bb6495ab5f", APIVersion:"v1", ResourceVersion:"146558", FieldPath:"spec.containers{p}"}, Reason:"Created", Message:"Created container p", Source:v1.EventSource{Component:"kubelet", Host:"ovirt16-x7nmv-worker-0-hdtkn"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbfc03a5834f6906d, ext:2692290923450, loc:(*time.Location)(0x747ef40)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbfc03a5834f6906d, ext:2692290923450, loc:(*time.Location)(0x747ef40)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events "send-events-93c2da02-0fb4-47fd-b96b-e4846958f031.1625fed2ef91506d" is forbidden: unable to create new content in namespace e2e-events-6610 because it is being terminated' (will not retry!)

Jul 28 18:44:54.487065 ovirt16-x7nmv-worker-0-hdtkn hyperkube[1576]: E0728 18:44:54.486984    1576 event.go:260] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"send-events-93c2da02-0fb4-47fd-b96b-e4846958f031.1625fed2f24479b3", GenerateName:"", Namespace:"e2e-events-6610", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"e2e-events-6610", Name:"send-events-93c2da02-0fb4-47fd-b96b-e4846958f031", UID:"aef1e984-8249-4228-a134-50bb6495ab5f", APIVersion:"v1", ResourceVersion:"146558", FieldPath:"spec.containers{p}"}, Reason:"Started", Message:"Started container p", Source:v1.EventSource{Component:"kubelet", Host:"ovirt16-x7nmv-worker-0-hdtkn"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbfc03a5837a9b9b3, ext:2692336219403, loc:(*time.Location)(0x747ef40)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbfc03a5837a9b9b3, ext:2692336219403, loc:(*time.Location)(0x747ef40)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events "send-events-93c2da02-0fb4-47fd-b96b-e4846958f031.1625fed2f24479b3" is forbidden: unable to create new content in namespace e2e-events-6610 because it is being terminated' (will not retry!)

So not sure why the events aren't showing up for the test.

Jul 28 18:43:42.685764 ovirt16-x7nmv-worker-0-hdtkn hyperkube[1576]: I0728 18:43:42.685714    1576 request.go:557] Throttling request took 196.799479ms, request: POST:https://api-int.ovirt16.gcp.devcluster.openshift.com:6443/api/v1/namespaces/e2e-network-policy-b-9491/events

Some event throttling up to 200ms.

Comment 17 Seth Jennings 2020-08-17 19:43:15 UTC
I still don't have an explanation for why the events are getting lost.  The pod does start on the kubelet so the event should be emitted.  But it does seem to only flake on ovirt, which is strange.

Comment 18 Seth Jennings 2020-08-17 20:08:46 UTC
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1294812708624928768

I do see a DNS failures

Aug 16 02:45:17.697096 ovirt11-wz8kt-worker-0-5686p hyperkube[1602]: I0816 02:45:17.693816    1602 event.go:291] "Event occurred" object="e2e-events-7293/send-events-50103669-ddb2-4417-bbca-63841a5fd198" kind="Pod" apiVersion="v1" type="Warning" reason="FailedCreatePodSandBox" message="Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_send-events-50103669-ddb2-4417-bbca-63841a5fd198_e2e-events-7293_4b7fef7c-f4ce-4ce8-a859-b7a81344d13c_0(9eeab53055bedd8135451eb19cbb7c5eabe854c132937172b6a3e977f5dfdddd): Multus: [e2e-events-7293/send-events-50103669-ddb2-4417-bbca-63841a5fd198]: error getting pod: Get \"https://[api-int.ovirt11.gcp.devcluster.openshift.com]:6443/api/v1/namespaces/e2e-events-7293/pods/send-events-50103669-ddb2-4417-bbca-63841a5fd198?timeout=1m0s\": dial tcp: lookup api-int.ovirt11.gcp.devcluster.openshift.com on 192.168.211.1:53: no such host"

Seems like the cluster DNS is generally unstable

$ cat workers-journal | grep "api-int.*53: no such host" | wc -l
759

Just for reference, the expected event activity starting a pod with a single container is

$ oc get events
LAST SEEN   TYPE     REASON           OBJECT        MESSAGE
8s          Normal   Scheduled        pod/busybox   Successfully assigned demo/busybox to ip-10-0-172-31.us-west-1.compute.internal
6s          Normal   AddedInterface   pod/busybox   Add eth0 [10.128.2.87/23]
6s          Normal   Pulling          pod/busybox   Pulling image "busybox:latest"
3s          Normal   Pulled           pod/busybox   Successfully pulled image "busybox:latest" in 3.111409964s
3s          Normal   Created          pod/busybox   Created container busybox
3s          Normal   Started          pod/busybox   Started container busybox

The test, when it fails, only shows to the first two (Scheduled and AddedInterface) events in the apiserver, when the kubelet log shows the events being sent

Comment 19 Seth Jennings 2020-08-26 14:35:38 UTC
This is a flaking test on only ovirt platform which is relatively new to CI and doesn't have great signal yet
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/?job=release*ovirt*

Not a 4.6 blocker.  Setting low for now.

Comment 20 Seth Jennings 2020-08-28 21:18:54 UTC
Still happening
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-ocp-installer-e2e-ovirt-4.6&include-filter-by-regex=Events%20should%20be%20sent%20by%20kubelets%20and%20the%20scheduler%20about%20pods%20scheduling%20and%20running

ovirt release informing tests starting passing overall yesterday 8/27, so the signal is getting stronger, but this flake still persists.  Very strange that it only happens on ovirt and no other tests are having issues.

Comment 21 Seth Jennings 2020-09-03 17:47:12 UTC
Just renewing investigation, still seeing same failure mode where kubelet sends events and >1m later, kubelet shows the events were rejected by apiserver.

From kubelet logs

Sep 01 08:27:13.968819 ovirt11-5x8d7-worker-0-7kq8z crio[1537]: I0901 08:27:13.956817  356441 event.go:221] Event(v1.ObjectReference{Kind:"Pod", Namespace:"e2e-events-7913", Name:"send-events-e1d19803-008c-41b0-ae57-6bee509c8d8d", UID:"e0b4d36a-5458-4653-b79b-b892eda4d7a4", APIVersion:"v1", ResourceVersion:"142873", FieldPath:""}): type: 'Normal' reason: 'AddedInterface' Add eth0 [10.128.3.11/23]

we see this event ^

Sep 01 08:27:14.873198 ovirt11-5x8d7-worker-0-7kq8z hyperkube[1574]: I0901 08:27:14.637756    1574 event.go:291] "Event occurred" object="e2e-events-7913/send-events-e1d19803-008c-41b0-ae57-6bee509c8d8d" kind="Pod" apiVersion="v1" type="Normal" reason="Pulled" message="Container image \"us.gcr.io/k8s-artifacts-prod/e2e-test-images/agnhost:2.20\" already present on machine"
Sep 01 08:27:15.461935 ovirt11-5x8d7-worker-0-7kq8z hyperkube[1574]: I0901 08:27:15.460795    1574 event.go:291] "Event occurred" object="e2e-events-7913/send-events-e1d19803-008c-41b0-ae57-6bee509c8d8d" kind="Pod" apiVersion="v1" type="Normal" reason="Created" message="Created container p"
Sep 01 08:27:15.529900 ovirt11-5x8d7-worker-0-7kq8z hyperkube[1574]: I0901 08:27:15.529775    1574 event.go:291] "Event occurred" object="e2e-events-7913/send-events-e1d19803-008c-41b0-ae57-6bee509c8d8d" kind="Pod" apiVersion="v1" type="Normal" reason="Started" message="Started container p"
Sep 01 08:28:19.795973 ovirt11-5x8d7-worker-0-7kq8z hyperkube[1574]: I0901 08:28:19.795914    1574 event.go:291] "Event occurred" object="e2e-events-7913/send-events-e1d19803-008c-41b0-ae57-6bee509c8d8d" kind="Pod" apiVersion="v1" type="Normal" reason="Killing" message="Stopping container p"

we don't see these ^

+1m23s

Sep 01 08:29:42.322552 ovirt11-5x8d7-worker-0-7kq8z hyperkube[1574]: E0901 08:29:42.322437    1574 event.go:264] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"send-events-e1d19803-008c-41b0-ae57-6bee509c8d8d.16309b7e5699e7a5", GenerateName:"", Namespace:"e2e-events-7913", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"e2e-events-7913", Name:"send-events-e1d19803-008c-41b0-ae57-6bee509c8d8d", UID:"e0b4d36a-5458-4653-b79b-b892eda4d7a4", APIVersion:"v1", ResourceVersion:"142869", FieldPath:"spec.containers{p}"}, Reason:"Pulled", Message:"Container image \"us.gcr.io/k8s-artifacts-prod/e2e-test-images/agnhost:2.20\" already present on machine", Source:v1.EventSource{Component:"kubelet", Host:"ovirt11-5x8d7-worker-0-7kq8z"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbfcb9f58a60193a5, ext:2194327082791, loc:(*time.Location)(0x75222a0)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbfcb9f58a60193a5, ext:2194327082791, loc:(*time.Location)(0x75222a0)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'namespaces "e2e-events-7913" not found' (will not retry!)
Sep 01 08:29:42.714506 ovirt11-5x8d7-worker-0-7kq8z hyperkube[1574]: E0901 08:29:42.714384    1574 event.go:264] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"send-events-e1d19803-008c-41b0-ae57-6bee509c8d8d.16309b7e87a8dd9f", GenerateName:"", Namespace:"e2e-events-7913", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"e2e-events-7913", Name:"send-events-e1d19803-008c-41b0-ae57-6bee509c8d8d", UID:"e0b4d36a-5458-4653-b79b-b892eda4d7a4", APIVersion:"v1", ResourceVersion:"142869", FieldPath:"spec.containers{p}"}, Reason:"Created", Message:"Created container p", Source:v1.EventSource{Component:"kubelet", Host:"ovirt11-5x8d7-worker-0-7kq8z"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbfcb9f58db75bf9f, ext:2195150146832, loc:(*time.Location)(0x75222a0)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbfcb9f58db75bf9f, ext:2195150146832, loc:(*time.Location)(0x75222a0)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'namespaces "e2e-events-7913" not found' (will not retry!)
Sep 01 08:29:43.552564 ovirt11-5x8d7-worker-0-7kq8z hyperkube[1574]: E0901 08:29:43.550609    1574 event.go:264] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"send-events-e1d19803-008c-41b0-ae57-6bee509c8d8d.16309b7e8bc49981", GenerateName:"", Namespace:"e2e-events-7913", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"e2e-events-7913", Name:"send-events-e1d19803-008c-41b0-ae57-6bee509c8d8d", UID:"e0b4d36a-5458-4653-b79b-b892eda4d7a4", APIVersion:"v1", ResourceVersion:"142869", FieldPath:"spec.containers{p}"}, Reason:"Started", Message:"Started container p", Source:v1.EventSource{Component:"kubelet", Host:"ovirt11-5x8d7-worker-0-7kq8z"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbfcb9f58df917b81, ext:2195219073266, loc:(*time.Location)(0x75222a0)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbfcb9f58df917b81, ext:2195219073266, loc:(*time.Location)(0x75222a0)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'namespaces "e2e-events-7913" not found' (will not retry!)
Sep 01 08:30:57.602940 ovirt11-5x8d7-worker-0-7kq8z hyperkube[1574]: E0901 08:30:57.514574    1574 event.go:264] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"send-events-e1d19803-008c-41b0-ae57-6bee509c8d8d.16309b8d82502fc8", GenerateName:"", Namespace:"e2e-events-7913", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"e2e-events-7913", Name:"send-events-e1d19803-008c-41b0-ae57-6bee509c8d8d", UID:"e0b4d36a-5458-4653-b79b-b892eda4d7a4", APIVersion:"v1", ResourceVersion:"142869", FieldPath:"spec.containers{p}"}, Reason:"Killing", Message:"Stopping container p", Source:v1.EventSource{Component:"kubelet", Host:"ovirt11-5x8d7-worker-0-7kq8z"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbfcb9f68ef6a91c8, ext:2259484958523, loc:(*time.Location)(0x75222a0)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbfcb9f68ef6a91c8, ext:2259484958523, loc:(*time.Location)(0x75222a0)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'namespaces "e2e-events-7913" not found' (will not retry!)

server rejects events 1m23s after the events are sent

From test output

STEP: checking for scheduler event about the pod
Sep  1 08:27:19.683: INFO: Saw scheduler event for our pod.
STEP: checking for kubelet event about the pod
STEP: deleting the pod
[AfterEach] [k8s.io] [sig-node] Events
  k8s.io/kubernetes.0-rc.2/test/e2e/framework/framework.go:175
STEP: Collecting events from namespace "e2e-events-7913".
STEP: Found 2 events.
Sep  1 08:28:19.838: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for send-events-e1d19803-008c-41b0-ae57-6bee509c8d8d: { } Scheduled: Successfully assigned e2e-events-7913/send-events-e1d19803-008c-41b0-ae57-6bee509c8d8d to ovirt11-5x8d7-worker-0-7kq8z
Sep  1 08:28:19.838: INFO: At 2020-09-01 08:27:13 +0000 UTC - event for send-events-e1d19803-008c-41b0-ae57-6bee509c8d8d: {multus } AddedInterface: Add eth0 [10.128.3.11/23]

<expected container related events here>

Sep  1 08:28:19.876: INFO: POD                                               NODE                          PHASE    GRACE  CONDITIONS
Sep  1 08:28:19.876: INFO: send-events-e1d19803-008c-41b0-ae57-6bee509c8d8d  ovirt11-5x8d7-worker-0-7kq8z  Running  30s    [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020-09-01 08:27:11 +0000 UTC  } {Ready True 0001-01-01 00:00:00 +0000 UTC 2020-09-01 08:27:16 +0000 UTC  } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2020-09-01 08:27:16 +0000 UTC  } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-09-01 08:27:11 +0000 UTC  }]

Comment 22 Seth Jennings 2020-09-03 17:52:42 UTC
Code from the test, the timeout on waiting for the kubelet events is 60s

// Check for kubelet event about the pod.
ginkgo.By("checking for kubelet event about the pod")
framework.ExpectNoError(wait.Poll(time.Second*2, time.Second*60, func() (bool, error) {
	selector := fields.Set{
		"involvedObject.uid":       string(podWithUID.UID),
		"involvedObject.kind":      "Pod",
		"involvedObject.namespace": f.Namespace.Name,
		"source":                   "kubelet",
	}.AsSelector().String()
	options := metav1.ListOptions{FieldSelector: selector}
	events, err = f.ClientSet.CoreV1().Events(f.Namespace.Name).List(context.TODO(), options)
	if err != nil {
		return false, err
	}
	if len(events.Items) > 0 {
		framework.Logf("Saw kubelet event for our pod.")
		return true, nil
	}
	return false, nil
}))

Comment 23 Derek Carr 2020-09-03 18:49:10 UTC
hmm, so this test is explicitly testing "Events", but events are not guaranteed delivery as we drop events if there is spam, or apiserver rejects traffic more than 12 times.

Comment 24 Derek Carr 2020-09-03 19:05:43 UTC
this will help us see in logs why earlier attempts to send the event to the server were rejected (not just the last reason).
https://github.com/kubernetes/kubernetes/pull/94483

Comment 25 Seth Jennings 2020-09-03 20:30:16 UTC
This looks like a DNS issue.

The problem is that if DNS blips for even a small period of time, any lookups that were started during the blip hang for 60-90s before failing.  This timeout exceeds the timeout for the test.  Hence the flakiness.

$ grep "no such host" journal  | grep pod_workers | wc -l
60

Failures are in the time window 08:15:40 to 08:33:52 which includes this test.

It also explains why we don't see this error if attempts to write events to a connected apiserver were failing:
https://github.com/kubernetes/kubernetes/blob/f736be5721e370db8fa4445a8a3264bb3770fb6e/staging/src/k8s.io/client-go/tools/record/event.go#L273

The multus event is not reported by multus directly, not the kubelet. Thus it is subject to a different DNS lookup, which explains why we see it but not the kubelet events.  I would imagine we see the multus event failing to show up sometimes too but the test doesn't check for it.

Sending to Networking for DNS unreliability on ovirt.

Comment 26 Seth Jennings 2020-09-03 20:31:40 UTC
Should be been "The multus event _is_ reported by multus directly"

Comment 27 Evgeny Slutsky 2020-09-06 07:19:25 UTC
(In reply to Seth Jennings from comment #18)
> https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-
> ocp-installer-e2e-ovirt-4.6/1294812708624928768
> 
> I do see a DNS failures
> 
> Aug 16 02:45:17.697096 ovirt11-wz8kt-worker-0-5686p hyperkube[1602]: I0816
> 02:45:17.693816    1602 event.go:291] "Event occurred"
> object="e2e-events-7293/send-events-50103669-ddb2-4417-bbca-63841a5fd198"
> kind="Pod" apiVersion="v1" type="Warning" reason="FailedCreatePodSandBox"
> message="Failed to create pod sandbox: rpc error: code = Unknown desc =
> failed to create pod network sandbox
> k8s_send-events-50103669-ddb2-4417-bbca-63841a5fd198_e2e-events-
> 7293_4b7fef7c-f4ce-4ce8-a859-
> b7a81344d13c_0(9eeab53055bedd8135451eb19cbb7c5eabe854c132937172b6a3e977f5dfdd
> dd): Multus:
> [e2e-events-7293/send-events-50103669-ddb2-4417-bbca-63841a5fd198]: error
> getting pod: Get
> \"https://[api-int.ovirt11.gcp.devcluster.openshift.com]:6443/api/v1/
> namespaces/e2e-events-7293/pods/send-events-50103669-ddb2-4417-bbca-
> 63841a5fd198?timeout=1m0s\": dial tcp: lookup
> api-int.ovirt11.gcp.devcluster.openshift.com on 192.168.211.1:53: no such
> host"
> 
> Seems like the cluster DNS is generally unstable
> 
> $ cat workers-journal | grep "api-int.*53: no such host" | wc -l
> 759
> 
> Just for reference, the expected event activity starting a pod with a single
> container is
> 
> $ oc get events
> LAST SEEN   TYPE     REASON           OBJECT        MESSAGE
> 8s          Normal   Scheduled        pod/busybox   Successfully assigned
> demo/busybox to ip-10-0-172-31.us-west-1.compute.internal
> 6s          Normal   AddedInterface   pod/busybox   Add eth0 [10.128.2.87/23]
> 6s          Normal   Pulling          pod/busybox   Pulling image
> "busybox:latest"
> 3s          Normal   Pulled           pod/busybox   Successfully pulled
> image "busybox:latest" in 3.111409964s
> 3s          Normal   Created          pod/busybox   Created container busybox
> 3s          Normal   Started          pod/busybox   Started container busybox
> 
> The test, when it fails, only shows to the first two (Scheduled and
> AddedInterface) events in the apiserver, when the kubelet log shows the
> events being sent

192.168.211.1 is oVirt infra upstream DNS server,
does api-int.ovirt11.gcp.devcluster.openshift.com  should be able resolved by the upstream DNS ?

Comment 28 Evgeny Slutsky 2020-09-06 08:01:46 UTC
by looking at ovirt11-wz8kt-worker-0-5686p workers journal [0]:

during this lookup failure we can see :

# cat workers-journal | grep nm-dispatcher | grep 'worker-0-5686p' | grep resolv-prepender
Aug 16 02:45:15.570577 ovirt11-wz8kt-worker-0-5686p nm-dispatcher[302731]: <13>Aug 16 02:45:15 root: NM resolv-prepender triggered by ens3 dhcp4-change.
Aug 16 02:45:17.726649 ovirt11-wz8kt-worker-0-5686p nm-dispatcher[302731]: <13>Aug 16 02:45:17 root: NM resolv-prepender: Prepending 'nameserver 192.168.211.118' to /etc/resolv.conf (other nameservers from /var/run/NetworkManager/resolv.conf)

it takes the prepender pod 20ms to complete [1] , during this time window there is no internal dns record in /etc/resolv.conf (so only upstream can be reached) .


[0] https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1294812708624928768/artifacts/e2e-ovirt/nodes/workers-journal
[1] https://github.com/openshift/machine-config-operator/blob/f8ff1d10d55425cd33c367c0911efcdd371125c6/templates/common/ovirt/files/NetworkManager-resolv-prepender.yaml#L40

Comment 29 Evgeny Slutsky 2020-09-06 08:23:54 UTC
(In reply to Evgeny Slutsky from comment #28)
> by looking at ovirt11-wz8kt-worker-0-5686p workers journal [0]:
> 
> during this lookup failure we can see :
> 
> # cat workers-journal | grep nm-dispatcher | grep 'worker-0-5686p' | grep
> resolv-prepender
> Aug 16 02:45:15.570577 ovirt11-wz8kt-worker-0-5686p nm-dispatcher[302731]:
> <13>Aug 16 02:45:15 root: NM resolv-prepender triggered by ens3 dhcp4-change.
> Aug 16 02:45:17.726649 ovirt11-wz8kt-worker-0-5686p nm-dispatcher[302731]:
> <13>Aug 16 02:45:17 root: NM resolv-prepender: Prepending 'nameserver
> 192.168.211.118' to /etc/resolv.conf (other nameservers from
> /var/run/NetworkManager/resolv.conf)
> 
> it takes the prepender pod 20ms to complete [1] , during this time window
> there is no internal dns record in /etc/resolv.conf (so only upstream can be
> reached) .
> 
> 
> [0]
> https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-
> installer-e2e-ovirt-4.6/1294812708624928768/artifacts/e2e-ovirt/nodes/
> workers-journal
> [1]
> https://github.com/openshift/machine-config-operator/blob/
> f8ff1d10d55425cd33c367c0911efcdd371125c6/templates/common/ovirt/files/
> NetworkManager-resolv-prepender.yaml#L40
correction its: 2 seconds .

Comment 30 Seth Jennings 2020-09-08 14:24:20 UTC
> 192.168.211.1 is oVirt infra upstream DNS server,
> does api-int.ovirt11.gcp.devcluster.openshift.com  should be able resolved by the upstream DNS ?

I'm not sure how it works for ovirt.  Every platform is a little different.

On Openstack, where there is no software defined load balancer, they run keepalived to assign a VIP to the master leader and the DNS resolves to that VIP.

Long story short though, I don't think the upstream resolver (outside the cluster) would be able to resolve the api-int name for a cluster.

Comment 31 Ben Nemec 2020-09-08 21:58:54 UTC
This sounds very similar to the bug being fixed in https://github.com/openshift/machine-config-operator/pull/2062.

Comment 32 Evgeny Slutsky 2020-09-09 09:11:14 UTC
(In reply to Ben Nemec from comment #31)
> This sounds very similar to the bug being fixed in
> https://github.com/openshift/machine-config-operator/pull/2062.

yeap, this  BZ is one of the motivators for this PR..

Comment 33 Ben Nemec 2020-09-28 20:39:41 UTC
2062 has merged, so I believe this should be fixed now. I'm going to dupe it to 1876216 which has already been verified.

*** This bug has been marked as a duplicate of bug 1876216 ***


Note You need to log in before you can comment on or make changes to this bug.