Bug 1881678 - [vsphere] Error while adding pod to CNI network \"multus-cni-network\": netplugin failed with no error message: signal: killed"
Summary: [vsphere] Error while adding pod to CNI network \"multus-cni-network\": netpl...
Keywords:
Status: CLOSED DUPLICATE of bug 1785399
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.7.0
Assignee: zenghui.shi
QA Contact: Weibin Liang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-22 21:07 UTC by Joseph Callen
Modified: 2020-10-13 19:49 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-13 19:49:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Joseph Callen 2020-09-22 21:07:03 UTC
Description of problem:

Using this CI job as an example:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/12106/rehearse-12106-pull-ci-openshift-machine-api-operator-master-e2e-vsphere-serial/1308465647310606336/

I have noticed the last few days a significant slow down in installation time where some pods take an extended amount of time to start. Within `oc get events`
there is an error:

Failed to create pod sandbox: rpc error: code = Unknown desc = Kubelet may be retrying requests that are timing out in CRI-O due to system load

https://coreos.slack.com/archives/CK1AE4ZCK/p1600802508017300
Chatted with the node team, Peter review crio logs and found:


5671 Sep 22 18:47:25.749752 ci-op-9mdkn32w-b51ab-8n2lh-master-0 crio[1826]: time="2020-09-22 18:47:25.749399387Z" level=warning msg="Error reserving pod name k8s_installer-4-ci-op-9mdkn32w-b51ab-8n2lh-master-0_openshift-kube-apiserver_d4675b9f-c3d7-4184-9     2c6-4952281c91dc_0 for id 0b6188c4131fa58a47b0030fdc755ccb519deaa2d4d455ad66429abdf0b860b0: name is reserved"
5672 Sep 22 18:47:26.211453 ci-op-9mdkn32w-b51ab-8n2lh-master-0 crio[1826]: time="2020-09-22 18:47:26.211408249Z" level=error msg="Error adding network: netplugin failed with no error message: signal: killed"                                               
5673 Sep 22 18:47:26.211453 ci-op-9mdkn32w-b51ab-8n2lh-master-0 crio[1826]: time="2020-09-22 18:47:26.211446291Z" level=error msg="Error while adding pod to CNI network \"multus-cni-network\": netplugin failed with no error message: signal: killed"
5674 Sep 22 18:47:26.211626 ci-op-9mdkn32w-b51ab-8n2lh-master-0 crio[1826]: time="2020-09-22 18:47:26.211477534Z" level=info msg="NetworkStart: stopping network for sandbox 5a9a3de48f8da97015dd3494686d1ae8b2cb9e5ae1e17e01ae38041d1fa391e6" id=8ee5e5a3-7617     -472f-be56-858e6727bbc6 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
5675 Sep 22 18:47:26.212238 ci-op-9mdkn32w-b51ab-8n2lh-master-0 crio[1826]: time="2020-09-22 18:47:26.212203948Z" level=info msg="Got pod network &{Name:installer-3-ci-op-9mdkn32w-b51ab-8n2lh-master-0 Namespace:openshift-etcd ID:5a9a3de48f8da97015dd349468     6d1ae8b2cb9e5ae1e17e01ae38041d1fa391e6 NetNS:/var/run/netns/8dace82c-c738-4b7e-ac01-da06efeaf097 Networks:[] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}]}"
5676 Sep 22 18:47:26.212332 ci-op-9mdkn32w-b51ab-8n2lh-master-0 crio[1826]: time="2020-09-22 18:47:26.212259044Z" level=info msg="About to del CNI network multus-cni-network (type=multus)"
5677 Sep 22 18:47:26.212391 ci-op-9mdkn32w-b51ab-8n2lh-master-0 crio[1826]: time="2020-09-22 18:47:26.212372848Z" level=error msg="Error deleting network: netplugin failed with no error message: context deadline exceeded"
5678 Sep 22 18:47:26.212391 ci-op-9mdkn32w-b51ab-8n2lh-master-0 crio[1826]: time="2020-09-22 18:47:26.212386882Z" level=error msg="Error while removing pod from CNI network \"multus-cni-network\": netplugin failed with no error message: context deadline e     xceeded"
5679 Sep 22 18:47:26.212474 ci-op-9mdkn32w-b51ab-8n2lh-master-0 crio[1826]: time="2020-09-22 18:47:26.212431278Z" level=error msg="Error stopping network on cleanup: failed to destroy network for pod sandbox k8s_installer-3-ci-op-9mdkn32w-b51ab-8n2lh-mast     er-0_openshift-etcd_3060d920-b82d-4793-b061-d1ecaac3f9c8_0(5a9a3de48f8da97015dd3494686d1ae8b2cb9e5ae1e17e01ae38041d1fa391e6): netplugin failed with no error message: context deadline exceeded" id=8ee5e5a3-7617-472f-be56-858e6727bbc6 name=/runtime.v1a     lpha2.RuntimeService/RunPodSandbox



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Douglas Smith 2020-09-23 17:55:24 UTC
What this looks like is that crio's context for the CNI executable is ending, and it kills the process. Therefore, Multus doesn't have time return an error. Without debug logging turned on for Multus, it makes it difficult to diagnose exactly what can cause this. (I will return to my team with a note about improving some of the visibility of the logs in must-gather)

However, there are two likelihoods for where Multus would wait long enough to get context killed:

1. The default network provider (openshift-sdn or ovn-k) isn't ready.
2. The API server isn't responding.

In this case, I believe it's #2. I've taken a look at must-gather, and I can get see that there is a time period where the default network provider isn't ready, however, this occurs at approximately 18:23 and lasts for 41 seconds according to `./host_service_logs/masters/kubelet_service.log`

Taking a look at the API server logs, we can see there's been a throttling event @ 18:47 which aligns with the time period where crio is context killing Multus per the above logs.


```
cat ./namespaces/openshift-kube-apiserver-operator/pods/kube-apiserver-operator-5f9df6bb56-62g7f/kube-apiserver-operator/kube-apiserver-operator/logs/current.log | grep "18:47"
020-09-22T18:47:00.090030431Z I0922 18:47:00.089963       1 request.go:645] Throttling request took 1.391262181s, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-apiserver/pods?labelSelector=apiserver%3Dtrue
2020-09-22T18:47:01.289957476Z I0922 18:47:01.289802       1 request.go:645] Throttling request took 1.389825161s, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-apiserver/configmaps/config
2020-09-22T18:47:02.489992922Z I0922 18:47:02.489885       1 request.go:645] Throttling request took 1.392346474s, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-apiserver/pods?labelSelector=apiserver%3Dtrue
2020-09-22T18:47:03.489978342Z I0922 18:47:03.489894       1 request.go:645] Throttling request took 1.591991131s, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-apiserver/pods?labelSelector=apiserver%3Dtrue
2020-09-22T18:47:04.689902998Z I0922 18:47:04.689856       1 request.go:645] Throttling request took 1.191412227s, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-apiserver/pods?labelSelector=apiserver%3Dtrue
2020-09-22T18:47:05.889962580Z I0922 18:47:05.889790       1 request.go:645] Throttling request took 1.214219132s, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-apiserver
```

I would advise that we have the folks from the kube-apiserver component take a look.

Thank you!

Comment 2 Abu Kashem 2020-09-24 22:17:57 UTC
> cat ./namespaces/openshift-kube-apiserver-operator/pods/kube-apiserver-operator-5f9df6bb56-62g7f/kube-apiserver-operator/kube-apiserver-
> operator/logs/current.log | grep "18:47"
> 020-09-22T18:47:00.090030431Z I0922 18:47:00.089963       1 request.go:645] Throttling request took 1.391262181s, request: 
> GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-apiserver/pods?labelSelector=apiserver%3Dtrue

This is client side throttling. If you check the operator's log from the ci job, you will see that the this message is prevalent throughout the log. 
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/12106/rehearse-12106-pull-ci-openshift-machine-api-operator-master-e2e-vsphere-serial/1308465647310606336/artifacts/e2e-vsphere-serial/gather-extra/pods/openshift-kube-apiserver-operator_kube-apiserver-operator-5f9df6bb56-62g7f_kube-apiserver-operator.log


I have downloaded the prometheus dump from the CI job:
A: I noticed that pods "CONNECT" call was taking longer (order of 50s to 60s) for a brief window of 5 minutes (from 15:55 to 16:00 as shown in the metrics)
B: Two requests failed with 429, pods (subresource=eviction, verb=POST). This happened around the same window.

The API server was not under load and was not rejecting any requests. Other Pod requests were normal. Do you think either A or B can be attributed to the issue you are observing? It's hard conclude so, A lasted for about 5 minutes and for B it's just two requests.



> I have noticed the last few days a significant slow down in installation time where some pods take an extended amount of time to start. 
> Within `oc get events` there is an error:

> Failed to create pod sandbox: rpc error: code = Unknown desc = Kubelet may be retrying requests that are timing out in CRI-O due to system load

This does not seem related to apiserver and based on my investigation so far it looks like this is not an issue with the apiserver. But if you can pinpoint the timeline and the pod resources I would be happy to look at it again. But for now, I am going to send it in your direction.

Comment 3 Stefan Schimanski 2020-09-25 09:22:40 UTC
Moving back to Multus to reevaluate with Abu's analysis.

Comment 4 Stefan Schimanski 2020-09-25 09:23:19 UTC
Note that slow CONNECT calls look like networking issues, don't they Abu?

Comment 5 Douglas Smith 2020-09-25 13:00:13 UTC
Thanks for the investigation Abu, and information about the throttling error messages.

Grepping through the must gather looking for failures in this time period with `grep -Prin "18:47:" *` I'm seeing a lot of errors that look like:

```
mustgather/host_service_logs/masters/kubelet_service.log:1237839:Sep 22 20:08:50.798654 ci-op-9mdkn32w-b51ab-8n2lh-master-2 hyperkube[1864]: 2020-09-22T18:47:28Z auth: error contacting auth provider (retrying in 10s): discovery through endpoint https://kubernetes.default.svc/.well-known/oauth-authorization-server failed: 404 Not Found
```

My investigation shows that there's 410 of these error messages within this minute, such as:

```
grep -Prin "18:47:" * | grep -i error | grep -i "error contacting auth provider" | wc -l
410
```

Whereas, there's only a few references to Multus errors within this time period:

```
grep -Prin "18:47:" * | grep -i error | grep -i "multus" | wc -l
15
```

Additionally, I believe that the CRIO error related to Multus is a symptom and indicative of a deeper systemic problem with a slowdown somewhere. Looking at all the Multus errors on the system, there are three events that occur:

1. There is a time period where the default network provider isn't ready, however, this occurs at approximately 18:23 and lasts for 41 seconds according to `./host_service_logs/masters/kubelet_service.log`, this appears nominal to me, as it can take some time for the default network provider (openshift-sdn in this case) to become ready.
2. between 18:47:26.2 and 18:47:27.2, a 1 second period, there are 14 events related to the CRIO context deadline in `./host_service_logs/masters/crio_service.log`
3. between 19:13:06.3 and 19:13:06.9, a 0.6 second period, there are 16 events related to the the CRIO context deadline in `./host_service_logs/masters/crio_service.log`

Typically, if there were errors related to Multus, this would appear in Kubelet logs, as CNI errors go to STDERR which is captured by the kubelet. The only errors that Multus is reporting is the default network not being ready @ 18:23 as mentioned in event #1 above.

Comment 6 Douglas Smith 2020-09-25 13:50:39 UTC
In consulting with my team, one observation was made regarding this error:

```
pods/sdn-844k6/sdn/sdn/logs/current.log:407:2020-09-22T18:47:26.204841994Z W0922 18:47:26.204767    4408 pod.go:276] CNI_ADD openshift-etcd/installer-3-ci-op-9mdkn32w-b51ab-8n2lh-master-0 failed: Get "https://api-int.ci-op-9mdkn32w-b51ab.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/openshift-etcd/pods/installer-3-ci-op-9mdkn32w-b51ab-8n2lh-master-0": read tcp 172.31.252.213:36158->172.31.252.213:6443: read: connection timed out
```

This is from the SDN, which is over host networking. This could be related to infrastructure problems, or the apiserver not being up on this node, or it could be that it's a load balancer problem.

Comment 7 Abu Kashem 2020-09-25 16:41:24 UTC
> This is from the SDN, which is over host networking. This could be related to infrastructure problems, or the apiserver not being up on this node, 
> or it could be that it's a load balancer problem.

- Unfortunately, the prometheus metric data does not go back as far as "18:47:26.204841994Z".
- The apiserver logs do not go back that far either due to this issue https://bugzilla.redhat.com/show_bug.cgi?id=1882446. 

I checked the events and saw this:

> "2020-09-22T18:26:07Z StaticPodInstallerCompleted Successfully installed revision 2"
> "2020-09-22T18:26:10Z Started Started container kube-apiserver-check-endpoints"
> "2020-09-22T18:26:44Z Started Started container pruner"
> "2020-09-22T18:26:58Z Started Started container kube-apiserver"
> "2020-09-22T18:28:14Z StaticPodInstallerCompleted Successfully installed revision 2"

> "2020-09-22T18:29:12Z StaticPodInstallerCompleted Successfully installed revision 3"
> "2020-09-22T18:38:48Z FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded"
> "2020-09-22T18:43:55Z FailedCreatePodSandBox (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = Kubelet may be 
> retrying requests that are timing out in CRI-O due to system load: error reserving pod name k8s_revision-pruner-3-ci-op-9mdkn32w-b51ab-8n2lh-master-
> 0_openshift-kube-apiserver_c094de13-e3c8-4f4e-b49e-5076bfe25504_0 for id 966299ad7d5e5982922ebb380c673693847ec7576c031b26fd7cdd3e95e21898: name is reserved"


"FailedCreatePodSandBox" error applies to the following Pods (not limited to the apiserver)

 $ cat events.json | jq '.items | sort_by(.lastTimestamp) | .[] | select(.reason=="FailedCreatePodSandBox") | .lastTimestamp + " " + .reason + " " + .involvedObject.namespace + "/" + .involvedObject.name'
"2020-09-22T18:33:15Z FailedCreatePodSandBox openshift-etcd/installer-3-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:33:27Z FailedCreatePodSandBox openshift-etcd/installer-3-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:33:38Z FailedCreatePodSandBox openshift-kube-controller-manager/installer-5-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:33:38Z FailedCreatePodSandBox openshift-kube-scheduler/installer-5-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:33:40Z FailedCreatePodSandBox openshift-etcd/installer-3-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:33:49Z FailedCreatePodSandBox openshift-kube-scheduler/installer-5-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:33:53Z FailedCreatePodSandBox openshift-kube-controller-manager/installer-5-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:33:54Z FailedCreatePodSandBox openshift-etcd/installer-3-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:34:01Z FailedCreatePodSandBox openshift-kube-scheduler/installer-5-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:34:04Z FailedCreatePodSandBox openshift-kube-controller-manager/installer-5-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:34:07Z FailedCreatePodSandBox openshift-etcd/installer-3-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:34:12Z FailedCreatePodSandBox openshift-kube-scheduler/installer-5-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:34:18Z FailedCreatePodSandBox openshift-kube-controller-manager/installer-5-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:34:21Z FailedCreatePodSandBox openshift-etcd/installer-3-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:34:24Z FailedCreatePodSandBox openshift-kube-scheduler/installer-5-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:34:29Z FailedCreatePodSandBox openshift-kube-controller-manager/installer-5-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:34:35Z FailedCreatePodSandBox openshift-etcd/installer-3-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:34:35Z FailedCreatePodSandBox openshift-kube-scheduler/installer-5-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:34:40Z FailedCreatePodSandBox openshift-kube-controller-manager/installer-5-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:34:47Z FailedCreatePodSandBox openshift-etcd/installer-3-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:34:47Z FailedCreatePodSandBox openshift-kube-scheduler/installer-5-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:34:54Z FailedCreatePodSandBox openshift-kube-controller-manager/installer-5-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:34:58Z FailedCreatePodSandBox openshift-kube-scheduler/installer-5-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:35:01Z FailedCreatePodSandBox openshift-etcd/installer-3-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:35:07Z FailedCreatePodSandBox openshift-kube-controller-manager/installer-5-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:35:13Z FailedCreatePodSandBox openshift-kube-scheduler/installer-5-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:35:20Z FailedCreatePodSandBox openshift-kube-controller-manager/installer-5-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:37:47Z FailedCreatePodSandBox openshift-console/console-84bd7b6b74-5cx59"
"2020-09-22T18:37:56Z FailedCreatePodSandBox openshift-authentication/oauth-openshift-5c4dd97b94-hq9ss"
"2020-09-22T18:38:00Z FailedCreatePodSandBox openshift-console/console-84bd7b6b74-5cx59"
"2020-09-22T18:38:11Z FailedCreatePodSandBox openshift-console/console-84bd7b6b74-5cx59"
"2020-09-22T18:38:23Z FailedCreatePodSandBox openshift-console/console-84bd7b6b74-5cx59"
"2020-09-22T18:38:34Z FailedCreatePodSandBox openshift-console/console-84bd7b6b74-5cx59"
"2020-09-22T18:38:40Z FailedCreatePodSandBox openshift-kube-apiserver/revision-pruner-3-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:38:48Z FailedCreatePodSandBox openshift-kube-apiserver/installer-4-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:38:49Z FailedCreatePodSandBox openshift-console/console-84bd7b6b74-5cx59"
"2020-09-22T18:38:51Z FailedCreatePodSandBox openshift-kube-apiserver/revision-pruner-3-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:39:01Z FailedCreatePodSandBox openshift-console/console-84bd7b6b74-5cx59"
"2020-09-22T18:39:01Z FailedCreatePodSandBox openshift-kube-apiserver/installer-4-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:39:03Z FailedCreatePodSandBox openshift-kube-apiserver/revision-pruner-3-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:39:14Z FailedCreatePodSandBox openshift-kube-apiserver/revision-pruner-3-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:39:15Z FailedCreatePodSandBox openshift-console/console-84bd7b6b74-5cx59"
"2020-09-22T18:39:16Z FailedCreatePodSandBox openshift-kube-apiserver/installer-4-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:39:26Z FailedCreatePodSandBox openshift-console/console-84bd7b6b74-5cx59"
"2020-09-22T18:39:29Z FailedCreatePodSandBox openshift-kube-apiserver/installer-4-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:39:29Z FailedCreatePodSandBox openshift-kube-apiserver/revision-pruner-3-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:39:40Z FailedCreatePodSandBox openshift-kube-apiserver/installer-4-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:39:43Z FailedCreatePodSandBox openshift-kube-apiserver/revision-pruner-3-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:39:54Z FailedCreatePodSandBox openshift-kube-apiserver/revision-pruner-3-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:39:55Z FailedCreatePodSandBox openshift-kube-apiserver/installer-4-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:40:09Z FailedCreatePodSandBox openshift-kube-apiserver/revision-pruner-3-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:40:11Z FailedCreatePodSandBox openshift-kube-apiserver/installer-4-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:40:23Z FailedCreatePodSandBox openshift-kube-apiserver/installer-4-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:40:24Z FailedCreatePodSandBox openshift-kube-apiserver/revision-pruner-3-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:40:35Z FailedCreatePodSandBox openshift-kube-apiserver/installer-4-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:43:08Z FailedCreatePodSandBox openshift-console/console-84bd7b6b74-5cx59"
"2020-09-22T18:43:22Z FailedCreatePodSandBox openshift-etcd/installer-3-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:43:40Z FailedCreatePodSandBox openshift-kube-controller-manager/installer-5-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:43:45Z FailedCreatePodSandBox openshift-kube-scheduler/installer-5-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:43:55Z FailedCreatePodSandBox openshift-kube-apiserver/revision-pruner-3-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T18:44:13Z FailedCreatePodSandBox openshift-kube-apiserver/installer-4-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T19:01:08Z FailedCreatePodSandBox openshift-apiserver/apiserver-7bf97f8874-4mxp9"
"2020-09-22T19:01:21Z FailedCreatePodSandBox openshift-apiserver/apiserver-7bf97f8874-4mxp9"
"2020-09-22T19:01:27Z FailedCreatePodSandBox openshift-kube-apiserver/revision-pruner-6-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T19:01:35Z FailedCreatePodSandBox openshift-apiserver/apiserver-7bf97f8874-4mxp9"
"2020-09-22T19:01:36Z FailedCreatePodSandBox openshift-kube-apiserver/installer-7-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T19:01:40Z FailedCreatePodSandBox openshift-kube-apiserver/revision-pruner-6-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T19:01:46Z FailedCreatePodSandBox openshift-apiserver/apiserver-7bf97f8874-4mxp9"
"2020-09-22T19:01:47Z FailedCreatePodSandBox openshift-kube-apiserver/installer-7-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T19:01:55Z FailedCreatePodSandBox openshift-kube-apiserver/revision-pruner-6-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T19:01:58Z FailedCreatePodSandBox openshift-kube-apiserver/installer-7-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T19:01:59Z FailedCreatePodSandBox openshift-apiserver/apiserver-7bf97f8874-4mxp9"
"2020-09-22T19:02:08Z FailedCreatePodSandBox openshift-kube-apiserver/revision-pruner-6-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T19:02:09Z FailedCreatePodSandBox openshift-kube-apiserver/installer-7-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T19:02:11Z FailedCreatePodSandBox openshift-apiserver/apiserver-7bf97f8874-4mxp9"
"2020-09-22T19:02:22Z FailedCreatePodSandBox openshift-kube-apiserver/installer-7-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T19:02:23Z FailedCreatePodSandBox openshift-kube-apiserver/revision-pruner-6-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T19:02:26Z FailedCreatePodSandBox openshift-apiserver/apiserver-7bf97f8874-4mxp9"
"2020-09-22T19:02:33Z FailedCreatePodSandBox openshift-kube-apiserver/installer-7-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T19:02:38Z FailedCreatePodSandBox openshift-kube-apiserver/revision-pruner-6-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T19:02:41Z FailedCreatePodSandBox openshift-apiserver/apiserver-7bf97f8874-4mxp9"
"2020-09-22T19:02:45Z FailedCreatePodSandBox openshift-kube-apiserver/installer-7-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T19:02:49Z FailedCreatePodSandBox openshift-kube-apiserver/revision-pruner-6-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T19:02:55Z FailedCreatePodSandBox openshift-apiserver/apiserver-7bf97f8874-4mxp9"
"2020-09-22T19:02:57Z FailedCreatePodSandBox openshift-kube-apiserver/installer-7-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T19:03:02Z FailedCreatePodSandBox openshift-kube-apiserver/revision-pruner-6-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T19:03:10Z FailedCreatePodSandBox openshift-kube-apiserver/installer-7-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T19:03:15Z FailedCreatePodSandBox openshift-kube-apiserver/revision-pruner-6-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T19:03:38Z FailedCreatePodSandBox openshift-controller-manager/controller-manager-dp5q4"
"2020-09-22T19:03:53Z FailedCreatePodSandBox openshift-controller-manager/controller-manager-dp5q4"
"2020-09-22T19:04:07Z FailedCreatePodSandBox openshift-controller-manager/controller-manager-dp5q4"
"2020-09-22T19:04:22Z FailedCreatePodSandBox openshift-controller-manager/controller-manager-dp5q4"
"2020-09-22T19:04:34Z FailedCreatePodSandBox openshift-controller-manager/controller-manager-dp5q4"
"2020-09-22T19:04:47Z FailedCreatePodSandBox openshift-controller-manager/controller-manager-dp5q4"
"2020-09-22T19:05:01Z FailedCreatePodSandBox openshift-controller-manager/controller-manager-dp5q4"
"2020-09-22T19:05:14Z FailedCreatePodSandBox openshift-controller-manager/controller-manager-dp5q4"
"2020-09-22T19:05:28Z FailedCreatePodSandBox openshift-controller-manager/controller-manager-dp5q4"
"2020-09-22T19:09:12Z FailedCreatePodSandBox openshift-controller-manager/controller-manager-dp5q4"
"2020-09-22T19:11:12Z FailedCreatePodSandBox openshift-apiserver/apiserver-7bf97f8874-4mxp9"
"2020-09-22T19:11:36Z FailedCreatePodSandBox openshift-kube-apiserver/revision-pruner-6-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T19:11:38Z FailedCreatePodSandBox openshift-kube-apiserver/installer-7-ci-op-9mdkn32w-b51ab-8n2lh-master-0"
"2020-09-22T19:25:29Z FailedCreatePodSandBox openshift-kube-apiserver/revision-pruner-7-ci-op-9mdkn32w-b51ab-8n2lh-master-2"
"2020-09-22T19:25:44Z FailedCreatePodSandBox openshift-kube-apiserver/revision-pruner-7-ci-op-9mdkn32w-b51ab-8n2lh-master-2"
"2020-09-22T19:25:56Z FailedCreatePodSandBox openshift-kube-apiserver/revision-pruner-7-ci-op-9mdkn32w-b51ab-8n2lh-master-2"
"2020-09-22T19:26:07Z FailedCreatePodSandBox openshift-kube-apiserver/revision-pruner-7-ci-op-9mdkn32w-b51ab-8n2lh-master-2"
"2020-09-22T19:26:19Z FailedCreatePodSandBox openshift-kube-apiserver/revision-pruner-7-ci-op-9mdkn32w-b51ab-8n2lh-master-2"
"2020-09-22T19:26:31Z FailedCreatePodSandBox openshift-kube-apiserver/revision-pruner-7-ci-op-9mdkn32w-b51ab-8n2lh-master-2"
"2020-09-22T19:26:44Z FailedCreatePodSandBox openshift-kube-apiserver/revision-pruner-7-ci-op-9mdkn32w-b51ab-8n2lh-master-2"
"2020-09-22T19:26:58Z FailedCreatePodSandBox openshift-kube-apiserver/revision-pruner-7-ci-op-9mdkn32w-b51ab-8n2lh-master-2"
"2020-09-22T19:27:10Z FailedCreatePodSandBox openshift-kube-apiserver/revision-pruner-7-ci-op-9mdkn32w-b51ab-8n2lh-master-2"
"2020-09-22T19:29:08Z FailedCreatePodSandBox openshift-kube-apiserver/revision-pruner-7-ci-op-9mdkn32w-b51ab-8n2lh-master-2"
"2020-09-22T19:39:58Z FailedCreatePodSandBox kube-system/critical-pod"


I also see evidence of network blips in the events. This could be a network related issue. My recommendation is to get it to reproduce with  https://github.com/openshift/kubernetes/pull/374 included in the build. This way we can look at the apiserver log in depth.

In the meantime, I would also recommend investigating the "FailedCreatePodSandBox" error further, where is it originating from?

Comment 8 Ben Bennett 2020-09-25 16:45:16 UTC
Moving to 4.7 to investigate further.  This seems to occur rarely (based on https://search.ci.openshift.org/) and doesn't break the cluster when it happens, it just causes slowdown in pod creation on the affected node.

It may be a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1785399

Comment 9 Joseph Callen 2020-09-25 19:34:15 UTC
After skimming though 1785399 I think it is a dup.(In reply to Ben Bennett from comment #8)
> Moving to 4.7 to investigate further.  This seems to occur rarely (based on

This isn't quite correct.  I am seeing on every install I have been running manually. Including the one I am presently running.

> https://search.ci.openshift.org/) and doesn't break the cluster when it
> happens, it just causes slowdown in pod creation on the affected node.
> 
> It may be a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1785399

After skimming the above BZ yeah it could be a dup

Comment 10 Douglas Smith 2020-10-13 19:49:04 UTC

*** This bug has been marked as a duplicate of bug 1785399 ***


Note You need to log in before you can comment on or make changes to this bug.