1789881 – Pod sandbox creation failures

Bug 1789881 - Pod sandbox creation failures

Summary: Pod sandbox creation failures

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.4
Hardware:	x86_64
OS:	All
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Jacob Tanenbaum
QA Contact:	gaoshang
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1810640 (view as bug list)
Depends On:
Blocks:	1811077
TreeView+	depends on / blocked

Reported:	2020-01-10 16:06 UTC by Sebastian Soto
Modified:	2020-07-13 17:13 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1811077 (view as bug list)
Environment:
Last Closed:	2020-07-13 17:13:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
ovn-master logs (272.30 KB, text/plain) 2020-01-10 16:06 UTC, Sebastian Soto	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-network-operator pull 500	None	closed	Bug 1789881: Start ovn correctly with hybrid networking	2020-10-20 02:44:23 UTC
Github	ovn-org ovn-kubernetes pull 1048	None	closed	add the ability for ovn-kubernetes to ignore a node by label	2020-10-20 02:44:23 UTC
Red Hat Product Errata	RHBA-2020:2409	None	None	None	2020-07-13 17:13:16 UTC

Description Sebastian Soto 2020-01-10 16:06:41 UTC

Created attachment 1651337 [details]
ovn-master logs

Description of problem:
when I try to run multiple pods through separate jobs objects, I am consistently hitting:

```
Warning  FailedCreatePodSandBox  1s         kubelet, ip-10-0-143-56.ec2.internal  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_linux-curler-i-0acde91e06bcd05c9-job-r48gm_default_700c0522-b31b-4e48-87b6-306f4118f6ce_0(27a4497872143f1bedc044571c06cb183f8d0976dea76164986e74a97d37ae67): Multus: error adding pod to network "ovn-kubernetes": delegateAdd: error invoking DelegateAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to get pod annotation: timed out waiting for the condition
```

Version-Release number of selected component (if applicable):
4.4

How reproducible:
Very reproducable

Steps to Reproduce:
1. oc patch network.operator cluster --type=merge -p '{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"hybridOverlayConfig":{"hybridClusterNetwork":[{"cidr":"10.132.0.0/14","hostPrefix":23}]}}}}}
2. Create a Windows deployment ( possibly optional)
   This can be done by following https://github.com/openshift/windows-machine-config-operator/tree/master/tools/ansible
3. Attempt to create multiple linux jobs/ deployments at the same time


Actual results:
Pod stuck in ContainerCreating status and describe show:
```
Name:               linux-curler-i-0acde91e06bcd05c9-job-r48gm
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               ip-10-0-143-56.ec2.internal/10.0.143.56
Start Time:         Fri, 10 Jan 2020 10:11:13 -0500
Labels:             controller-uid=4e70868e-c0b6-4ece-95cf-17a8b5a8bcd8
                    job-name=linux-curler-i-0acde91e06bcd05c9-job
Annotations:        k8s.v1.cni.cncf.io/networks-status: 
Status:             Pending
IP:                 
Controlled By:      Job/linux-curler-i-0acde91e06bcd05c9-job
Containers:
  linux-curler-i-0acde91e06bcd05c9:
    Container ID:  
    Image:         registry.access.redhat.com/ubi8/ubi:latest
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      bash
      -c
      yum update; yum install curl -y; curl 10.132.9.2
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-9bc8l (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  default-token-9bc8l:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-9bc8l
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  beta.kubernetes.io/os=linux
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                  Age        From                                  Message
  ----     ------                  ----       ----                                  -------
  Normal   Scheduled               <unknown>  default-scheduler                     Successfully assigned default/linux-curler-i-0acde91e06bcd05c9-job-r48gm to ip-10-0-143-56.ec2.internal
  Warning  FailedCreatePodSandBox  1s         kubelet, ip-10-0-143-56.ec2.internal  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_linux-curler-i-0acde91e06bcd05c9-job-r48gm_default_700c0522-b31b-4e48-87b6-306f4118f6ce_0(27a4497872143f1bedc044571c06cb183f8d0976dea76164986e74a97d37ae67): Multus: error adding pod to network "ovn-kubernetes": delegateAdd: error invoking DelegateAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to get pod annotation: timed out waiting for the condition

```

Expected results:
Pod goes into running status with no errors

Additional info:

Comment 1 Sebastian Soto 2020-01-10 16:21:14 UTC

Alternatively, to reproduce you can clone: https://github.com/sebsoto/windows-machine-config-operator/tree/reproduce_bug
and run:
$ make run-wsu-ci-e2e-test
You will need to set the KUBECONFIG env variable to your kubeconfig, AWS_SHARED_CREDENTIALS_FILE to your .aws/credentials filepath, and KUBE_SSH_KEY_PATH to the path of openshift-dev.pem

Comment 2 Sebastian Soto 2020-01-13 16:42:39 UTC

Having tested further, creating a Windows deployment is unnecessary. The bug occurs after patching the network operator.
I was not able to reproduce without the patch.

Comment 3 Sebastian Soto 2020-01-14 19:33:48 UTC

Updated steps after a request for more detail:

Steps to Reproduce:
1. oc patch network.operator cluster --type=merge -p '{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"hybridOverlayConfig":{"hybridClusterNetwork":[{"cidr":"10.132.0.0/14","hostPrefix":23}]}}}}}

2. Create a Windows node. This can be done by following https://github.com/openshift/windows-machine-config-operator/tree/master/tools/ansible

3. Create a Windows deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: win-webserver
  name: win-webserver
spec:
  selector:
    matchLabels:
      app: win-webserver
  replicas: 1
  template:
    metadata:
      labels:
        app: win-webserver
      name: win-webserver
    spec:
      tolerations:
      - key: "os"
        value: "Windows"
        Effect: "NoSchedule"
      containers:
      - name: windowswebserver
        image: mcr.microsoft.com/windows/servercore:ltsc2019
        imagePullPolicy: IfNotPresent
        command:
        - powershell.exe
        - -command
        - $listener = New-Object System.Net.HttpListener; $listener.Prefixes.Add('http://*:80/'); $listener.Start();Write-Host('Listening at http://*:80/'); while ($listener.IsListening) { $context = $listener.GetContext(); $response = $context.Response; $content='<html><body><H1>Windows Container Web Server</H1></body></html>'; $buffer = [System.Text.Encoding]::UTF8.GetBytes($content); $response.ContentLength64 = $buffer.Length; $response.OutputStream.Write($buffer, 0, $buffer.Length); $response.Close(); };
      nodeSelector:
        beta.kubernetes.io/os: windows

3. Attempt to create multiple linux jobs/ deployments at the same time as the Windows deployment:

apiVersion: batch/v1
kind: Job
metadata:
  name: linux-curler-test-1
  namespace: default
spec:
  template:
    metadata:
      creationTimestamp: null
    spec:
      containers:
      - command:
        - bash
        - -c
        - "yum update; yum install curl -y; curl -L redhat.com"
        image: registry.access.redhat.com/ubi8/ubi:latest
        imagePullPolicy: IfNotPresent
        name: linux-curler-test-1
      nodeSelector:
        beta.kubernetes.io/os: linux
      restartPolicy: Never
      terminationGracePeriodSeconds: 30
---
apiVersion: batch/v1
kind: Job
metadata:
  name: linux-curler-test-2
  namespace: default
spec:
  template:
    metadata:
      creationTimestamp: null
    spec:
      containers:
      - command:
        - bash
        - -c
        - "yum update; yum install curl -y; curl -L redhat.com"
        image: registry.access.redhat.com/ubi8/ubi:latest
        imagePullPolicy: IfNotPresent
        name: linux-curler-test-2
      nodeSelector:
        beta.kubernetes.io/os: linux
      restartPolicy: Never
      terminationGracePeriodSeconds: 30
---
apiVersion: batch/v1
kind: Job
metadata:
  name: linux-curler-test-3
  namespace: default
spec:
  template:
    metadata:
      creationTimestamp: null
    spec:
      containers:
      - command:
        - bash
        - -c
        - "yum update; yum install curl -y; curl -L redhat.com"
        image: registry.access.redhat.com/ubi8/ubi:latest
        imagePullPolicy: IfNotPresent
        name: linux-curler-test-3
      nodeSelector:
        beta.kubernetes.io/os: linux
      restartPolicy: Never
      terminationGracePeriodSeconds: 30
---
apiVersion: batch/v1
kind: Job
metadata:
  name: linux-curler-test-4
  namespace: default
spec:
  template:
    metadata:
      creationTimestamp: null
    spec:
      containers:
      - command:
        - bash
        - -c
        - "yum update; yum install curl -y; curl -L redhat.com"
        image: registry.access.redhat.com/ubi8/ubi:latest
        imagePullPolicy: IfNotPresent
        name: linux-curler-test-4
      nodeSelector:
        beta.kubernetes.io/os: linux
      restartPolicy: Never
      terminationGracePeriodSeconds: 30

5. Check pod events, 
oc get events |grep Warning

6. If issue is not reproduced,  try recreating the or scaling up the Windows deployment

Comment 4 gaoshang 2020-01-20 09:55:05 UTC

This bug also exist in OCP 4.3, and I thinks it caused by: when creating/destroying windows pod, it takes more than 90 seconds to configure logical switch.

Version:
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2020-01-18-010201   True        False         6h9m    Cluster version is 4.3.0-0.nightly-2020-01-18-010201
windows-machine-config-operator# git show
commit 1fbcb18cac33848454e59ddd619ea385082400c0

Steps:
1, Install OCP 4.3 with ovnkubenetes and apply patch
# oc patch network.operator cluster --type=merge -p '{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"hybridOverlayConfig":{"hybridClusterNetwork":[{"cidr":"10.132.0.0/14","hostPrefix":23}]}}}}}'
2, Create a Windows node. This can be done by following https://github.com/openshift/windows-machine-config-operator/tree/master/tools/ansible

3, Create/destroy windows deployment(Step 3 in Comment 3), check ovnkube-master log

After creating windows deployment, though pod in READY status, check ovnkube-master log, it's still running "addLogicalPort" and get time out error. If running linux job before "addLogicalPort" process finished, then I get this "Pod sandbox creation failures", if wait minutes and running after "addLogicalPort" finished, this issue not exist anymore.


# oc create -f WinDeployment.yaml
# oc get pod -o wide
NAME                             READY   STATUS    RESTARTS   AGE   IP           NODE                                       NOMINATED NODE   READINESS GATES
win-webserver-6f5bdc5b95-hb98b   1/1     Running   0          37s   10.132.0.3   ip-10-0-26-51.us-east-2.compute.internal   <none>           <none>
# oc logs -f pod/ovnkube-master-2kdjf -c ovnkube-master -n openshift-ovn-kubernetes
...
time="2020-01-20T08:56:28Z" level=error msg="timed out waiting for node \"ip-10-0-26-51.us-east-2.compute.internal\" logical switch: timed out waiting for the condition"
time="2020-01-20T08:56:28Z" level=info msg="[default/win-webserver-6f5bdc5b95-dhjjf] addLogicalPort took 30.013745132s"
time="2020-01-20T08:56:28Z" level=error msg="timed out waiting for the condition"
time="2020-01-20T08:56:58Z" level=error msg="timed out waiting for node \"ip-10-0-26-51.us-east-2.compute.internal\" logical switch: timed out waiting for the condition"
time="2020-01-20T08:56:58Z" level=info msg="[default/win-webserver-6f5bdc5b95-dhjjf] addLogicalPort took 30.00992975s"
time="2020-01-20T08:56:58Z" level=error msg="timed out waiting for the condition"
time="2020-01-20T08:57:28Z" level=error msg="timed out waiting for node \"ip-10-0-26-51.us-east-2.compute.internal\" logical switch: timed out waiting for the condition"
time="2020-01-20T08:57:28Z" level=info msg="[default/win-webserver-6f5bdc5b95-dhjjf] addLogicalPort took 30.010600728s"
time="2020-01-20T08:57:28Z" level=error msg="timed out waiting for the condition"


# cat Job.yaml 
apiVersion: batch/v1
kind: Job
metadata:
  name: linux-curler-test-1
  namespace: default
spec:
  template:
    metadata:
      creationTimestamp: null
    spec:
      containers:
      - command:
        - bash
        - -c
        - "yum update; yum install curl -y; curl -L redhat.com"
        image: registry.access.redhat.com/ubi8/ubi:latest
        imagePullPolicy: IfNotPresent
        name: linux-curler-test-1
      nodeSelector:
        beta.kubernetes.io/os: linux
      restartPolicy: Never
      terminationGracePeriodSeconds: 30

Running linux job before "addLogicalPort" process finished:

# oc create -f Job.yaml
# oc describe pod/linux-curler-test-1-zlvnn
...
Events:
  Type     Reason                  Age        From                                                 Message
  ----     ------                  ----       ----                                                 -------
  Normal   Scheduled               <unknown>  default-scheduler                                    Successfully assigned default/linux-curler-test-1-zlvnn to ip-10-0-169-112.us-east-2.compute.internal
  Warning  FailedCreatePodSandBox  108s       kubelet, ip-10-0-169-112.us-east-2.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_linux-curler-test-1-zlvnn_default_97a5634b-5a15-4aaf-8a41-730f2748334c_0(81ab6f6c2928763a1f64d397b5662fe51293cf0e00d728f31797b3af3a68a7f4): Multus: error adding pod to network "ovn-kubernetes": delegateAdd: error invoking DelegateAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to get pod annotation: timed out waiting for the condition
'
  Warning  FailedCreatePodSandBox  79s  kubelet, ip-10-0-169-112.us-east-2.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_linux-curler-test-1-zlvnn_default_97a5634b-5a15-4aaf-8a41-730f2748334c_0(d39b9439c01b6c50ed521f43fffd2b33658ee5156c5457a669a4598c09d1cf18): Multus: error adding pod to network "ovn-kubernetes": delegateAdd: error invoking DelegateAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to get pod annotation: timed out waiting for the condition
'
  Normal  Pulled   51s  kubelet, ip-10-0-169-112.us-east-2.compute.internal  Container image "registry.access.redhat.com/ubi8/ubi:latest" already present on machine
  Normal  Created  51s  kubelet, ip-10-0-169-112.us-east-2.compute.internal  Created container linux-curler-test-1
  Normal  Started  51s  kubelet, ip-10-0-169-112.us-east-2.compute.internal  Started container linux-curler-test-1


Wait minutes and running linux job after "addLogicalPort" process finished:
# oc create -f Job.yaml
# oc describe pod/linux-curler-test-1-zlvnn
Events:
  Type    Reason     Age        From                                                 Message
  ----    ------     ----       ----                                                 -------
  Normal  Scheduled  <unknown>  default-scheduler                                    Successfully assigned default/linux-curler-test-1-sh9mq to ip-10-0-169-112.us-east-2.compute.internal
  Normal  Pulled     <invalid>  kubelet, ip-10-0-169-112.us-east-2.compute.internal  Container image "registry.access.redhat.com/ubi8/ubi:latest" already present on machine
  Normal  Created    <invalid>  kubelet, ip-10-0-169-112.us-east-2.compute.internal  Created container linux-curler-test-1
  Normal  Started    <invalid>  kubelet, ip-10-0-169-112.us-east-2.compute.internal  Started container linux-curler-test-1


Deleting windows pod have the same time out issue

Comment 6 Scott Dodson 2020-02-25 16:05:59 UTC

PR referenced is against master so moving target release to 4.5, if this needs to be backported to 4.4 please clone this bug.

Comment 9 gaoshang 2020-03-06 07:06:47 UTC

Since the target release moved to 4.5, but install cluster with ovn kubernetes failed currently, please see Bug 1810896, so this bug verification is blocked, will verify it when OCP 4.5 cluster with ovn kubernetes available, thanks.

Comment 10 gaoshang 2020-03-06 07:55:06 UTC

In OCP 4.4.0-0.nightly-2020-03-05-010704, I can not reproduce this bug anymore.

Comment 11 Phil Cameron 2020-03-06 13:41:57 UTC

*** Bug 1810640 has been marked as a duplicate of this bug. ***

Comment 12 gaoshang 2020-03-10 04:17:33 UTC

This bug fix in OCP 4.5.0-0.nightly-2020-03-09-195717, moved status to VERIFIED, thanks.

Steps:

1. Add windows node to cluster
2. Create windows and linux pod together, check both succeed without "FailedCreatePodSandBox" warning.
# oc create -f WinWebServer.yaml; oc create -f Job.yaml
3. Delete linux pod
# oc delete job.batch/linux-curler-test-1 job.batch/linux-curler-test-2 job.batch/linux-curler-test-3 job.batch/linux-curler-test-4
4. Delete windows pod and create linux pod again at the same time, check linux pod creation succeed without "FailedCreatePodSandBox" warning.
# oc delete deployment.apps/win-webserver; oc create -f Job.yaml

Info:
# cat Job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: linux-curler-test-1
  namespace: default
spec:
  template:
    metadata:
      creationTimestamp: null
    spec:
      containers:
      - command:
        - bash
        - -c
        - "yum update; yum install curl -y; curl -L redhat.com"
        image: registry.access.redhat.com/ubi8/ubi:latest
        imagePullPolicy: IfNotPresent
        name: linux-curler-test-1
      nodeSelector:
        beta.kubernetes.io/os: linux
      restartPolicy: Never
      terminationGracePeriodSeconds: 30
---
apiVersion: batch/v1
kind: Job
metadata:
  name: linux-curler-test-2
  namespace: default
spec:
  template:
    metadata:
      creationTimestamp: null
    spec:
      containers:
      - command:
        - bash
        - -c
        - "yum update; yum install curl -y; curl -L redhat.com"
        image: registry.access.redhat.com/ubi8/ubi:latest
        imagePullPolicy: IfNotPresent
        name: linux-curler-test-2
      nodeSelector:
        beta.kubernetes.io/os: linux
      restartPolicy: Never
      terminationGracePeriodSeconds: 30
---
apiVersion: batch/v1
kind: Job
metadata:
  name: linux-curler-test-3
  namespace: default
spec:
  template:
    metadata:
      creationTimestamp: null
    spec:
      containers:
      - command:
        - bash
        - -c
        - "yum update; yum install curl -y; curl -L redhat.com"
        image: registry.access.redhat.com/ubi8/ubi:latest
        imagePullPolicy: IfNotPresent
        name: linux-curler-test-3
      nodeSelector:
        beta.kubernetes.io/os: linux
      restartPolicy: Never
      terminationGracePeriodSeconds: 30
---
apiVersion: batch/v1
kind: Job
metadata:
  name: linux-curler-test-4
  namespace: default
spec:
  template:
    metadata:
      creationTimestamp: null
    spec:
      containers:
      - command:
        - bash
        - -c
        - "yum update; yum install curl -y; curl -L redhat.com"
        image: registry.access.redhat.com/ubi8/ubi:latest
        imagePullPolicy: IfNotPresent
        name: linux-curler-test-4
      nodeSelector:
        beta.kubernetes.io/os: linux
      restartPolicy: Never
      terminationGracePeriodSeconds: 30

# cat WinWebServer.yaml 
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: win-webserver
  name: win-webserver
spec:
  replicas: 1
  selector:
    matchLabels:
      app: win-webserver
  template:
    metadata:
      labels:
        app: win-webserver
      name: win-webserver
    spec:
      tolerations:
      - key: "os"
        value: "Windows"
        Effect: "NoSchedule"
      containers:
      - name: windowswebserver
        image: mcr.microsoft.com/windows/servercore:ltsc2019
        imagePullPolicy: IfNotPresent
        # ---
        # NOTE: Use privileged security context in none default project, need to add-scc-to-user, e.g.
        # oc adm policy add-scc-to-user privileged system:serviceaccount:sgao:default
        # oc get scc privileged -o yaml
        # ---
        # securityContext:
        #   privileged: true
        command:
        - powershell.exe
        - -command
        - "<#code used from https://gist.github.com/wagnerandrade/5424431#> ; $$listener = New-Object System.Net.HttpListener ; $$listener.Prefixes.Add('http://*:80/') ; $$listener.Start() ; $$callerCounts = @{} ; Write-Host('Listening at http://*:80/') ; while ($$listener.IsListening) { ;$$context = $$listener.GetContext() ;$$requestUrl = $$context.Request.Url ;$$clientIP = $$context.Request.RemoteEndPoint.Address ;$$response = $$context.Response ;Write-Host '' ;Write-Host('> {0}' -f $$requestUrl) ;  ;$$count = 1 ;$$k=$$callerCounts.Get_Item($$clientIP) ;if ($$k -ne $$null) { $$count += $$k } ;$$callerCounts.Set_Item($$clientIP, $$count) ;$$ip=(Get-NetAdapter | Get-NetIpAddress); $$header='<html><body><H1>Windows Container Web Server</H1>' ;$$callerCountsString='' ;$$callerCounts.Keys | % { $$callerCountsString+='<p>IP {0} callerCount {1} ' -f $$ip[1].IPAddress,$$callerCounts.Item($$_) } ;$$footer='</body></html>' ;$$content='{0}{1}{2}' -f $$header,$$callerCountsString,$$footer ;Write-Output $$content ;$$buffer = [System.Text.Encoding]::UTF8.GetBytes($$content) ;$$response.ContentLength64 = $$buffer.Length ;$$response.OutputStream.Write($$buffer, 0, $$buffer.Length) ;$$response.Close() ;$$responseStatus = $$response.StatusCode ;Write-Host('< {0}' -f $$responseStatus)  } ; "
      nodeSelector:
        beta.kubernetes.io/os: windows

Comment 14 errata-xmlrpc 2020-07-13 17:13:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.