Created attachment 1651337 [details] ovn-master logs Description of problem: when I try to run multiple pods through separate jobs objects, I am consistently hitting: ``` Warning FailedCreatePodSandBox 1s kubelet, ip-10-0-143-56.ec2.internal Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_linux-curler-i-0acde91e06bcd05c9-job-r48gm_default_700c0522-b31b-4e48-87b6-306f4118f6ce_0(27a4497872143f1bedc044571c06cb183f8d0976dea76164986e74a97d37ae67): Multus: error adding pod to network "ovn-kubernetes": delegateAdd: error invoking DelegateAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to get pod annotation: timed out waiting for the condition ``` Version-Release number of selected component (if applicable): 4.4 How reproducible: Very reproducable Steps to Reproduce: 1. oc patch network.operator cluster --type=merge -p '{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"hybridOverlayConfig":{"hybridClusterNetwork":[{"cidr":"10.132.0.0/14","hostPrefix":23}]}}}}} 2. Create a Windows deployment ( possibly optional) This can be done by following https://github.com/openshift/windows-machine-config-operator/tree/master/tools/ansible 3. Attempt to create multiple linux jobs/ deployments at the same time Actual results: Pod stuck in ContainerCreating status and describe show: ``` Name: linux-curler-i-0acde91e06bcd05c9-job-r48gm Namespace: default Priority: 0 PriorityClassName: <none> Node: ip-10-0-143-56.ec2.internal/10.0.143.56 Start Time: Fri, 10 Jan 2020 10:11:13 -0500 Labels: controller-uid=4e70868e-c0b6-4ece-95cf-17a8b5a8bcd8 job-name=linux-curler-i-0acde91e06bcd05c9-job Annotations: k8s.v1.cni.cncf.io/networks-status: Status: Pending IP: Controlled By: Job/linux-curler-i-0acde91e06bcd05c9-job Containers: linux-curler-i-0acde91e06bcd05c9: Container ID: Image: registry.access.redhat.com/ubi8/ubi:latest Image ID: Port: <none> Host Port: <none> Command: bash -c yum update; yum install curl -y; curl 10.132.9.2 State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Environment: <none> Mounts: /var/run/secrets/kubernetes.io/serviceaccount from default-token-9bc8l (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: default-token-9bc8l: Type: Secret (a volume populated by a Secret) SecretName: default-token-9bc8l Optional: false QoS Class: BestEffort Node-Selectors: beta.kubernetes.io/os=linux Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled <unknown> default-scheduler Successfully assigned default/linux-curler-i-0acde91e06bcd05c9-job-r48gm to ip-10-0-143-56.ec2.internal Warning FailedCreatePodSandBox 1s kubelet, ip-10-0-143-56.ec2.internal Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_linux-curler-i-0acde91e06bcd05c9-job-r48gm_default_700c0522-b31b-4e48-87b6-306f4118f6ce_0(27a4497872143f1bedc044571c06cb183f8d0976dea76164986e74a97d37ae67): Multus: error adding pod to network "ovn-kubernetes": delegateAdd: error invoking DelegateAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to get pod annotation: timed out waiting for the condition ``` Expected results: Pod goes into running status with no errors Additional info:
Alternatively, to reproduce you can clone: https://github.com/sebsoto/windows-machine-config-operator/tree/reproduce_bug and run: $ make run-wsu-ci-e2e-test You will need to set the KUBECONFIG env variable to your kubeconfig, AWS_SHARED_CREDENTIALS_FILE to your .aws/credentials filepath, and KUBE_SSH_KEY_PATH to the path of openshift-dev.pem
Having tested further, creating a Windows deployment is unnecessary. The bug occurs after patching the network operator. I was not able to reproduce without the patch.
Updated steps after a request for more detail: Steps to Reproduce: 1. oc patch network.operator cluster --type=merge -p '{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"hybridOverlayConfig":{"hybridClusterNetwork":[{"cidr":"10.132.0.0/14","hostPrefix":23}]}}}}} 2. Create a Windows node. This can be done by following https://github.com/openshift/windows-machine-config-operator/tree/master/tools/ansible 3. Create a Windows deployment: apiVersion: apps/v1 kind: Deployment metadata: labels: app: win-webserver name: win-webserver spec: selector: matchLabels: app: win-webserver replicas: 1 template: metadata: labels: app: win-webserver name: win-webserver spec: tolerations: - key: "os" value: "Windows" Effect: "NoSchedule" containers: - name: windowswebserver image: mcr.microsoft.com/windows/servercore:ltsc2019 imagePullPolicy: IfNotPresent command: - powershell.exe - -command - $listener = New-Object System.Net.HttpListener; $listener.Prefixes.Add('http://*:80/'); $listener.Start();Write-Host('Listening at http://*:80/'); while ($listener.IsListening) { $context = $listener.GetContext(); $response = $context.Response; $content='<html><body><H1>Windows Container Web Server</H1></body></html>'; $buffer = [System.Text.Encoding]::UTF8.GetBytes($content); $response.ContentLength64 = $buffer.Length; $response.OutputStream.Write($buffer, 0, $buffer.Length); $response.Close(); }; nodeSelector: beta.kubernetes.io/os: windows 3. Attempt to create multiple linux jobs/ deployments at the same time as the Windows deployment: apiVersion: batch/v1 kind: Job metadata: name: linux-curler-test-1 namespace: default spec: template: metadata: creationTimestamp: null spec: containers: - command: - bash - -c - "yum update; yum install curl -y; curl -L redhat.com" image: registry.access.redhat.com/ubi8/ubi:latest imagePullPolicy: IfNotPresent name: linux-curler-test-1 nodeSelector: beta.kubernetes.io/os: linux restartPolicy: Never terminationGracePeriodSeconds: 30 --- apiVersion: batch/v1 kind: Job metadata: name: linux-curler-test-2 namespace: default spec: template: metadata: creationTimestamp: null spec: containers: - command: - bash - -c - "yum update; yum install curl -y; curl -L redhat.com" image: registry.access.redhat.com/ubi8/ubi:latest imagePullPolicy: IfNotPresent name: linux-curler-test-2 nodeSelector: beta.kubernetes.io/os: linux restartPolicy: Never terminationGracePeriodSeconds: 30 --- apiVersion: batch/v1 kind: Job metadata: name: linux-curler-test-3 namespace: default spec: template: metadata: creationTimestamp: null spec: containers: - command: - bash - -c - "yum update; yum install curl -y; curl -L redhat.com" image: registry.access.redhat.com/ubi8/ubi:latest imagePullPolicy: IfNotPresent name: linux-curler-test-3 nodeSelector: beta.kubernetes.io/os: linux restartPolicy: Never terminationGracePeriodSeconds: 30 --- apiVersion: batch/v1 kind: Job metadata: name: linux-curler-test-4 namespace: default spec: template: metadata: creationTimestamp: null spec: containers: - command: - bash - -c - "yum update; yum install curl -y; curl -L redhat.com" image: registry.access.redhat.com/ubi8/ubi:latest imagePullPolicy: IfNotPresent name: linux-curler-test-4 nodeSelector: beta.kubernetes.io/os: linux restartPolicy: Never terminationGracePeriodSeconds: 30 5. Check pod events, oc get events |grep Warning 6. If issue is not reproduced, try recreating the or scaling up the Windows deployment
This bug also exist in OCP 4.3, and I thinks it caused by: when creating/destroying windows pod, it takes more than 90 seconds to configure logical switch. Version: # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.0-0.nightly-2020-01-18-010201 True False 6h9m Cluster version is 4.3.0-0.nightly-2020-01-18-010201 windows-machine-config-operator# git show commit 1fbcb18cac33848454e59ddd619ea385082400c0 Steps: 1, Install OCP 4.3 with ovnkubenetes and apply patch # oc patch network.operator cluster --type=merge -p '{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"hybridOverlayConfig":{"hybridClusterNetwork":[{"cidr":"10.132.0.0/14","hostPrefix":23}]}}}}}' 2, Create a Windows node. This can be done by following https://github.com/openshift/windows-machine-config-operator/tree/master/tools/ansible 3, Create/destroy windows deployment(Step 3 in Comment 3), check ovnkube-master log After creating windows deployment, though pod in READY status, check ovnkube-master log, it's still running "addLogicalPort" and get time out error. If running linux job before "addLogicalPort" process finished, then I get this "Pod sandbox creation failures", if wait minutes and running after "addLogicalPort" finished, this issue not exist anymore. # oc create -f WinDeployment.yaml # oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES win-webserver-6f5bdc5b95-hb98b 1/1 Running 0 37s 10.132.0.3 ip-10-0-26-51.us-east-2.compute.internal <none> <none> # oc logs -f pod/ovnkube-master-2kdjf -c ovnkube-master -n openshift-ovn-kubernetes ... time="2020-01-20T08:56:28Z" level=error msg="timed out waiting for node \"ip-10-0-26-51.us-east-2.compute.internal\" logical switch: timed out waiting for the condition" time="2020-01-20T08:56:28Z" level=info msg="[default/win-webserver-6f5bdc5b95-dhjjf] addLogicalPort took 30.013745132s" time="2020-01-20T08:56:28Z" level=error msg="timed out waiting for the condition" time="2020-01-20T08:56:58Z" level=error msg="timed out waiting for node \"ip-10-0-26-51.us-east-2.compute.internal\" logical switch: timed out waiting for the condition" time="2020-01-20T08:56:58Z" level=info msg="[default/win-webserver-6f5bdc5b95-dhjjf] addLogicalPort took 30.00992975s" time="2020-01-20T08:56:58Z" level=error msg="timed out waiting for the condition" time="2020-01-20T08:57:28Z" level=error msg="timed out waiting for node \"ip-10-0-26-51.us-east-2.compute.internal\" logical switch: timed out waiting for the condition" time="2020-01-20T08:57:28Z" level=info msg="[default/win-webserver-6f5bdc5b95-dhjjf] addLogicalPort took 30.010600728s" time="2020-01-20T08:57:28Z" level=error msg="timed out waiting for the condition" # cat Job.yaml apiVersion: batch/v1 kind: Job metadata: name: linux-curler-test-1 namespace: default spec: template: metadata: creationTimestamp: null spec: containers: - command: - bash - -c - "yum update; yum install curl -y; curl -L redhat.com" image: registry.access.redhat.com/ubi8/ubi:latest imagePullPolicy: IfNotPresent name: linux-curler-test-1 nodeSelector: beta.kubernetes.io/os: linux restartPolicy: Never terminationGracePeriodSeconds: 30 Running linux job before "addLogicalPort" process finished: # oc create -f Job.yaml # oc describe pod/linux-curler-test-1-zlvnn ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled <unknown> default-scheduler Successfully assigned default/linux-curler-test-1-zlvnn to ip-10-0-169-112.us-east-2.compute.internal Warning FailedCreatePodSandBox 108s kubelet, ip-10-0-169-112.us-east-2.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_linux-curler-test-1-zlvnn_default_97a5634b-5a15-4aaf-8a41-730f2748334c_0(81ab6f6c2928763a1f64d397b5662fe51293cf0e00d728f31797b3af3a68a7f4): Multus: error adding pod to network "ovn-kubernetes": delegateAdd: error invoking DelegateAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to get pod annotation: timed out waiting for the condition ' Warning FailedCreatePodSandBox 79s kubelet, ip-10-0-169-112.us-east-2.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_linux-curler-test-1-zlvnn_default_97a5634b-5a15-4aaf-8a41-730f2748334c_0(d39b9439c01b6c50ed521f43fffd2b33658ee5156c5457a669a4598c09d1cf18): Multus: error adding pod to network "ovn-kubernetes": delegateAdd: error invoking DelegateAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to get pod annotation: timed out waiting for the condition ' Normal Pulled 51s kubelet, ip-10-0-169-112.us-east-2.compute.internal Container image "registry.access.redhat.com/ubi8/ubi:latest" already present on machine Normal Created 51s kubelet, ip-10-0-169-112.us-east-2.compute.internal Created container linux-curler-test-1 Normal Started 51s kubelet, ip-10-0-169-112.us-east-2.compute.internal Started container linux-curler-test-1 Wait minutes and running linux job after "addLogicalPort" process finished: # oc create -f Job.yaml # oc describe pod/linux-curler-test-1-zlvnn Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled <unknown> default-scheduler Successfully assigned default/linux-curler-test-1-sh9mq to ip-10-0-169-112.us-east-2.compute.internal Normal Pulled <invalid> kubelet, ip-10-0-169-112.us-east-2.compute.internal Container image "registry.access.redhat.com/ubi8/ubi:latest" already present on machine Normal Created <invalid> kubelet, ip-10-0-169-112.us-east-2.compute.internal Created container linux-curler-test-1 Normal Started <invalid> kubelet, ip-10-0-169-112.us-east-2.compute.internal Started container linux-curler-test-1 Deleting windows pod have the same time out issue
PR referenced is against master so moving target release to 4.5, if this needs to be backported to 4.4 please clone this bug.
Since the target release moved to 4.5, but install cluster with ovn kubernetes failed currently, please see Bug 1810896, so this bug verification is blocked, will verify it when OCP 4.5 cluster with ovn kubernetes available, thanks.
In OCP 4.4.0-0.nightly-2020-03-05-010704, I can not reproduce this bug anymore.
*** Bug 1810640 has been marked as a duplicate of this bug. ***
This bug fix in OCP 4.5.0-0.nightly-2020-03-09-195717, moved status to VERIFIED, thanks. Steps: 1. Add windows node to cluster 2. Create windows and linux pod together, check both succeed without "FailedCreatePodSandBox" warning. # oc create -f WinWebServer.yaml; oc create -f Job.yaml 3. Delete linux pod # oc delete job.batch/linux-curler-test-1 job.batch/linux-curler-test-2 job.batch/linux-curler-test-3 job.batch/linux-curler-test-4 4. Delete windows pod and create linux pod again at the same time, check linux pod creation succeed without "FailedCreatePodSandBox" warning. # oc delete deployment.apps/win-webserver; oc create -f Job.yaml Info: # cat Job.yaml apiVersion: batch/v1 kind: Job metadata: name: linux-curler-test-1 namespace: default spec: template: metadata: creationTimestamp: null spec: containers: - command: - bash - -c - "yum update; yum install curl -y; curl -L redhat.com" image: registry.access.redhat.com/ubi8/ubi:latest imagePullPolicy: IfNotPresent name: linux-curler-test-1 nodeSelector: beta.kubernetes.io/os: linux restartPolicy: Never terminationGracePeriodSeconds: 30 --- apiVersion: batch/v1 kind: Job metadata: name: linux-curler-test-2 namespace: default spec: template: metadata: creationTimestamp: null spec: containers: - command: - bash - -c - "yum update; yum install curl -y; curl -L redhat.com" image: registry.access.redhat.com/ubi8/ubi:latest imagePullPolicy: IfNotPresent name: linux-curler-test-2 nodeSelector: beta.kubernetes.io/os: linux restartPolicy: Never terminationGracePeriodSeconds: 30 --- apiVersion: batch/v1 kind: Job metadata: name: linux-curler-test-3 namespace: default spec: template: metadata: creationTimestamp: null spec: containers: - command: - bash - -c - "yum update; yum install curl -y; curl -L redhat.com" image: registry.access.redhat.com/ubi8/ubi:latest imagePullPolicy: IfNotPresent name: linux-curler-test-3 nodeSelector: beta.kubernetes.io/os: linux restartPolicy: Never terminationGracePeriodSeconds: 30 --- apiVersion: batch/v1 kind: Job metadata: name: linux-curler-test-4 namespace: default spec: template: metadata: creationTimestamp: null spec: containers: - command: - bash - -c - "yum update; yum install curl -y; curl -L redhat.com" image: registry.access.redhat.com/ubi8/ubi:latest imagePullPolicy: IfNotPresent name: linux-curler-test-4 nodeSelector: beta.kubernetes.io/os: linux restartPolicy: Never terminationGracePeriodSeconds: 30 # cat WinWebServer.yaml --- apiVersion: apps/v1 kind: Deployment metadata: labels: app: win-webserver name: win-webserver spec: replicas: 1 selector: matchLabels: app: win-webserver template: metadata: labels: app: win-webserver name: win-webserver spec: tolerations: - key: "os" value: "Windows" Effect: "NoSchedule" containers: - name: windowswebserver image: mcr.microsoft.com/windows/servercore:ltsc2019 imagePullPolicy: IfNotPresent # --- # NOTE: Use privileged security context in none default project, need to add-scc-to-user, e.g. # oc adm policy add-scc-to-user privileged system:serviceaccount:sgao:default # oc get scc privileged -o yaml # --- # securityContext: # privileged: true command: - powershell.exe - -command - "<#code used from https://gist.github.com/wagnerandrade/5424431#> ; $$listener = New-Object System.Net.HttpListener ; $$listener.Prefixes.Add('http://*:80/') ; $$listener.Start() ; $$callerCounts = @{} ; Write-Host('Listening at http://*:80/') ; while ($$listener.IsListening) { ;$$context = $$listener.GetContext() ;$$requestUrl = $$context.Request.Url ;$$clientIP = $$context.Request.RemoteEndPoint.Address ;$$response = $$context.Response ;Write-Host '' ;Write-Host('> {0}' -f $$requestUrl) ; ;$$count = 1 ;$$k=$$callerCounts.Get_Item($$clientIP) ;if ($$k -ne $$null) { $$count += $$k } ;$$callerCounts.Set_Item($$clientIP, $$count) ;$$ip=(Get-NetAdapter | Get-NetIpAddress); $$header='<html><body><H1>Windows Container Web Server</H1>' ;$$callerCountsString='' ;$$callerCounts.Keys | % { $$callerCountsString+='<p>IP {0} callerCount {1} ' -f $$ip[1].IPAddress,$$callerCounts.Item($$_) } ;$$footer='</body></html>' ;$$content='{0}{1}{2}' -f $$header,$$callerCountsString,$$footer ;Write-Output $$content ;$$buffer = [System.Text.Encoding]::UTF8.GetBytes($$content) ;$$response.ContentLength64 = $$buffer.Length ;$$response.OutputStream.Write($$buffer, 0, $$buffer.Length) ;$$response.Close() ;$$responseStatus = $$response.StatusCode ;Write-Host('< {0}' -f $$responseStatus) } ; " nodeSelector: beta.kubernetes.io/os: windows
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409