Bug 1789881
| Summary: | Pod sandbox creation failures | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Sebastian Soto <ssoto> | ||||
| Component: | Networking | Assignee: | Jacob Tanenbaum <jtanenba> | ||||
| Networking sub component: | ovn-kubernetes | QA Contact: | gaoshang <sgao> | ||||
| Status: | CLOSED ERRATA | Docs Contact: | |||||
| Severity: | high | ||||||
| Priority: | unspecified | CC: | aconstan, aravindh, bbennett, nagrawal, pcameron, sdodson, sgao | ||||
| Version: | 4.4 | ||||||
| Target Milestone: | --- | ||||||
| Target Release: | 4.5.0 | ||||||
| Hardware: | x86_64 | ||||||
| OS: | All | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 1811077 (view as bug list) | Environment: | |||||
| Last Closed: | 2020-07-13 17:13:00 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1811077 | ||||||
| Attachments: |
|
||||||
|
Description
Sebastian Soto
2020-01-10 16:06:41 UTC
Alternatively, to reproduce you can clone: https://github.com/sebsoto/windows-machine-config-operator/tree/reproduce_bug and run: $ make run-wsu-ci-e2e-test You will need to set the KUBECONFIG env variable to your kubeconfig, AWS_SHARED_CREDENTIALS_FILE to your .aws/credentials filepath, and KUBE_SSH_KEY_PATH to the path of openshift-dev.pem Having tested further, creating a Windows deployment is unnecessary. The bug occurs after patching the network operator. I was not able to reproduce without the patch. Updated steps after a request for more detail:
Steps to Reproduce:
1. oc patch network.operator cluster --type=merge -p '{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"hybridOverlayConfig":{"hybridClusterNetwork":[{"cidr":"10.132.0.0/14","hostPrefix":23}]}}}}}
2. Create a Windows node. This can be done by following https://github.com/openshift/windows-machine-config-operator/tree/master/tools/ansible
3. Create a Windows deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: win-webserver
name: win-webserver
spec:
selector:
matchLabels:
app: win-webserver
replicas: 1
template:
metadata:
labels:
app: win-webserver
name: win-webserver
spec:
tolerations:
- key: "os"
value: "Windows"
Effect: "NoSchedule"
containers:
- name: windowswebserver
image: mcr.microsoft.com/windows/servercore:ltsc2019
imagePullPolicy: IfNotPresent
command:
- powershell.exe
- -command
- $listener = New-Object System.Net.HttpListener; $listener.Prefixes.Add('http://*:80/'); $listener.Start();Write-Host('Listening at http://*:80/'); while ($listener.IsListening) { $context = $listener.GetContext(); $response = $context.Response; $content='<html><body><H1>Windows Container Web Server</H1></body></html>'; $buffer = [System.Text.Encoding]::UTF8.GetBytes($content); $response.ContentLength64 = $buffer.Length; $response.OutputStream.Write($buffer, 0, $buffer.Length); $response.Close(); };
nodeSelector:
beta.kubernetes.io/os: windows
3. Attempt to create multiple linux jobs/ deployments at the same time as the Windows deployment:
apiVersion: batch/v1
kind: Job
metadata:
name: linux-curler-test-1
namespace: default
spec:
template:
metadata:
creationTimestamp: null
spec:
containers:
- command:
- bash
- -c
- "yum update; yum install curl -y; curl -L redhat.com"
image: registry.access.redhat.com/ubi8/ubi:latest
imagePullPolicy: IfNotPresent
name: linux-curler-test-1
nodeSelector:
beta.kubernetes.io/os: linux
restartPolicy: Never
terminationGracePeriodSeconds: 30
---
apiVersion: batch/v1
kind: Job
metadata:
name: linux-curler-test-2
namespace: default
spec:
template:
metadata:
creationTimestamp: null
spec:
containers:
- command:
- bash
- -c
- "yum update; yum install curl -y; curl -L redhat.com"
image: registry.access.redhat.com/ubi8/ubi:latest
imagePullPolicy: IfNotPresent
name: linux-curler-test-2
nodeSelector:
beta.kubernetes.io/os: linux
restartPolicy: Never
terminationGracePeriodSeconds: 30
---
apiVersion: batch/v1
kind: Job
metadata:
name: linux-curler-test-3
namespace: default
spec:
template:
metadata:
creationTimestamp: null
spec:
containers:
- command:
- bash
- -c
- "yum update; yum install curl -y; curl -L redhat.com"
image: registry.access.redhat.com/ubi8/ubi:latest
imagePullPolicy: IfNotPresent
name: linux-curler-test-3
nodeSelector:
beta.kubernetes.io/os: linux
restartPolicy: Never
terminationGracePeriodSeconds: 30
---
apiVersion: batch/v1
kind: Job
metadata:
name: linux-curler-test-4
namespace: default
spec:
template:
metadata:
creationTimestamp: null
spec:
containers:
- command:
- bash
- -c
- "yum update; yum install curl -y; curl -L redhat.com"
image: registry.access.redhat.com/ubi8/ubi:latest
imagePullPolicy: IfNotPresent
name: linux-curler-test-4
nodeSelector:
beta.kubernetes.io/os: linux
restartPolicy: Never
terminationGracePeriodSeconds: 30
5. Check pod events,
oc get events |grep Warning
6. If issue is not reproduced, try recreating the or scaling up the Windows deployment
This bug also exist in OCP 4.3, and I thinks it caused by: when creating/destroying windows pod, it takes more than 90 seconds to configure logical switch.
Version:
# oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.3.0-0.nightly-2020-01-18-010201 True False 6h9m Cluster version is 4.3.0-0.nightly-2020-01-18-010201
windows-machine-config-operator# git show
commit 1fbcb18cac33848454e59ddd619ea385082400c0
Steps:
1, Install OCP 4.3 with ovnkubenetes and apply patch
# oc patch network.operator cluster --type=merge -p '{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"hybridOverlayConfig":{"hybridClusterNetwork":[{"cidr":"10.132.0.0/14","hostPrefix":23}]}}}}}'
2, Create a Windows node. This can be done by following https://github.com/openshift/windows-machine-config-operator/tree/master/tools/ansible
3, Create/destroy windows deployment(Step 3 in Comment 3), check ovnkube-master log
After creating windows deployment, though pod in READY status, check ovnkube-master log, it's still running "addLogicalPort" and get time out error. If running linux job before "addLogicalPort" process finished, then I get this "Pod sandbox creation failures", if wait minutes and running after "addLogicalPort" finished, this issue not exist anymore.
# oc create -f WinDeployment.yaml
# oc get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
win-webserver-6f5bdc5b95-hb98b 1/1 Running 0 37s 10.132.0.3 ip-10-0-26-51.us-east-2.compute.internal <none> <none>
# oc logs -f pod/ovnkube-master-2kdjf -c ovnkube-master -n openshift-ovn-kubernetes
...
time="2020-01-20T08:56:28Z" level=error msg="timed out waiting for node \"ip-10-0-26-51.us-east-2.compute.internal\" logical switch: timed out waiting for the condition"
time="2020-01-20T08:56:28Z" level=info msg="[default/win-webserver-6f5bdc5b95-dhjjf] addLogicalPort took 30.013745132s"
time="2020-01-20T08:56:28Z" level=error msg="timed out waiting for the condition"
time="2020-01-20T08:56:58Z" level=error msg="timed out waiting for node \"ip-10-0-26-51.us-east-2.compute.internal\" logical switch: timed out waiting for the condition"
time="2020-01-20T08:56:58Z" level=info msg="[default/win-webserver-6f5bdc5b95-dhjjf] addLogicalPort took 30.00992975s"
time="2020-01-20T08:56:58Z" level=error msg="timed out waiting for the condition"
time="2020-01-20T08:57:28Z" level=error msg="timed out waiting for node \"ip-10-0-26-51.us-east-2.compute.internal\" logical switch: timed out waiting for the condition"
time="2020-01-20T08:57:28Z" level=info msg="[default/win-webserver-6f5bdc5b95-dhjjf] addLogicalPort took 30.010600728s"
time="2020-01-20T08:57:28Z" level=error msg="timed out waiting for the condition"
# cat Job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: linux-curler-test-1
namespace: default
spec:
template:
metadata:
creationTimestamp: null
spec:
containers:
- command:
- bash
- -c
- "yum update; yum install curl -y; curl -L redhat.com"
image: registry.access.redhat.com/ubi8/ubi:latest
imagePullPolicy: IfNotPresent
name: linux-curler-test-1
nodeSelector:
beta.kubernetes.io/os: linux
restartPolicy: Never
terminationGracePeriodSeconds: 30
Running linux job before "addLogicalPort" process finished:
# oc create -f Job.yaml
# oc describe pod/linux-curler-test-1-zlvnn
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned default/linux-curler-test-1-zlvnn to ip-10-0-169-112.us-east-2.compute.internal
Warning FailedCreatePodSandBox 108s kubelet, ip-10-0-169-112.us-east-2.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_linux-curler-test-1-zlvnn_default_97a5634b-5a15-4aaf-8a41-730f2748334c_0(81ab6f6c2928763a1f64d397b5662fe51293cf0e00d728f31797b3af3a68a7f4): Multus: error adding pod to network "ovn-kubernetes": delegateAdd: error invoking DelegateAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to get pod annotation: timed out waiting for the condition
'
Warning FailedCreatePodSandBox 79s kubelet, ip-10-0-169-112.us-east-2.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_linux-curler-test-1-zlvnn_default_97a5634b-5a15-4aaf-8a41-730f2748334c_0(d39b9439c01b6c50ed521f43fffd2b33658ee5156c5457a669a4598c09d1cf18): Multus: error adding pod to network "ovn-kubernetes": delegateAdd: error invoking DelegateAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to get pod annotation: timed out waiting for the condition
'
Normal Pulled 51s kubelet, ip-10-0-169-112.us-east-2.compute.internal Container image "registry.access.redhat.com/ubi8/ubi:latest" already present on machine
Normal Created 51s kubelet, ip-10-0-169-112.us-east-2.compute.internal Created container linux-curler-test-1
Normal Started 51s kubelet, ip-10-0-169-112.us-east-2.compute.internal Started container linux-curler-test-1
Wait minutes and running linux job after "addLogicalPort" process finished:
# oc create -f Job.yaml
# oc describe pod/linux-curler-test-1-zlvnn
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned default/linux-curler-test-1-sh9mq to ip-10-0-169-112.us-east-2.compute.internal
Normal Pulled <invalid> kubelet, ip-10-0-169-112.us-east-2.compute.internal Container image "registry.access.redhat.com/ubi8/ubi:latest" already present on machine
Normal Created <invalid> kubelet, ip-10-0-169-112.us-east-2.compute.internal Created container linux-curler-test-1
Normal Started <invalid> kubelet, ip-10-0-169-112.us-east-2.compute.internal Started container linux-curler-test-1
Deleting windows pod have the same time out issue
PR referenced is against master so moving target release to 4.5, if this needs to be backported to 4.4 please clone this bug. Since the target release moved to 4.5, but install cluster with ovn kubernetes failed currently, please see Bug 1810896, so this bug verification is blocked, will verify it when OCP 4.5 cluster with ovn kubernetes available, thanks. In OCP 4.4.0-0.nightly-2020-03-05-010704, I can not reproduce this bug anymore. *** Bug 1810640 has been marked as a duplicate of this bug. *** This bug fix in OCP 4.5.0-0.nightly-2020-03-09-195717, moved status to VERIFIED, thanks.
Steps:
1. Add windows node to cluster
2. Create windows and linux pod together, check both succeed without "FailedCreatePodSandBox" warning.
# oc create -f WinWebServer.yaml; oc create -f Job.yaml
3. Delete linux pod
# oc delete job.batch/linux-curler-test-1 job.batch/linux-curler-test-2 job.batch/linux-curler-test-3 job.batch/linux-curler-test-4
4. Delete windows pod and create linux pod again at the same time, check linux pod creation succeed without "FailedCreatePodSandBox" warning.
# oc delete deployment.apps/win-webserver; oc create -f Job.yaml
Info:
# cat Job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: linux-curler-test-1
namespace: default
spec:
template:
metadata:
creationTimestamp: null
spec:
containers:
- command:
- bash
- -c
- "yum update; yum install curl -y; curl -L redhat.com"
image: registry.access.redhat.com/ubi8/ubi:latest
imagePullPolicy: IfNotPresent
name: linux-curler-test-1
nodeSelector:
beta.kubernetes.io/os: linux
restartPolicy: Never
terminationGracePeriodSeconds: 30
---
apiVersion: batch/v1
kind: Job
metadata:
name: linux-curler-test-2
namespace: default
spec:
template:
metadata:
creationTimestamp: null
spec:
containers:
- command:
- bash
- -c
- "yum update; yum install curl -y; curl -L redhat.com"
image: registry.access.redhat.com/ubi8/ubi:latest
imagePullPolicy: IfNotPresent
name: linux-curler-test-2
nodeSelector:
beta.kubernetes.io/os: linux
restartPolicy: Never
terminationGracePeriodSeconds: 30
---
apiVersion: batch/v1
kind: Job
metadata:
name: linux-curler-test-3
namespace: default
spec:
template:
metadata:
creationTimestamp: null
spec:
containers:
- command:
- bash
- -c
- "yum update; yum install curl -y; curl -L redhat.com"
image: registry.access.redhat.com/ubi8/ubi:latest
imagePullPolicy: IfNotPresent
name: linux-curler-test-3
nodeSelector:
beta.kubernetes.io/os: linux
restartPolicy: Never
terminationGracePeriodSeconds: 30
---
apiVersion: batch/v1
kind: Job
metadata:
name: linux-curler-test-4
namespace: default
spec:
template:
metadata:
creationTimestamp: null
spec:
containers:
- command:
- bash
- -c
- "yum update; yum install curl -y; curl -L redhat.com"
image: registry.access.redhat.com/ubi8/ubi:latest
imagePullPolicy: IfNotPresent
name: linux-curler-test-4
nodeSelector:
beta.kubernetes.io/os: linux
restartPolicy: Never
terminationGracePeriodSeconds: 30
# cat WinWebServer.yaml
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: win-webserver
name: win-webserver
spec:
replicas: 1
selector:
matchLabels:
app: win-webserver
template:
metadata:
labels:
app: win-webserver
name: win-webserver
spec:
tolerations:
- key: "os"
value: "Windows"
Effect: "NoSchedule"
containers:
- name: windowswebserver
image: mcr.microsoft.com/windows/servercore:ltsc2019
imagePullPolicy: IfNotPresent
# ---
# NOTE: Use privileged security context in none default project, need to add-scc-to-user, e.g.
# oc adm policy add-scc-to-user privileged system:serviceaccount:sgao:default
# oc get scc privileged -o yaml
# ---
# securityContext:
# privileged: true
command:
- powershell.exe
- -command
- "<#code used from https://gist.github.com/wagnerandrade/5424431#> ; $$listener = New-Object System.Net.HttpListener ; $$listener.Prefixes.Add('http://*:80/') ; $$listener.Start() ; $$callerCounts = @{} ; Write-Host('Listening at http://*:80/') ; while ($$listener.IsListening) { ;$$context = $$listener.GetContext() ;$$requestUrl = $$context.Request.Url ;$$clientIP = $$context.Request.RemoteEndPoint.Address ;$$response = $$context.Response ;Write-Host '' ;Write-Host('> {0}' -f $$requestUrl) ; ;$$count = 1 ;$$k=$$callerCounts.Get_Item($$clientIP) ;if ($$k -ne $$null) { $$count += $$k } ;$$callerCounts.Set_Item($$clientIP, $$count) ;$$ip=(Get-NetAdapter | Get-NetIpAddress); $$header='<html><body><H1>Windows Container Web Server</H1>' ;$$callerCountsString='' ;$$callerCounts.Keys | % { $$callerCountsString+='<p>IP {0} callerCount {1} ' -f $$ip[1].IPAddress,$$callerCounts.Item($$_) } ;$$footer='</body></html>' ;$$content='{0}{1}{2}' -f $$header,$$callerCountsString,$$footer ;Write-Output $$content ;$$buffer = [System.Text.Encoding]::UTF8.GetBytes($$content) ;$$response.ContentLength64 = $$buffer.Length ;$$response.OutputStream.Write($$buffer, 0, $$buffer.Length) ;$$response.Close() ;$$responseStatus = $$response.StatusCode ;Write-Host('< {0}' -f $$responseStatus) } ; "
nodeSelector:
beta.kubernetes.io/os: windows
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409 |