Bug 1889946

Summary: Pod stuck in ContainerCreating due to error "failed to create pod network sandbox" and "netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input"
Product: OpenShift Container Platform Reporter: Mohammad <mahmad>
Component: NetworkingAssignee: MichaƂ Dulko <mdulko>
Networking sub component: kuryr QA Contact: GenadiC <gcheresh>
Status: CLOSED NOTABUG Docs Contact:
Severity: medium    
Priority: medium CC: atragler, bbennett, ealcaniz, itbrown, ltomasbo, mmohan
Version: 3.11.0Keywords: Reopened
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-16 10:51:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1917441    
Bug Blocks:    

Description Mohammad 2020-10-21 03:30:15 UTC
Description of problem:

Pod stuck in ContainerCreating due to error (from `oc get events`):

1- failed to create pod network sandbox
2- netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input


Version-Release number of selected component (if applicable): 3.11.272 and 3.11.232

How reproducible:

Unknown at this stage. It seems to appear on worker nodes that have many applications running and have been running for a longer period of time.

Steps to Reproduce (uncertain):
1. Install OCP3.11 with Kuryr on OSP13 with CRI-O
2. Put a load on the cluster (applications) and then deploy more applications


Actual results:

New applications are stuck in ContainerCreating.

Expected results:

New applications are created and running.

Additional info: The problem is resolved or removed by draining each node, performing the steps below, then uncordoning the node:

sudo systemctl disable crio
sudo systemctl disable atomic-openshift-node.service
sudo reboot
sudo rm -fr /var/lib/containers/*
sudo systemctl enable crio
sudo systemctl enable atomic-openshift-node.service
sudo systemctl start atomic-openshift-node.service
sudo systemctl start crio

We think it might have to do with the Kuryr cni. The kuryr controller allocates the ports on OpenStack, and annotates the pods with the new IPs, but the kuryr-cni is unable to attach the network to the pods.

Comment 15 Itzik Brown 2021-01-29 00:05:45 UTC
Ran tempest tests on v3.11.380 and all passed. (docker not cri-o)

Comment 19 errata-xmlrpc 2021-02-03 18:40:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 3.11.380 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0274

Comment 20 Itzik Brown 2021-02-15 17:14:26 UTC
When updating from v3.11.346 to v3.11.386 I got the following:

(shiftstack) [stack@undercloud-0 ~]$ oc get pods
NAME                       READY     STATUS    RESTARTS   AGE
demo-68dbc445d-8dt5m       1/1       Running   0          7h
demo-68dbc445d-cw8p5       1/1       Running   0          7h
demo-68dbc445d-nrfxt       0/1       Error     0          7h
docker-registry-1-cm2wk    1/1       Running   0          8h
registry-console-1-h2lv9   0/1       Error     0          8h
router-1-8mkt2             1/1       Running   0          8h
router-1-9mtbp             1/1       Running   0          8h
router-1-bkcjf             1/1       Running   0          8h

and 
(shiftstack) [stack@undercloud-0 ~]$ oc get pods -n kuryr
NAME                                READY     STATUS             RESTARTS   AGE
kuryr-cni-ds-4g78t                  1/2       CrashLoopBackOff   21         1h
kuryr-cni-ds-565df                  2/2       Running            0          8h
kuryr-cni-ds-7gm75                  1/2       CrashLoopBackOff   19         1h
kuryr-cni-ds-j4nrl                  2/2       Running            0          8h
kuryr-cni-ds-jqt4j                  1/2       CrashLoopBackOff   23         1h
kuryr-cni-ds-l99xw                  2/2       Running            0          8h
kuryr-cni-ds-n5n8h                  2/2       Running            0          8h
kuryr-cni-ds-q9fr7                  2/2       Running            0          8h
kuryr-controller-74c988b946-tldhv   0/1       Running            21         1h

Comment 22 Itzik Brown 2021-02-16 10:51:12 UTC
Opened a bug: https://bugzilla.redhat.com/show_bug.cgi?id=1929170

Comment 23 Red Hat Bugzilla 2023-09-15 00:50:00 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days