1693247 – [upgrade] upgrade failed due to the "failed to create pod network sandbox" "netplugin failed but error parsing its diagnostic message"

Bug 1693247 - [upgrade] upgrade failed due to the "failed to create pod network sandbox" "netplugin failed but error parsing its diagnostic message"

Summary: [upgrade] upgrade failed due to the "failed to create pod network sandbox" "n...

Keywords:
Status:	CLOSED DUPLICATE of bug 1700504
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Casey Callendrello
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-03-27 11:36 UTC by Jian Zhang
Modified:	2019-10-30 06:35 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-04-18 13:02:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Jian Zhang 2019-03-27 11:36:23 UTC

Description of problem:
SDN pods worked fail during the cluster upgrading. So, other pods cannot be created successfully. Upgrade failed.

mac:~ jianzhang$ oc logs sdn-gd8zz -n openshift-sdn
Error from server: Get https://ip-172-31-159-215.us-east-2.compute.internal:10250/containerLogs/openshift-sdn/sdn-gd8zz/sdn: x509: certificate has expired or is not yet valid

Version-Release number of selected component (if applicable):
4.0.0-0.nightly-2019-03-25-180911 -> 4.0.0-0.nightly-2019-03-26-072833

How reproducible:
Sometimes
I retest this in another cluster which running 30 hours, but cannot reproduce this issue. It works well.

Steps to Reproduce:
1. Create the OCP 4.0 with payload: 4.0.0-0.nightly-2019-03-25-180911
2. Running the cluster for a day(about 17 hours).
3. Execute the upgrade.
$ oc adm upgrade --to=4.0.0-0.nightly-2019-03-26-072833

Actual results:
Upgrade failed, hang there more than three hours. Got below errors:
Error from server: Get https://ip-172-31-159-215.us-east-2.compute.internal:10250/containerLogs/openshift-sdn/sdn-gd8zz/sdn: x509: certificate has expired or is not yet valid

mac:~ jianzhang$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE     STATUS
version   4.0.0-0.nightly-2019-03-26-072833   True        True          47m       Unable to apply 4.0.0-0.nightly-2019-03-26-072833: the update could not be applied
mac:~ jianzhang$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE     STATUS
version   4.0.0-0.nightly-2019-03-26-072833   True        True          65m       Working towards 4.0.0-0.nightly-2019-03-26-072833: 15% complete
mac:~ jianzhang$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE     STATUS
version   4.0.0-0.nightly-2019-03-26-072833   True        True          169m      Working towards 4.0.0-0.nightly-2019-03-26-072833: 15% complete
mac:~ jianzhang$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE     STATUS
version   4.0.0-0.nightly-2019-03-26-072833   True        True          171m      Working towards 4.0.0-0.nightly-2019-03-26-072833: 15% complete

Expected results:
Upgrade success.

Additional info:
1) Check the machine config:
mac:~ jianzhang$ oc describe pods machine-config-controller-864b594976-gg85k -n openshift-machine-config-operator
Name:               machine-config-controller-864b594976-gg85k
Namespace:          openshift-machine-config-operator
Priority:           2000000000
PriorityClassName:  system-cluster-critical
Node:               ip-172-31-159-215.us-east-2.compute.internal/172.31.159.215
Start Time:         Wed, 27 Mar 2019 14:59:04 +0800
Labels:             k8s-app=machine-config-controller
                    pod-template-hash=864b594976
Annotations:        k8s.v1.cni.cncf.io/networks-status=
Status:             Pending
IP:                 
Controlled By:      ReplicaSet/machine-config-controller-864b594976
Containers:
  machine-config-controller:
    Container ID:  
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eebe318172ab6cd2ff5d5bcd518ce55032ffc6a6a868e57d20bc7a2bd938f8d7
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Args:
      start
      --resourcelock-namespace=openshift-machine-config-operator
      --v=2
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      memory:  50Mi
    Requests:
      cpu:        20m
      memory:     50Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from machine-config-controller-token-nhcdx (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  machine-config-controller-token-nhcdx:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  machine-config-controller-token-nhcdx
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  node-role.kubernetes.io/master=
Tolerations:     node-role.kubernetes.io/master:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                  Age                From                                                   Message
  ----     ------                  ----               ----                                                   -------
  Normal   Scheduled               2h                 default-scheduler                                      Successfully assigned openshift-machine-config-operator/machine-config-controller-864b594976-gg85k to ip-172-31-159-215.us-east-2.compute.internal
  Warning  FailedCreatePodSandBox  2h                 kubelet, ip-172-31-159-215.us-east-2.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_machine-config-controller-864b594976-gg85k_openshift-machine-config-operator_ceeb5a39-505d-11e9-b9b3-024f9c8d2a44_0(93d212c81bbff94bd33d7b0b4d7022dc1cb911c81706fe2c10cd3443828a260d): netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
  Warning  FailedCreatePodSandBox  2h                 kubelet, ip-172-31-159-215.us-east-2.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_machine-config-controller-864b594976-gg85k_openshift-machine-config-operator_ceeb5a39-505d-11e9-b9b3-024f9c8d2a44_0(018163ad5b434d55e6df6914fea1832a13b52da3c0787d241cd71ab6ada77b85): netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
  Warning  FailedCreatePodSandBox  2h                 kubelet, ip-172-31-159-215.us-east-2.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_machine-config-controller-864b594976-gg85k_openshift-machine-config-operator_ceeb5a39-505d-11e9-b9b3-024f9c8d2a44_0(35692ee2c3f04303e27d4c580b10b72de1438746303a9b24e0756ab7b6192a46): netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
  Warning  FailedCreatePodSandBox  2h                 kubelet, ip-172-31-159-215.us-east-2.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_machine-config-controller-864b594976-gg85k_openshift-machine-config-operator_ceeb5a39-505d-11e9-b9b3-024f9c8d2a44_0(7007c31a632ae9cc859e5c5d17533477ae3abc782f0bf82fc798ad5a2f1665f4): netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
  Warning  FailedCreatePodSandBox  2h                 kubelet, ip-172-31-159-215.us-east-2.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_machine-config-controller-864b594976-gg85k_openshift-machine-config-operator_ceeb5a39-505d-11e9-b9b3-024f9c8d2a44_0(3914b0e7103eff609188ceff009208cf5299d2b99faa35791fe48beb35d223b6): netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
  Warning  FailedCreatePodSandBox  2h                 kubelet, ip-172-31-159-215.us-east-2.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_machine-config-controller-864b594976-gg85k_openshift-machine-config-operator_ceeb5a39-505d-11e9-b9b3-024f9c8d2a44_0(6238b671202dc9d342a0b0efd52065bb0a970d4318ceae2c465db023f3ffd7fd): netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
...

mac:~ jianzhang$ oc get co
NAME                                  VERSION                             AVAILABLE   PROGRESSING   FAILING   SINCE
authentication                                                            False       True          True      153m
cloud-credential                      4.0.0-0.nightly-2019-03-26-072833   True        False         False     161m
cluster-autoscaler                    4.0.0-0.nightly-2019-03-25-180911   True        False         False     153m
console                               4.0.0-0.nightly-2019-03-25-180911   True        False         False     179m
dns                                   4.0.0-0.nightly-2019-03-26-072833   True        False         False     17h
image-registry                        4.0.0-0.nightly-2019-03-25-180911   True        False         False     17h
ingress                               4.0.0-0.nightly-2019-03-25-180911   True        False         False     17h
kube-apiserver                        4.0.0-0.nightly-2019-03-26-072833   True        False         False     151m
kube-controller-manager               4.0.0-0.nightly-2019-03-26-072833   True        False         False     161m
kube-scheduler                        4.0.0-0.nightly-2019-03-26-072833   True        False         False     166m
machine-api                           4.0.0-0.nightly-2019-03-26-072833   True        False         False     17h
machine-config                        4.0.0-0.nightly-2019-03-26-072833   False       False         True      155m
marketplace                           4.0.0-0.nightly-2019-03-25-180911   False       False         True      152m
monitoring                            4.0.0-0.nightly-2019-03-25-180911   True        False         False     150m
network                               4.0.0-0.nightly-2019-03-26-072833   True        False         False     17h
node-tuning                           4.0.0-0.nightly-2019-03-25-180911   True        False         False     17h
openshift-apiserver                   4.0.0-0.nightly-2019-03-25-180911   True        False         False     150m
openshift-cloud-credential-operator   4.0.0-0.nightly-2019-03-25-180911   True        False         False     17h
openshift-controller-manager          4.0.0-0.nightly-2019-03-25-180911   True        False         False     153m
openshift-samples                     4.0.0-0.nightly-2019-03-25-180911   True        False         False     17h
operator-lifecycle-manager            4.0.0-0.nightly-2019-03-25-180911   True        False         False     17h
service-ca                            4.0.0-0.nightly-2019-03-26-072833   True        False         False     153m
service-catalog-apiserver             4.0.0-0.nightly-2019-03-25-180911   True        False         False     150m
service-catalog-controller-manager    4.0.0-0.nightly-2019-03-25-180911   True        False         False     153m
storage                               4.0.0-0.nightly-2019-03-25-180911   True        False         False     17h

mac:~ jianzhang$ oc get nodes
NAME                                           STATUS    ROLES     AGE       VERSION
ip-172-31-137-131.us-east-2.compute.internal   Ready     master    17h       v1.12.4+30e6a0f55
ip-172-31-150-176.us-east-2.compute.internal   Ready     worker    17h       v1.12.4+30e6a0f55
ip-172-31-159-215.us-east-2.compute.internal   Ready     master    17h       v1.12.4+30e6a0f55
ip-172-31-160-63.us-east-2.compute.internal    Ready     master    17h       v1.12.4+30e6a0f55

mac:~ jianzhang$ oc get pods -n openshift-machine-config-operator 
NAME                                         READY     STATUS              RESTARTS   AGE
machine-config-controller-864b594976-gg85k   0/1       ContainerCreating   0          158m
machine-config-daemon-6fhbq                  1/1       Running             1          163m
machine-config-daemon-m42cd                  1/1       Running             1          162m
machine-config-daemon-vj4qh                  1/1       Running             0          163m
machine-config-daemon-wxr2v                  1/1       Running             1          163m
machine-config-operator-7fcc47f75f-cklhh     1/1       Running             0          159m
machine-config-server-76jdw                  1/1       Running             1          163m
machine-config-server-jw477                  1/1       Running             1          163m
machine-config-server-x628x                  1/1       Running             1          163m

2) Check the sdn logs.
mac:~ jianzhang$ oc get pod -n openshift-apiserver
NAME              READY     STATUS       RESTARTS   AGE
apiserver-ftrct   0/1       Init:Error   0          17h
apiserver-xgnqs   1/1       Running      1          17h
apiserver-zb6h5   1/1       Running      1          17h

mac:~ jianzhang$ oc get pod -n openshift-sdn
NAME                   READY     STATUS    RESTARTS   AGE
ovs-h8b45              1/1       Running   1          3h10m
ovs-p658l              1/1       Running   1          3h9m
ovs-tz87f              1/1       Running   1          3h9m
ovs-wmqw8              1/1       Running   0          3h11m
sdn-4m5l8              1/1       Running   0          3h11m
sdn-5hpwj              1/1       Running   2          3h11m
sdn-controller-64v8t   1/1       Running   1          3h11m
sdn-controller-87bv2   1/1       Running   1          3h10m
sdn-controller-fsrf5   1/1       Running   1          3h9m
sdn-gd8zz              1/1       Running   2          3h11m
sdn-lmt5t              1/1       Running   2          3h11m

mac:~ jianzhang$ oc logs sdn-gd8zz -n openshift-sdn
Error from server: Get https://ip-172-31-159-215.us-east-2.compute.internal:10250/containerLogs/openshift-sdn/sdn-gd8zz/sdn: x509: certificate has expired or is not yet valid

Comment 1 Jian Zhang 2019-03-27 11:42:44 UTC

Unfortunately, this cluster has been removed. I will launch a new cluster.

Comment 2 Ben Bennett 2019-03-27 14:10:10 UTC

Can you provide the logs from the sdn pods (masters, nodes, and ovs)?  Thanks.

I have a hunch the problem may be due to cert rotation causing trouble, but I could be wrong.

Comment 3 Ben Bennett 2019-03-27 14:16:52 UTC

@rvokal identified that it is probably a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1692408

Comment 5 Jian Zhang 2019-03-28 01:41:45 UTC

Ben,

> Can you provide the logs from the sdn pods (masters, nodes, and ovs)?  Thanks.

I'm sorry, as described in comment 1, the cluster had been removed, and not all nodes sdn logs preserved, only one:
mac:~ jianzhang$ oc logs sdn-gd8zz -n openshift-sdn
Error from server: Get https://ip-172-31-159-215.us-east-2.compute.internal:10250/containerLogs/openshift-sdn/sdn-gd8zz/sdn: x509: certificate has expired or is not yet valid

I will try to reproduce this, but it's not always happening.

Comment 10 Jian Zhang 2019-03-29 02:15:26 UTC

Seth,

> This there a cluster where this is currently happening that I can look at?

Sorry, as I described in comment 1, that issued cluster had been removed. And, this issue doesn't always happen.
I'm trying to reproduce it, but it was blocked due to no available upgrade graph, details in comment 6.

Comment 11 Seth Jennings 2019-04-01 19:39:47 UTC

Just to be clear, the reason for the upgrade failure was not the kubelet serving cert becoming invalid; it is this:

FailedCreatePodSandBox  2h                 kubelet, ip-172-31-159-215.us-east-2.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_machine-config-controller-864b594976-gg85k_openshift-machine-config-operator_ceeb5a39-505d-11e9-b9b3-024f9c8d2a44_0(93d212c81bbff94bd33d7b0b4d7022dc1cb911c81706fe2c10cd3443828a260d): netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input

Sending to Network to see if they have encountered this before.

Comment 12 Casey Callendrello 2019-04-09 12:24:25 UTC

Hm, interesting.

Marking this as low until it's reproduced. If this happens again, please wake the bug.

Comment 13 Casey Callendrello 2019-04-18 13:02:58 UTC

It looks like we found the issue. Furthermore, danw has submitted a libcni change that makes the error message more interesting.

Closing as dupe.

*** This bug has been marked as a duplicate of bug 1700504 ***

Comment 14 Jian Zhang 2019-10-30 06:35:23 UTC

Casey,

Ok, thanks!

Note You need to log in before you can comment on or make changes to this bug.