Description of problem: SDN pods worked fail during the cluster upgrading. So, other pods cannot be created successfully. Upgrade failed. mac:~ jianzhang$ oc logs sdn-gd8zz -n openshift-sdn Error from server: Get https://ip-172-31-159-215.us-east-2.compute.internal:10250/containerLogs/openshift-sdn/sdn-gd8zz/sdn: x509: certificate has expired or is not yet valid Version-Release number of selected component (if applicable): 4.0.0-0.nightly-2019-03-25-180911 -> 4.0.0-0.nightly-2019-03-26-072833 How reproducible: Sometimes I retest this in another cluster which running 30 hours, but cannot reproduce this issue. It works well. Steps to Reproduce: 1. Create the OCP 4.0 with payload: 4.0.0-0.nightly-2019-03-25-180911 2. Running the cluster for a day(about 17 hours). 3. Execute the upgrade. $ oc adm upgrade --to=4.0.0-0.nightly-2019-03-26-072833 Actual results: Upgrade failed, hang there more than three hours. Got below errors: Error from server: Get https://ip-172-31-159-215.us-east-2.compute.internal:10250/containerLogs/openshift-sdn/sdn-gd8zz/sdn: x509: certificate has expired or is not yet valid mac:~ jianzhang$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-03-26-072833 True True 47m Unable to apply 4.0.0-0.nightly-2019-03-26-072833: the update could not be applied mac:~ jianzhang$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-03-26-072833 True True 65m Working towards 4.0.0-0.nightly-2019-03-26-072833: 15% complete mac:~ jianzhang$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-03-26-072833 True True 169m Working towards 4.0.0-0.nightly-2019-03-26-072833: 15% complete mac:~ jianzhang$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-03-26-072833 True True 171m Working towards 4.0.0-0.nightly-2019-03-26-072833: 15% complete Expected results: Upgrade success. Additional info: 1) Check the machine config: mac:~ jianzhang$ oc describe pods machine-config-controller-864b594976-gg85k -n openshift-machine-config-operator Name: machine-config-controller-864b594976-gg85k Namespace: openshift-machine-config-operator Priority: 2000000000 PriorityClassName: system-cluster-critical Node: ip-172-31-159-215.us-east-2.compute.internal/172.31.159.215 Start Time: Wed, 27 Mar 2019 14:59:04 +0800 Labels: k8s-app=machine-config-controller pod-template-hash=864b594976 Annotations: k8s.v1.cni.cncf.io/networks-status= Status: Pending IP: Controlled By: ReplicaSet/machine-config-controller-864b594976 Containers: machine-config-controller: Container ID: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eebe318172ab6cd2ff5d5bcd518ce55032ffc6a6a868e57d20bc7a2bd938f8d7 Image ID: Port: <none> Host Port: <none> Args: start --resourcelock-namespace=openshift-machine-config-operator --v=2 State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Limits: memory: 50Mi Requests: cpu: 20m memory: 50Mi Environment: <none> Mounts: /var/run/secrets/kubernetes.io/serviceaccount from machine-config-controller-token-nhcdx (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: machine-config-controller-token-nhcdx: Type: Secret (a volume populated by a Secret) SecretName: machine-config-controller-token-nhcdx Optional: false QoS Class: Burstable Node-Selectors: node-role.kubernetes.io/master= Tolerations: node-role.kubernetes.io/master:NoSchedule node.kubernetes.io/memory-pressure:NoSchedule node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 2h default-scheduler Successfully assigned openshift-machine-config-operator/machine-config-controller-864b594976-gg85k to ip-172-31-159-215.us-east-2.compute.internal Warning FailedCreatePodSandBox 2h kubelet, ip-172-31-159-215.us-east-2.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_machine-config-controller-864b594976-gg85k_openshift-machine-config-operator_ceeb5a39-505d-11e9-b9b3-024f9c8d2a44_0(93d212c81bbff94bd33d7b0b4d7022dc1cb911c81706fe2c10cd3443828a260d): netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input Warning FailedCreatePodSandBox 2h kubelet, ip-172-31-159-215.us-east-2.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_machine-config-controller-864b594976-gg85k_openshift-machine-config-operator_ceeb5a39-505d-11e9-b9b3-024f9c8d2a44_0(018163ad5b434d55e6df6914fea1832a13b52da3c0787d241cd71ab6ada77b85): netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input Warning FailedCreatePodSandBox 2h kubelet, ip-172-31-159-215.us-east-2.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_machine-config-controller-864b594976-gg85k_openshift-machine-config-operator_ceeb5a39-505d-11e9-b9b3-024f9c8d2a44_0(35692ee2c3f04303e27d4c580b10b72de1438746303a9b24e0756ab7b6192a46): netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input Warning FailedCreatePodSandBox 2h kubelet, ip-172-31-159-215.us-east-2.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_machine-config-controller-864b594976-gg85k_openshift-machine-config-operator_ceeb5a39-505d-11e9-b9b3-024f9c8d2a44_0(7007c31a632ae9cc859e5c5d17533477ae3abc782f0bf82fc798ad5a2f1665f4): netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input Warning FailedCreatePodSandBox 2h kubelet, ip-172-31-159-215.us-east-2.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_machine-config-controller-864b594976-gg85k_openshift-machine-config-operator_ceeb5a39-505d-11e9-b9b3-024f9c8d2a44_0(3914b0e7103eff609188ceff009208cf5299d2b99faa35791fe48beb35d223b6): netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input Warning FailedCreatePodSandBox 2h kubelet, ip-172-31-159-215.us-east-2.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_machine-config-controller-864b594976-gg85k_openshift-machine-config-operator_ceeb5a39-505d-11e9-b9b3-024f9c8d2a44_0(6238b671202dc9d342a0b0efd52065bb0a970d4318ceae2c465db023f3ffd7fd): netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input ... mac:~ jianzhang$ oc get co NAME VERSION AVAILABLE PROGRESSING FAILING SINCE authentication False True True 153m cloud-credential 4.0.0-0.nightly-2019-03-26-072833 True False False 161m cluster-autoscaler 4.0.0-0.nightly-2019-03-25-180911 True False False 153m console 4.0.0-0.nightly-2019-03-25-180911 True False False 179m dns 4.0.0-0.nightly-2019-03-26-072833 True False False 17h image-registry 4.0.0-0.nightly-2019-03-25-180911 True False False 17h ingress 4.0.0-0.nightly-2019-03-25-180911 True False False 17h kube-apiserver 4.0.0-0.nightly-2019-03-26-072833 True False False 151m kube-controller-manager 4.0.0-0.nightly-2019-03-26-072833 True False False 161m kube-scheduler 4.0.0-0.nightly-2019-03-26-072833 True False False 166m machine-api 4.0.0-0.nightly-2019-03-26-072833 True False False 17h machine-config 4.0.0-0.nightly-2019-03-26-072833 False False True 155m marketplace 4.0.0-0.nightly-2019-03-25-180911 False False True 152m monitoring 4.0.0-0.nightly-2019-03-25-180911 True False False 150m network 4.0.0-0.nightly-2019-03-26-072833 True False False 17h node-tuning 4.0.0-0.nightly-2019-03-25-180911 True False False 17h openshift-apiserver 4.0.0-0.nightly-2019-03-25-180911 True False False 150m openshift-cloud-credential-operator 4.0.0-0.nightly-2019-03-25-180911 True False False 17h openshift-controller-manager 4.0.0-0.nightly-2019-03-25-180911 True False False 153m openshift-samples 4.0.0-0.nightly-2019-03-25-180911 True False False 17h operator-lifecycle-manager 4.0.0-0.nightly-2019-03-25-180911 True False False 17h service-ca 4.0.0-0.nightly-2019-03-26-072833 True False False 153m service-catalog-apiserver 4.0.0-0.nightly-2019-03-25-180911 True False False 150m service-catalog-controller-manager 4.0.0-0.nightly-2019-03-25-180911 True False False 153m storage 4.0.0-0.nightly-2019-03-25-180911 True False False 17h mac:~ jianzhang$ oc get nodes NAME STATUS ROLES AGE VERSION ip-172-31-137-131.us-east-2.compute.internal Ready master 17h v1.12.4+30e6a0f55 ip-172-31-150-176.us-east-2.compute.internal Ready worker 17h v1.12.4+30e6a0f55 ip-172-31-159-215.us-east-2.compute.internal Ready master 17h v1.12.4+30e6a0f55 ip-172-31-160-63.us-east-2.compute.internal Ready master 17h v1.12.4+30e6a0f55 mac:~ jianzhang$ oc get pods -n openshift-machine-config-operator NAME READY STATUS RESTARTS AGE machine-config-controller-864b594976-gg85k 0/1 ContainerCreating 0 158m machine-config-daemon-6fhbq 1/1 Running 1 163m machine-config-daemon-m42cd 1/1 Running 1 162m machine-config-daemon-vj4qh 1/1 Running 0 163m machine-config-daemon-wxr2v 1/1 Running 1 163m machine-config-operator-7fcc47f75f-cklhh 1/1 Running 0 159m machine-config-server-76jdw 1/1 Running 1 163m machine-config-server-jw477 1/1 Running 1 163m machine-config-server-x628x 1/1 Running 1 163m 2) Check the sdn logs. mac:~ jianzhang$ oc get pod -n openshift-apiserver NAME READY STATUS RESTARTS AGE apiserver-ftrct 0/1 Init:Error 0 17h apiserver-xgnqs 1/1 Running 1 17h apiserver-zb6h5 1/1 Running 1 17h mac:~ jianzhang$ oc get pod -n openshift-sdn NAME READY STATUS RESTARTS AGE ovs-h8b45 1/1 Running 1 3h10m ovs-p658l 1/1 Running 1 3h9m ovs-tz87f 1/1 Running 1 3h9m ovs-wmqw8 1/1 Running 0 3h11m sdn-4m5l8 1/1 Running 0 3h11m sdn-5hpwj 1/1 Running 2 3h11m sdn-controller-64v8t 1/1 Running 1 3h11m sdn-controller-87bv2 1/1 Running 1 3h10m sdn-controller-fsrf5 1/1 Running 1 3h9m sdn-gd8zz 1/1 Running 2 3h11m sdn-lmt5t 1/1 Running 2 3h11m mac:~ jianzhang$ oc logs sdn-gd8zz -n openshift-sdn Error from server: Get https://ip-172-31-159-215.us-east-2.compute.internal:10250/containerLogs/openshift-sdn/sdn-gd8zz/sdn: x509: certificate has expired or is not yet valid
Unfortunately, this cluster has been removed. I will launch a new cluster.
Can you provide the logs from the sdn pods (masters, nodes, and ovs)? Thanks. I have a hunch the problem may be due to cert rotation causing trouble, but I could be wrong.
@rvokal identified that it is probably a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1692408
Ben, > Can you provide the logs from the sdn pods (masters, nodes, and ovs)? Thanks. I'm sorry, as described in comment 1, the cluster had been removed, and not all nodes sdn logs preserved, only one: mac:~ jianzhang$ oc logs sdn-gd8zz -n openshift-sdn Error from server: Get https://ip-172-31-159-215.us-east-2.compute.internal:10250/containerLogs/openshift-sdn/sdn-gd8zz/sdn: x509: certificate has expired or is not yet valid I will try to reproduce this, but it's not always happening.
Seth, > This there a cluster where this is currently happening that I can look at? Sorry, as I described in comment 1, that issued cluster had been removed. And, this issue doesn't always happen. I'm trying to reproduce it, but it was blocked due to no available upgrade graph, details in comment 6.
Just to be clear, the reason for the upgrade failure was not the kubelet serving cert becoming invalid; it is this: FailedCreatePodSandBox 2h kubelet, ip-172-31-159-215.us-east-2.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_machine-config-controller-864b594976-gg85k_openshift-machine-config-operator_ceeb5a39-505d-11e9-b9b3-024f9c8d2a44_0(93d212c81bbff94bd33d7b0b4d7022dc1cb911c81706fe2c10cd3443828a260d): netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input Sending to Network to see if they have encountered this before.
Hm, interesting. Marking this as low until it's reproduced. If this happens again, please wake the bug.
It looks like we found the issue. Furthermore, danw has submitted a libcni change that makes the error message more interesting. Closing as dupe. *** This bug has been marked as a duplicate of bug 1700504 ***
Casey, Ok, thanks!