Description of problem: Director deployed OCP 3.11: pods end up in CrashLoopBackOff state after overcloud nodes rolling reboot: [root@openshift-master-0 heat-admin]# oc get pods --all-namespaces | grep -v Running | grep -v Completed NAMESPACE NAME READY STATUS RESTARTS AGE default glusterblock-registry-provisioner-dc-1-dlgd9 0/1 CrashLoopBackOff 8 50m glusterfs glusterblock-storage-provisioner-dc-1-8dx75 0/1 Error 9 54m openshift-console console-6b4548888-s4rhh 0/1 CrashLoopBackOff 12 46m openshift-monitoring grafana-675bb887cc-k8vm2 1/2 CrashLoopBackOff 10 46m openshift-monitoring kube-state-metrics-7588654c69-kfwqm 2/3 CrashLoopBackOff 11 44m openshift-monitoring prometheus-k8s-0 3/4 CrashLoopBackOff 14 46m openshift-web-console webconsole-857446847c-2phh6 0/1 CrashLoopBackOff 11 46m Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-9.0.1-0.20181013060891.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. Deploy overcloud with 3 masters + 3 infra + 3 worker with CNS enabled 2. Reboot nodes one by one 3. Check pods status Actual results: some of the pods end up in CrashLoopBackOff state Expected results: all pods should be running Additional info:
I managed to reproduce this on a manual openshift-ansible installation(without Director) so I am dropping the blocker flag.
It seems like this is a generic problem that can happen whenever you restart docker; I've reproduced this simply by running openshift-ansible/playbooks/openshift-node/restart.yml. I experimented by running restart.yml five times; the 2nd and 4th times all the pods came up with no failures; the other times there was exactly one pod in the CrashLoopBackOff state - but a different pod each time: * openshift-template-service-broker apiserver-t5jlc 0/1 CrashLoopBackOff 236 3d Error: Get https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 172.30.0.1:443: connect: network is unreachable * kube-service-catalog apiserver-blrc8 0/1 CrashLoopBackOff 49 3d Error: Get https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 172.30.0.1:443: connect: network is unreachable * openshift-web-console webconsole-857446847c-fqkx8 0/1 CrashLoopBackOff 83 3d F1214 15:37:46.785377 1 console.go:35] Get https://172.30.0.1:443/.well-known/oauth-authorization-server: dial tcp 172.30.0.1:443: connect: network is unreachable This seems like it may be a generic sdn issue? Note that each time this happened I was able to go to the system where the affected pod was running, restart docker, and then all the pods were fine. That may be a documentable temporary workaround... ?
Two questions: 1 - Can you describe more exactly how to reproduce this? 2 - Can you post the logs from the SDN pod on the node with the crashing pods? You can get them by listing the sdn pods (oc -n openshift-sdn get pods -o wide, then oc logs -n openshift-sdn <podname>)
Hi, any update about this? Our customer can easily repro this issue by restart a master node.
(In reply to Casey Callendrello from comment #5) > Two questions: > 1 - Can you describe more exactly how to reproduce this? a. SSH to the one of the openshift nodes b. shutdown -r now c. wait for the node to reboot d. wait for atomic-openshift-node service to start f. re-run steps a-d for the rest of the openshift nodes in the cluster > 2 - Can you post the logs from the SDN pod on the node with the crashing > pods? You can get them by listing the sdn pods (oc -n openshift-sdn get pods > -o wide, then oc logs -n openshift-sdn <podname>) Attaching the log output.
Created attachment 1518156 [details] sdn pod log output Attaching sdn pod log output.
Interesting, thanks for the helpful logs It looks like we're triggering the same issue noticed in https://github.com/openshift/origin/pull/21654 W0103 03:46:31.823333 9682 node.go:367] will restart pod 'openshift-monitoring/alertmanager-main-0' due to update failure on restart: could not parse ofport "": strconv.Atoi: parsing "": invalid syntax W0103 03:46:31.838837 9682 node.go:367] will restart pod 'openshift-monitoring/kube-state-metrics-7588654c69-mzq92' due to update failure on restart: could not parse ofport "": strconv.Atoi: parsing "": invalid syntax W0103 03:46:31.904371 9682 node.go:367] will restart pod 'openshift-monitoring/prometheus-operator-769776d47-rgljl' due to update failure on restart: could not parse ofport "": strconv.Atoi: parsing "": invalid syntax Weibin or Meng Bo, can you try and reproduce this? Thank you.
Casey, I reproduced the problem and saw the same errors as comment 11 in my v3.11.66 cluster [root@ip-172-18-11-3 ec2-user]# oc get pods --all-namespaces | grep -v Running | grep -v Completed NAMESPACE NAME READY STATUS RESTARTS AGE default router-1-deploy 0/1 Error 0 37m openshift-ansible-service-broker asb-1-gnw44 0/1 CrashLoopBackOff 7 32m openshift-monitoring alertmanager-main-1 2/3 CrashLoopBackOff 9 34m openshift-monitoring prometheus-operator-7566fcccc8-vhgsd 0/1 CrashLoopBackOff 7 36m [root@ip-172-18-11-3 ec2-user]# [root@ip-172-18-11-3 ec2-user]# oc logs pod/sdn-f5fb -n openshift-sdn I0104 14:35:42.598927 8848 node.go:348] Starting openshift-sdn pod manager E0104 14:35:42.608313 8848 cniserver.go:148] failed to remove old pod info socket: remove /var/run/openshift-sdn: device or resource busy E0104 14:35:42.608403 8848 cniserver.go:151] failed to remove contents of socket directory: remove /var/run/openshift-sdn: device or resource busy W0104 14:35:42.623752 8848 util_unix.go:75] Using "/var/run/dockershim.sock" as endpoint is deprecated, please consider using full url format "unix:///var/run/dockershim.sock". W0104 14:35:42.695796 8848 node.go:367] will restart pod 'openshift-ansible-service-broker/asb-1-gnw44' due to update failure on restart: could not parse ofport "": strconv.Atoi: parsing "": invalid syntax W0104 14:35:42.725736 8848 node.go:367] will restart pod 'openshift-monitoring/alertmanager-main-1' due to update failure on restart: could not parse ofport "": strconv.Atoi: parsing "": invalid syntax W0104 14:35:42.751036 8848 node.go:367] will restart pod 'openshift-monitoring/prometheus-operator-7566fcccc8-vhgsd' due to update failure on restart: could not parse ofport "": strconv.Atoi: parsing "": invalid syntax
The log from failed pod: Normal Started 24m kubelet, ip-172-18-8-14.ec2.internal Started container Warning FailedCreatePodSandBox 3m kubelet, ip-172-18-8-14.ec2.internal Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "8a7b56bdabd9ab1ee24afd007d28340e8a0bdae044666913e2bd6a3cbb80092c" network for pod "prometheus-operator-7566fcccc8-vhgsd": NetworkPlugin cni failed to set up pod "prometheus-operator-7566fcccc8-vhgsd_openshift-monitoring" network: OpenShift SDN network process is not (yet?) available Warning FailedCreatePodSandBox 3m kubelet, ip-172-18-8-14.ec2.internal Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "dfa1270c600e6d1a6b6c4b81b89c3a3eaf434a599a0adf4b9866b5f4d0061d2e" network for pod "prometheus-operator-7566fcccc8-vhgsd": NetworkPlugin cni failed to set up pod "prometheus-operator-7566fcccc8-vhgsd_openshift-monitoring" network: OpenShift SDN network process is not (yet?) available Normal SandboxChanged 3m (x3 over 3m) kubelet, ip-172-18-8-14.ec2.internal Pod sandbox changed, it will be killed and re-created. Warning NetworkFailed 3m openshift-sdn, ip-172-18-8-14.ec2.internal The pod's network interface has been lost and the pod will be stopped. Normal Pulled 1m (x4 over 3m) kubelet, ip-172-18-8-14.ec2.internal Container image "registry.reg-aws.openshift.com:443/openshift3/ose-prometheus-operator:v3.11" already present on machine Normal Created 1m (x4 over 3m) kubelet, ip-172-18-8-14.ec2.internal Created container Normal Started 1m (x4 over 3m) kubelet, ip-172-18-8-14.ec2.internal Started container Warning BackOff 25s (x13 over 3m) kubelet, ip-172-18-8-14.ec2.internal Back-off restarting failed container
Assigning to Phil to take a look.
I have seen this in bringing up ovs/ovn on a 3.11 cluster. I don't know what is causing the problem, however I suspect that the order ovs and sdn daemons come up and the delays between the starts seems suspicious. In the ovn case deleting the node pods and subsequent restart fixes the problem. In this case restarting docker fixes the problem. The common code in this is ovs.
Mark, could you take a look at this? I think there is something happening between ovs and sdn, not sure how to figure it out. Thanks
Work with Phil in my v3.11 cluster which only has one master, one infra and two worker nodes. After just "shutdown -r now" the master, we saw the same issue and both ovs-pod and sdn-pod restart once after master come back. The testing log is attached.
Created attachment 1520616 [details] Testing log
Notes on comment 18: I think this may be a startup sequencing problem. I think SDN starts before OVS is ready for it. SND handles this badly and breaks all existing pods on the node. It appears that 3.9 works and this is new in 3.11. As an aside, I am running into similar behavior in bringing up ovs/ovn networking (1654942). There may be a common cause. The suggested workaround, restarting docker, effectively restarts the pods and after a while networking works again. Looking into how ovs and snd work together. Also, looking at delaying sdn until ovs is ready.
Rolling restarts on nodes in our cluster seems to trigger this behavior too. Right now in kube-service-catalog the api-server pod on one master is crash looping with: "Error: Get https://10.127.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 10.127.0.1:443: connect: network is unreachable" I delete the pod, and the new one created works just fine. No idea if its related, but our logging-fluentd pods had problems on reboot too, looking very similar to https://bugzilla.redhat.com/show_bug.cgi?id=1538971 I solved that by removing the label to stop logging first, reboot, then add the label back. Could that also be caused by startup order? 3.11.59
We had a suspicion that cleanup of the ovs db on restart might be the problem, we back ported two PRs but that didn't work. Still looking for the root cause.
Hi, Meanwhile the bug is fixed, could we have a better workaround? Restart the docker is not a good workaround, sometimes the infrastructure could have a unexpected reboot our you could reboot the servers one by one after an update and it's expected that everything works. Regards, Oscar
I was narrow in on the bug. 1. This happens only when the node is first starting up. Likely due to pods starting up before the SDN is fully up and running. 2. A pod gets started and all the OVS networking is set up correctly (flows,ports,interfaces) 3. What is missing is the the correct routes in this network namespace. The pod container is never recreated just the app container is restarted over and over again which is to be expected. 4. The network that is set up for this pod is used for the application container, but the correct routes are never added. Guessing somewhere around here we hit the issue where the routes are not fully setup https://github.com/openshift/origin/blob/release-3.11/pkg/network/node/pod.go#L115 # ip route 10.130.0.0/23 dev eth0 proto kernel scope link src 10.130.0.66 Due to missing routes connections fail. # curl -vk https://172.30.0.1:443/healthz curl: (7) Failed to connect to 172.30.0.1: Network is unreachable We expect to see this: default via 10.130.0.1 dev eth0 10.128.0.0/14 dev eth0 10.130.0.0/23 dev eth0 proto kernel scope link src 10.130.0.66 224.0.0.0/4 dev eth0 When we add a default route # ip route add default via 10.130.0.1 dev eth0 We are able to make connecting using the kubernetes service IP # curl -vk https://172.30.0.1:443/healthz HTTP/1.1 200 OK This error is seen around one of the pods we reproduced with. atomic-openshift-node[20179]: W0215 12:26:53.477756 20179 docker_sandbox.go:372] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "apiserver-cxlh4_kube-service-catalog": CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "2a0609362dcc080ff2a553a6aad4616f67e4df84c5dfd956efbb975bd92e8e14" atomic-openshift-node[20179]: W0215 12:26:59.444697 20179 cni.go:243] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "2a0609362dcc080ff2a553a6aad4616f67e4df84c5dfd956efbb975bd92e8e14" Could be the the wrong sandbox is being referenced for some reason. https://github.com/openshift/origin/blob/release-3.11/vendor/k8s.io/kubernetes/pkg/kubelet/dockershim/network/cni/cni.go#L209-L213
*** Bug 1663358 has been marked as a duplicate of this bug. ***
*** Bug 1661170 has been marked as a duplicate of this bug. ***
https://github.com/openshift/openshift-ansible/pull/11470 - changes on 3.10 cherrypick PR 11409 into 3.10
Tested and verified on v3.11.106. There are no pods end up in CrashLoopBackOff state after a rolling reboot of the node.
Workaround: On nodes run the following: ~~~ echo -e "r /etc/cni/net.d/80-openshift-network.conf\nr /etc/origin/openvswitch/conf.db" > /usr/lib/tmpfiles.d/cleanup-cni.conf ~~~
*** Bug 1659864 has been marked as a duplicate of this bug. ***
What is the latest update about this issue? When could we provide the fix to customer? Thanks, Yunyun
Ther is an errata for the Openshift installer in the release queue. Note that installing the updated installer packages *will not* remediate the cluster alone and will require users to run the installer to update their clusters to resolve this. That said, any customer who ran the workaround listed in comment #95 has the exact same fix that is incorporated into the installer.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0794
Have requested that a new BZ is created for this related issue. Will track the request there.
Team, A customer is running OCP 3.11.141 with crio and noticing this issue. Is the fix specific to docker or will it work for CRIO as well?
(In reply to Dan Geoffroy from comment #114) > Have requested that a new BZ is created for this related issue. Will track > the request there. Can you share the link for the new BZ?
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days