1654044 – OCP 3.11: pods end up in CrashLoopBackOff state after a rolling reboot of the node

Bug 1654044 - OCP 3.11: pods end up in CrashLoopBackOff state after a rolling reboot of the node

Summary: OCP 3.11: pods end up in CrashLoopBackOff state after a rolling reboot of the...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Casey Callendrello
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Duplicates (3):	1659864 1661170 1663358 (view as bug list)
Depends On:
Blocks:	1698629 1744077
TreeView+	depends on / blocked

Reported:	2018-11-27 22:13 UTC by Marius Cornea
Modified:	2024-12-20 18:47 UTC (History)
CC List:	46 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: After boot component appears ready before it starts Consequence: Fix: Eliminate ready indicator at boot, lets component come up and assert ready. Result:
Clone Of:
Clones:	1698626 1698629 1744077 (view as bug list)
Environment:
Last Closed:	2019-06-06 02:00:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
sdn pod log output (221.04 KB, text/plain) 2019-01-03 14:11 UTC, Marius Cornea	no flags	Details
Testing log (32.19 KB, text/plain) 2019-01-14 22:03 UTC, Weibin Liang	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1661170	unspecified	CLOSED	sdn/ovs pods should always be the first to start before app pods during an incident	2021-02-22 00:41:40 UTC
Red Hat Knowledge Base (Solution)	3914791	None	None	After reboot some pods are in CrashLoopBackOff state due to network error.	2019-02-15 17:39:30 UTC
Red Hat Knowledge Base (Solution)	4042791	None	None	NetworkPlugin cni failed to set up pod after rebooting host	2019-04-08 07:02:26 UTC
Red Hat Knowledge Base (Solution)	4220561	None	None	None	2019-06-13 19:58:35 UTC
Red Hat Product Errata	RHBA-2019:0794	None	None	None	2019-06-06 02:00:41 UTC

Description Marius Cornea 2018-11-27 22:13:53 UTC

Description of problem:
Director deployed OCP 3.11: pods end up in CrashLoopBackOff state after overcloud nodes rolling reboot:

[root@openshift-master-0 heat-admin]# oc get pods --all-namespaces | grep -v Running | grep -v Completed
NAMESPACE               NAME                                           READY     STATUS             RESTARTS   AGE
default                 glusterblock-registry-provisioner-dc-1-dlgd9   0/1       CrashLoopBackOff   8          50m
glusterfs               glusterblock-storage-provisioner-dc-1-8dx75    0/1       Error              9          54m
openshift-console       console-6b4548888-s4rhh                        0/1       CrashLoopBackOff   12         46m
openshift-monitoring    grafana-675bb887cc-k8vm2                       1/2       CrashLoopBackOff   10         46m
openshift-monitoring    kube-state-metrics-7588654c69-kfwqm            2/3       CrashLoopBackOff   11         44m
openshift-monitoring    prometheus-k8s-0                               3/4       CrashLoopBackOff   14         46m
openshift-web-console   webconsole-857446847c-2phh6                    0/1       CrashLoopBackOff   11         46m


Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-9.0.1-0.20181013060891.el7ost.noarch


How reproducible:
100%

Steps to Reproduce:
1. Deploy overcloud with 3 masters + 3 infra + 3 worker with CNS enabled
2. Reboot nodes one by one
3. Check pods status

Actual results:
some of the pods end up in CrashLoopBackOff state

Expected results:
all pods should be running

Additional info:

Comment 3 Marius Cornea 2018-12-03 18:04:52 UTC

I managed to reproduce this on a manual openshift-ansible installation(without Director) so I am dropping the blocker flag.

Comment 4 Tzu-Mainn Chen 2018-12-14 17:27:14 UTC

It seems like this is a generic problem that can happen whenever you restart docker; I've reproduced this simply by running openshift-ansible/playbooks/openshift-node/restart.yml. I experimented by running restart.yml five times; the 2nd and 4th times all the pods came up with no failures; the other times there was exactly one pod in the CrashLoopBackOff state - but a different pod each time:

* openshift-template-service-broker   apiserver-t5jlc                                0/1       CrashLoopBackOff   236        3d
Error: Get https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 172.30.0.1:443: connect: network is unreachable

* kube-service-catalog                apiserver-blrc8                                0/1       CrashLoopBackOff   49         3d
Error: Get https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 172.30.0.1:443: connect: network is unreachable

* openshift-web-console               webconsole-857446847c-fqkx8                    0/1       CrashLoopBackOff   83         3d
F1214 15:37:46.785377       1 console.go:35] Get https://172.30.0.1:443/.well-known/oauth-authorization-server: dial tcp 172.30.0.1:443: connect: network is unreachable


This seems like it may be a generic sdn issue?

Note that each time this happened I was able to go to the system where the affected pod was running, restart docker, and then all the pods were fine. That may be a documentable temporary workaround... ?

Comment 5 Casey Callendrello 2018-12-17 19:03:00 UTC

Two questions:
1 - Can you describe more exactly how to reproduce this?
2 - Can you post the logs from the SDN pod on the node with the crashing pods? You can get them by listing the sdn pods (oc -n openshift-sdn get pods -o wide, then oc logs -n openshift-sdn <podname>)

Comment 8 wangzhida 2019-01-03 01:23:30 UTC

Hi, any update about this? Our customer can easily repro this issue by restart a master node.

Comment 9 Marius Cornea 2019-01-03 14:11:23 UTC

(In reply to Casey Callendrello from comment #5)
> Two questions:
> 1 - Can you describe more exactly how to reproduce this?

a. SSH to the one of the openshift nodes
b. shutdown -r now
c. wait for the node to reboot
d. wait for atomic-openshift-node service to start
f. re-run steps a-d for the rest of the openshift nodes in the cluster 

> 2 - Can you post the logs from the SDN pod on the node with the crashing
> pods? You can get them by listing the sdn pods (oc -n openshift-sdn get pods
> -o wide, then oc logs -n openshift-sdn <podname>)

Attaching the log output.

Comment 10 Marius Cornea 2019-01-03 14:11:48 UTC

Created attachment 1518156 [details]
sdn pod log output

Attaching sdn pod log output.

Comment 11 Casey Callendrello 2019-01-03 22:27:32 UTC

Interesting, thanks for the helpful logs

It looks like we're triggering the same issue noticed in https://github.com/openshift/origin/pull/21654

W0103 03:46:31.823333    9682 node.go:367] will restart pod 'openshift-monitoring/alertmanager-main-0' due to update failure on restart: could not parse ofport "": strconv.Atoi: parsing "": invalid syntax
W0103 03:46:31.838837    9682 node.go:367] will restart pod 'openshift-monitoring/kube-state-metrics-7588654c69-mzq92' due to update failure on restart: could not parse ofport "": strconv.Atoi: parsing "": invalid syntax
W0103 03:46:31.904371    9682 node.go:367] will restart pod 'openshift-monitoring/prometheus-operator-769776d47-rgljl' due to update failure on restart: could not parse ofport "": strconv.Atoi: parsing "": invalid syntax

Weibin or Meng Bo, can you try and reproduce this? Thank you.

Comment 12 Weibin Liang 2019-01-04 14:54:34 UTC

Casey, I reproduced the problem and saw the same errors as comment 11 in my v3.11.66 cluster

[root@ip-172-18-11-3 ec2-user]# oc get pods --all-namespaces | grep -v Running | grep -v Completed
NAMESPACE                           NAME                                               READY     STATUS             RESTARTS   AGE
default                             router-1-deploy                                    0/1       Error              0          37m
openshift-ansible-service-broker    asb-1-gnw44                                        0/1       CrashLoopBackOff   7          32m
openshift-monitoring                alertmanager-main-1                                2/3       CrashLoopBackOff   9          34m
openshift-monitoring                prometheus-operator-7566fcccc8-vhgsd               0/1       CrashLoopBackOff   7          36m
[root@ip-172-18-11-3 ec2-user]# 


[root@ip-172-18-11-3 ec2-user]# oc logs pod/sdn-f5fb -n openshift-sdn

I0104 14:35:42.598927    8848 node.go:348] Starting openshift-sdn pod manager
E0104 14:35:42.608313    8848 cniserver.go:148] failed to remove old pod info socket: remove /var/run/openshift-sdn: device or resource busy
E0104 14:35:42.608403    8848 cniserver.go:151] failed to remove contents of socket directory: remove /var/run/openshift-sdn: device or resource busy
W0104 14:35:42.623752    8848 util_unix.go:75] Using "/var/run/dockershim.sock" as endpoint is deprecated, please consider using full url format "unix:///var/run/dockershim.sock".
W0104 14:35:42.695796    8848 node.go:367] will restart pod 'openshift-ansible-service-broker/asb-1-gnw44' due to update failure on restart: could not parse ofport "": strconv.Atoi: parsing "": invalid syntax
W0104 14:35:42.725736    8848 node.go:367] will restart pod 'openshift-monitoring/alertmanager-main-1' due to update failure on restart: could not parse ofport "": strconv.Atoi: parsing "": invalid syntax
W0104 14:35:42.751036    8848 node.go:367] will restart pod 'openshift-monitoring/prometheus-operator-7566fcccc8-vhgsd' due to update failure on restart: could not parse ofport "": strconv.Atoi: parsing "": invalid syntax

Comment 13 Weibin Liang 2019-01-04 18:14:07 UTC

The log from failed pod:

Normal   Started                 24m                kubelet, ip-172-18-8-14.ec2.internal        Started container
  Warning  FailedCreatePodSandBox  3m                 kubelet, ip-172-18-8-14.ec2.internal        Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "8a7b56bdabd9ab1ee24afd007d28340e8a0bdae044666913e2bd6a3cbb80092c" network for pod "prometheus-operator-7566fcccc8-vhgsd": NetworkPlugin cni failed to set up pod "prometheus-operator-7566fcccc8-vhgsd_openshift-monitoring" network: OpenShift SDN network process is not (yet?) available
  Warning  FailedCreatePodSandBox  3m                 kubelet, ip-172-18-8-14.ec2.internal        Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "dfa1270c600e6d1a6b6c4b81b89c3a3eaf434a599a0adf4b9866b5f4d0061d2e" network for pod "prometheus-operator-7566fcccc8-vhgsd": NetworkPlugin cni failed to set up pod "prometheus-operator-7566fcccc8-vhgsd_openshift-monitoring" network: OpenShift SDN network process is not (yet?) available
  Normal   SandboxChanged          3m (x3 over 3m)    kubelet, ip-172-18-8-14.ec2.internal        Pod sandbox changed, it will be killed and re-created.
  Warning  NetworkFailed           3m                 openshift-sdn, ip-172-18-8-14.ec2.internal  The pod's network interface has been lost and the pod will be stopped.
  Normal   Pulled                  1m (x4 over 3m)    kubelet, ip-172-18-8-14.ec2.internal        Container image "registry.reg-aws.openshift.com:443/openshift3/ose-prometheus-operator:v3.11" already present on machine
  Normal   Created                 1m (x4 over 3m)    kubelet, ip-172-18-8-14.ec2.internal        Created container
  Normal   Started                 1m (x4 over 3m)    kubelet, ip-172-18-8-14.ec2.internal        Started container
  Warning  BackOff                 25s (x13 over 3m)  kubelet, ip-172-18-8-14.ec2.internal        Back-off restarting failed container

Comment 14 Casey Callendrello 2019-01-04 20:48:22 UTC

Assigning to Phil to take a look.

Comment 16 Phil Cameron 2019-01-09 13:28:04 UTC

I have seen this in bringing up ovs/ovn on a 3.11 cluster. I don't know what is causing the problem, however I suspect that the order ovs and sdn daemons come up and the delays between the starts seems suspicious. In the ovn case deleting the node pods and subsequent restart fixes the problem. In this case restarting docker fixes the problem. The common code in this is ovs.

Comment 17 Phil Cameron 2019-01-09 13:33:09 UTC

Mark, could you take a look at this? I think there is something happening between ovs and sdn, not sure how to figure it out.
Thanks

Comment 18 Weibin Liang 2019-01-14 22:01:39 UTC

Work with Phil in my v3.11 cluster which only has one master, one infra and two worker nodes.

After just "shutdown -r now" the master, we saw the same issue and both ovs-pod and sdn-pod restart once after master come back.

The testing log is attached.

Comment 19 Weibin Liang 2019-01-14 22:03:03 UTC

Created attachment 1520616 [details]
Testing log

Comment 20 Phil Cameron 2019-01-15 14:23:07 UTC

Notes on comment 18: I think this may be a startup sequencing problem. I think SDN starts before OVS is ready for it. SND handles this badly and breaks all existing pods on the node. It appears that 3.9 works and this is new in 3.11. As an aside, I am running into similar behavior in bringing up ovs/ovn networking (1654942). There may be a common cause. The suggested workaround, restarting docker, effectively restarts the pods and after a while networking works again.  Looking into how ovs and snd work together. Also, looking at delaying sdn until ovs is ready.

Comment 21 Magnus Heino 2019-01-23 09:57:31 UTC

Rolling restarts on nodes in our cluster seems to trigger this behavior too.

Right now in kube-service-catalog the api-server pod on one master is crash looping with:

"Error: Get https://10.127.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication:  dial tcp 10.127.0.1:443: connect: network is unreachable"

I delete the pod, and the new one created works just fine.

No idea if its related, but our logging-fluentd pods had problems on reboot too, looking very similar to https://bugzilla.redhat.com/show_bug.cgi?id=1538971

I solved that by removing the label to stop logging first, reboot, then add the label back. Could that also be caused by startup order?

3.11.59

Comment 22 Phil Cameron 2019-01-23 13:44:29 UTC

We had a suspicion that cleanup of the ovs db on restart might be the problem, we back ported two PRs but that didn't work. Still looking for the root cause.

Comment 31 Oscar Casal Sanchez 2019-02-06 10:00:40 UTC

Hi,

Meanwhile the bug is fixed, could we have a better workaround? Restart the docker is not a good workaround, sometimes the infrastructure could have a unexpected reboot our you could reboot the servers one by one after an update and it's expected that everything works. 

Regards,
Oscar

Comment 42 Ryan Howe 2019-02-15 17:09:29 UTC

I was narrow in on the bug. 

  1. This happens only when the node is first starting up. Likely due to pods starting up before the SDN is fully up and running. 
  2. A pod gets started and all the OVS networking is set up correctly (flows,ports,interfaces)  
  3. What is missing is the the correct routes in this network namespace. The pod container is never recreated just the app container is restarted over and over again which is to be expected. 
  4. The network that is set up for this pod is used for the application container, but the correct routes are never added. 

Guessing somewhere around here we hit the issue where the routes are not fully setup 
  https://github.com/openshift/origin/blob/release-3.11/pkg/network/node/pod.go#L115

     # ip route
     10.130.0.0/23 dev eth0 proto kernel scope link src 10.130.0.66 

    Due to missing routes connections fail. 
     # curl -vk https://172.30.0.1:443/healthz 
     curl: (7) Failed to connect to 172.30.0.1: Network is unreachable

    We expect to see this: 
     default via 10.130.0.1 dev eth0 
     10.128.0.0/14 dev eth0 
     10.130.0.0/23 dev eth0 proto kernel scope link src 10.130.0.66 
     224.0.0.0/4 dev eth0 
     
    When we add a default route 
    # ip route add default via 10.130.0.1 dev eth0
    
    We are able to make connecting using the kubernetes service IP
    # curl -vk https://172.30.0.1:443/healthz 
    HTTP/1.1 200 OK
 


This error is seen around one of the pods we reproduced with. 

atomic-openshift-node[20179]: W0215 12:26:53.477756   20179 docker_sandbox.go:372] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "apiserver-cxlh4_kube-service-catalog": CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "2a0609362dcc080ff2a553a6aad4616f67e4df84c5dfd956efbb975bd92e8e14"

atomic-openshift-node[20179]: W0215 12:26:59.444697   20179 cni.go:243] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "2a0609362dcc080ff2a553a6aad4616f67e4df84c5dfd956efbb975bd92e8e14"

Could be the the wrong sandbox is being referenced for some reason.
https://github.com/openshift/origin/blob/release-3.11/vendor/k8s.io/kubernetes/pkg/kubelet/dockershim/network/cni/cni.go#L209-L213

Comment 71 Ben Bennett 2019-03-26 18:35:22 UTC

*** Bug 1663358 has been marked as a duplicate of this bug. ***

Comment 75 Ben Bennett 2019-03-28 14:59:43 UTC

*** Bug 1661170 has been marked as a duplicate of this bug. ***

Comment 86 Phil Cameron 2019-04-05 17:36:44 UTC

https://github.com/openshift/openshift-ansible/pull/11470 - changes on 3.10

cherrypick PR 11409 into 3.10

Comment 94 Weibin Liang 2019-04-15 15:16:26 UTC

Tested and verified on v3.11.106.

There are no pods end up in CrashLoopBackOff state after a rolling reboot of the node.

Comment 95 Ryan Howe 2019-04-15 15:42:46 UTC

Workaround:
On nodes run the following:
~~~
echo -e "r /etc/cni/net.d/80-openshift-network.conf\nr /etc/origin/openvswitch/conf.db"  > /usr/lib/tmpfiles.d/cleanup-cni.conf
~~~

Comment 97 Stephen Cuppett 2019-04-15 16:32:56 UTC

*** Bug 1659864 has been marked as a duplicate of this bug. ***

Comment 108 Yunyun Qu 2019-05-20 02:53:27 UTC

What is the latest update about this issue? When could we provide the fix to customer?


Thanks,
Yunyun

Comment 110 Johnray Fuller 2019-05-20 03:24:02 UTC

Ther is an errata for the Openshift installer in the release queue.

Note that installing the updated installer packages *will not* remediate the cluster alone and will require users to run the installer to update their clusters to resolve this. 

That said, any customer who ran the workaround listed in comment #95 has the exact same fix that is incorporated into the installer.

Comment 112 errata-xmlrpc 2019-06-06 02:00:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0794

Comment 114 Dan Geoffroy 2019-08-01 13:27:23 UTC

Have requested that a new BZ is created for this related issue.  Will track the request there.

Comment 116 Anand Paladugu 2019-10-28 01:45:24 UTC

Team,  A customer is running OCP 3.11.141 with crio and noticing this issue. Is the fix specific to docker or will it work for CRIO as well?

Comment 118 Avinash Bodhe 2020-10-20 09:29:12 UTC

(In reply to Dan Geoffroy from comment #114)
> Have requested that a new BZ is created for this related issue.  Will track
> the request there.

Can you share the link for the new BZ?

Comment 121 Red Hat Bugzilla 2023-09-18 00:14:56 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.

abhinkum
abodhe
akrastin
ansverma
anusaxen
aos-bugs
apaladug
arghosh
bbennett
bleanhar
cshereme
dageoffr
dbecker
dcbw
haowang
isanchez
jnie
jrfuller
jupittma
knakayam
ltomasbo
magnus.heino
m.andre
mburns
mcornea
mhayashi
mirollin
mmichels
mnoguera
morazi
mvardhan
ocasalsa
pcameron
pdwyer
rhowe
rjamadar
sauchter
sfu
sponnaga
swasthan
tkimura
travi
weliang
wsun
yqu
zhiwang