Bug 1421035
Summary: | Router pod keep restarting then become CrashLoopBackOff in container network mode | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Yan Du <yadu> | ||||
Component: | Networking | Assignee: | Weibin Liang <weliang> | ||||
Networking sub component: | router | QA Contact: | zhaozhanqi <zzhao> | ||||
Status: | CLOSED NOTABUG | Docs Contact: | |||||
Severity: | medium | ||||||
Priority: | medium | CC: | aos-bugs, bbennett, ichavero, weliang, yadu | ||||
Version: | 3.5.0 | Keywords: | Reopened | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2017-03-01 17:01:06 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Yan Du
2017-02-10 07:51:19 UTC
Jake can't reproduce this with Origin latest. Weibin, can you try with OSE please? Same v3.5.0.18 failed on evn. [root@dhcp-41-87 byo]# oc version oc v3.5.0.18+9a5d1aa kubernetes v1.5.2+43a9be4 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://dhcp-41-87.bos.redhat.com:8443 openshift v3.5.0.18+9a5d1aa kubernetes v1.5.2+43a9be4 [root@dhcp-41-87 byo]# [root@dhcp-41-87 byo]# oc new-project http Already on project "http" on server "https://dhcp-41-87.bos.redhat.com:8443". You can add applications to this project with the 'new-app' command. For example, try: oc new-app centos/ruby-22-centos7~https://github.com/openshift/ruby-ex.git to build a new example application in Ruby. [root@dhcp-41-87 byo]# oadm policy add-scc-to-user privileged -z http-user [root@dhcp-41-87 byo]# oadm router router-http --replicas=1 --service-account=http-user -n http --host-network=false info: password for stats user admin has been set to IS4WwULmfW --> Creating router router-http ... serviceaccount "http-user" created warning: clusterrolebinding "router-router-http-role" already exists deploymentconfig "router-http" created service "router-http" created --> Success [root@dhcp-41-87 byo]# oc get pods NAME READY STATUS RESTARTS AGE router-http-1-deploy 0/1 ContainerCreating 0 4s [root@dhcp-41-87 byo]# oc get pods NAME READY STATUS RESTARTS AGE router-http-1-4b3g8 0/1 ErrImagePull 0 12s router-http-1-deploy 1/1 Running 0 19s [root@dhcp-41-87 byo]# oc logs router-http-1-4b3g8 Error from server (BadRequest): container "router" in pod "router-http-1-4b3g8" is waiting to start: trying and failing to pull image [root@dhcp-41-87 byo]# oc get events LASTSEEN FIRSTSEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE 1m 1m 1 router-http-1-4b3g8 Pod Normal Scheduled {default-scheduler } Successfully assigned router-http-1-4b3g8 to dhcp-41-106.bos.redhat.com 3s 1m 4 router-http-1-4b3g8 Pod spec.containers{router} Normal Pulling {kubelet dhcp-41-106.bos.redhat.com} pulling image "openshift/origin-haproxy-router:v3.5.0.18" 43s 1m 3 router-http-1-4b3g8 Pod spec.containers{router} Warning Failed {kubelet dhcp-41-106.bos.redhat.com} Failed to pull image "openshift/origin-haproxy-router:v3.5.0.18": manifest unknown: manifest unknown 43s 1m 3 router-http-1-4b3g8 Pod Warning FailedSync {kubelet dhcp-41-106.bos.redhat.com} Error syncing pod, skipping: failed to "StartContainer" for "router" with ErrImagePull: "manifest unknown: manifest unknown" 1m 1m 1 router-http-1-4b3g8 Pod Warning FailedSync {kubelet dhcp-41-106.bos.redhat.com} Error syncing pod, skipping: failed to "SetupNetwork" for "router-http-1-4b3g8_http" with SetupNetworkError: "Failed to setup network for pod \"router-http-1-4b3g8_http(29c09530-f2c3-11e6-996e-525400bef5b7)\" using network plugins \"cni\": CNI request failed with status 400: 'Cannot open hostport 80 for pod router-http-1-4b3g8_http: listen tcp :80: bind: address already in use\n'; Skipping pod" 17s 1m 6 router-http-1-4b3g8 Pod spec.containers{router} Normal BackOff {kubelet dhcp-41-106.bos.redhat.com} Back-off pulling image "openshift/origin-haproxy-router:v3.5.0.18" 17s 1m 6 router-http-1-4b3g8 Pod Warning FailedSync {kubelet dhcp-41-106.bos.redhat.com} Error syncing pod, skipping: failed to "StartContainer" for "router" with ImagePullBackOff: "Back-off pulling image \"openshift/origin-haproxy-router:v3.5.0.18\"" 1m 1m 1 router-http-1-deploy Pod Normal Scheduled {default-scheduler } Successfully assigned router-http-1-deploy to dhcp-41-55.bos.redhat.com 1m 1m 1 router-http-1-deploy Pod spec.containers{deployment} Normal Pulled {kubelet dhcp-41-55.bos.redhat.com} Container image "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/ose-deployer:v3.5.0.18" already present on machine 1m 1m 1 router-http-1-deploy Pod spec.containers{deployment} Normal Created {kubelet dhcp-41-55.bos.redhat.com} Created container with docker id cb0bffdbec33; Security:[seccomp=unconfined] 1m 1m 1 router-http-1-deploy Pod spec.containers{deployment} Normal Started {kubelet dhcp-41-55.bos.redhat.com} Started container with docker id cb0bffdbec33 1m 1m 1 router-http-1 ReplicationController Normal SuccessfulCreate {replication-controller } Created pod: router-http-1-4b3g8 1m 1m 1 router-http DeploymentConfig Normal DeploymentCreated {deploymentconfig-controller } Created new replication controller "router-http-1" for version 1 [root@dhcp-41-87 byo]# Router pod can be deployed after run "oadm router router-container --replicas=1 --service-account=https-user -n https --images='openshift3/ose-${component}:${version}' --host-network=false" [root@dhcp-41-87 origin]# oc version oc v3.5.0.18+9a5d1aa kubernetes v1.5.2+43a9be4 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://dhcp-41-87.bos.redhat.com:8443 openshift v3.5.0.18+9a5d1aa kubernetes v1.5.2+43a9be4 [root@dhcp-41-87 origin]# oadm router router-container --replicas=1 --service-account=https-user -n https --images='openshift3/ose-${component}:${version}' --host-network=false [root@dhcp-41-87 origin]# oc get pods -o wide -w NAME READY STATUS RESTARTS AGE IP NODE router-container-1-nffrd 1/1 Running 0 8m 10.130.0.17 dhcp-41-106.bos.redhat.com router1-1-rx2h6 1/1 Running 0 9m 10.18.41.55 dhcp-41-55.bos.redhat.com Also tried --host-network=false in AWS with v3.5.0.20, it works fine too. I still could reproduce the issue with the latest OCP env openshift v3.5.0.20+87266c6 kubernetes v1.5.2+43a9be4 etcd 3.1.0 After change hostNetwork to false, monitor the pod # oc get pod -w router-2-7jps5 0/1 Pending 0 1s router-2-7jps5 0/1 ContainerCreating 0 1s router-2-deploy 0/1 Completed 0 7s router-2-deploy 0/1 Terminating 0 7s router-2-deploy 0/1 Terminating 0 7s router-2-7jps5 0/1 Running 0 4s router-2-7jps5 0/1 Running 1 44s router-2-7jps5 0/1 Running 2 1m router-2-7jps5 0/1 Running 3 1m router-2-7jps5 0/1 Running 4 1m router-2-7jps5 0/1 CrashLoopBackOff 4 2m Attach part of the log: r} Normal Started Started container with docker id 7d8478ed2dd6 3m 3m 1 {kubelet host-8-174-95.host.centralci.eng.rdu2.redhat.com} spec.containers{router} Normal Killing Killing container with docker id 93465e7c66dd: pod "router-2-7jps5_default(dfff0717-f349-11e6-8d27-fa163ebf4833)" container "router" is unhealthy, it will be killed and re-created. 3m 3m 1 {kubelet host-8-174-95.host.centralci.eng.rdu2.redhat.com} spec.containers{router} Normal Created Created container with docker id 7d8478ed2dd6; Security:[seccomp=unconfined] 3m 3m 1 {kubelet host-8-174-95.host.centralci.eng.rdu2.redhat.com} spec.containers{router} Normal Killing Killing container with docker id 7d8478ed2dd6: pod "router-2-7jps5_default(dfff0717-f349-11e6-8d27-fa163ebf4833)" container "router" is unhealthy, it will be killed and re-created. 3m 3m 1 {kubelet host-8-174-95.host.centralci.eng.rdu2.redhat.com} spec.containers{router} Normal Started Started container with docker id 93c43f56795a 3m 3m 1 {kubelet host-8-174-95.host.centralci.eng.rdu2.redhat.com} spec.containers{router} Normal Created Created container with docker id 93c43f56795a; Security:[seccomp=unconfined] 2m 2m 1 {kubelet host-8-174-95.host.centralci.eng.rdu2.redhat.com} spec.containers{router} Normal Killing Killing container with docker id 93c43f56795a: pod "router-2-7jps5_default(dfff0717-f349-11e6-8d27-fa163ebf4833)" container "router" is unhealthy, it will be killed and re-created. 2m 2m 1 {kubelet host-8-174-95.host.centralci.eng.rdu2.redhat.com} spec.containers{router} Normal Created Created container with docker id 01a67a2462cc; Security:[seccomp=unconfined] 2m 2m 1 {kubelet host-8-174-95.host.centralci.eng.rdu2.redhat.com} spec.containers{router} Normal Started Started container with docker id 01a67a2462cc 2m 2m 1 {kubelet host-8-174-95.host.centralci.eng.rdu2.redhat.com} spec.containers{router} Normal Killing Killing container with docker id 01a67a2462cc: pod "router-2-7jps5_default(dfff0717-f349-11e6-8d27-fa163ebf4833)" container "router" is unhealthy, it will be killed and re-created. 2m 2m 4 {kubelet host-8-174-95.host.centralci.eng.rdu2.redhat.com} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "router" with CrashLoopBackOff: "Back-off 40s restarting failed container=router pod=router-2-7jps5_default(dfff0717-f349-11e6-8d27-fa163ebf4833)" Yandu, I login your master, after delete your existing router and recreate it again by "oadm router router --images='openshift3/ose-${component}:${version}' --host-network=false", router creation is stable for 15m. When create router without --images option, you will see ErrImagePull error which DE is working on this issue now. [root@host-8-175-216 ~]# oc get all NAME DOCKER REPO TAGS UPDATED is/registry-console 172.30.222.93:5000/default/registry-console 3.5 NAME REVISION DESIRED CURRENT TRIGGERED BY dc/docker-registry 2 1 1 config dc/registry-console 1 1 1 config dc/router 5 1 1 config NAME DESIRED CURRENT READY AGE rc/docker-registry-1 0 0 0 1d rc/docker-registry-2 1 1 1 1d rc/registry-console-1 1 1 1 1d rc/router-3 0 0 0 29m rc/router-4 0 0 0 25m rc/router-5 1 1 0 22m NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD routes/docker-registry docker-registry-default.0214-bfs.qe.rhcloud.com docker-registry 5000-tcp passthrough None routes/registry-console registry-console-default.0214-bfs.qe.rhcloud.com registry-console registry-console passthrough None NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE svc/docker-registry 172.30.222.93 <none> 5000/TCP 1d svc/kubernetes 172.30.0.1 <none> 443/TCP,53/UDP,53/TCP 1d svc/registry-console 172.30.190.18 <none> 9000/TCP 1d svc/router 172.30.211.236 <none> 80/TCP,443/TCP,1936/TCP 1d NAME READY STATUS RESTARTS AGE po/docker-registry-2-z2ltg 1/1 Running 2 1d po/registry-console-1-7gqhf 1/1 Running 2 1d po/router-5-deploy 1/1 Running 0 8m po/router-5-h1tkw 0/1 CrashLoopBackOff 7 8m [root@host-8-175-216 ~]# oc delete dc/router deploymentconfig "router" deleted [root@host-8-175-216 ~]# oc delete svc/router service "router" deleted [root@host-8-175-216 ~]# oc get pods NAME READY STATUS RESTARTS AGE docker-registry-2-z2ltg 1/1 Running 2 1d registry-console-1-7gqhf 1/1 Running 2 1d router-5-deploy 0/1 Completed 0 8m [root@host-8-175-216 ~]# oc delete po/router-5-deploy pod "router-5-deploy" deleted [root@host-8-175-216 ~]# oc delete svc/router Error from server (NotFound): services "router" not found [root@host-8-175-216 ~]# oc get pods NAME READY STATUS RESTARTS AGE docker-registry-2-z2ltg 1/1 Running 2 1d registry-console-1-7gqhf 1/1 Running 2 1d [root@host-8-175-216 ~]# oc get all NAME DOCKER REPO TAGS UPDATED is/registry-console 172.30.222.93:5000/default/registry-console 3.5 NAME REVISION DESIRED CURRENT TRIGGERED BY dc/docker-registry 2 1 1 config dc/registry-console 1 1 1 config NAME DESIRED CURRENT READY AGE rc/docker-registry-1 0 0 0 1d rc/docker-registry-2 1 1 1 1d rc/registry-console-1 1 1 1 1d NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD routes/docker-registry docker-registry-default.0214-bfs.qe.rhcloud.com docker-registry 5000-tcp passthrough None routes/registry-console registry-console-default.0214-bfs.qe.rhcloud.com registry-console registry-console passthrough None NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE svc/docker-registry 172.30.222.93 <none> 5000/TCP 1d svc/kubernetes 172.30.0.1 <none> 443/TCP,53/UDP,53/TCP 1d svc/registry-console 172.30.190.18 <none> 9000/TCP 1d NAME READY STATUS RESTARTS AGE po/docker-registry-2-z2ltg 1/1 Running 2 1d po/registry-console-1-7gqhf 1/1 Running 2 1d [root@host-8-175-216 ~]# oadm router router --images='openshift3/ose-${component}:${version}' --host-network=false info: password for stats user admin has been set to MjkCCV8Wyy --> Creating router router ... warning: serviceaccounts "router" already exists warning: clusterrolebinding "router-router-role" already exists deploymentconfig "router" created service "router" created --> Success [root@host-8-175-216 ~]# oc get all NAME DOCKER REPO TAGS UPDATED is/registry-console 172.30.222.93:5000/default/registry-console 3.5 NAME REVISION DESIRED CURRENT TRIGGERED BY dc/docker-registry 2 1 1 config dc/registry-console 1 1 1 config dc/router 1 1 1 config NAME DESIRED CURRENT READY AGE rc/docker-registry-1 0 0 0 1d rc/docker-registry-2 1 1 1 1d rc/registry-console-1 1 1 1 1d rc/router-1 1 1 0 9s NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD routes/docker-registry docker-registry-default.0214-bfs.qe.rhcloud.com docker-registry 5000-tcp passthrough None routes/registry-console registry-console-default.0214-bfs.qe.rhcloud.com registry-console registry-console passthrough None NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE svc/docker-registry 172.30.222.93 <none> 5000/TCP 1d svc/kubernetes 172.30.0.1 <none> 443/TCP,53/UDP,53/TCP 1d svc/registry-console 172.30.190.18 <none> 9000/TCP 1d svc/router 172.30.13.116 <none> 80/TCP,443/TCP,1936/TCP 9s NAME READY STATUS RESTARTS AGE po/docker-registry-2-z2ltg 1/1 Running 2 1d po/registry-console-1-7gqhf 1/1 Running 2 1d po/router-1-21xtk 0/1 Running 0 7s po/router-1-deploy 1/1 Running 0 9s [root@host-8-175-216 ~]# oc get pods -w NAME READY STATUS RESTARTS AGE docker-registry-2-z2ltg 1/1 Running 2 1d registry-console-1-7gqhf 1/1 Running 2 1d router-1-21xtk 0/1 Running 0 17s router-1-deploy 1/1 Running 0 19s NAME READY STATUS RESTARTS AGE router-1-21xtk 1/1 Running 0 21s router-1-deploy 0/1 Completed 0 23s router-1-deploy 0/1 Terminating 0 23s router-1-deploy 0/1 Terminating 0 23s ^C[root@host-8-175-216 ~]# oc get all NAME DOCKER REPO TAGS UPDATED is/registry-console 172.30.222.93:5000/default/registry-console 3.5 NAME REVISION DESIRED CURRENT TRIGGERED BY dc/docker-registry 2 1 1 config dc/registry-console 1 1 1 config dc/router 1 1 1 config NAME DESIRED CURRENT READY AGE rc/docker-registry-1 0 0 0 1d rc/docker-registry-2 1 1 1 1d rc/registry-console-1 1 1 1 1d rc/router-1 1 1 1 4m NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD routes/docker-registry docker-registry-default.0214-bfs.qe.rhcloud.com docker-registry 5000-tcp passthrough None routes/registry-console registry-console-default.0214-bfs.qe.rhcloud.com registry-console registry-console passthrough None NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE svc/docker-registry 172.30.222.93 <none> 5000/TCP 1d svc/kubernetes 172.30.0.1 <none> 443/TCP,53/UDP,53/TCP 1d svc/registry-console 172.30.190.18 <none> 9000/TCP 1d svc/router 172.30.13.116 <none> 80/TCP,443/TCP,1936/TCP 4m NAME READY STATUS RESTARTS AGE po/docker-registry-2-z2ltg 1/1 Running 2 1d po/registry-console-1-7gqhf 1/1 Running 2 1d po/router-1-21xtk 1/1 Running 0 4m [root@host-8-175-216 ~]# oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE docker-registry-2-z2ltg 1/1 Running 2 1d 10.129.0.19 host-8-174-95.host.centralci.eng.rdu2.redhat.com registry-console-1-7gqhf 1/1 Running 2 1d 10.129.0.20 host-8-174-95.host.centralci.eng.rdu2.redhat.com router-1-21xtk 1/1 Running 0 10m 10.129.0.39 host-8-174-95.host.centralci.eng.rdu2.redhat.com [root@host-8-175-216 ~]# [root@host-8-175-216 ~]# oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE docker-registry-2-z2ltg 1/1 Running 2 1d 10.129.0.19 host-8-174-95.host.centralci.eng.rdu2.redhat.com registry-console-1-7gqhf 1/1 Running 2 1d 10.129.0.20 host-8-174-95.host.centralci.eng.rdu2.redhat.com router-1-21xtk 1/1 Running 0 13m 10.129.0.39 host-8-174-95.host.centralci.eng.rdu2.redhat.com [root@host-8-175-216 ~]# [root@host-8-175-216 ~]# oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE docker-registry-2-z2ltg 1/1 Running 2 1d 10.129.0.19 host-8-174-95.host.centralci.eng.rdu2.redhat.com registry-console-1-7gqhf 1/1 Running 2 1d 10.129.0.20 host-8-174-95.host.centralci.eng.rdu2.redhat.com router-1-21xtk 1/1 Running 0 15m 10.129.0.39 host-8-174-95.host.centralci.eng.rdu2.redhat.com [root@host-8-175-216 ~]# How did you deploy? I went to your system and ran: oadm router --host-network=false --images='openshift3/ose-${component}:${version}' And that image is happily running. Somehow before it was contacting localhost directly in the liveness checks, but localhost was resolving to the IPv6 address and it was failing (look in /etc/hosts). I found the steps to reproduce the bug: 1. Create router # oadm router router --images='openshift3/ose-${component}:${version}' info: password for stats user admin has been set to YzFMuOsfDv --> Creating router router ... warning: serviceaccounts "router" already exists warning: clusterrolebinding "router-router-role" already exists deploymentconfig "router" created service "router" created --> Success # oc get pod NAME READY STATUS RESTARTS AGE router-1-pb4d6 1/1 Running 0 42s 2. Edit the dc of router to hostNetwork: false (it was hostNetwork: true), then the router will be re-deploy, and the pod will keep restarting and go to CrashLoopBackOff # oc get pod -w NAME READY STATUS RESTARTS AGE router-2-5s803 0/1 Running 1 43s router-2-5s803 0/1 Running 2 1m router-2-5s803 0/1 Running 3 1m router-2-5s803 0/1 Running 4 1m router-2-5s803 0/1 CrashLoopBackOff 4 2m router-2-5s803 0/1 Running 5 2m router-2-5s803 0/1 CrashLoopBackOff 5 3m But if I create router with --host-network=false, the router could be running successfully. # oadm router router --images='openshift3/ose-${component}:${version}' --host-network=false info: password for stats user admin has been set to 16ir20Yf1h --> Creating router router ... warning: serviceaccounts "router" already exists warning: clusterrolebinding "router-router-role" already exists deploymentconfig "router" created service "router" created --> Success # oc get pod NAME READY STATUS RESTARTS AGE router-1-gghdn 1/1 Running 0 1m So I'm not quite sure that is there any difference that [1]create router with --host-network=false and [2]update dc of router to hostNetwork: false then re-deploy it, but the both ways was working well before. Ben, following the same steps mentioned in comments 8, after edit dc/router to change hostNetwork to be false, I saw the same router pod in CrashLoopBackOff state issue in my local env. Created attachment 1255994 [details]
kube describe router
According to kubectl describe the router container has problems executing. Checking other logs Could not reproduce this problem from latest source: v3.5.0.32-1+4f84c83-1100 Weibin can you confirm this? Retest it on OCP 3.5.0.33 ose-haproxy-router v3.5.0.33 25f705e32e9b Issue still could be reproduced. [root@host-8-174-87 ~]# oc get pod -w NAME READY STATUS RESTARTS AGE router-2-fc97k 0/1 Running 1 44s router-2-fc97k 0/1 Running 2 1m router-2-fc97k 0/1 Running 3 1m router-2-fc97k 0/1 Running 4 1m router-2-fc97k 0/1 CrashLoopBackOff 4 I was testing under an all in one setup, i'm checking with a multi node setup to see if this is part of the problem. Yan Du: Somehow before it was contacting localhost directly in the liveness checks, but localhost was resolving to the IPv6 address and it was failing (look in /etc/hosts). If you remove the IPv6 address for localhost then there is no problem. |