Description of problem: openshift-template-service-broker apiserver pod crash after upgrade to OCP 3.9.19 Version-Release number of the following components: openshift-ansible-3.9.19-1.git.0.34f4090.el7.noarch How reproducible: Always Steps to Reproduce: 1. upgrade OCP 3.7.42 to 3.9.19 on Atomic Host Actual results: Upgrade playbook success. 2/5 openshift-template-service-broker apiserver pod in crash state those pods on master are running those openshift-template-service-broker apiserver pods on compute nodes are in crash state. # oc version oc v3.9.19 kubernetes v1.9.1+a0ce1bc657 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://host-8-240-104.host.centralci.eng.rdu2.redhat.com openshift v3.9.19 kubernetes v1.9.1+a0ce1bc657 [root@host-172-16-120-4 ~]# oc get nodes NAME STATUS ROLES AGE VERSION 172.16.120.10 Ready master 14h v1.9.1+a0ce1bc657 172.16.120.15 Ready compute 14h v1.9.1+a0ce1bc657 172.16.120.24 Ready master 14h v1.9.1+a0ce1bc657 172.16.120.4 Ready master 14h v1.9.1+a0ce1bc657 172.16.120.9 Ready compute 14h v1.9.1+a0ce1bc657 # oc get pods -o wide -n openshift-template-service-broker NAME READY STATUS RESTARTS AGE IP NODE apiserver-2h544 1/1 Running 0 10h 10.2.0.5 172.16.120.4 apiserver-7kzt4 1/1 Running 0 10h 10.2.2.4 172.16.120.10 apiserver-d5zpc 0/1 CrashLoopBackOff 127 10h 10.2.6.18 172.16.120.15 apiserver-lk7c4 1/1 Running 0 10h 10.2.4.4 172.16.120.24 apiserver-p6kjx 0/1 CrashLoopBackOff 127 10h 10.2.8.18 172.16.120.9 # oc logs apiserver-d5zpc W0411 01:20:23.579940 1 authentication.go:229] Unable to get configmap/extension-apiserver-authentication in kube-system. Usually fixed by 'kubectl create rolebinding -n kube-system ROLE_NAME --role=extension-apiserver-authentication-reader --serviceaccount=YOUR_NS:YOUR_SA' Error: Get https://172.31.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 172.31.0.1:443: getsockopt: no route to host Usage: template-service-broker [flags] Flags: --alsologtostderr log to standard error as well as files ............. --vmodule moduleSpec comma-separated list of pattern=N settings for file-filtered logging F0411 01:20:23.583159 1 tsb.go:41] Get https://172.31.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 172.31.0.1:443: getsockopt: no route to host Image: registry.reg-aws.openshift.com:443/openshift3/ose-template-service-broker:v3.9.19 the other crashed pod got same log Expected results: all pods running after upgrade
I'm afraid I need to transfer this to the template service broker team - I'm on Service Catalog. @John please let me know if I've miss-assigned this.
Hi, John Could you take a look at this? Thanks.
Fabian please take a look to help find the issue.
Those instances have been preserved for a week. Do they still need to be preserved?
Yeah, just for 1 more day. Sorry if this has caused an inconvenience.
It looks like the issue might be that the project node selector is set to "", which is overriding the apiserver's daemonset node selector (of region=infra), which causes the apiserver pods to be incorrectly scheduled on compute nodes.
Not yet sure if this fixes it, but posted a WIP PR: https://github.com/openshift/openshift-ansible/pull/8010
I do not think node selector should/would cause pods CrashLoopBackOff.
It does seem like nodeselector configuration can cause CrashLoopBackOff, judging by: https://github.com/kubernetes/kubernetes/issues/16967#issuecomment-298608454 It's odd because it seems that setting the project nodeselector to "" is actually supposed to avoid this problem, so I'm a little lost as to the actual cause at the moment. It seems this issue could be related to https://bugzilla.redhat.com/show_bug.cgi?id=1501514#c9 There was a user that reported errors even with the project node selector as "" (https://github.com/kubernetes/kubernetes/issues/51788#issuecomment-364039589) Judging by the issue linked from that issue, it could be caused by an admissionControllerPlugin (https://github.com/kubernetes/kubernetes/issues/61886). Do you know what admissionControllerPlugins are running in this cluster? Is this error reproducible for you? If it happens again I think it would be useful to get another look at the environment, as there is definitely some weirdness going on. I'll see if I can get someone with more familiarity to weigh in on this.
openshift-ansible-3.9.24-1.git.0.d0289ea.el7.noarch Operating System: Red Hat Enterprise Linux Atomic Host 7.5.0 CPE OS Name: cpe:/o:redhat:enterprise_linux:7.5:GA:atomic-host Kernel: Linux 3.10.0-862.el7.x86_64 Architecture: x86-64 before upgrade # oc version oc v3.7.44 kubernetes v1.7.6+a08f5eeb62 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://host-8-248-214.host.centralci.eng.rdu2.redhat.com openshift v3.7.44 kubernetes v1.7.6+a08f5eeb62 # oc get all -n openshift-template-service-broker NAME READY STATUS RESTARTS AGE po/apiserver-4fc9c 1/1 Running 0 31m po/apiserver-56wqt 1/1 Running 0 31m po/apiserver-kmhw5 1/1 Running 0 31m po/apiserver-l6dt7 1/1 Running 0 31m po/apiserver-pghzm 1/1 Running 0 31m after upgrade # oc version oc v3.9.25 kubernetes v1.9.1+a0ce1bc657 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://host-8-248-214.host.centralci.eng.rdu2.redhat.com openshift v3.9.25 kubernetes v1.9.1+a0ce1bc657 # oc get pods -o wide -n openshift-template-service-broker NAME READY STATUS RESTARTS AGE IP NODE apiserver-9dnf5 0/1 CrashLoopBackOff 6 6m 10.2.6.19 172.16.120.80 apiserver-gxbfd 0/1 CrashLoopBackOff 5 5m 10.2.8.16 172.16.120.73 apiserver-q4d4k 1/1 Running 0 5m 10.2.2.6 172.16.120.16 apiserver-rlxts 1/1 Running 0 6m 10.2.0.4 172.16.120.49 apiserver-tn7mm 1/1 Running 0 6m 10.2.4.6 172.16.120.9 # oc get pods -o yaml -n openshift-template-service-broker | grep image: image: registry.reg-aws.openshift.com:443/openshift3/ose-template-service-broker:v3.9.25 # oc logs apiserver-9dnf5 -n openshift-template-service-broker W0420 03:14:39.653899 1 authentication.go:229] Unable to get configmap/extension-apiserver-authentication in kube-system. Usually fixed by 'kubectl create rolebinding -n kube-system ROLE_NAME --role=extension-apiserver-authentication-reader --serviceaccount=YOUR_NS:YOUR_SA' Error: Get https://172.31.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 172.31.0.1:443: getsockopt: no route to host Usage: template-service-broker [flags] Flags: ...... F0420 03:14:39.654756 1 tsb.go:41] Get https://172.31.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 172.31.0.1:443: getsockopt: no route to host
I totally missed it on my first few passes, but it looks like template_service_broker_selector={"role": "node"} is set in the inventory, which explains why the nodes are being scheduled there at least. As for the crashloop backoff, the pod is trying to hit 172.31.0.1:443, which I think is the master API. As far as I can tell that IP is only reachable by pods running on the masters, so it looks like in the current state the TSB will only work when running on masters. Reassigning to John Kim to determine whether this is a bug or just missing documentation.