Bug 1565896
Summary: | openshift-template-service-broker apiserver pod crash after upgrade to OCP 3.9.19 | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Weihua Meng <wmeng> |
Component: | Service Broker | Assignee: | jkim |
Status: | CLOSED WONTFIX | QA Contact: | Weihua Meng <wmeng> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 3.9.0 | CC: | aos-bugs, chezhang, fabian, jesusr, jmalde, jmatthew, jokerman, mmccomas, wmeng, wzheng |
Target Milestone: | --- | ||
Target Release: | 3.9.z | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-09-25 14:08:26 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Weihua Meng
2018-04-11 02:23:49 UTC
I'm afraid I need to transfer this to the template service broker team - I'm on Service Catalog. @John please let me know if I've miss-assigned this. Hi, John Could you take a look at this? Thanks. Fabian please take a look to help find the issue. Those instances have been preserved for a week. Do they still need to be preserved? Yeah, just for 1 more day. Sorry if this has caused an inconvenience. Yeah, just for 1 more day. Sorry if this has caused an inconvenience. It looks like the issue might be that the project node selector is set to "", which is overriding the apiserver's daemonset node selector (of region=infra), which causes the apiserver pods to be incorrectly scheduled on compute nodes. Not yet sure if this fixes it, but posted a WIP PR: https://github.com/openshift/openshift-ansible/pull/8010 I do not think node selector should/would cause pods CrashLoopBackOff. It does seem like nodeselector configuration can cause CrashLoopBackOff, judging by: https://github.com/kubernetes/kubernetes/issues/16967#issuecomment-298608454 It's odd because it seems that setting the project nodeselector to "" is actually supposed to avoid this problem, so I'm a little lost as to the actual cause at the moment. It seems this issue could be related to https://bugzilla.redhat.com/show_bug.cgi?id=1501514#c9 There was a user that reported errors even with the project node selector as "" (https://github.com/kubernetes/kubernetes/issues/51788#issuecomment-364039589) Judging by the issue linked from that issue, it could be caused by an admissionControllerPlugin (https://github.com/kubernetes/kubernetes/issues/61886). Do you know what admissionControllerPlugins are running in this cluster? Is this error reproducible for you? If it happens again I think it would be useful to get another look at the environment, as there is definitely some weirdness going on. I'll see if I can get someone with more familiarity to weigh in on this. openshift-ansible-3.9.24-1.git.0.d0289ea.el7.noarch Operating System: Red Hat Enterprise Linux Atomic Host 7.5.0 CPE OS Name: cpe:/o:redhat:enterprise_linux:7.5:GA:atomic-host Kernel: Linux 3.10.0-862.el7.x86_64 Architecture: x86-64 before upgrade # oc version oc v3.7.44 kubernetes v1.7.6+a08f5eeb62 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://host-8-248-214.host.centralci.eng.rdu2.redhat.com openshift v3.7.44 kubernetes v1.7.6+a08f5eeb62 # oc get all -n openshift-template-service-broker NAME READY STATUS RESTARTS AGE po/apiserver-4fc9c 1/1 Running 0 31m po/apiserver-56wqt 1/1 Running 0 31m po/apiserver-kmhw5 1/1 Running 0 31m po/apiserver-l6dt7 1/1 Running 0 31m po/apiserver-pghzm 1/1 Running 0 31m after upgrade # oc version oc v3.9.25 kubernetes v1.9.1+a0ce1bc657 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://host-8-248-214.host.centralci.eng.rdu2.redhat.com openshift v3.9.25 kubernetes v1.9.1+a0ce1bc657 # oc get pods -o wide -n openshift-template-service-broker NAME READY STATUS RESTARTS AGE IP NODE apiserver-9dnf5 0/1 CrashLoopBackOff 6 6m 10.2.6.19 172.16.120.80 apiserver-gxbfd 0/1 CrashLoopBackOff 5 5m 10.2.8.16 172.16.120.73 apiserver-q4d4k 1/1 Running 0 5m 10.2.2.6 172.16.120.16 apiserver-rlxts 1/1 Running 0 6m 10.2.0.4 172.16.120.49 apiserver-tn7mm 1/1 Running 0 6m 10.2.4.6 172.16.120.9 # oc get pods -o yaml -n openshift-template-service-broker | grep image: image: registry.reg-aws.openshift.com:443/openshift3/ose-template-service-broker:v3.9.25 # oc logs apiserver-9dnf5 -n openshift-template-service-broker W0420 03:14:39.653899 1 authentication.go:229] Unable to get configmap/extension-apiserver-authentication in kube-system. Usually fixed by 'kubectl create rolebinding -n kube-system ROLE_NAME --role=extension-apiserver-authentication-reader --serviceaccount=YOUR_NS:YOUR_SA' Error: Get https://172.31.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 172.31.0.1:443: getsockopt: no route to host Usage: template-service-broker [flags] Flags: ...... F0420 03:14:39.654756 1 tsb.go:41] Get https://172.31.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 172.31.0.1:443: getsockopt: no route to host I totally missed it on my first few passes, but it looks like template_service_broker_selector={"role": "node"} is set in the inventory, which explains why the nodes are being scheduled there at least. As for the crashloop backoff, the pod is trying to hit 172.31.0.1:443, which I think is the master API. As far as I can tell that IP is only reachable by pods running on the masters, so it looks like in the current state the TSB will only work when running on masters. Reassigning to John Kim to determine whether this is a bug or just missing documentation. |