Bug 1565896

Summary:	openshift-template-service-broker apiserver pod crash after upgrade to OCP 3.9.19
Product:	OpenShift Container Platform	Reporter:	Weihua Meng <wmeng>
Component:	Service Broker	Assignee:	jkim
Status:	CLOSED WONTFIX	QA Contact:	Weihua Meng <wmeng>
Severity:	high	Docs Contact:
Priority:	medium
Version:	3.9.0	CC:	aos-bugs, chezhang, fabian, jesusr, jmalde, jmatthew, jokerman, mmccomas, wmeng, wzheng
Target Milestone:	---
Target Release:	3.9.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-09-25 14:08:26 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Weihua Meng 2018-04-11 02:23:49 UTC

Description of problem:
openshift-template-service-broker apiserver pod crash after upgrade to OCP 3.9.19

Version-Release number of the following components:
openshift-ansible-3.9.19-1.git.0.34f4090.el7.noarch

How reproducible:
Always

Steps to Reproduce:
1. upgrade OCP 3.7.42 to 3.9.19 on Atomic Host

Actual results:
Upgrade playbook success.
2/5 openshift-template-service-broker apiserver pod in crash state
those pods on master are running
those openshift-template-service-broker apiserver pods on compute nodes are in crash state.

# oc version
oc v3.9.19
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://host-8-240-104.host.centralci.eng.rdu2.redhat.com
openshift v3.9.19
kubernetes v1.9.1+a0ce1bc657
[root@host-172-16-120-4 ~]# oc get nodes
NAME            STATUS    ROLES     AGE       VERSION
172.16.120.10   Ready     master    14h       v1.9.1+a0ce1bc657
172.16.120.15   Ready     compute   14h       v1.9.1+a0ce1bc657
172.16.120.24   Ready     master    14h       v1.9.1+a0ce1bc657
172.16.120.4    Ready     master    14h       v1.9.1+a0ce1bc657
172.16.120.9    Ready     compute   14h       v1.9.1+a0ce1bc657

# oc get pods -o wide -n openshift-template-service-broker
NAME              READY     STATUS             RESTARTS   AGE       IP          NODE
apiserver-2h544   1/1       Running            0          10h       10.2.0.5    172.16.120.4
apiserver-7kzt4   1/1       Running            0          10h       10.2.2.4    172.16.120.10
apiserver-d5zpc   0/1       CrashLoopBackOff   127        10h       10.2.6.18   172.16.120.15
apiserver-lk7c4   1/1       Running            0          10h       10.2.4.4    172.16.120.24
apiserver-p6kjx   0/1       CrashLoopBackOff   127        10h       10.2.8.18   172.16.120.9

# oc logs apiserver-d5zpc
W0411 01:20:23.579940       1 authentication.go:229] Unable to get configmap/extension-apiserver-authentication in kube-system.  Usually fixed by 'kubectl create rolebinding -n kube-system ROLE_NAME --role=extension-apiserver-authentication-reader --serviceaccount=YOUR_NS:YOUR_SA'
Error: Get https://172.31.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 172.31.0.1:443: getsockopt: no route to host
Usage:
  template-service-broker [flags]
Flags:
      --alsologtostderr                                         log to standard error as well as files
.............
      --vmodule moduleSpec                                      comma-separated list of pattern=N settings for file-filtered logging
F0411 01:20:23.583159       1 tsb.go:41] Get https://172.31.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 172.31.0.1:443: getsockopt: no route to host

    Image:         registry.reg-aws.openshift.com:443/openshift3/ose-template-service-broker:v3.9.19

the other crashed pod got same log


Expected results:
all pods running after upgrade

Comment 5 Jay Boyd 2018-04-12 02:06:16 UTC

I'm afraid I need to transfer this to the template service broker team - I'm on Service Catalog.  @John please let me know if I've miss-assigned this.

Comment 6 Weihua Meng 2018-04-14 02:55:24 UTC

Hi, John
Could you take a look at this?

Thanks.

Comment 7 John Matthews 2018-04-14 13:23:45 UTC

Fabian please take a look to help find the issue.

Comment 8 Weihua Meng 2018-04-17 02:53:01 UTC

Those instances have been preserved for a week.
Do they still need to be preserved?

Comment 9 Fabian von Feilitzsch 2018-04-17 04:01:01 UTC

Yeah, just for 1 more day. Sorry if this has caused an inconvenience.

Comment 10 Fabian von Feilitzsch 2018-04-17 04:01:25 UTC

Yeah, just for 1 more day. Sorry if this has caused an inconvenience.

Comment 11 Fabian von Feilitzsch 2018-04-17 17:31:05 UTC

It looks like the issue might be that the project node selector is set to "", which is overriding the apiserver's daemonset node selector (of region=infra), which causes the apiserver pods to be incorrectly scheduled on compute nodes.

Comment 12 Fabian von Feilitzsch 2018-04-17 21:27:44 UTC

Not yet sure if this fixes it, but posted a WIP PR: https://github.com/openshift/openshift-ansible/pull/8010

Comment 13 Weihua Meng 2018-04-18 01:32:39 UTC

I do not think node selector should/would cause pods CrashLoopBackOff.

Comment 14 Fabian von Feilitzsch 2018-04-19 15:42:47 UTC

It does seem like nodeselector configuration can cause CrashLoopBackOff, judging by: https://github.com/kubernetes/kubernetes/issues/16967#issuecomment-298608454

It's odd because it seems that setting the project nodeselector to "" is actually supposed to avoid this problem, so I'm a little lost as to the actual cause at the moment. It seems this issue could be related to https://bugzilla.redhat.com/show_bug.cgi?id=1501514#c9 

There was a user that reported errors even with the project node selector as "" (https://github.com/kubernetes/kubernetes/issues/51788#issuecomment-364039589)

Judging by the issue linked from that issue, it could be caused by an admissionControllerPlugin (https://github.com/kubernetes/kubernetes/issues/61886). Do you know what admissionControllerPlugins are running in this cluster?

Is this error reproducible for you? If it happens again I think it would be useful to get another look at the environment, as there is definitely some weirdness going on. I'll see if I can get someone with more familiarity to weigh in on this.

Comment 15 Weihua Meng 2018-04-20 03:20:21 UTC

openshift-ansible-3.9.24-1.git.0.d0289ea.el7.noarch

  Operating System: Red Hat Enterprise Linux Atomic Host 7.5.0
       CPE OS Name: cpe:/o:redhat:enterprise_linux:7.5:GA:atomic-host
            Kernel: Linux 3.10.0-862.el7.x86_64
      Architecture: x86-64

before upgrade
# oc version
oc v3.7.44
kubernetes v1.7.6+a08f5eeb62
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://host-8-248-214.host.centralci.eng.rdu2.redhat.com
openshift v3.7.44
kubernetes v1.7.6+a08f5eeb62

# oc get all -n openshift-template-service-broker
NAME                 READY     STATUS    RESTARTS   AGE
po/apiserver-4fc9c   1/1       Running   0          31m
po/apiserver-56wqt   1/1       Running   0          31m
po/apiserver-kmhw5   1/1       Running   0          31m
po/apiserver-l6dt7   1/1       Running   0          31m
po/apiserver-pghzm   1/1       Running   0          31m


after upgrade
# oc version
oc v3.9.25
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://host-8-248-214.host.centralci.eng.rdu2.redhat.com
openshift v3.9.25
kubernetes v1.9.1+a0ce1bc657
# oc get pods -o wide  -n openshift-template-service-broker
NAME              READY     STATUS             RESTARTS   AGE       IP          NODE
apiserver-9dnf5   0/1       CrashLoopBackOff   6          6m        10.2.6.19   172.16.120.80
apiserver-gxbfd   0/1       CrashLoopBackOff   5          5m        10.2.8.16   172.16.120.73
apiserver-q4d4k   1/1       Running            0          5m        10.2.2.6    172.16.120.16
apiserver-rlxts   1/1       Running            0          6m        10.2.0.4    172.16.120.49
apiserver-tn7mm   1/1       Running            0          6m        10.2.4.6    172.16.120.9

# oc get pods -o yaml -n openshift-template-service-broker | grep image:
      image: registry.reg-aws.openshift.com:443/openshift3/ose-template-service-broker:v3.9.25

# oc logs apiserver-9dnf5 -n openshift-template-service-broker
W0420 03:14:39.653899       1 authentication.go:229] Unable to get configmap/extension-apiserver-authentication in kube-system.  Usually fixed by 'kubectl create rolebinding -n kube-system ROLE_NAME --role=extension-apiserver-authentication-reader --serviceaccount=YOUR_NS:YOUR_SA'
Error: Get https://172.31.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 172.31.0.1:443: getsockopt: no route to host
Usage:
  template-service-broker [flags]
Flags:
......
F0420 03:14:39.654756       1 tsb.go:41] Get https://172.31.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 172.31.0.1:443: getsockopt: no route to host

Comment 21 Fabian von Feilitzsch 2018-04-23 17:37:25 UTC

I totally missed it on my first few passes, but it looks like 

  template_service_broker_selector={"role": "node"}

is set in the inventory, which explains why the nodes are being scheduled there at least.

As for the crashloop backoff, the pod is trying to hit 172.31.0.1:443, which I think is the master API. As far as I can tell that IP is only reachable by pods running on the masters, so it looks like in the current state the TSB will only work when running on masters. 

Reassigning to John Kim to determine whether this is a bug or just missing documentation.