Bug 1547803

Summary:	Upgrade failed at TASK [openshift_service_catalog : wait for api server to be ready] while upgrade to 3.9.0 with service calalog enabled.
Product:	OpenShift Container Platform	Reporter:	Weihua Meng <wmeng>
Component:	Service Broker	Assignee:	Jeff Peeler <jpeeler>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Weihua Meng <wmeng>
Severity:	high	Docs Contact:
Priority:	high
Version:	3.9.0	CC:	aos-bugs, chezhang, jiajliu, jialiu, jmatthew, jokerman, mgugino, mmccomas, smunilla, vrutkovs, wmeng, wsun, wzheng, xtian
Target Milestone:	---
Target Release:	3.9.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openshift-ansible-3.9.3-1.git.0.e166207.el7	Doc Type:	No Doc Update
Doc Text:	undefined	Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-06-18 17:48:56 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1540840, 1541247, 1546365, 1555394

Description Weihua Meng 2018-02-22 00:45:13 UTC

Description of problem:
Upgrade failed at TASK [openshift_service_catalog : wait for api server to be ready]
Upgrade succeeded with service catalog disabled cluster.

Version-Release number of the following components:
openshift-ansible-3.9.0-0.47.0.git.0.f8847bb.el7.noarch

How reproducible:
Always

Steps to Reproduce:
1. upgrade OCP 3.7.31 with service_catalog enabled (dafault) to OCP 3.9
# ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_9/upgrade.yml

Actual results:
TASK [openshift_service_catalog : wait for api server to be ready] ************************************************************************************************************************************************
task path: /usr/share/ansible/openshift-ansible/roles/openshift_service_catalog/tasks/start_api_server.yml:11
Using module file /usr/lib/python2.7/site-packages/ansible/modules/net_tools/basics/uri.py
***
<host-xxxx.redhat.com> (0, '\r\n{"status": 500, "content_length": "180", "exception": "  File \\"/tmp/ansible_blHu7b/ansible_module_uri.py\\", line 468, in main\\n    uresp[\'location\'] = absolute_location(url, uresp[\'location\'])\\n", "url": "https://apiserver.kube-service-catalog.svc/healthz", "x_content_type_options": "nosniff", "content": "[+]ping ok\\n[+]poststarthook/generic-apiserver-start-informers ok\\n[+]poststarthook/start-service-catalog-apiserver-informers ok\\n[-]etcd failed: reason withheld\\nhealthz check failed\\n", "failed": true, "connection": "close", "content_type": "text/plain; charset=utf-8", "invocation": {"module_args": {"directory_mode": null, "force": false, "remote_src": null, "status_code": [200], "body_format": "raw", "owner": null, "follow": false, "client_key": null, "group": null, "use_proxy": true, "unsafe_writes": null, "setype": null, "content": null, "serole": null, "follow_redirects": "safe", "return_content": true, "method": "GET", "body": null, "timeout": 30, "src": null, "dest": null, "selevel": null, "force_basic_auth": false, "removes": null, "http_agent": "ansible-httpget", "regexp": null, "url_password": null, "url": "https://apiserver.kube-service-catalog.svc/healthz", "validate_certs": false, "seuser": null, "client_cert": null, "creates": null, "headers": {}, "delimiter": null, "mode": null, "url_username": null, "attributes": null, "backup": null}}, "date": "Thu, 22 Feb 2018 00:01:49 GMT", "redirected": false, "msg": "Status code was not [200]: HTTP Error 500: Internal Server Error"}\r\n', 'Shared connection to host-xxxx.redhat.com closed.\r\n')
The full traceback is:
  File "/tmp/ansible_blHu7b/ansible_module_uri.py", line 468, in main
    uresp['location'] = absolute_location(url, uresp['location'])
fatal: [host-xxxx.redhat.com]: FAILED! => {
    "attempts": 8, 
    "changed": false, 
    "connection": "close", 
    "content": "[+]ping ok\n[+]poststarthook/generic-apiserver-start-informers ok\n[+]poststarthook/start-service-catalog-apiserver-informers ok\n[-]etcd failed: reason withheld\nhealthz check failed\n", 
    "content_length": "180", 
    "content_type": "text/plain; charset=utf-8", 
    "date": "Thu, 22 Feb 2018 00:01:49 GMT", 
    "invocation": {
        "module_args": {
            "attributes": null, 
            "backup": null, 
            "body": null, 
            "body_format": "raw", 
            "client_cert": null, 
            "client_key": null, 
            "content": null, 
            "creates": null, 
            "delimiter": null, 
            "dest": null, 
            "directory_mode": null, 
            "follow": false, 
            "follow_redirects": "safe", 
            "force": false, 
            "force_basic_auth": false, 
            "group": null, 
            "headers": {}, 
            "http_agent": "ansible-httpget", 
            "method": "GET", 
            "mode": null, 
            "owner": null, 
            "regexp": null, 
            "remote_src": null, 
            "removes": null, 
            "return_content": true, 
            "selevel": null, 
            "serole": null, 
            "setype": null, 
            "seuser": null, 
            "src": null, 
            "status_code": [
                200
            ], 
            "timeout": 30, 
            "unsafe_writes": null, 
            "url": "https://apiserver.kube-service-catalog.svc/healthz", 
            "url_password": null, 
            "url_username": null, 
            "use_proxy": true, 
            "validate_certs": false
        }
    }, 
    "msg": "Status code was not [200]: HTTP Error 500: Internal Server Error", 
    "redirected": false, 
    "status": 500, 
    "url": "https://apiserver.kube-service-catalog.svc/healthz", 
    "x_content_type_options": "nosniff"
}


Expected results:
Upgrade succeeds

Comment 2 Weihua Meng 2018-02-22 03:43:50 UTC

upgrade from OCP 3.7 with service catalog disabled and to OCP 3.9 with service catalog enabled also failed at same task.

Failure summary:


  1. Hosts:    host-xxxx.redhat.com
     Play:     Upgrade Service Catalog
     Task:     wait for api server to be ready
     Message:  Status code was not [200]: HTTP Error 500: Internal Server Error

Comment 3 Weihua Meng 2018-02-22 06:51:50 UTC

blocks service catalog / Openshift Service Broker / Template Service Broker upgrade test.
Testblocker keyword added.

Comment 4 Weihua Meng 2018-02-23 23:37:50 UTC

Any progress for this bug? It blocks QE test. Thanks.

Comment 5 Vadim Rutkovsky 2018-02-26 16:06:47 UTC

This happens here on clean 3.7 install as well, the apiserver pod keeps running, but logs have a lot of errors:

logging error output: "{\"metadata\":{},\"status\":\"Failure\",\"message\":\"Timeout: request did not complete within 1m0s\",\"reason\":\"Timeout\",\"details\":{},\"code\":504}\n"
 [[service-catalog/v3.7.31 (linux/amd64) kubernetes/09cef4e] 127.0.0.1:47824]
I0226 16:05:38.951677       1 round_trippers.go:436] GET https://127.0.0.1:6443/apis/servicecatalog.k8s.io/v1beta1/serviceinstances?resourceVersion=0 504 Gateway Timeout in 60000 milliseconds
I0226 16:05:38.951694       1 round_trippers.go:442] Response Headers:
I0226 16:05:38.951699       1 round_trippers.go:445]     Date: Mon, 26 Feb 2018 16:05:38 GMT
I0226 16:05:38.951703       1 round_trippers.go:445]     Content-Type: text/plain; charset=utf-8
I0226 16:05:38.951707       1 round_trippers.go:445]     Content-Length: 136
I0226 16:05:38.951729       1 request.go:836] Response Body: {"metadata":{},"status":"Failure","message":"Timeout: request did not complete within 1m0s","reason":"Timeout","details":{},"code":504}
E0226 16:05:38.951767       1 reflector.go:205] github.com/kubernetes-incubator/service-catalog/pkg/client/informers_generated/internalversion/factory.go:61: Failed to list *servicecatalog.ServiceInstance: the server was unable to return a response in the time allotted, but may still be processing the request (get serviceinstances.servicecatalog.k8s.io)

Comment 7 Jeff Peeler 2018-02-27 19:35:37 UTC

To get an overall picture of the state of things, please provide "oc get po --all-namespaces" and get the etcd logs. It looks like something is wrong with etcd.

Comment 8 Weihua Meng 2018-02-28 02:14:14 UTC

As more than 1 week passed.
The env has already gone.

I am trying to do it again.

Comment 9 Weihua Meng 2018-02-28 05:54:44 UTC

I tried again.
This time latest v3.7 and v3.9 used.
means when I reported the bug, v3.7.31 used, this time v3.7.35 used.
and openshift-ansible-3.9.1-1.git.0.9862628.el7.noarch used this time.


before upgrade
[root@host-172-xxx ~]# oc get pods --all-namespaces
NAMESPACE                           NAME                             READY     STATUS      RESTARTS   AGE
default                             docker-registry-1-9glz8          1/1       Running     0          1h
default                             docker-registry-1-f4s79          1/1       Running     0          1h
default                             registry-console-1-dbb4h         1/1       Running     0          1h
default                             router-1-8lvcv                   1/1       Running     0          1h
default                             router-1-d4xkm                   1/1       Running     0          1h
kube-service-catalog                apiserver-fgq8b                  1/1       Running     0          1h
kube-service-catalog                controller-manager-vcl98         1/1       Running     0          1h
openshift-ansible-service-broker    asb-1-9x5r5                      1/1       Running     2          1h
openshift-ansible-service-broker    asb-etcd-1-zl2kk                 1/1       Running     0          1h
openshift-template-service-broker   apiserver-2c7sf                  1/1       Running     0          1h
openshift-template-service-broker   apiserver-bpz75                  1/1       Running     0          1h
openshift-template-service-broker   apiserver-qf8n4                  1/1       Running     0          1h
wmeng3735                           mongodb-1-l7v9w                  1/1       Running     0          1h
wmeng3735                           nodejs-mongodb-example-1-build   0/1       Completed   0          1h
wmeng3735                           nodejs-mongodb-example-1-wkd9f   1/1       Running     0          1h


after upgrade failed
[root@host-172-xxx ~]# oc get pods --all-namespaces
NAMESPACE                           NAME                             READY     STATUS             RESTARTS   AGE
default                             docker-registry-1-9glz8          1/1       Running            0          1h
default                             docker-registry-1-f4s79          1/1       Running            0          1h
default                             registry-console-1-dbb4h         1/1       Running            0          1h
default                             router-1-8lvcv                   1/1       Running            0          1h
default                             router-1-d4xkm                   1/1       Running            0          1h
kube-service-catalog                apiserver-fgq8b                  0/1       CrashLoopBackOff   10         1h
kube-service-catalog                controller-manager-vcl98         1/1       Running            2          1h
openshift-ansible-service-broker    asb-1-9x5r5                      1/1       Running            2          1h
openshift-ansible-service-broker    asb-etcd-1-zl2kk                 1/1       Running            0          1h
openshift-template-service-broker   apiserver-2c7sf                  1/1       Running            0          1h
openshift-template-service-broker   apiserver-bpz75                  0/1       Running            2          1h
openshift-template-service-broker   apiserver-qf8n4                  1/1       Running            0          1h
wmeng3735                           mongodb-1-l7v9w                  1/1       Running            0          1h
wmeng3735                           nodejs-mongodb-example-1-build   0/1       Completed          0          1h
wmeng3735                           nodejs-mongodb-example-1-wkd9f   1/1       Running            0          1h

Comment 14 Jeff Peeler 2018-02-28 18:15:39 UTC

$ oc adm policy who-can get configmap -n kube-system extension-apiserver-authentication

Namespace: kube-system
Verb:      get
Resource:  configmaps

Users:  system:admin
        system:serviceaccount:default:pvinstaller
        system:serviceaccount:kube-service-catalog:service-catalog-apiserver
        system:serviceaccount:kube-system:cloud-provider
        system:serviceaccount:kube-system:clusterrole-aggregation-controller
        system:serviceaccount:kube-system:generic-garbage-collector
        system:serviceaccount:kube-system:namespace-controller
        system:serviceaccount:openshift-infra:build-controller
        system:serviceaccount:openshift-infra:cluster-quota-reconciliation-controller
        system:serviceaccount:openshift-infra:template-instance-controller
        system:serviceaccount:openshift-infra:template-service-broker
        system:serviceaccount:openshift-template-service-broker:apiserver

Groups: system:cluster-admins
        system:cluster-readers
        system:masters

The above tells us that the catalog api server has permission to get the configmap data as needed. However, the below logs show that the pod must have been created before the above was configured.

$ kubectl logs -n kube-service-catalog -lapp=apiserver -c apiserver
I0228 16:22:15.894361       1 feature_gate.go:156] feature gates: map[OriginatingIdentity:true]
I0228 16:22:15.904169       1 run_server.go:59] Preparing to run API server
I0228 16:22:16.345393       1 round_trippers.go:417] curl -k -v -XGET  -H "User-Agent: service-catalog/v3.7.35 (linux/amd64) kubernetes/e81cd1e" -H "Authorization: Bearer eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJrdWJlLXNlcnZpY2UtY2F0YWxvZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJzZXJ2aWNlLWNhdGFsb2ctYXBpc2VydmVyLXRva2VuLWhnaHdwIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQubmFtZSI6InNlcnZpY2UtY2F0YWxvZy1hcGlzZXJ2ZXIiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC51aWQiOiJiYjg2OTUzYi0xYzM1LTExZTgtOGMxMC1mYTE2M2UzOTAzMzAiLCJzdWIiOiJzeXN0ZW06c2VydmljZWFjY291bnQ6a3ViZS1zZXJ2aWNlLWNhdGFsb2c6c2VydmljZS1jYXRhbG9nLWFwaXNlcnZlciJ9.SimuWWBtaVm117fw1h1wz5qRrqmg10ggqzJEOoZb_rEpHWyygIVMpcyklyajeT2Uk0uA3_9Mf3bXJLg2_eF8JtkfXjnhc7rMmNL3p1b2BuUxld1N1xZss9repCMBXSb9dFcKkSRV9evqu1C3BEImoO1j6vEirQslsFi1TOKdG9t1cYLwAtaN4gv7jioMyQIfg2C2ItbhwFg5fiyTUZnsPBpUKVQBL4c08NxV9TTIVi05YAb6ZwbOAgo5xmECGUcHds9RlUW8D0LQU4AGHfUvhb7r8R1QYMsinQ5whqHDPXzCoPIPLzfJtxjVhj-pPWvvA3nlKGCSNgFqiARSnG-T6Q" -H "Accept: application/json, */*" https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication
I0228 16:22:19.350847       1 round_trippers.go:436] GET https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication  in 3005 milliseconds
I0228 16:22:19.350871       1 round_trippers.go:442] Response Headers:
W0228 16:22:19.350909       1 authentication.go:231] Unable to get configmap/extension-apiserver-authentication in kube-system.  Usually fixed by 'kubectl create rolebinding -n kube-system ROLE_NAME --role=extension-apiserver-authentication-reader --serviceaccount=YOUR_NS:YOUR_SA'
Error: Get https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 172.30.0.1:443: getsockopt: no route to host

As a workaround you can delete the apiserver pod and have it automatically redeploy. Will work on fixing this in the installer.

Comment 15 Weihua Meng 2018-02-28 20:42:37 UTC

Thanks for those helpful info.

The issue now is that automated upgrade failed.

For workaround, It is feasible for manual upgrade.

For automated upgrade with ansible playbook, is there a feasible workaround?

Comment 16 Weihua Meng 2018-02-28 20:47:09 UTC

Since we got the reason and method to deal with it.

Then we can fix it rather than workaround.

Comment 17 Jeff Peeler 2018-03-01 00:30:24 UTC

I'm not so sure my initial comment is correct. I noticed that etcd was configured on the wrong port before I lost access to the test system. I'm hoping that this PR will fix all the problems: https://github.com/openshift/openshift-ansible/pull/7334

Comment 18 Jeff Peeler 2018-03-05 14:41:12 UTC

Both of these fixes are required:

https://github.com/openshift/openshift-ansible/pull/7382 - etcd port fix in release-3.9 branch.

https://github.com/openshift/openshift-ansible/pull/7362 - NoVolumeNodeConflict scheduler predictate removal

Comment 19 Jeff Peeler 2018-03-05 14:54:28 UTC

Modified is the correct state to be in now.

Comment 20 Johnny Liu 2018-03-07 03:25:15 UTC

The PR is available for 3.9.3 rpm, so move it to ON_QA.

Comment 21 Weihua Meng 2018-03-07 09:25:54 UTC

Fixed.
openshift-ansible-3.9.3-1.git.0.e166207.el7.noarch