Bug 1547803
Summary: | Upgrade failed at TASK [openshift_service_catalog : wait for api server to be ready] while upgrade to 3.9.0 with service calalog enabled. | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Weihua Meng <wmeng> |
Component: | Service Broker | Assignee: | Jeff Peeler <jpeeler> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Weihua Meng <wmeng> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 3.9.0 | CC: | aos-bugs, chezhang, jiajliu, jialiu, jmatthew, jokerman, mgugino, mmccomas, smunilla, vrutkovs, wmeng, wsun, wzheng, xtian |
Target Milestone: | --- | ||
Target Release: | 3.9.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | openshift-ansible-3.9.3-1.git.0.e166207.el7 | Doc Type: | No Doc Update |
Doc Text: |
undefined
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2018-06-18 17:48:56 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1540840, 1541247, 1546365, 1555394 |
Description
Weihua Meng
2018-02-22 00:45:13 UTC
upgrade from OCP 3.7 with service catalog disabled and to OCP 3.9 with service catalog enabled also failed at same task. Failure summary: 1. Hosts: host-xxxx.redhat.com Play: Upgrade Service Catalog Task: wait for api server to be ready Message: Status code was not [200]: HTTP Error 500: Internal Server Error blocks service catalog / Openshift Service Broker / Template Service Broker upgrade test. Testblocker keyword added. Any progress for this bug? It blocks QE test. Thanks. This happens here on clean 3.7 install as well, the apiserver pod keeps running, but logs have a lot of errors: logging error output: "{\"metadata\":{},\"status\":\"Failure\",\"message\":\"Timeout: request did not complete within 1m0s\",\"reason\":\"Timeout\",\"details\":{},\"code\":504}\n" [[service-catalog/v3.7.31 (linux/amd64) kubernetes/09cef4e] 127.0.0.1:47824] I0226 16:05:38.951677 1 round_trippers.go:436] GET https://127.0.0.1:6443/apis/servicecatalog.k8s.io/v1beta1/serviceinstances?resourceVersion=0 504 Gateway Timeout in 60000 milliseconds I0226 16:05:38.951694 1 round_trippers.go:442] Response Headers: I0226 16:05:38.951699 1 round_trippers.go:445] Date: Mon, 26 Feb 2018 16:05:38 GMT I0226 16:05:38.951703 1 round_trippers.go:445] Content-Type: text/plain; charset=utf-8 I0226 16:05:38.951707 1 round_trippers.go:445] Content-Length: 136 I0226 16:05:38.951729 1 request.go:836] Response Body: {"metadata":{},"status":"Failure","message":"Timeout: request did not complete within 1m0s","reason":"Timeout","details":{},"code":504} E0226 16:05:38.951767 1 reflector.go:205] github.com/kubernetes-incubator/service-catalog/pkg/client/informers_generated/internalversion/factory.go:61: Failed to list *servicecatalog.ServiceInstance: the server was unable to return a response in the time allotted, but may still be processing the request (get serviceinstances.servicecatalog.k8s.io) To get an overall picture of the state of things, please provide "oc get po --all-namespaces" and get the etcd logs. It looks like something is wrong with etcd. As more than 1 week passed. The env has already gone. I am trying to do it again. I tried again. This time latest v3.7 and v3.9 used. means when I reported the bug, v3.7.31 used, this time v3.7.35 used. and openshift-ansible-3.9.1-1.git.0.9862628.el7.noarch used this time. before upgrade [root@host-172-xxx ~]# oc get pods --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE default docker-registry-1-9glz8 1/1 Running 0 1h default docker-registry-1-f4s79 1/1 Running 0 1h default registry-console-1-dbb4h 1/1 Running 0 1h default router-1-8lvcv 1/1 Running 0 1h default router-1-d4xkm 1/1 Running 0 1h kube-service-catalog apiserver-fgq8b 1/1 Running 0 1h kube-service-catalog controller-manager-vcl98 1/1 Running 0 1h openshift-ansible-service-broker asb-1-9x5r5 1/1 Running 2 1h openshift-ansible-service-broker asb-etcd-1-zl2kk 1/1 Running 0 1h openshift-template-service-broker apiserver-2c7sf 1/1 Running 0 1h openshift-template-service-broker apiserver-bpz75 1/1 Running 0 1h openshift-template-service-broker apiserver-qf8n4 1/1 Running 0 1h wmeng3735 mongodb-1-l7v9w 1/1 Running 0 1h wmeng3735 nodejs-mongodb-example-1-build 0/1 Completed 0 1h wmeng3735 nodejs-mongodb-example-1-wkd9f 1/1 Running 0 1h after upgrade failed [root@host-172-xxx ~]# oc get pods --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE default docker-registry-1-9glz8 1/1 Running 0 1h default docker-registry-1-f4s79 1/1 Running 0 1h default registry-console-1-dbb4h 1/1 Running 0 1h default router-1-8lvcv 1/1 Running 0 1h default router-1-d4xkm 1/1 Running 0 1h kube-service-catalog apiserver-fgq8b 0/1 CrashLoopBackOff 10 1h kube-service-catalog controller-manager-vcl98 1/1 Running 2 1h openshift-ansible-service-broker asb-1-9x5r5 1/1 Running 2 1h openshift-ansible-service-broker asb-etcd-1-zl2kk 1/1 Running 0 1h openshift-template-service-broker apiserver-2c7sf 1/1 Running 0 1h openshift-template-service-broker apiserver-bpz75 0/1 Running 2 1h openshift-template-service-broker apiserver-qf8n4 1/1 Running 0 1h wmeng3735 mongodb-1-l7v9w 1/1 Running 0 1h wmeng3735 nodejs-mongodb-example-1-build 0/1 Completed 0 1h wmeng3735 nodejs-mongodb-example-1-wkd9f 1/1 Running 0 1h $ oc adm policy who-can get configmap -n kube-system extension-apiserver-authentication Namespace: kube-system Verb: get Resource: configmaps Users: system:admin system:serviceaccount:default:pvinstaller system:serviceaccount:kube-service-catalog:service-catalog-apiserver system:serviceaccount:kube-system:cloud-provider system:serviceaccount:kube-system:clusterrole-aggregation-controller system:serviceaccount:kube-system:generic-garbage-collector system:serviceaccount:kube-system:namespace-controller system:serviceaccount:openshift-infra:build-controller system:serviceaccount:openshift-infra:cluster-quota-reconciliation-controller system:serviceaccount:openshift-infra:template-instance-controller system:serviceaccount:openshift-infra:template-service-broker system:serviceaccount:openshift-template-service-broker:apiserver Groups: system:cluster-admins system:cluster-readers system:masters The above tells us that the catalog api server has permission to get the configmap data as needed. However, the below logs show that the pod must have been created before the above was configured. $ kubectl logs -n kube-service-catalog -lapp=apiserver -c apiserver I0228 16:22:15.894361 1 feature_gate.go:156] feature gates: map[OriginatingIdentity:true] I0228 16:22:15.904169 1 run_server.go:59] Preparing to run API server I0228 16:22:16.345393 1 round_trippers.go:417] curl -k -v -XGET -H "User-Agent: service-catalog/v3.7.35 (linux/amd64) kubernetes/e81cd1e" -H "Authorization: Bearer eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJrdWJlLXNlcnZpY2UtY2F0YWxvZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJzZXJ2aWNlLWNhdGFsb2ctYXBpc2VydmVyLXRva2VuLWhnaHdwIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQubmFtZSI6InNlcnZpY2UtY2F0YWxvZy1hcGlzZXJ2ZXIiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC51aWQiOiJiYjg2OTUzYi0xYzM1LTExZTgtOGMxMC1mYTE2M2UzOTAzMzAiLCJzdWIiOiJzeXN0ZW06c2VydmljZWFjY291bnQ6a3ViZS1zZXJ2aWNlLWNhdGFsb2c6c2VydmljZS1jYXRhbG9nLWFwaXNlcnZlciJ9.SimuWWBtaVm117fw1h1wz5qRrqmg10ggqzJEOoZb_rEpHWyygIVMpcyklyajeT2Uk0uA3_9Mf3bXJLg2_eF8JtkfXjnhc7rMmNL3p1b2BuUxld1N1xZss9repCMBXSb9dFcKkSRV9evqu1C3BEImoO1j6vEirQslsFi1TOKdG9t1cYLwAtaN4gv7jioMyQIfg2C2ItbhwFg5fiyTUZnsPBpUKVQBL4c08NxV9TTIVi05YAb6ZwbOAgo5xmECGUcHds9RlUW8D0LQU4AGHfUvhb7r8R1QYMsinQ5whqHDPXzCoPIPLzfJtxjVhj-pPWvvA3nlKGCSNgFqiARSnG-T6Q" -H "Accept: application/json, */*" https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication I0228 16:22:19.350847 1 round_trippers.go:436] GET https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication in 3005 milliseconds I0228 16:22:19.350871 1 round_trippers.go:442] Response Headers: W0228 16:22:19.350909 1 authentication.go:231] Unable to get configmap/extension-apiserver-authentication in kube-system. Usually fixed by 'kubectl create rolebinding -n kube-system ROLE_NAME --role=extension-apiserver-authentication-reader --serviceaccount=YOUR_NS:YOUR_SA' Error: Get https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 172.30.0.1:443: getsockopt: no route to host As a workaround you can delete the apiserver pod and have it automatically redeploy. Will work on fixing this in the installer. Thanks for those helpful info. The issue now is that automated upgrade failed. For workaround, It is feasible for manual upgrade. For automated upgrade with ansible playbook, is there a feasible workaround? Since we got the reason and method to deal with it. Then we can fix it rather than workaround. I'm not so sure my initial comment is correct. I noticed that etcd was configured on the wrong port before I lost access to the test system. I'm hoping that this PR will fix all the problems: https://github.com/openshift/openshift-ansible/pull/7334 Both of these fixes are required: https://github.com/openshift/openshift-ansible/pull/7382 - etcd port fix in release-3.9 branch. https://github.com/openshift/openshift-ansible/pull/7362 - NoVolumeNodeConflict scheduler predictate removal Modified is the correct state to be in now. The PR is available for 3.9.3 rpm, so move it to ON_QA. Fixed. openshift-ansible-3.9.3-1.git.0.e166207.el7.noarch |