Description of problem: Upgrade ocp from v3.9 to v3.10, upgrade failed at TASK [openshift_control_plane : Wait for /apis/servicecatalog.k8s.io/v1beta1 when registered] ******************************************************************************* FAILED - RETRYING: Wait for /apis/servicecatalog.k8s.io/v1beta1 when registered (30 retries left). ... FAILED - RETRYING: Wait for /apis/servicecatalog.k8s.io/v1beta1 when registered (1 retries left). fatal: [ec2-3-81-14-6.compute-1.amazonaws.com]: FAILED! => {"attempts": 30, "changed": true, "cmd": ["oc", "--config=/etc/origin/master/admin.kubeconfig", "get", "--raw", "/apis/servicecatalog.k8s.io/v1beta1"], "delta": "0:00:00.327392", "end": "2019-03-26 05:23:34.081277", "msg": "non-zero return code", "rc": 1, "start": "2019-03-26 05:23:33.753885", "stderr": "Error from server (ServiceUnavailable): the server is currently unable to handle the request", "stderr_lines": ["Error from server (ServiceUnavailable): the server is currently unable to handle the request"], "stdout": "", "stdout_lines": []} to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade.retry ============================================ # oc get po -n kube-system NAME READY STATUS RESTARTS AGE master-etcd-ip-172-18-9-63.ec2.internal 1/1 Running 0 44m # oc logs pod/apiserver-fh9dc -n kube-service-catalog I0326 10:00:44.976313 1 feature_gate.go:184] feature gates: map[OriginatingIdentity:true] I0326 10:00:44.976630 1 hyperkube.go:188] Service Catalog version v3.9.74 (built 2019-03-20T00:45:15Z) W0326 10:00:48.759824 1 authentication.go:229] Unable to get configmap/extension-apiserver-authentication in kube-system. Usually fixed by 'kubectl create rolebinding -n kube-system ROLE_NAME --role=extension-apiserver-authentication-reader --serviceaccount=YOUR_NS:YOUR_SA' Error: Get https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 172.30.0.1:443: getsockopt: no route to host # cat master-api.log |grep servicecatalog|grep fail Mar 26 04:57:27 ip-172-18-9-63.ec2.internal atomic-openshift-master-api[12272]: E0326 04:57:27.234062 12272 controller.go:111] loading OpenAPI spec for "v1beta1.servicecatalog.k8s.io" failed with: OpenAPI spec does not exists ... Mar 26 05:30:53 ip-172-18-9-63.ec2.internal atomic-openshift-master-api[47931]: E0326 05:30:53.901635 47931 controller.go:111] loading OpenAPI spec for "v1beta1.servicecatalog.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable Version-Release number of the following components: openshift-ansible-3.10.127-1.git.0.131da09.el7.noarch How reproducible: always Steps to Reproduce: 1. Install ocp v3.9 with service catalog enable 2. Enable v3.10.127 repo on all hosts 3. Do upgrade against above cluster Actual results: Upgrade failed. Expected results: Upgrade succeed. Additional info: Disable service catalog when fresh install will not hit the issue during upgrade.
Add testblocker due to we can not do upgrade with service catalog enabled now.
Based on the inability to reach the kubenertes service ip this is either master or networking.
I upgrade successfully with service catalog enabled. from v3.9.74 to v3.10.127, 3 masters + 3 nodes + 1 lb, docker container install on RHEL 7.6 before upgrade. vms on Openstack
I am seeing this as well, upgrading from 3.9 to 3.10 on bare metal, with identical error messages. Is there any information I can provide that would help.
Based on comment #3 and comment #13 as well as the symptoms presented, I believe this is a networking bug. We had a similar issue in 3.11 which was fixed by the following patch: https://github.com/openshift/openshift-ansible/pull/11708 which went into 3.10 only 5 days ago.
See comment #19, I forgot to clear the needinfo flag.
This bug is related to the 3.11 upgrade bug: https://bugzilla.redhat.com/show_bug.cgi?id=1717764
Jesus, I agree with this being some kind of networking issue, but I don't see a reason to believe that it is the MTU as it was in the github pull request you presented. Errors at comments you are presenting are too generic
https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_sdn/tasks/main.yml line29-35 seems to set the scc. Not sure why this is not working.
I have confirmed the user is added to the privileged scc correctly. "users": [ "system:admin", "system:serviceaccount:openshift-infra:build-controller", "system:serviceaccount:openshift-node:sync", "system:serviceaccount:openshift-sdn:sdn", "system:serviceaccount:management-infra:management-admin", "system:serviceaccount:management-infra:inspector-admin" ], Please provide the version of openshift-ansible used and the verbose Ansible logs where the scc was not properly updated.
PR to increase aggregated API availability check: https://github.com/openshift/openshift-ansible/pull/11821
QE hit this issue again when running upgrade cluster from v3.9.94 to 3.10.161 using openshift-ansible-3.10.162 installer. But not 100% reproduced.
Version: openshift-ansible-3.10.169-1.git.0.a62e7aa.el7.noarch Checked pr11828 merged. # grep -r "retries" roles/openshift_control_plane/tasks/check_master_api_is_ready.yml retries: 60 retries: 60 retries: 60 retries: 60 retries: 60 Steps: 1. Install ocp v3.9 2. Enable v3.10.169 repo on all hosts 3. Do upgrade against above cluster Upgrade succeed, verify the bug.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2688