Bug 1692735 - upgrade failed due to [openshift_control_plane : Wait for /apis/servicecatalog.k8s.io/v1beta1 when registered]
Summary: upgrade failed due to [openshift_control_plane : Wait for /apis/servicecatalo...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.10.0
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: 3.10.z
Assignee: Russell Teague
QA Contact: liujia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-26 10:29 UTC by liujia
Modified: 2019-09-10 23:59 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: ServiceCatalog API endpoint not responding before timeout Consequence: Control Plane upgrade fails Fix: Increase timeout for checking API endpoints Result: Successful control plane upgrade
Clone Of:
: 1742002 (view as bug list)
Environment:
Last Closed: 2019-09-10 23:59:06 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift openshift-ansible pull 11821 0 'None' closed Bug 1742002: Increase wait for aggregated API availability 2021-02-04 11:53:35 UTC
Github openshift openshift-ansible pull 11828 0 'None' closed Bug 1692735: Increase wait for aggregated API availability 2021-02-04 11:53:35 UTC
Red Hat Product Errata RHBA-2019:2688 0 None None None 2019-09-10 23:59:10 UTC

Description liujia 2019-03-26 10:29:32 UTC
Description of problem:
Upgrade ocp from v3.9 to v3.10, upgrade failed at TASK [openshift_control_plane : Wait for /apis/servicecatalog.k8s.io/v1beta1 when registered] *******************************************************************************
FAILED - RETRYING: Wait for /apis/servicecatalog.k8s.io/v1beta1 when registered (30 retries left).
...
FAILED - RETRYING: Wait for /apis/servicecatalog.k8s.io/v1beta1 when registered (1 retries left).
fatal: [ec2-3-81-14-6.compute-1.amazonaws.com]: FAILED! => {"attempts": 30, "changed": true, "cmd": ["oc", "--config=/etc/origin/master/admin.kubeconfig", "get", "--raw", "/apis/servicecatalog.k8s.io/v1beta1"], "delta": "0:00:00.327392", "end": "2019-03-26 05:23:34.081277", "msg": "non-zero return code", "rc": 1, "start": "2019-03-26 05:23:33.753885", "stderr": "Error from server (ServiceUnavailable): the server is currently unable to handle the request", "stderr_lines": ["Error from server (ServiceUnavailable): the server is currently unable to handle the request"], "stdout": "", "stdout_lines": []}
        to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade.retry

============================================
# oc get po -n kube-system
NAME                                      READY     STATUS    RESTARTS   AGE
master-etcd-ip-172-18-9-63.ec2.internal   1/1       Running   0          44m

# oc logs pod/apiserver-fh9dc -n kube-service-catalog
I0326 10:00:44.976313       1 feature_gate.go:184] feature gates: map[OriginatingIdentity:true]
I0326 10:00:44.976630       1 hyperkube.go:188] Service Catalog version v3.9.74 (built 2019-03-20T00:45:15Z)
W0326 10:00:48.759824       1 authentication.go:229] Unable to get configmap/extension-apiserver-authentication in kube-system.  Usually fixed by 'kubectl create rolebinding -n kube-system ROLE_NAME --role=extension-apiserver-authentication-reader --serviceaccount=YOUR_NS:YOUR_SA'
Error: Get https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 172.30.0.1:443: getsockopt: no route to host

# cat master-api.log |grep servicecatalog|grep fail
Mar 26 04:57:27 ip-172-18-9-63.ec2.internal atomic-openshift-master-api[12272]: E0326 04:57:27.234062   12272 controller.go:111] loading OpenAPI spec for "v1beta1.servicecatalog.k8s.io" failed with: OpenAPI spec does not exists
...
Mar 26 05:30:53 ip-172-18-9-63.ec2.internal atomic-openshift-master-api[47931]: E0326 05:30:53.901635   47931 controller.go:111] loading OpenAPI spec for "v1beta1.servicecatalog.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable


Version-Release number of the following components:
openshift-ansible-3.10.127-1.git.0.131da09.el7.noarch

How reproducible:
always

Steps to Reproduce:
1. Install ocp v3.9 with service catalog enable
2. Enable v3.10.127 repo on all hosts
3. Do upgrade against above cluster

Actual results:
Upgrade failed.

Expected results:
Upgrade succeed.

Additional info:
Disable service catalog when fresh install will not hit the issue during upgrade.

Comment 2 liujia 2019-03-26 10:41:17 UTC
Add testblocker due to we can not do upgrade with service catalog enabled now.

Comment 3 Scott Dodson 2019-03-26 12:23:01 UTC
Based on the inability to reach the kubenertes service ip this is either master or networking.

Comment 4 Weihua Meng 2019-03-29 10:23:53 UTC
I upgrade successfully with service catalog enabled.
from v3.9.74 to v3.10.127, 3 masters + 3 nodes + 1 lb, docker container install on RHEL 7.6 before upgrade.
vms on Openstack

Comment 5 Matt Traylor 2019-04-02 14:18:12 UTC
I am seeing this as well, upgrading from 3.9 to 3.10 on bare metal, with identical error messages.  Is there any information I can provide that would help.

Comment 19 Jesus M. Rodriguez 2019-06-25 18:16:03 UTC
Based on comment #3 and comment #13 as well as the symptoms presented, I believe this is a networking bug. We had a similar issue in 3.11 which was fixed by the following patch: https://github.com/openshift/openshift-ansible/pull/11708 which went into 3.10 only 5 days ago.

Comment 20 Jesus M. Rodriguez 2019-06-25 18:18:08 UTC
See comment #19, I forgot to clear the needinfo flag.

Comment 21 Jesus M. Rodriguez 2019-06-25 18:33:11 UTC
This bug is related to the 3.11 upgrade bug: https://bugzilla.redhat.com/show_bug.cgi?id=1717764

Comment 22 Pablo Alonso Rodriguez 2019-06-26 07:35:16 UTC
Jesus, I agree with this being some kind of networking issue, but I don't see a reason to believe that it is the MTU as it was in the github pull request you presented. Errors at comments you are presenting are too generic

Comment 42 Phil Cameron 2019-08-08 17:50:40 UTC
https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_sdn/tasks/main.yml line29-35
seems to set the scc. Not sure why this is not working.

Comment 43 Russell Teague 2019-08-09 20:57:03 UTC
I have confirmed the user is added to the privileged scc correctly.

    "users": [
        "system:admin",
        "system:serviceaccount:openshift-infra:build-controller",
        "system:serviceaccount:openshift-node:sync",
        "system:serviceaccount:openshift-sdn:sdn",
        "system:serviceaccount:management-infra:management-admin",
        "system:serviceaccount:management-infra:inspector-admin"
    ],

Please provide the version of openshift-ansible used and the verbose Ansible logs where the scc was not properly updated.

Comment 48 Russell Teague 2019-08-12 18:05:37 UTC
PR to increase aggregated API availability check:
https://github.com/openshift/openshift-ansible/pull/11821

Comment 49 Johnny Liu 2019-08-13 04:09:11 UTC
QE hit this issue again when running upgrade cluster from v3.9.94 to 3.10.161 using openshift-ansible-3.10.162 installer. But not 100% reproduced.

Comment 54 liujia 2019-09-04 09:52:02 UTC
Version: openshift-ansible-3.10.169-1.git.0.a62e7aa.el7.noarch

Checked pr11828 merged.

# grep -r "retries" roles/openshift_control_plane/tasks/check_master_api_is_ready.yml
  retries: 60
  retries: 60
  retries: 60
  retries: 60
  retries: 60

Steps:
1. Install ocp v3.9
2. Enable v3.10.169 repo on all hosts
3. Do upgrade against above cluster

Upgrade succeed, verify the bug.

Comment 56 errata-xmlrpc 2019-09-10 23:59:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2688


Note You need to log in before you can comment on or make changes to this bug.