1692735 – upgrade failed due to [openshift_control_plane : Wait for /apis/servicecatalog.k8s.io/v1beta1 when registered]

Bug 1692735 - upgrade failed due to [openshift_control_plane : Wait for /apis/servicecatalog.k8s.io/v1beta1 when registered]

Summary: upgrade failed due to [openshift_control_plane : Wait for /apis/servicecatalo...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.10.0
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	3.10.z
Assignee:	Russell Teague
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-03-26 10:29 UTC by liujia
Modified:	2019-09-10 23:59 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: ServiceCatalog API endpoint not responding before timeout Consequence: Control Plane upgrade fails Fix: Increase timeout for checking API endpoints Result: Successful control plane upgrade
Clone Of:
Clones:	1742002 (view as bug list)
Environment:
Last Closed:	2019-09-10 23:59:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift openshift-ansible pull 11821	'None'	closed	Bug 1742002: Increase wait for aggregated API availability	2021-02-04 11:53:35 UTC
Github	openshift openshift-ansible pull 11828	'None'	closed	Bug 1692735: Increase wait for aggregated API availability	2021-02-04 11:53:35 UTC
Red Hat Product Errata	RHBA-2019:2688	None	None	None	2019-09-10 23:59:10 UTC

Description liujia 2019-03-26 10:29:32 UTC

Description of problem:
Upgrade ocp from v3.9 to v3.10, upgrade failed at TASK [openshift_control_plane : Wait for /apis/servicecatalog.k8s.io/v1beta1 when registered] *******************************************************************************
FAILED - RETRYING: Wait for /apis/servicecatalog.k8s.io/v1beta1 when registered (30 retries left).
...
FAILED - RETRYING: Wait for /apis/servicecatalog.k8s.io/v1beta1 when registered (1 retries left).
fatal: [ec2-3-81-14-6.compute-1.amazonaws.com]: FAILED! => {"attempts": 30, "changed": true, "cmd": ["oc", "--config=/etc/origin/master/admin.kubeconfig", "get", "--raw", "/apis/servicecatalog.k8s.io/v1beta1"], "delta": "0:00:00.327392", "end": "2019-03-26 05:23:34.081277", "msg": "non-zero return code", "rc": 1, "start": "2019-03-26 05:23:33.753885", "stderr": "Error from server (ServiceUnavailable): the server is currently unable to handle the request", "stderr_lines": ["Error from server (ServiceUnavailable): the server is currently unable to handle the request"], "stdout": "", "stdout_lines": []}
        to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade.retry

============================================
# oc get po -n kube-system
NAME                                      READY     STATUS    RESTARTS   AGE
master-etcd-ip-172-18-9-63.ec2.internal   1/1       Running   0          44m

# oc logs pod/apiserver-fh9dc -n kube-service-catalog
I0326 10:00:44.976313       1 feature_gate.go:184] feature gates: map[OriginatingIdentity:true]
I0326 10:00:44.976630       1 hyperkube.go:188] Service Catalog version v3.9.74 (built 2019-03-20T00:45:15Z)
W0326 10:00:48.759824       1 authentication.go:229] Unable to get configmap/extension-apiserver-authentication in kube-system.  Usually fixed by 'kubectl create rolebinding -n kube-system ROLE_NAME --role=extension-apiserver-authentication-reader --serviceaccount=YOUR_NS:YOUR_SA'
Error: Get https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 172.30.0.1:443: getsockopt: no route to host

# cat master-api.log |grep servicecatalog|grep fail
Mar 26 04:57:27 ip-172-18-9-63.ec2.internal atomic-openshift-master-api[12272]: E0326 04:57:27.234062   12272 controller.go:111] loading OpenAPI spec for "v1beta1.servicecatalog.k8s.io" failed with: OpenAPI spec does not exists
...
Mar 26 05:30:53 ip-172-18-9-63.ec2.internal atomic-openshift-master-api[47931]: E0326 05:30:53.901635   47931 controller.go:111] loading OpenAPI spec for "v1beta1.servicecatalog.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable


Version-Release number of the following components:
openshift-ansible-3.10.127-1.git.0.131da09.el7.noarch

How reproducible:
always

Steps to Reproduce:
1. Install ocp v3.9 with service catalog enable
2. Enable v3.10.127 repo on all hosts
3. Do upgrade against above cluster

Actual results:
Upgrade failed.

Expected results:
Upgrade succeed.

Additional info:
Disable service catalog when fresh install will not hit the issue during upgrade.

Comment 2 liujia 2019-03-26 10:41:17 UTC

Add testblocker due to we can not do upgrade with service catalog enabled now.

Comment 3 Scott Dodson 2019-03-26 12:23:01 UTC

Based on the inability to reach the kubenertes service ip this is either master or networking.

Comment 4 Weihua Meng 2019-03-29 10:23:53 UTC

I upgrade successfully with service catalog enabled.
from v3.9.74 to v3.10.127, 3 masters + 3 nodes + 1 lb, docker container install on RHEL 7.6 before upgrade.
vms on Openstack

Comment 5 Matt Traylor 2019-04-02 14:18:12 UTC

I am seeing this as well, upgrading from 3.9 to 3.10 on bare metal, with identical error messages.  Is there any information I can provide that would help.

Comment 19 Jesus M. Rodriguez 2019-06-25 18:16:03 UTC

Based on comment #3 and comment #13 as well as the symptoms presented, I believe this is a networking bug. We had a similar issue in 3.11 which was fixed by the following patch: https://github.com/openshift/openshift-ansible/pull/11708 which went into 3.10 only 5 days ago.

Comment 20 Jesus M. Rodriguez 2019-06-25 18:18:08 UTC

See comment #19, I forgot to clear the needinfo flag.

Comment 21 Jesus M. Rodriguez 2019-06-25 18:33:11 UTC

This bug is related to the 3.11 upgrade bug: https://bugzilla.redhat.com/show_bug.cgi?id=1717764

Comment 22 Pablo Alonso Rodriguez 2019-06-26 07:35:16 UTC

Jesus, I agree with this being some kind of networking issue, but I don't see a reason to believe that it is the MTU as it was in the github pull request you presented. Errors at comments you are presenting are too generic

Comment 42 Phil Cameron 2019-08-08 17:50:40 UTC

https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_sdn/tasks/main.yml line29-35
seems to set the scc. Not sure why this is not working.

Comment 43 Russell Teague 2019-08-09 20:57:03 UTC

I have confirmed the user is added to the privileged scc correctly.

    "users": [
        "system:admin",
        "system:serviceaccount:openshift-infra:build-controller",
        "system:serviceaccount:openshift-node:sync",
        "system:serviceaccount:openshift-sdn:sdn",
        "system:serviceaccount:management-infra:management-admin",
        "system:serviceaccount:management-infra:inspector-admin"
    ],

Please provide the version of openshift-ansible used and the verbose Ansible logs where the scc was not properly updated.

Comment 48 Russell Teague 2019-08-12 18:05:37 UTC

PR to increase aggregated API availability check:
https://github.com/openshift/openshift-ansible/pull/11821

Comment 49 Johnny Liu 2019-08-13 04:09:11 UTC

QE hit this issue again when running upgrade cluster from v3.9.94 to 3.10.161 using openshift-ansible-3.10.162 installer. But not 100% reproduced.

Comment 54 liujia 2019-09-04 09:52:02 UTC

Version: openshift-ansible-3.10.169-1.git.0.a62e7aa.el7.noarch

Checked pr11828 merged.

# grep -r "retries" roles/openshift_control_plane/tasks/check_master_api_is_ready.yml
  retries: 60
  retries: 60
  retries: 60
  retries: 60
  retries: 60

Steps:
1. Install ocp v3.9
2. Enable v3.10.169 repo on all hosts
3. Do upgrade against above cluster

Upgrade succeed, verify the bug.

Comment 56 errata-xmlrpc 2019-09-10 23:59:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2688

Note You need to log in before you can comment on or make changes to this bug.