Bug 1473523

Summary: Got "500 Internal Server Error" when watch bindings and instances of apigroup servicecatalog.k8s.io
Product: OpenShift Container Platform Reporter: Jordan Liggitt <jliggitt>
Component: MasterAssignee: Jordan Liggitt <jliggitt>
Status: CLOSED ERRATA QA Contact: Weihua Meng <wmeng>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.7.0CC: aos-bugs, chezhang, chuyu, deads, dmace, dma, eparis, ewolinet, jforrest, jliggitt, jokerman, mmccomas, wjiang
Target Milestone: ---   
Target Release: 3.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1472148 Environment:
Last Closed: 2017-11-28 22:05:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1472148    
Bug Blocks:    

Description Jordan Liggitt 2017-07-21 05:17:38 UTC
+++ This bug was initially created as a clone of Bug #1472148 +++

Description of problem:
After enable service-catalog console, go to the project overview page, then got "500 Internal Server Error" in devtools for watching bindings and instances api, and page show "server connection interrupted".

Normal api resources such as BC, IS don't have the problem.


Version-Release number of selected component (if applicable):
# openshift version 
openshift v3.6.151
kubernetes v1.6.1+5115d708d7
etcd 3.2.1


How reproducible:
always

Steps to Reproduce:
1. Install openshift with service-catalog and enable service-catalog console
2. Create a project
3. Go to project overview page 

Actual results:
Got "500 Internal Server Error" in devtools and page show "Server connection interrupted"

Expected results:
Should not got this error.

Additional info:

--- Additional comment from weiwei jiang on 2017-07-18 03:36 EDT ---



--- Additional comment from Jessica Forrester on 2017-07-18 08:30:11 EDT ---

Reassigning since all the other websockets are working in the console and this is specific to the svc catalog websocket connections.

@weiwei can you confirm this server was installed using the ansible installer? Are those websocket connections going to the same hostname as the working websockets?

We will need master logs from during this time to debug and possibly also logs from the svc catalog containers. Will need this to figure out if this is an aggregator problem or a svc catalog problem.

--- Additional comment from Paul Morie on 2017-07-18 14:14:04 EDT ---

Was this cluster created using `oc cluster up` or the installer?

--- Additional comment from weiwei jiang on 2017-07-18 23:17:49 EDT ---

(In reply to Paul Morie from comment #3)
> Was this cluster created using `oc cluster up` or the installer?

The cluster is created by installer.

And I got some useful log in service-catalog apiserver pod after page got 500 error:
# oc logs -f apiserver-gtn7l -n kube-service-catalog |grep   E0719
E0719 02:14:54.219751       1 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
E0719 02:14:57.241773       1 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
E0719 02:22:58.341435       1 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
E0719 02:22:59.346418       1 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
E0719 02:31:19.450048       1 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
E0719 02:32:15.484653       1 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
E0719 02:44:33.575782       1 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
E0719 02:47:38.592010       1 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
E0719 02:51:01.724747       1 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
E0719 02:55:26.687264       1 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
E0719 03:07:50.860714       1 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted

--- Additional comment from DeShuai Ma on 2017-07-19 03:23:15 EDT ---

After deprovision, the "Provisioned Services" still keep on OverView page unless manual refresh the page. Actually the instance already removed.

--- Additional comment from Jessica Forrester on 2017-07-19 08:12:21 EDT ---

If the websocket watches are failing then the issue in comment 5 is expected.

--- Additional comment from  on 2017-07-19 17:30:13 EDT ---

I'm able to recreate this locally on the console, however if I ssh into the node and run:

$ oc policy can-i watch bindings --as=eric -n testproject
yes

$ oc policy can-i watch instances --as=eric -n testproject
yes



Where testproject is a newly created project and eric is a user that is an admin for testproject.

--- Additional comment from  on 2017-07-20 09:24:16 EDT ---

When I updated a failed 500 request in devtools to add the 
"Authorization: Bearer" header with a valid token, I saw the request come back as a 200

--- Additional comment from  on 2017-07-20 09:26:23 EDT ---

Is this something the installer can configure to happen within the console? If so, what needs to be added where?

--- Additional comment from Jessica Forrester on 2017-07-20 09:41:22 EDT ---

Websockets from the browser can not use an Authorization bearer header.  This should not be needed, the token is being passed via the Sec-Websocket-Protocol header, and it working fine against all other endpoints.  If this is now failing against aggregated APIs then we have a problem, but we do not see this issue in the oc cluster up environment.

--- Additional comment from Eric Paris on 2017-07-20 10:17:59 EDT ---

What needs to be changed in the installer? Tell Scott EXACTLY what flag is set differently, what file needs to contain what, etc. I'm not seeing the root cause here.

Unless I'm mistaken I believe that Eric needs to stand up a cluster with oc cluster up and a cluster with the installer, find the difference between them, and explain what exactly needs changed.

--- Additional comment from Jordan Liggitt on 2017-07-20 12:14:36 EDT ---

ansible installs with a caBundle on the service catalog API service, cluster up installs with insecureSkipTLSVerify: true

no other differences leapt out at me

--- Additional comment from Jordan Liggitt on 2017-07-20 15:28:58 EDT ---

changing the APIService config to "insecureSkipTLSVerify: true" resolved the 500

looks like the upgrade path with TLS verification is not handled correctly

--- Additional comment from Jordan Liggitt on 2017-07-20 16:08:13 EDT ---

server is returning this error:

error dialing backend: x509: cannot validate certificate for 172.30.1.2 because it doesn't contain any IP SANs

--- Additional comment from Jordan Liggitt on 2017-07-21 01:15:57 EDT ---

To recreate, ensure the APIService configured for the service catalog contains a caBundle, not insecureSkipTLSVerify: true

kube issue:     https://github.com/kubernetes/kubernetes/issues/49354
kube fix:       https://github.com/kubernetes/kubernetes/pull/49353
origin 3.6 fix: https://github.com/openshift/origin/pull/15388
origin 3.7 fix: https://github.com/openshift/origin/pull/15390

Comment 1 weiwei jiang 2017-09-04 07:59:19 UTC
Will verify this issue after https://bugzilla.redhat.com/show_bug.cgi?id=1486623 fixed.

Comment 2 Weihua Meng 2017-09-28 02:21:44 UTC
Verified on openshift v3.7.0-0.131.0
Fixed.
Everything is normal in overview page, No error shows up.

Comment 5 errata-xmlrpc 2017-11-28 22:05:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188