Bug 1784963

Summary:	oc adm upgrade not working with error: "Unable to retrieve available updates: unexpected HTTP status: 500 Internal Server Error"
Product:	OpenShift Container Platform	Reporter:	Hugo Cisneiros (Eitch) <hcisneir>
Component:	OpenShift Update Service	Assignee:	Lalatendu Mohanty <lmohanty>
OpenShift Update Service sub component:	operand	QA Contact:
Status:	CLOSED CURRENTRELEASE	Docs Contact:
Severity:	low
Priority:	high	CC:	aos-bugs, asadawar, jokerman, lmohanty, mjahangi, paul, scuppett, wking
Version:	4.6
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-04-17 09:24:54 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Hugo Cisneiros (Eitch) 2019-12-18 20:08:10 UTC

Description of problem:

"oc adm upgrade" is not working:

$ oc4 adm upgrade 
Cluster version is 4.1.27

warning: Cannot display available updates:
  Reason: RemoteFailed
  Message: Unable to retrieve available updates: unexpected HTTP status: 504 Gateway Time-out

This is happening with 4.2 too.

There are no relevant logs in the cluster-version-operator.

A quick test:

$ curl -sH 'Accept: application/json'  "https://api.openshift.com/api/upgrades_info/v1/graph?channel=stable-4.2&arch=amd64"
{"kind":"Error","id":"500","href":"/api/gateway/v1/errors/500","code":"GATEWAY-500","reason":"Request couldn't be processed due to an internal error"}

Comment 2 Stephen Cuppett 2019-12-18 20:38:00 UTC

Setting Target Release to active development branch (4.4). Clones will be created for fixes, if any, which need backported.

Comment 3 W. Trevor King 2019-12-18 21:40:10 UTC

Ok, underlying issue here was a buggy ingress router/gateway in the cluster where Cincinnati runs.  It's been fixed by cycling 2 of three gateway pods.  We definitely need to grow an alert about "Cincinnati requests served per minute dropped below $THIS_EXPECTED_FLOOR" to notify us when the deployed community can't reach Cincinnati.  But dropping severity now that this particular instance is over.

Comment 4 Abhinav Dahiya 2020-02-05 00:09:29 UTC

Moving to Lala, since this something that the updates team is looking at it.

Comment 8 Lalatendu Mohanty 2020-02-12 12:30:00 UTC

Increasing the priority as more customers seeing this intermittently

Comment 10 W. Trevor King 2020-02-13 03:39:36 UTC

Oops.  Resetting priority (my browser had "helpfully" saved the earlier form values).  Leaving severity low, because the current ~5m outages every hour do not have a significant impact on clusters, who will retained their cached view of the available updates until the next time or two they poll Cincinnati (which will succeed, since our outages are short, and the CVO polls every two to five minutes or so).  The graph changes slowly, so having the local cache in ClusterVersion's status go even ~15 minutes stale instead of the usual ~2-to-5 minutes stale doesn't have much impact.

Comment 11 W. Trevor King 2020-02-13 03:41:47 UTC

The current situation's short outages are different from the original router issues mentioned in comment 3.  That original issue resulted in a sustained outage, which had a much greater impact (denying clusters the ability to populate their available-update cache in the first place, or blocking them from update an hours-stale cache).

Comment 12 Lalatendu Mohanty 2020-02-14 09:51:53 UTC

The The policy engine container is restarting because of OOM kill. We are fix it first and verify if it fixes it.

Comment 13 Lalatendu Mohanty 2020-02-24 09:21:37 UTC

The production Cincinnati is fixed now and the issue should not be re-produced.

Comment 14 liujia 2020-02-25 06:13:27 UTC

QE did not hit the issue during our test. It looks like a stability/performance issue from pro Cincinnati, which need many times request to server. So i curl Cincinnati every 1 sec for a 60min's monitoring. 

curl -sH 'Accept: application/json' "https://api.openshift.com/api/upgrades_info/v1/graph?channel=stable-4.2&arch=amd64" -o /dev/null -w "status: %{http_code}"

According to my log, no downtime during above 60min.

# cat mo.log |grep "status: 200"|wc -l
3600
# cat mo.log |grep "status"|wc -l
3600

Comment 16 Paul Cuciureanu 2020-03-14 21:05:22 UTC

I'm still experiencing this (same?) issue, though the error is slightly different.

$ oc adm upgrade
Error while reconciling 4.2.0: an unknown error has occurred

warning: Cannot display available updates:
  Reason: RemoteFailed
  Message: Unable to retrieve available updates: Get https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.2&id=171257a7-a56b-4d65-bd4d-85ca0cf6c38e&version=4.2.0: dial tcp 18.207.44.243:443: connect: connection timed out



The curl test returns 200, however.

$ curl -sH 'Accept: application/json' "https://api.openshift.com/api/upgrades_info/v1/graph?channel=stable-4.2&arch=amd64" -o /dev/null -w "status: %{http_code}"
status: 200


Also tried:
$ oc adm upgrade --allow-upgrade-with-warnings=true

... which gives the same warning


Other thoughts?

Comment 17 W. Trevor King 2020-03-17 18:04:44 UTC

> Message: Unable to retrieve available updates: Get https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.2&id=171257a7-a56b-4d65-bd4d-85ca0cf6c38e&version=4.2.0: dial tcp 18.207.44.243:443: connect: connection timed out

Probably spin this out into a new bug.  My wild guess is that you have a restricted network and a firewall or some such is silently dropping your cluster's connection attempts.  But I don't see anything in Telemetry or Insights about your cluster, so it's hard to say without more details.  Can you link a must-gather for your cluster when creating your new bug [1]?  Also in this space, [2] is in-flight with notes on RetrievedUpdates issues; reasons weren't as granular in 4.2, but your case would fall under RemoteFailed.

[1]: https://docs.openshift.com/container-platform/4.2/support/gathering-cluster-data.html
[2]: https://github.com/openshift/cluster-version-operator/pull/335

Comment 18 Lalatendu Mohanty 2020-04-17 09:24:54 UTC

This bug does not affect a publicly released version of the product so marking it as "CLOSED / CURRENTRELEASE"

Comment 19 W. Trevor King 2020-11-20 17:13:46 UTC

Moving under OCP, now that we are back to being a sub-component instead of a parallel (in Bugzilla) project.

Comment 20 W. Trevor King 2020-11-20 17:13:48 UTC

Moving under OCP, now that we are back to being a sub-component instead of a parallel (in Bugzilla) project.