Bug 1784963 - oc adm upgrade not working with error: "Unable to retrieve available updates: unexpected HTTP status: 500 Internal Server Error"
Summary: oc adm upgrade not working with error: "Unable to retrieve available updates:...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OpenShift Update Service
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
low
Target Milestone: ---
: ---
Assignee: Lalatendu Mohanty
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-12-18 20:08 UTC by Hugo Cisneiros (Eitch)
Modified: 2023-09-07 21:18 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-04-17 09:24:54 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift/cincinnati/commit/c670f369a0ee7b15ddb69c8079789ce56bd26c4b 0 None None None 2020-09-14 10:06:14 UTC

Description Hugo Cisneiros (Eitch) 2019-12-18 20:08:10 UTC
Description of problem:

"oc adm upgrade" is not working:

$ oc4 adm upgrade 
Cluster version is 4.1.27

warning: Cannot display available updates:
  Reason: RemoteFailed
  Message: Unable to retrieve available updates: unexpected HTTP status: 504 Gateway Time-out

This is happening with 4.2 too.

There are no relevant logs in the cluster-version-operator.

A quick test:

$ curl -sH 'Accept: application/json'  "https://api.openshift.com/api/upgrades_info/v1/graph?channel=stable-4.2&arch=amd64"
{"kind":"Error","id":"500","href":"/api/gateway/v1/errors/500","code":"GATEWAY-500","reason":"Request couldn't be processed due to an internal error"}

Comment 2 Stephen Cuppett 2019-12-18 20:38:00 UTC
Setting Target Release to active development branch (4.4). Clones will be created for fixes, if any, which need backported.

Comment 3 W. Trevor King 2019-12-18 21:40:10 UTC
Ok, underlying issue here was a buggy ingress router/gateway in the cluster where Cincinnati runs.  It's been fixed by cycling 2 of three gateway pods.  We definitely need to grow an alert about "Cincinnati requests served per minute dropped below $THIS_EXPECTED_FLOOR" to notify us when the deployed community can't reach Cincinnati.  But dropping severity now that this particular instance is over.

Comment 4 Abhinav Dahiya 2020-02-05 00:09:29 UTC
Moving to Lala, since this something that the updates team is looking at it.

Comment 8 Lalatendu Mohanty 2020-02-12 12:30:00 UTC
Increasing the priority as more customers seeing this intermittently

Comment 10 W. Trevor King 2020-02-13 03:39:36 UTC
Oops.  Resetting priority (my browser had "helpfully" saved the earlier form values).  Leaving severity low, because the current ~5m outages every hour do not have a significant impact on clusters, who will retained their cached view of the available updates until the next time or two they poll Cincinnati (which will succeed, since our outages are short, and the CVO polls every two to five minutes or so).  The graph changes slowly, so having the local cache in ClusterVersion's status go even ~15 minutes stale instead of the usual ~2-to-5 minutes stale doesn't have much impact.

Comment 11 W. Trevor King 2020-02-13 03:41:47 UTC
The current situation's short outages are different from the original router issues mentioned in comment 3.  That original issue resulted in a sustained outage, which had a much greater impact (denying clusters the ability to populate their available-update cache in the first place, or blocking them from update an hours-stale cache).

Comment 12 Lalatendu Mohanty 2020-02-14 09:51:53 UTC
The The policy engine container is restarting because of OOM kill. We are fix it first and verify if it fixes it.

Comment 13 Lalatendu Mohanty 2020-02-24 09:21:37 UTC
The production Cincinnati is fixed now and the issue should not be re-produced.

Comment 14 liujia 2020-02-25 06:13:27 UTC
QE did not hit the issue during our test. It looks like a stability/performance issue from pro Cincinnati, which need many times request to server. So i curl Cincinnati every 1 sec for a 60min's monitoring. 

curl -sH 'Accept: application/json' "https://api.openshift.com/api/upgrades_info/v1/graph?channel=stable-4.2&arch=amd64" -o /dev/null -w "status: %{http_code}"

According to my log, no downtime during above 60min.

# cat mo.log |grep "status: 200"|wc -l
3600
# cat mo.log |grep "status"|wc -l
3600

Comment 16 Paul Cuciureanu 2020-03-14 21:05:22 UTC
I'm still experiencing this (same?) issue, though the error is slightly different.

$ oc adm upgrade
Error while reconciling 4.2.0: an unknown error has occurred

warning: Cannot display available updates:
  Reason: RemoteFailed
  Message: Unable to retrieve available updates: Get https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.2&id=171257a7-a56b-4d65-bd4d-85ca0cf6c38e&version=4.2.0: dial tcp 18.207.44.243:443: connect: connection timed out



The curl test returns 200, however.

$ curl -sH 'Accept: application/json' "https://api.openshift.com/api/upgrades_info/v1/graph?channel=stable-4.2&arch=amd64" -o /dev/null -w "status: %{http_code}"
status: 200


Also tried:
$ oc adm upgrade --allow-upgrade-with-warnings=true

... which gives the same warning


Other thoughts?

Comment 17 W. Trevor King 2020-03-17 18:04:44 UTC
> Message: Unable to retrieve available updates: Get https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.2&id=171257a7-a56b-4d65-bd4d-85ca0cf6c38e&version=4.2.0: dial tcp 18.207.44.243:443: connect: connection timed out

Probably spin this out into a new bug.  My wild guess is that you have a restricted network and a firewall or some such is silently dropping your cluster's connection attempts.  But I don't see anything in Telemetry or Insights about your cluster, so it's hard to say without more details.  Can you link a must-gather for your cluster when creating your new bug [1]?  Also in this space, [2] is in-flight with notes on RetrievedUpdates issues; reasons weren't as granular in 4.2, but your case would fall under RemoteFailed.

[1]: https://docs.openshift.com/container-platform/4.2/support/gathering-cluster-data.html
[2]: https://github.com/openshift/cluster-version-operator/pull/335

Comment 18 Lalatendu Mohanty 2020-04-17 09:24:54 UTC
This bug does not affect a publicly released version of the product so marking it as "CLOSED / CURRENTRELEASE"

Comment 19 W. Trevor King 2020-11-20 17:13:46 UTC
Moving under OCP, now that we are back to being a sub-component instead of a parallel (in Bugzilla) project.

Comment 20 W. Trevor King 2020-11-20 17:13:48 UTC
Moving under OCP, now that we are back to being a sub-component instead of a parallel (in Bugzilla) project.


Note You need to log in before you can comment on or make changes to this bug.