Description of problem: "oc adm upgrade" is not working: $ oc4 adm upgrade Cluster version is 4.1.27 warning: Cannot display available updates: Reason: RemoteFailed Message: Unable to retrieve available updates: unexpected HTTP status: 504 Gateway Time-out This is happening with 4.2 too. There are no relevant logs in the cluster-version-operator. A quick test: $ curl -sH 'Accept: application/json' "https://api.openshift.com/api/upgrades_info/v1/graph?channel=stable-4.2&arch=amd64" {"kind":"Error","id":"500","href":"/api/gateway/v1/errors/500","code":"GATEWAY-500","reason":"Request couldn't be processed due to an internal error"}
Setting Target Release to active development branch (4.4). Clones will be created for fixes, if any, which need backported.
Ok, underlying issue here was a buggy ingress router/gateway in the cluster where Cincinnati runs. It's been fixed by cycling 2 of three gateway pods. We definitely need to grow an alert about "Cincinnati requests served per minute dropped below $THIS_EXPECTED_FLOOR" to notify us when the deployed community can't reach Cincinnati. But dropping severity now that this particular instance is over.
Moving to Lala, since this something that the updates team is looking at it.
Increasing the priority as more customers seeing this intermittently
Oops. Resetting priority (my browser had "helpfully" saved the earlier form values). Leaving severity low, because the current ~5m outages every hour do not have a significant impact on clusters, who will retained their cached view of the available updates until the next time or two they poll Cincinnati (which will succeed, since our outages are short, and the CVO polls every two to five minutes or so). The graph changes slowly, so having the local cache in ClusterVersion's status go even ~15 minutes stale instead of the usual ~2-to-5 minutes stale doesn't have much impact.
The current situation's short outages are different from the original router issues mentioned in comment 3. That original issue resulted in a sustained outage, which had a much greater impact (denying clusters the ability to populate their available-update cache in the first place, or blocking them from update an hours-stale cache).
The The policy engine container is restarting because of OOM kill. We are fix it first and verify if it fixes it.
The production Cincinnati is fixed now and the issue should not be re-produced.
QE did not hit the issue during our test. It looks like a stability/performance issue from pro Cincinnati, which need many times request to server. So i curl Cincinnati every 1 sec for a 60min's monitoring. curl -sH 'Accept: application/json' "https://api.openshift.com/api/upgrades_info/v1/graph?channel=stable-4.2&arch=amd64" -o /dev/null -w "status: %{http_code}" According to my log, no downtime during above 60min. # cat mo.log |grep "status: 200"|wc -l 3600 # cat mo.log |grep "status"|wc -l 3600
I'm still experiencing this (same?) issue, though the error is slightly different. $ oc adm upgrade Error while reconciling 4.2.0: an unknown error has occurred warning: Cannot display available updates: Reason: RemoteFailed Message: Unable to retrieve available updates: Get https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.2&id=171257a7-a56b-4d65-bd4d-85ca0cf6c38e&version=4.2.0: dial tcp 18.207.44.243:443: connect: connection timed out The curl test returns 200, however. $ curl -sH 'Accept: application/json' "https://api.openshift.com/api/upgrades_info/v1/graph?channel=stable-4.2&arch=amd64" -o /dev/null -w "status: %{http_code}" status: 200 Also tried: $ oc adm upgrade --allow-upgrade-with-warnings=true ... which gives the same warning Other thoughts?
> Message: Unable to retrieve available updates: Get https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.2&id=171257a7-a56b-4d65-bd4d-85ca0cf6c38e&version=4.2.0: dial tcp 18.207.44.243:443: connect: connection timed out Probably spin this out into a new bug. My wild guess is that you have a restricted network and a firewall or some such is silently dropping your cluster's connection attempts. But I don't see anything in Telemetry or Insights about your cluster, so it's hard to say without more details. Can you link a must-gather for your cluster when creating your new bug [1]? Also in this space, [2] is in-flight with notes on RetrievedUpdates issues; reasons weren't as granular in 4.2, but your case would fall under RemoteFailed. [1]: https://docs.openshift.com/container-platform/4.2/support/gathering-cluster-data.html [2]: https://github.com/openshift/cluster-version-operator/pull/335
This bug does not affect a publicly released version of the product so marking it as "CLOSED / CURRENTRELEASE"
Moving under OCP, now that we are back to being a sub-component instead of a parallel (in Bugzilla) project.