Bug 1784963
| Summary: | oc adm upgrade not working with error: "Unable to retrieve available updates: unexpected HTTP status: 500 Internal Server Error" | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Hugo Cisneiros (Eitch) <hcisneir> |
| Component: | OpenShift Update Service | Assignee: | Lalatendu Mohanty <lmohanty> |
| OpenShift Update Service sub component: | operand | QA Contact: | |
| Status: | CLOSED CURRENTRELEASE | Docs Contact: | |
| Severity: | low | ||
| Priority: | high | CC: | aos-bugs, asadawar, jokerman, lmohanty, mjahangi, paul, scuppett, wking |
| Version: | 4.6 | ||
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-04-17 09:24:54 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Hugo Cisneiros (Eitch)
2019-12-18 20:08:10 UTC
Setting Target Release to active development branch (4.4). Clones will be created for fixes, if any, which need backported. Ok, underlying issue here was a buggy ingress router/gateway in the cluster where Cincinnati runs. It's been fixed by cycling 2 of three gateway pods. We definitely need to grow an alert about "Cincinnati requests served per minute dropped below $THIS_EXPECTED_FLOOR" to notify us when the deployed community can't reach Cincinnati. But dropping severity now that this particular instance is over. Moving to Lala, since this something that the updates team is looking at it. Increasing the priority as more customers seeing this intermittently Oops. Resetting priority (my browser had "helpfully" saved the earlier form values). Leaving severity low, because the current ~5m outages every hour do not have a significant impact on clusters, who will retained their cached view of the available updates until the next time or two they poll Cincinnati (which will succeed, since our outages are short, and the CVO polls every two to five minutes or so). The graph changes slowly, so having the local cache in ClusterVersion's status go even ~15 minutes stale instead of the usual ~2-to-5 minutes stale doesn't have much impact. The current situation's short outages are different from the original router issues mentioned in comment 3. That original issue resulted in a sustained outage, which had a much greater impact (denying clusters the ability to populate their available-update cache in the first place, or blocking them from update an hours-stale cache). The The policy engine container is restarting because of OOM kill. We are fix it first and verify if it fixes it. The production Cincinnati is fixed now and the issue should not be re-produced. QE did not hit the issue during our test. It looks like a stability/performance issue from pro Cincinnati, which need many times request to server. So i curl Cincinnati every 1 sec for a 60min's monitoring. curl -sH 'Accept: application/json' "https://api.openshift.com/api/upgrades_info/v1/graph?channel=stable-4.2&arch=amd64" -o /dev/null -w "status: %{http_code}" According to my log, no downtime during above 60min. # cat mo.log |grep "status: 200"|wc -l 3600 # cat mo.log |grep "status"|wc -l 3600 I'm still experiencing this (same?) issue, though the error is slightly different. $ oc adm upgrade Error while reconciling 4.2.0: an unknown error has occurred warning: Cannot display available updates: Reason: RemoteFailed Message: Unable to retrieve available updates: Get https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.2&id=171257a7-a56b-4d65-bd4d-85ca0cf6c38e&version=4.2.0: dial tcp 18.207.44.243:443: connect: connection timed out The curl test returns 200, however. $ curl -sH 'Accept: application/json' "https://api.openshift.com/api/upgrades_info/v1/graph?channel=stable-4.2&arch=amd64" -o /dev/null -w "status: %{http_code}" status: 200 Also tried: $ oc adm upgrade --allow-upgrade-with-warnings=true ... which gives the same warning Other thoughts? > Message: Unable to retrieve available updates: Get https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.2&id=171257a7-a56b-4d65-bd4d-85ca0cf6c38e&version=4.2.0: dial tcp 18.207.44.243:443: connect: connection timed out Probably spin this out into a new bug. My wild guess is that you have a restricted network and a firewall or some such is silently dropping your cluster's connection attempts. But I don't see anything in Telemetry or Insights about your cluster, so it's hard to say without more details. Can you link a must-gather for your cluster when creating your new bug [1]? Also in this space, [2] is in-flight with notes on RetrievedUpdates issues; reasons weren't as granular in 4.2, but your case would fall under RemoteFailed. [1]: https://docs.openshift.com/container-platform/4.2/support/gathering-cluster-data.html [2]: https://github.com/openshift/cluster-version-operator/pull/335 This bug does not affect a publicly released version of the product so marking it as "CLOSED / CURRENTRELEASE" Moving under OCP, now that we are back to being a sub-component instead of a parallel (in Bugzilla) project. Moving under OCP, now that we are back to being a sub-component instead of a parallel (in Bugzilla) project. |