Bug 1744297 - [Disruptive] Cluster upgrade should maintain a functioning cluster, "ClusterOperatorDegraded: Cluster operator insights is reporting a failure: Unable to report: gateway server reported unexpected error code: 415"
Summary: [Disruptive] Cluster upgrade should maintain a functioning cluster, "ClusterO...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Insights Operator
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.2.0
Assignee: Ivan Necas
QA Contact: Dmitry Misharov
Radek Vokál
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-21 18:51 UTC by Miciah Dashiel Butler Masters
Modified: 2019-10-16 06:37 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-16 06:37:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift insights-operator pull 12 0 'None' closed Bug 1744297: Revert "pkg/controller/status/status: Never go degraded (hack)" 2021-01-29 11:41:37 UTC
Github openshift insights-operator pull 7 0 'None' closed Bug 1744297: pkg/controller/status/status: Never go degraded (hack) 2021-01-29 11:41:37 UTC
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:37:11 UTC

Description Miciah Dashiel Butler Masters 2019-08-21 18:51:51 UTC
"Cluster upgrade should maintain a functioning cluster" failed because the Insights operator reported degraded status: "Unable to report: gateway server reported unexpected error code: 415 (request=4dbe8c44218f43c7ad32be69309fd976): ".

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/5976

Comment 1 Miciah Dashiel Butler Masters 2019-08-21 19:04:53 UTC
From the Insight operator pod logs:

    I0821 17:59:56.637562       1 insightsclient.go:111] Uploading application/vnd.redhat.openshift.periodic to https://cloud.redhat.com/api/ingress/v1/upload
    I0821 18:00:01.266355       1 insightsuploader.go:132] Unable to upload report after 4.62s: gateway server reported unexpected error code: 415 (request=0abffba15e904f23babad5f1b2725ee0):

Comment 2 W. Trevor King 2019-08-21 19:05:30 UTC
Looks like the current code to set Media-Type [1] may be insufficient?  Or https://cloud.redhat.com/api/ingress/v1/upload is being too picky about what it accepts?

[1]: https://github.com/openshift/insights-operator/blob/915a77d65a9862fa2411fac208e5b477e0f57924/pkg/insights/insightsclient/insightsclient.go#L90

Comment 3 W. Trevor King 2019-08-21 19:05:49 UTC
s/Media-Type/Content-Type/

Comment 5 W. Trevor King 2019-08-21 19:11:49 UTC
"Unknown" is a better holding component than "Installer"

Comment 7 W. Trevor King 2019-08-21 20:47:54 UTC
Filling in here, the upstream server is complaining with logs like:

  {"level":"error","ts":1566410401.2626693,"caller":"upload/upload.go:76","msg":"Unable to find file or upload parts","error":"multipart: NextPart: EOF","request_id":"0abffba15e904f23babad5f1b2725ee0"}

The timing of the outage roughly corresponds to [1], although we don't understand how that could be leading to the 415s yet.  We're trying to work out the disconnect between the receiving code and the apparently fast uploads from the client:

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/5976/artifacts/e2e-aws-upgrade/must-gather/namespaces/openshift-insights/pods/insights-operator-b9466f584-tj2bb/operator/operator/logs/current.log | grep 'Uploading ' | tail -n2
  2019-08-21T18:21:56.106817612Z I0821 18:21:56.106759       1 insightsuploader.go:126] Uploading latest report since 0001-01-01T00:00:00Z
  2019-08-21T18:21:56.106899903Z I0821 18:21:56.106858       1 insightsclient.go:111] Uploading application/vnd.redhat.openshift.periodic to https://cloud.redhat.com/api/ingress/v1/upload

[1]: https://github.com/RedHatInsights/uhc-auth-proxy/commit/400b13527667056e403e96fbb8a97fc825598d9e

Comment 9 W. Trevor King 2019-08-21 21:15:28 UTC
The PR that landed just unblocks CI; it does not fix the underlying problem.

Comment 10 W. Trevor King 2019-08-23 18:53:13 UTC
The underlying problem was a change in Akamai handling that led to payload removal from upload requests smaller than ~8 KiB (for example, see this test [1]).  Jesse Jaggars is continuing to work on resolving the Akamai issue.  Getting that issue resolved so we can revert #7 is still a 4.2 release blocker.

[1]: https://github.com/openshift/insights-operator/pull/9#issuecomment-524419565

Comment 13 W. Trevor King 2019-08-27 16:32:04 UTC
The Akamai config has been fixed, and by 2019-08-27T15:04Z the UploadFailed degradations had all gone away [1].  I've filed [2] to revert the earlier workaround.

[1]: count(cluster_operator_conditions{name="insights",condition="Degraded",reason="UploadFailed"})
[2]: https://github.com/openshift/insights-operator/pull/12

Comment 15 Dmitry Misharov 2019-09-19 08:11:34 UTC
Verified on 4.2.0-0.ci-2019-09-19-043318.
Reports are uploaded correctly.

Comment 16 errata-xmlrpc 2019-10-16 06:37:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.