1744297 – [Disruptive] Cluster upgrade should maintain a functioning cluster, "ClusterOperatorDegraded: Cluster operator insights is reporting a failure: Unable to report: gateway server reported unexpected error code: 415"

Bug 1744297 - [Disruptive] Cluster upgrade should maintain a functioning cluster, "ClusterOperatorDegraded: Cluster operator insights is reporting a failure: Unable to report: gateway server reported unexpected error code: 415"

Summary: [Disruptive] Cluster upgrade should maintain a functioning cluster, "ClusterO...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Insights Operator
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Ivan Necas
QA Contact:	Dmitry Misharov
Docs Contact:	Radek Vokál
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-08-21 18:51 UTC by Miciah Dashiel Butler Masters
Modified:	2019-10-16 06:37 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-16 06:37:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift insights-operator pull 12	'None'	closed	Bug 1744297: Revert "pkg/controller/status/status: Never go degraded (hack)"	2021-01-29 11:41:37 UTC
Github	openshift insights-operator pull 7	'None'	closed	Bug 1744297: pkg/controller/status/status: Never go degraded (hack)	2021-01-29 11:41:37 UTC
Red Hat Product Errata	RHBA-2019:2922	None	None	None	2019-10-16 06:37:11 UTC

Description Miciah Dashiel Butler Masters 2019-08-21 18:51:51 UTC

"Cluster upgrade should maintain a functioning cluster" failed because the Insights operator reported degraded status: "Unable to report: gateway server reported unexpected error code: 415 (request=4dbe8c44218f43c7ad32be69309fd976): ".

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/5976

Comment 1 Miciah Dashiel Butler Masters 2019-08-21 19:04:53 UTC

From the Insight operator pod logs:

    I0821 17:59:56.637562       1 insightsclient.go:111] Uploading application/vnd.redhat.openshift.periodic to https://cloud.redhat.com/api/ingress/v1/upload
    I0821 18:00:01.266355       1 insightsuploader.go:132] Unable to upload report after 4.62s: gateway server reported unexpected error code: 415 (request=0abffba15e904f23babad5f1b2725ee0):

Comment 2 W. Trevor King 2019-08-21 19:05:30 UTC

Looks like the current code to set Media-Type [1] may be insufficient?  Or https://cloud.redhat.com/api/ingress/v1/upload is being too picky about what it accepts?

[1]: https://github.com/openshift/insights-operator/blob/915a77d65a9862fa2411fac208e5b477e0f57924/pkg/insights/insightsclient/insightsclient.go#L90

Comment 3 W. Trevor King 2019-08-21 19:05:49 UTC

s/Media-Type/Content-Type/

Comment 5 W. Trevor King 2019-08-21 19:11:49 UTC

"Unknown" is a better holding component than "Installer"

Comment 6 W. Trevor King 2019-08-21 19:54:40 UTC

Bumped to Urgent because this is shutting down 4.2 upgrade CI:

https://ci-search-ci-search-next.svc.ci.openshift.org/chart?name=release-.*-upgrade$&search=gateway%20server%20reported%20unexpected%20error%20code:%20415

Comment 7 W. Trevor King 2019-08-21 20:47:54 UTC

Filling in here, the upstream server is complaining with logs like:

  {"level":"error","ts":1566410401.2626693,"caller":"upload/upload.go:76","msg":"Unable to find file or upload parts","error":"multipart: NextPart: EOF","request_id":"0abffba15e904f23babad5f1b2725ee0"}

The timing of the outage roughly corresponds to [1], although we don't understand how that could be leading to the 415s yet.  We're trying to work out the disconnect between the receiving code and the apparently fast uploads from the client:

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/5976/artifacts/e2e-aws-upgrade/must-gather/namespaces/openshift-insights/pods/insights-operator-b9466f584-tj2bb/operator/operator/logs/current.log | grep 'Uploading ' | tail -n2
  2019-08-21T18:21:56.106817612Z I0821 18:21:56.106759       1 insightsuploader.go:126] Uploading latest report since 0001-01-01T00:00:00Z
  2019-08-21T18:21:56.106899903Z I0821 18:21:56.106858       1 insightsclient.go:111] Uploading application/vnd.redhat.openshift.periodic to https://cloud.redhat.com/api/ingress/v1/upload

[1]: https://github.com/RedHatInsights/uhc-auth-proxy/commit/400b13527667056e403e96fbb8a97fc825598d9e

Comment 8 W. Trevor King 2019-08-21 21:13:51 UTC

We expect to be setting file [1] and the server code choking on it and setting the 415 is [2].

[1]: https://github.com/openshift/insights-operator/blob/915a77d65a9862fa2411fac208e5b477e0f57924/pkg/insights/insightsclient/insightsclient.go#L93
[2]: https://github.com/RedHatInsights/insights-ingress-go/blob/06e05176c610f2b8fe0acb039ad11d0f1765274d/upload/upload.go#L82-L83

Comment 9 W. Trevor King 2019-08-21 21:15:28 UTC

The PR that landed just unblocks CI; it does not fix the underlying problem.

Comment 10 W. Trevor King 2019-08-23 18:53:13 UTC

The underlying problem was a change in Akamai handling that led to payload removal from upload requests smaller than ~8 KiB (for example, see this test [1]).  Jesse Jaggars is continuing to work on resolving the Akamai issue.  Getting that issue resolved so we can revert #7 is still a 4.2 release blocker.

[1]: https://github.com/openshift/insights-operator/pull/9#issuecomment-524419565

Comment 13 W. Trevor King 2019-08-27 16:32:04 UTC

The Akamai config has been fixed, and by 2019-08-27T15:04Z the UploadFailed degradations had all gone away [1].  I've filed [2] to revert the earlier workaround.

[1]: count(cluster_operator_conditions{name="insights",condition="Degraded",reason="UploadFailed"})
[2]: https://github.com/openshift/insights-operator/pull/12

Comment 15 Dmitry Misharov 2019-09-19 08:11:34 UTC

Verified on 4.2.0-0.ci-2019-09-19-043318.
Reports are uploaded correctly.

Comment 16 errata-xmlrpc 2019-10-16 06:37:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.