Bug 1744297

Summary:	[Disruptive] Cluster upgrade should maintain a functioning cluster, "ClusterOperatorDegraded: Cluster operator insights is reporting a failure: Unable to report: gateway server reported unexpected error code: 415"
Product:	OpenShift Container Platform	Reporter:	Miciah Dashiel Butler Masters <mmasters>
Component:	Insights Operator	Assignee:	Ivan Necas <inecas>
Status:	CLOSED ERRATA	QA Contact:	Dmitry Misharov <dmisharo>
Severity:	urgent	Docs Contact:	Radek Vokál <rvokal>
Priority:	urgent
Version:	4.2.0	CC:	aos-bugs, dmisharo, eparis, jjaggars, jokerman, mfojtik, rvokal, wking
Target Milestone:	---
Target Release:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-10-16 06:37:03 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Miciah Dashiel Butler Masters 2019-08-21 18:51:51 UTC

"Cluster upgrade should maintain a functioning cluster" failed because the Insights operator reported degraded status: "Unable to report: gateway server reported unexpected error code: 415 (request=4dbe8c44218f43c7ad32be69309fd976): ".

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/5976

Comment 1 Miciah Dashiel Butler Masters 2019-08-21 19:04:53 UTC

From the Insight operator pod logs:

    I0821 17:59:56.637562       1 insightsclient.go:111] Uploading application/vnd.redhat.openshift.periodic to https://cloud.redhat.com/api/ingress/v1/upload
    I0821 18:00:01.266355       1 insightsuploader.go:132] Unable to upload report after 4.62s: gateway server reported unexpected error code: 415 (request=0abffba15e904f23babad5f1b2725ee0):

Comment 2 W. Trevor King 2019-08-21 19:05:30 UTC

Looks like the current code to set Media-Type [1] may be insufficient?  Or https://cloud.redhat.com/api/ingress/v1/upload is being too picky about what it accepts?

[1]: https://github.com/openshift/insights-operator/blob/915a77d65a9862fa2411fac208e5b477e0f57924/pkg/insights/insightsclient/insightsclient.go#L90

Comment 3 W. Trevor King 2019-08-21 19:05:49 UTC

s/Media-Type/Content-Type/

Comment 5 W. Trevor King 2019-08-21 19:11:49 UTC

"Unknown" is a better holding component than "Installer"

Comment 6 W. Trevor King 2019-08-21 19:54:40 UTC

Bumped to Urgent because this is shutting down 4.2 upgrade CI:

https://ci-search-ci-search-next.svc.ci.openshift.org/chart?name=release-.*-upgrade$&search=gateway%20server%20reported%20unexpected%20error%20code:%20415

Comment 7 W. Trevor King 2019-08-21 20:47:54 UTC

Filling in here, the upstream server is complaining with logs like:

  {"level":"error","ts":1566410401.2626693,"caller":"upload/upload.go:76","msg":"Unable to find file or upload parts","error":"multipart: NextPart: EOF","request_id":"0abffba15e904f23babad5f1b2725ee0"}

The timing of the outage roughly corresponds to [1], although we don't understand how that could be leading to the 415s yet.  We're trying to work out the disconnect between the receiving code and the apparently fast uploads from the client:

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/5976/artifacts/e2e-aws-upgrade/must-gather/namespaces/openshift-insights/pods/insights-operator-b9466f584-tj2bb/operator/operator/logs/current.log | grep 'Uploading ' | tail -n2
  2019-08-21T18:21:56.106817612Z I0821 18:21:56.106759       1 insightsuploader.go:126] Uploading latest report since 0001-01-01T00:00:00Z
  2019-08-21T18:21:56.106899903Z I0821 18:21:56.106858       1 insightsclient.go:111] Uploading application/vnd.redhat.openshift.periodic to https://cloud.redhat.com/api/ingress/v1/upload

[1]: https://github.com/RedHatInsights/uhc-auth-proxy/commit/400b13527667056e403e96fbb8a97fc825598d9e

Comment 8 W. Trevor King 2019-08-21 21:13:51 UTC

We expect to be setting file [1] and the server code choking on it and setting the 415 is [2].

[1]: https://github.com/openshift/insights-operator/blob/915a77d65a9862fa2411fac208e5b477e0f57924/pkg/insights/insightsclient/insightsclient.go#L93
[2]: https://github.com/RedHatInsights/insights-ingress-go/blob/06e05176c610f2b8fe0acb039ad11d0f1765274d/upload/upload.go#L82-L83

Comment 9 W. Trevor King 2019-08-21 21:15:28 UTC

The PR that landed just unblocks CI; it does not fix the underlying problem.

Comment 10 W. Trevor King 2019-08-23 18:53:13 UTC

The underlying problem was a change in Akamai handling that led to payload removal from upload requests smaller than ~8 KiB (for example, see this test [1]).  Jesse Jaggars is continuing to work on resolving the Akamai issue.  Getting that issue resolved so we can revert #7 is still a 4.2 release blocker.

[1]: https://github.com/openshift/insights-operator/pull/9#issuecomment-524419565

Comment 13 W. Trevor King 2019-08-27 16:32:04 UTC

The Akamai config has been fixed, and by 2019-08-27T15:04Z the UploadFailed degradations had all gone away [1].  I've filed [2] to revert the earlier workaround.

[1]: count(cluster_operator_conditions{name="insights",condition="Degraded",reason="UploadFailed"})
[2]: https://github.com/openshift/insights-operator/pull/12

Comment 15 Dmitry Misharov 2019-09-19 08:11:34 UTC

Verified on 4.2.0-0.ci-2019-09-19-043318.
Reports are uploaded correctly.

Comment 16 errata-xmlrpc 2019-10-16 06:37:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922