Bug 1462445

Summary:	Useless log messages from AWS API calls
Product:	OpenShift Container Platform	Reporter:	Eric Paris <eparis>
Component:	Node	Assignee:	Jeff Peeler <jpeeler>
Status:	CLOSED ERRATA	QA Contact:	DeShuai Ma <dma>
Severity:	medium	Docs Contact:
Priority:	low
Version:	3.5.1	CC:	aos-bugs, deads, decarr, hekumar, jliggitt, jokerman, jsafrane, mmccomas, trankin, xtian
Target Milestone:	---
Target Release:	3.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Previously messages originating from the AWS SDK were causing partial log entries due to newlines in the message itself. Error messages are now properly quoted so all messages are fully reported.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-11-28 21:58:09 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Eric Paris 2017-06-17 18:20:10 UTC

oc v3.5.5.19

/usr/bin/openshift start master controllers --config=/etc/origin/master/master-config.yaml --loglevel=2 --listen=https://0.0.0.0:8444

I see semi-regular spam in the logs like:

Jun 16 22:12:05 ip-172-31-54-162.ec2.internal atomic-openshift-master-controllers[48569]: status code: 400, request id:
Jun 16 22:12:05 ip-172-31-54-162.ec2.internal atomic-openshift-master-controllers[48569]: status code: 400, request id:
Jun 16 22:12:05 ip-172-31-54-162.ec2.internal atomic-openshift-master-controllers[48569]: status code: 400, request id:

This message tells me NOTHING of value and is just spam. We need to do two things.

1. Get rid of the useless spam (either by not logging or logging something useful)
2. Determine if it represents a problem and if so fix the problem.

It seems each useless spam is always proceeded by a log message which looks like one of the following:

Jun 16 22:13:40 ip-172-31-54-162.ec2.internal atomic-openshift-master-controllers[48569]: I0616 22:13:40.675705   48569 aws.go:1248] Assigned mount device bj -> volume vol-09e9fa5a07720c47e
Jun 16 22:13:40 ip-172-31-54-162.ec2.internal atomic-openshift-master-controllers[48569]: I0616 22:13:40.912467   48569 aws.go:1271] Releasing in-process attachment entry: ba -> volume vol-009ac4600758f6896

Comment 2 Robert Rati 2017-06-20 12:45:47 UTC

From the logs it looks like the attach detach controller is trying to attach volumes that are already attached to an instance to another instance (possibly the same instance?).  That doesn't seem like the appropriate behavior, but I'm not an expert in that controller.

The log message itself is coming from the aws sdk.  There has been some reworking of error handling in the newer sdk, but it isn't clear that would solve this issue.  Still tracking down where that message actually gets printed to see what control, if any, we have over it.

How is this issue reproduced?

Comment 3 Robert Rati 2017-06-20 12:46:43 UTC

Correction, the reconciler in the attach detach controller.

Comment 4 Eric Paris 2017-06-20 13:01:54 UTC

Unmarked your comments private. They contained no private information.

The attach/detach controller should definitely be better about not trying to mount volumes on a node while it is still mounted on another node. No question. But that is the not what this BZ is about. We get a very good reasonable log message when that happens:

We also get a completely useless spammy log message which provides nothing of value. This BZ is to address the completely useless spammy log message. You need to either make it provide useful actionable intelligent information not already being logged or you need to make it go away entirely.

I don't care which, but useless log spam must go.

Comment 5 Robert Rati 2017-06-20 13:11:22 UTC

Sure, I'm just highlighting that the error situation that brings this up might want to be investigated by someone more knowledgeable about the controller.  I'm looking at getting rid of the debug message.

Comment 6 Hemant Kumar 2017-06-20 13:17:21 UTC

The log line is coming from AWS SDK code. This is coming from - https://github.com/kubernetes/kubernetes/blob/master/vendor/github.com/aws/aws-sdk-go/aws/awserr/types.go#L142

And this is how AWS combines the error string - https://github.com/kubernetes/kubernetes/blob/master/vendor/github.com/aws/aws-sdk-go/aws/awserr/types.go#L9

They are joining those different error strings by "\n\t", so when we print a straight up error returned by AWS SDK, that is what we get. :(

It is bit crazy that AWS sdk joins the error strings by "\n\t" and thereby making them span several lines.

Comment 7 Eric Paris 2017-06-20 13:34:06 UTC

The handling of log message with \n are similarly a problem in a BZ I assigned to deads (his is about the trace subsystem).

I guess we need to make a concerted effort to drive all \n out of log messages?

Comment 8 Eric Paris 2017-06-20 13:38:05 UTC

Although, looking through the (raw) logs I don't see any following log lines that seem to include more related information. It looks like the spam is all I'm getting...

Comment 9 Robert Rati 2017-06-20 13:43:45 UTC

The information that looks like spam is part of the error message from the aws sdk.  I'm on the fence as to whether that information will ever be useful if the request id is never filled in by the sdk.

I'm going to log an issue upstream where kubernetes community can discuss.

Comment 10 Hemant Kumar 2017-06-20 13:44:35 UTC

I think log lines are getting interleaved because of multi-threaded nature of controller-manager and the fact that, error string from AWS spans multiple lines..

Comment 11 Hemant Kumar 2017-06-20 13:48:26 UTC

It also makes effort we put to improve that error kind of mute. https://trello.com/c/vy1B8QyL/467-1-improving-logging-of-volumeinuse-errors 

because if AWS error is in middle of log line, it will automatically split the device error too. :(

Comment 12 Robert Rati 2017-06-20 13:50:12 UTC

That is my guess as to what is happening. However even if we were to introspect the error message from aws and remove the \n from the log message so that it attaches to a previous error message it isn't clear to me having the status code and a blank request id adds anything to the error message.

Comment 13 Eric Paris 2017-06-27 23:25:34 UTC

https://github.com/kubernetes/kubernetes/pull/47919

Comment 14 Jeff Peeler 2017-06-28 20:27:59 UTC

Origin cherry pick: https://github.com/openshift/origin/pull/14948

Comment 21 errata-xmlrpc 2017-11-28 21:58:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188