oc v3.5.5.19 /usr/bin/openshift start master controllers --config=/etc/origin/master/master-config.yaml --loglevel=2 --listen=https://0.0.0.0:8444 I see semi-regular spam in the logs like: Jun 16 22:12:05 ip-172-31-54-162.ec2.internal atomic-openshift-master-controllers[48569]: status code: 400, request id: Jun 16 22:12:05 ip-172-31-54-162.ec2.internal atomic-openshift-master-controllers[48569]: status code: 400, request id: Jun 16 22:12:05 ip-172-31-54-162.ec2.internal atomic-openshift-master-controllers[48569]: status code: 400, request id: This message tells me NOTHING of value and is just spam. We need to do two things. 1. Get rid of the useless spam (either by not logging or logging something useful) 2. Determine if it represents a problem and if so fix the problem. It seems each useless spam is always proceeded by a log message which looks like one of the following: Jun 16 22:13:40 ip-172-31-54-162.ec2.internal atomic-openshift-master-controllers[48569]: I0616 22:13:40.675705 48569 aws.go:1248] Assigned mount device bj -> volume vol-09e9fa5a07720c47e Jun 16 22:13:40 ip-172-31-54-162.ec2.internal atomic-openshift-master-controllers[48569]: I0616 22:13:40.912467 48569 aws.go:1271] Releasing in-process attachment entry: ba -> volume vol-009ac4600758f6896
From the logs it looks like the attach detach controller is trying to attach volumes that are already attached to an instance to another instance (possibly the same instance?). That doesn't seem like the appropriate behavior, but I'm not an expert in that controller. The log message itself is coming from the aws sdk. There has been some reworking of error handling in the newer sdk, but it isn't clear that would solve this issue. Still tracking down where that message actually gets printed to see what control, if any, we have over it. How is this issue reproduced?
Correction, the reconciler in the attach detach controller.
Unmarked your comments private. They contained no private information. The attach/detach controller should definitely be better about not trying to mount volumes on a node while it is still mounted on another node. No question. But that is the not what this BZ is about. We get a very good reasonable log message when that happens: We also get a completely useless spammy log message which provides nothing of value. This BZ is to address the completely useless spammy log message. You need to either make it provide useful actionable intelligent information not already being logged or you need to make it go away entirely. I don't care which, but useless log spam must go.
Sure, I'm just highlighting that the error situation that brings this up might want to be investigated by someone more knowledgeable about the controller. I'm looking at getting rid of the debug message.
The log line is coming from AWS SDK code. This is coming from - https://github.com/kubernetes/kubernetes/blob/master/vendor/github.com/aws/aws-sdk-go/aws/awserr/types.go#L142 And this is how AWS combines the error string - https://github.com/kubernetes/kubernetes/blob/master/vendor/github.com/aws/aws-sdk-go/aws/awserr/types.go#L9 They are joining those different error strings by "\n\t", so when we print a straight up error returned by AWS SDK, that is what we get. :( It is bit crazy that AWS sdk joins the error strings by "\n\t" and thereby making them span several lines.
The handling of log message with \n are similarly a problem in a BZ I assigned to deads (his is about the trace subsystem). I guess we need to make a concerted effort to drive all \n out of log messages?
Although, looking through the (raw) logs I don't see any following log lines that seem to include more related information. It looks like the spam is all I'm getting...
The information that looks like spam is part of the error message from the aws sdk. I'm on the fence as to whether that information will ever be useful if the request id is never filled in by the sdk. I'm going to log an issue upstream where kubernetes community can discuss.
I think log lines are getting interleaved because of multi-threaded nature of controller-manager and the fact that, error string from AWS spans multiple lines..
It also makes effort we put to improve that error kind of mute. https://trello.com/c/vy1B8QyL/467-1-improving-logging-of-volumeinuse-errors because if AWS error is in middle of log line, it will automatically split the device error too. :(
That is my guess as to what is happening. However even if we were to introspect the error message from aws and remove the \n from the log message so that it attaches to a previous error message it isn't clear to me having the status code and a blank request id adds anything to the error message.
https://github.com/kubernetes/kubernetes/pull/47919
Origin cherry pick: https://github.com/openshift/origin/pull/14948
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188