Bug 1462446

Summary: Useless log messages from apiserver/pkg/endpoints/handlers/rest.go
Product: OpenShift Container Platform Reporter: Eric Paris <eparis>
Component: NodeAssignee: David Eads <deads>
Status: CLOSED ERRATA QA Contact: Zhang Cheng <chezhang>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.5.1CC: aos-bugs, jliggitt, jokerman, mmccomas, smunilla
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-10 05:28:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Eric Paris 2017-06-17 18:34:44 UTC
oc v3.5.5.19

/usr/bin/openshift start master api --config=/etc/origin/master/master-config.yaml --loglevel=2 --listen=https://0.0.0.0:443 --master=https://ip-172-31-54-162.ec2.internal

I see semi-regular spam (about 3msg/sec) in the logs like:

Jun 16 22:15:36 ip-172-31-60-65.ec2.internal atomic-openshift-master-api[49708]: [100.91µs] [55.468µs] Conversion done
Jun 16 22:15:36 ip-172-31-60-65.ec2.internal atomic-openshift-master-api[49708]: [1.017752141s] [1.01764173s] Object stored in database
Jun 16 22:15:36 ip-172-31-60-65.ec2.internal atomic-openshift-master-api[49708]: [1.01775391s] [1.769µs] Self-link added
Jun 16 22:15:36 ip-172-31-60-65.ec2.internal atomic-openshift-master-api[49708]: [105.006µs] [62.684µs] Conversion done
Jun 16 22:15:36 ip-172-31-60-65.ec2.internal atomic-openshift-master-api[49708]: [108.761µs] [55.671µs] About to store object in database
Jun 16 22:15:36 ip-172-31-60-65.ec2.internal atomic-openshift-master-api[49708]: [110.411µs] [39.82µs] About to store object in database
Jun 16 22:15:36 ip-172-31-60-65.ec2.internal atomic-openshift-master-api[49708]: [110.713µs] [79.207µs] Conversion done
Jun 16 22:15:36 ip-172-31-60-65.ec2.internal atomic-openshift-master-api[49708]: [114.178µs] [33.322µs] About to store object in database
Jun 16 22:15:36 ip-172-31-60-65.ec2.internal atomic-openshift-master-api[49708]: [1.145085549s] [1.144939724s] Object stored in database
Jun 16 22:15:36 ip-172-31-60-65.ec2.internal atomic-openshift-master-api[49708]: [1.145086921s] [1.372µs] Self-link added
Jun 16 22:15:36 ip-172-31-60-65.ec2.internal atomic-openshift-master-api[49708]: [115.915µs] [5.202µs] About to store object in database
Jun 16 22:15:36 ip-172-31-60-65.ec2.internal atomic-openshift-master-api[49708]: [1.249214715s] [1.249067734s] Object stored in database
Jun 16 22:15:36 ip-172-31-60-65.ec2.internal atomic-openshift-master-api[49708]: [1.249215901s] [1.186µs] Self-link added
Jun 16 22:15:36 ip-172-31-60-65.ec2.internal atomic-openshift-master-api[49708]: [126.589µs] [39.447µs] About to store object in database
Jun 16 22:15:36 ip-172-31-60-65.ec2.internal atomic-openshift-master-api[49708]: [126.89µs] [37.705µs] About to store object in database
Jun 16 22:15:36 ip-172-31-60-65.ec2.internal atomic-openshift-master-api[49708]: [145.825µs] [44.915µs] About to store object in database
Jun 16 22:15:36 ip-172-31-60-65.ec2.internal atomic-openshift-master-api[49708]: [146.981µs] [47.796µs] About to store object in database

These messages provide no value to an admin and just hide useful messages. If they need to belong in like V(4) ok. But even then they seem a bit excessive being their own message. Finding a way to write one message instead of interleaving pieces and parts of different calls would both reduce the number of these and potentially increase the value.

Comment 2 Eric Paris 2017-06-17 18:45:01 UTC
Its the whole trace system. We also have painful logs coming from syncReplicationController()

Comment 3 David Eads 2017-06-19 11:31:32 UTC
This trace is only supposed to come out when requests take more than 500 milliseconds.  In this case its telling you than an individual etcd write took over a second (about two orders of magnitude too long) and provides the timing on the call so you can localize the trouble.  I would expect an admin to care that his system is running 100 times slower than it should (he probably already knows from prometheus or something), but this highlights exactly which step is broken.

I think the interleaving is due weird writer behavior.  It's dumped as a single glog: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/util/trace/trace.go#L61

Comment 4 Jordan Liggitt 2017-06-20 15:11:02 UTC
moving to low severity, not a regression, would need to be addressed upstream

Comment 5 Eric Paris 2017-06-20 16:16:00 UTC
Log spam is a priority for 3.6. Starter/free is spamming the logs so badly we lose many (most) meaningful log messages to the rate limiter.  If we have to carry a 3.6 path to move these to a lower priorty or something, so be it.

I am returning this to medium and this situation must be improved in 3.6, even if we have to carry while we work upstream.

One 'solution' would be to replace all \n's with something like "--" so we end up with 1 log message instead of 8.

But online can not handle the huge number of useless log messages.

Comment 6 David Eads 2017-06-21 12:36:40 UTC
How is a 100x slower than expected response from etcd not the most important message you have?  Why isn't the fix to address the etcd latency that is going to affect the responses from the API server to all clients?

It's pretty trivial to reduce the verbosity of the message, but all I'd be doing is hiding the reason why everything that relies on the API server (which is pretty much everything) is slow.  Are you really, really, really sure that you want to suppress these?  How slow a response do you want to consider exceptional?  1000x, 10000x, more?

Comment 7 Eric Paris 2017-06-21 18:36:40 UTC
This bug is about log spam. It is a requirement to reduce/eliminate log spam in 3.6. I have a lot of log message that says nothing of value. An example would be:

[105.006µs] [62.684µs] Conversion done

I'm not saying there isn't a serious bug we need to address. And rest assured, we're going to harass you on why some things are taking so long. But the useless log spam must go. Either by making all of this a single message or by hiding the useless portions. Even the one part of these that seem useful:

[1.01775391s] [1.769µs] Self-link added

Don't actually tell us where to go next. Maybe when corrolated with OTHER message you get a pointer on what to do. If that is the case, make that a single message. Make them correlated/correlatable.

I'm not saying we don't need to fix the underlying problem/issue that is causing the spam, but we must reduce/eliminate the useless spam in 3.6.


Please ALSO let us know what information needs to be collected to get to the bottom of the underlying issue.

Comment 8 David Eads 2017-06-27 12:48:46 UTC
https://github.com/openshift/origin/pull/14911

Comment 10 Zhang Cheng 2017-07-05 06:11:18 UTC
Verified and passed on openshift v3.6.133

Comment 12 errata-xmlrpc 2017-08-10 05:28:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1716