Bug 1462446
Summary: | Useless log messages from apiserver/pkg/endpoints/handlers/rest.go | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Eric Paris <eparis> |
Component: | Node | Assignee: | David Eads <deads> |
Status: | CLOSED ERRATA | QA Contact: | Zhang Cheng <chezhang> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | 3.5.1 | CC: | aos-bugs, jliggitt, jokerman, mmccomas, smunilla |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: |
undefined
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2017-08-10 05:28:09 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Eric Paris
2017-06-17 18:34:44 UTC
Its the whole trace system. We also have painful logs coming from syncReplicationController() This trace is only supposed to come out when requests take more than 500 milliseconds. In this case its telling you than an individual etcd write took over a second (about two orders of magnitude too long) and provides the timing on the call so you can localize the trouble. I would expect an admin to care that his system is running 100 times slower than it should (he probably already knows from prometheus or something), but this highlights exactly which step is broken. I think the interleaving is due weird writer behavior. It's dumped as a single glog: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/util/trace/trace.go#L61 moving to low severity, not a regression, would need to be addressed upstream Log spam is a priority for 3.6. Starter/free is spamming the logs so badly we lose many (most) meaningful log messages to the rate limiter. If we have to carry a 3.6 path to move these to a lower priorty or something, so be it. I am returning this to medium and this situation must be improved in 3.6, even if we have to carry while we work upstream. One 'solution' would be to replace all \n's with something like "--" so we end up with 1 log message instead of 8. But online can not handle the huge number of useless log messages. How is a 100x slower than expected response from etcd not the most important message you have? Why isn't the fix to address the etcd latency that is going to affect the responses from the API server to all clients? It's pretty trivial to reduce the verbosity of the message, but all I'd be doing is hiding the reason why everything that relies on the API server (which is pretty much everything) is slow. Are you really, really, really sure that you want to suppress these? How slow a response do you want to consider exceptional? 1000x, 10000x, more? This bug is about log spam. It is a requirement to reduce/eliminate log spam in 3.6. I have a lot of log message that says nothing of value. An example would be: [105.006µs] [62.684µs] Conversion done I'm not saying there isn't a serious bug we need to address. And rest assured, we're going to harass you on why some things are taking so long. But the useless log spam must go. Either by making all of this a single message or by hiding the useless portions. Even the one part of these that seem useful: [1.01775391s] [1.769µs] Self-link added Don't actually tell us where to go next. Maybe when corrolated with OTHER message you get a pointer on what to do. If that is the case, make that a single message. Make them correlated/correlatable. I'm not saying we don't need to fix the underlying problem/issue that is causing the spam, but we must reduce/eliminate the useless spam in 3.6. Please ALSO let us know what information needs to be collected to get to the bottom of the underlying issue. Verified and passed on openshift v3.6.133 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1716 |