Bug 1555057
| Summary: | Logging output from atomic-openshift-*.service(s) are emitting the majority of operational logs on a cluster over-whelming local journald configurations | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Peter Portante <pportant> |
| Component: | Node | Assignee: | Ryan Phillips <rphillips> |
| Status: | CLOSED WONTFIX | QA Contact: | Jianwei Hou <jhou> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 3.9.0 | CC: | anli, aos-bugs, dblack, jcantril, jeder, jokerman, mmccomas, nagrawal, pportant, rmeggins, tkatarki, wsun |
| Target Milestone: | --- | Keywords: | OpsBlocker |
| Target Release: | 3.11.z | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-06-05 15:07:29 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Peter Portante
2018-03-13 21:09:00 UTC
What is the log level? Not sure where it is configured - /etc/origin/master/master-config.yaml? Also node config? Other configs? Can we see what are the most offending log entries? We can bump the log level for them. Yes, we can review free-int and free-stg cluster for logs there. still looking for the configured log levels on those clusters and any standout repeated messages to curtail First, Kibana would let us see all this across all the nodes with the operational logs in aggregate. Jeff Cantrill is working on getting that unblocked. As for logging levels: [root@starter-us-west-1-master-2514b ~]# ps auxww | grep -E [l]oglevel root 10056 79.3 4.9 10304848 6202532 ? Ssl May03 500:40 /usr/bin/openshift start master api --config=/etc/origin/master/master-config.yaml --loglevel=2 --listen=https://0.0.0.0:443 --master=https://ip-172-31-28-100.us-west-1.compute.internal root 10258 14.1 3.0 6576944 3793204 ? Ssl May03 89:05 /usr/bin/openshift start master controllers --config=/etc/origin/master/master-config.yaml --loglevel=2 --listen=https://0.0.0.0:8444 root 62506 6.4 0.3 2190292 386352 ? Ssl May02 109:22 /usr/bin/openshift start node --config=/etc/origin/node/node-config.yaml --loglevel=2 [root@starter-us-west-1-node-infra-9436d ~]# ps auxww | grep -E [l]oglevel 1000000+ 4053 46.2 0.3 5243036 560760 ? Ssl Apr03 20321:35 /usr/bin/openshift-router --loglevel=0 root 32414 19.4 3.2 10626860 5387580 ? Ssl Apr18 4483:27 /usr/bin/openshift start node --config=/etc/origin/node/node-config.yaml --loglevel=2 [root@starter-us-west-1-node-compute-06c59 ~]# ps auxww | grep -E [l]oglevel root 23053 33.9 22.6 8619296 7420980 ? Ssl Apr18 7854:59 /usr/bin/openshift start node --config=/etc/origin/node/node-config.yaml --loglevel=2 I am not sure where the actual configuration is stored for the logging level. As for messages to curtail, here is a sample from a compute node, the same process can be followed for the various services: [root@starter-us-west-1-node-compute-06c59 ~]# date Fri May 4 01:47:52 UTC 2018 [root@starter-us-west-1-node-compute-06c59 ~]# journalctl --output=cat --identifier=atomic-openshift-node --since=01:18 > ./atomic-openshift-node.log [root@starter-us-west-1-node-compute-06c59 ~]# wc -l atomic-openshift-node.log 3024 atomic-openshift-node.log 3,024 messages captured in the last 30 minutes, with 931 informational, 3 warnings, 2,090 errors, 0 others: [root@starter-us-west-1-node-compute-06c59 ~]# grep -E "^I" atomic-openshift-node.log | wc -l 931 [root@starter-us-west-1-node-compute-06c59 ~]# grep -E "^W" atomic-openshift-node.log | wc -l 3 [root@starter-us-west-1-node-compute-06c59 ~]# grep -E "^E" atomic-openshift-node.log | wc -l 2090 [root@starter-us-west-1-node-compute-06c59 ~]# grep -vE "^(E|I|W)" atomic-openshift-node.log | wc -l [root@starter-us-west-1-node-compute-06c59 ~]# You can see the logging location via: [root@starter-us-west-1-node-compute-06c59 ~]# awk '{print $4}' atomic-openshift-node.log | sort | uniq -c 2 cadvisor_stats_provider.go:355] 3 container.go:507] 7 container_manager_linux.go:425] 1 fsHandler.go:135] 18 helpers.go:135] 13 kube_docker_client.go:348] 25 kubelet.go:1882] 293 kuberuntime_manager.go:514] 293 kuberuntime_manager.go:758] 280 kuberuntime_manager.go:768] 1 multitenant.go:154] 280 pod_workers.go:186] 7 provider.go:119] 5 proxier.go:318] 70 proxier.go:354] 1 proxier.go:391] 1 proxier.go:438] 1715 proxy.go:152] 5 vnids.go:148] 4 vnids.go:162] And the following command will give you a list of the repeated messages: [root@starter-us-west-1-node-compute-06c59 ~]# sort -k 5 atomic-openshift-node.log | less -S Seth, for the nodes, the high-frequency locations are these:
kuberuntime_manager.go:514
if kubecontainer.ShouldContainerBeRestarted(&container, pod, podStatus) {
message := fmt.Sprintf("Container %+v is dead, but RestartPolicy says that we should restart it.", container)
glog.Info(message)
changes.ContainersToStart = append(changes.ContainersToStart, idx)
}
kuberuntime_manager.go:758
glog.Infof("checking backoff for container %q in pod %q", container.Name, format.Pod(pod))
kuberuntime_manager.go:768
err := fmt.Errorf("Back-off %s restarting failed container=%s pod=%s", backOff.Get(key), container.Name, format.Pod(pod))
glog.Infof("%s", err.Error())
pod_workers.go:186
if err != nil {
// IMPORTANT: we do not log errors here, the syncPodFn is responsible for logging errors
glog.Errorf("Error syncing pod %s (%q), skipping: %v", update.Pod.UID, format.Pod(update.Pod), err)
}
Solly, is service lookup errors a known issue with the unidling proxy?
proxy.go:152
func (p *HybridProxier) switchService(name types.NamespacedName) {
svc, err := p.serviceLister.Services(name.Namespace).Get(name.Name)
if err != nil {
utilruntime.HandleError(fmt.Errorf("Error while getting service %s from cache: %v", name.String(), err))
return
}
For those watching at home, using:
awk '{print $4}' atomic-openshift-node.log | sort | uniq -c | sort -k 1 -n -r
will give you the list sorted by most frequent to least.
On the master side, from a 30-minute sample:
6 authentication.go:64]
3 generator.go:325]
480 generator.go:672]
2153 get.go:238]
242 logs.go:41]
382 rbac.go:116]
40 reflector.go:341]
291 trace.go:76]
128 watcher.go:208]
5910 watcher.go:235]
4710 watcher.go:341]
get.go:238
glog.V(2).Infof("Starting watch for %s, rv=%s labels=%s fields=%s timeout=%s", req.URL.Path, opts.ResourceVersion, opts.LabelSelector, opts.FieldSelector, timeout)
watcher.go:235
if len(wc.resultChan) == outgoingBufSize {
glog.Warningf("Fast watcher, slow processing. Number of buffered events: %d."+
"Probably caused by slow dispatching events to watchers", outgoingBufSize)
}
watcher.go:341
if len(wc.incomingEventChan) == incomingBufSize {
glog.Warningf("Fast watcher, slow processing. Number of buffered events: %d."+
"Probably caused by slow decoding, user not receiving fast, or other processing logic",
incomingBufSize)
}
controllers:
2 buildconfig_controller.go:139]
480 buildconfig_controller.go:149]
2 buildconfig_controller.go:243]
30 buildconfig_controller.go:254]
1 build_controller.go:1122]
2 build_controller.go:1136]
2 build_controller.go:1284]
21 deploymentconfig_controller.go:496]
1 end
3455 event.go:218]
2100 factory.go:1158]
2100 factory.go:1251]
2640 horizontal.go:189]
82 reflector.go:341]
2100 scheduler.go:191]
2 vnids.go:123]
2 vnids.go:145]
Avesh, it appears the scheduler logs three times for every scheduling error:
factory.go:1158
} else {
glog.Errorf("Error scheduling %v %v: %v; retrying", pod.Namespace, pod.Name, err)
}
factory:1251
func (p *podConditionUpdater) Update(pod *v1.Pod, condition *v1.PodCondition) error {
glog.V(2).Infof("Updating pod condition for %s/%s to (%s==%s)", pod.Namespace, pod.Name, condition.Type, condition.Status)
scheduler.go:191
Solly, it looks like HPA is failing on extensions deploymentconfig lookups
horizontal.go:189]
unable to determine resource for scale target reference: no matches for extensions/, Kind=DeploymentConfig
opened https://github.com/openshift/origin/pull/19970 for apiserver outliers Avesh, can you look at deduplicating and/or raising verbosity level on the scheduler log in https://bugzilla.redhat.com/show_bug.cgi?id=1555057#c9 Seth, can you look at deduplicating and/or raising verbosity level on the node log output in https://bugzilla.redhat.com/show_bug.cgi?id=1555057#c6 Will take a look tomorrow. thanks. Kube PR for https://bugzilla.redhat.com/show_bug.cgi?id=1555057#c6 https://github.com/kubernetes/kubernetes/pull/65065 https://github.com/openshift/origin/pull/19970 is merged, pod and scheduler team own remaining outliers PR to kubernetes: https://github.com/kubernetes/kubernetes/pull/65188 Origin PR for kubelet/scheduler changes: https://github.com/openshift/origin/pull/20661 Origin PR expired due to inactivity. Retarget for 3.11. This is fixed in all versions of 4.x. New Origin PR: https://github.com/openshift/origin/pull/23182 |