Logging output from atomic-openshift-*.service(s) are emitting the majority of operational logs on a cluster creating a situation that can over-whelm local journald configurations. On one of the starter clusters, operational logs from journald are samples on a master, an infra node, and a compute node as follows (see the raw data below, note this is with journald rate-limiting in place, 10,000 / sec): master: 153,333 total in an hour, 99.3% are from atomic-openshift-* processes infra: 18,332 total in an hour, 87.6% are from atomic-openshift-* processes compute: 29,607 total in an hour, 84.9% are from atomic-openshift-* processes This represents a significant increase in logging rates on a local RHEL node, where the default configuration for rsyslog and journald may be ill-suited to handle this. Is such a logging rate really necessary? [root@starter-us-west-1-master-2514b ~]# journalctl --since="03:16:00" --output=verbose | grep SYSLOG_IDENTIFIER > who-is-logging-in-last-hour.lis [root@starter-us-west-1-master-2514b ~]# wc -l who-is-logging-in-last-hour.lis 153333 who-is-logging-in-last-hour.lis [root@starter-us-west-1-master-2514b ~]# grep -v "SYSLOG_IDENTIFIER=atomic-openshift" who-is-logging-in-last-hour.lis | wc -l 2598 [root@starter-us-west-1-node-infra-9436d ~]# journalctl --since="03:16:00" --output=verbose | grep SYSLOG_IDENTIFIER > who-is-logging-in-last-hour.lis [root@starter-us-west-1-node-infra-9436d ~]# wc -l who-is-logging-in-last-hour.lis 18332 who-is-logging-in-last-hour.lis [root@starter-us-west-1-node-infra-9436d ~]# grep -v "SYSLOG_IDENTIFIER=atomic-openshift" who-is-logging-in-last-hour.lis | wc -l 2268 [root@starter-us-west-1-node-compute-06c59 ~]# journalctl --since="03:16:00" --output=verbose | grep SYSLOG_IDENTIFIER > who-is-logging-in-last-hour.lis [root@starter-us-west-1-node-compute-06c59 ~]# wc -l who-is-logging-in-last-hour.lis 29607 who-is-logging-in-last-hour.lis [root@starter-us-west-1-node-compute-06c59 ~]# grep -v "SYSLOG_IDENTIFIER=atomic-openshift" who-is-logging-in-last-hour.lis | wc -l 4466
What is the log level? Not sure where it is configured - /etc/origin/master/master-config.yaml? Also node config? Other configs?
Can we see what are the most offending log entries? We can bump the log level for them.
Yes, we can review free-int and free-stg cluster for logs there.
still looking for the configured log levels on those clusters and any standout repeated messages to curtail
First, Kibana would let us see all this across all the nodes with the operational logs in aggregate. Jeff Cantrill is working on getting that unblocked. As for logging levels: [root@starter-us-west-1-master-2514b ~]# ps auxww | grep -E [l]oglevel root 10056 79.3 4.9 10304848 6202532 ? Ssl May03 500:40 /usr/bin/openshift start master api --config=/etc/origin/master/master-config.yaml --loglevel=2 --listen=https://0.0.0.0:443 --master=https://ip-172-31-28-100.us-west-1.compute.internal root 10258 14.1 3.0 6576944 3793204 ? Ssl May03 89:05 /usr/bin/openshift start master controllers --config=/etc/origin/master/master-config.yaml --loglevel=2 --listen=https://0.0.0.0:8444 root 62506 6.4 0.3 2190292 386352 ? Ssl May02 109:22 /usr/bin/openshift start node --config=/etc/origin/node/node-config.yaml --loglevel=2 [root@starter-us-west-1-node-infra-9436d ~]# ps auxww | grep -E [l]oglevel 1000000+ 4053 46.2 0.3 5243036 560760 ? Ssl Apr03 20321:35 /usr/bin/openshift-router --loglevel=0 root 32414 19.4 3.2 10626860 5387580 ? Ssl Apr18 4483:27 /usr/bin/openshift start node --config=/etc/origin/node/node-config.yaml --loglevel=2 [root@starter-us-west-1-node-compute-06c59 ~]# ps auxww | grep -E [l]oglevel root 23053 33.9 22.6 8619296 7420980 ? Ssl Apr18 7854:59 /usr/bin/openshift start node --config=/etc/origin/node/node-config.yaml --loglevel=2 I am not sure where the actual configuration is stored for the logging level. As for messages to curtail, here is a sample from a compute node, the same process can be followed for the various services: [root@starter-us-west-1-node-compute-06c59 ~]# date Fri May 4 01:47:52 UTC 2018 [root@starter-us-west-1-node-compute-06c59 ~]# journalctl --output=cat --identifier=atomic-openshift-node --since=01:18 > ./atomic-openshift-node.log [root@starter-us-west-1-node-compute-06c59 ~]# wc -l atomic-openshift-node.log 3024 atomic-openshift-node.log 3,024 messages captured in the last 30 minutes, with 931 informational, 3 warnings, 2,090 errors, 0 others: [root@starter-us-west-1-node-compute-06c59 ~]# grep -E "^I" atomic-openshift-node.log | wc -l 931 [root@starter-us-west-1-node-compute-06c59 ~]# grep -E "^W" atomic-openshift-node.log | wc -l 3 [root@starter-us-west-1-node-compute-06c59 ~]# grep -E "^E" atomic-openshift-node.log | wc -l 2090 [root@starter-us-west-1-node-compute-06c59 ~]# grep -vE "^(E|I|W)" atomic-openshift-node.log | wc -l [root@starter-us-west-1-node-compute-06c59 ~]# You can see the logging location via: [root@starter-us-west-1-node-compute-06c59 ~]# awk '{print $4}' atomic-openshift-node.log | sort | uniq -c 2 cadvisor_stats_provider.go:355] 3 container.go:507] 7 container_manager_linux.go:425] 1 fsHandler.go:135] 18 helpers.go:135] 13 kube_docker_client.go:348] 25 kubelet.go:1882] 293 kuberuntime_manager.go:514] 293 kuberuntime_manager.go:758] 280 kuberuntime_manager.go:768] 1 multitenant.go:154] 280 pod_workers.go:186] 7 provider.go:119] 5 proxier.go:318] 70 proxier.go:354] 1 proxier.go:391] 1 proxier.go:438] 1715 proxy.go:152] 5 vnids.go:148] 4 vnids.go:162] And the following command will give you a list of the repeated messages: [root@starter-us-west-1-node-compute-06c59 ~]# sort -k 5 atomic-openshift-node.log | less -S
Seth, for the nodes, the high-frequency locations are these: kuberuntime_manager.go:514 if kubecontainer.ShouldContainerBeRestarted(&container, pod, podStatus) { message := fmt.Sprintf("Container %+v is dead, but RestartPolicy says that we should restart it.", container) glog.Info(message) changes.ContainersToStart = append(changes.ContainersToStart, idx) } kuberuntime_manager.go:758 glog.Infof("checking backoff for container %q in pod %q", container.Name, format.Pod(pod)) kuberuntime_manager.go:768 err := fmt.Errorf("Back-off %s restarting failed container=%s pod=%s", backOff.Get(key), container.Name, format.Pod(pod)) glog.Infof("%s", err.Error()) pod_workers.go:186 if err != nil { // IMPORTANT: we do not log errors here, the syncPodFn is responsible for logging errors glog.Errorf("Error syncing pod %s (%q), skipping: %v", update.Pod.UID, format.Pod(update.Pod), err) } Solly, is service lookup errors a known issue with the unidling proxy? proxy.go:152 func (p *HybridProxier) switchService(name types.NamespacedName) { svc, err := p.serviceLister.Services(name.Namespace).Get(name.Name) if err != nil { utilruntime.HandleError(fmt.Errorf("Error while getting service %s from cache: %v", name.String(), err)) return }
For those watching at home, using: awk '{print $4}' atomic-openshift-node.log | sort | uniq -c | sort -k 1 -n -r will give you the list sorted by most frequent to least.
On the master side, from a 30-minute sample: 6 authentication.go:64] 3 generator.go:325] 480 generator.go:672] 2153 get.go:238] 242 logs.go:41] 382 rbac.go:116] 40 reflector.go:341] 291 trace.go:76] 128 watcher.go:208] 5910 watcher.go:235] 4710 watcher.go:341] get.go:238 glog.V(2).Infof("Starting watch for %s, rv=%s labels=%s fields=%s timeout=%s", req.URL.Path, opts.ResourceVersion, opts.LabelSelector, opts.FieldSelector, timeout) watcher.go:235 if len(wc.resultChan) == outgoingBufSize { glog.Warningf("Fast watcher, slow processing. Number of buffered events: %d."+ "Probably caused by slow dispatching events to watchers", outgoingBufSize) } watcher.go:341 if len(wc.incomingEventChan) == incomingBufSize { glog.Warningf("Fast watcher, slow processing. Number of buffered events: %d."+ "Probably caused by slow decoding, user not receiving fast, or other processing logic", incomingBufSize) }
controllers: 2 buildconfig_controller.go:139] 480 buildconfig_controller.go:149] 2 buildconfig_controller.go:243] 30 buildconfig_controller.go:254] 1 build_controller.go:1122] 2 build_controller.go:1136] 2 build_controller.go:1284] 21 deploymentconfig_controller.go:496] 1 end 3455 event.go:218] 2100 factory.go:1158] 2100 factory.go:1251] 2640 horizontal.go:189] 82 reflector.go:341] 2100 scheduler.go:191] 2 vnids.go:123] 2 vnids.go:145] Avesh, it appears the scheduler logs three times for every scheduling error: factory.go:1158 } else { glog.Errorf("Error scheduling %v %v: %v; retrying", pod.Namespace, pod.Name, err) } factory:1251 func (p *podConditionUpdater) Update(pod *v1.Pod, condition *v1.PodCondition) error { glog.V(2).Infof("Updating pod condition for %s/%s to (%s==%s)", pod.Namespace, pod.Name, condition.Type, condition.Status) scheduler.go:191 Solly, it looks like HPA is failing on extensions deploymentconfig lookups horizontal.go:189] unable to determine resource for scale target reference: no matches for extensions/, Kind=DeploymentConfig
See also: https://bugzilla.redhat.com/show_bug.cgi?id=1569137
opened https://github.com/openshift/origin/pull/19970 for apiserver outliers
Avesh, can you look at deduplicating and/or raising verbosity level on the scheduler log in https://bugzilla.redhat.com/show_bug.cgi?id=1555057#c9 Seth, can you look at deduplicating and/or raising verbosity level on the node log output in https://bugzilla.redhat.com/show_bug.cgi?id=1555057#c6
Will take a look tomorrow. thanks.
Kube PR for https://bugzilla.redhat.com/show_bug.cgi?id=1555057#c6 https://github.com/kubernetes/kubernetes/pull/65065
https://github.com/openshift/origin/pull/19970 is merged, pod and scheduler team own remaining outliers
PR to kubernetes: https://github.com/kubernetes/kubernetes/pull/65188
Origin PR for kubelet/scheduler changes: https://github.com/openshift/origin/pull/20661
Origin PR expired due to inactivity. Retarget for 3.11. This is fixed in all versions of 4.x. New Origin PR: https://github.com/openshift/origin/pull/23182