Description of problem: While running an OpenShift logging test, one of the pod log indices showed missing data. Investigating on the node where the pod was running showed that the subdirectory for the pod logs in /var/log/pods existed and contained the correct logs. However the expected symlink in /var/log/containers referring to the current pod log in /var/log/pods was missing. Apparently, when the log wrapped the symlink did not get recreated. In this test there were 20 pods logging at a rate of 67 1KB messages/second. This is a rate that has been handled easily in the past. I will link the location of a tar of /var/log/pods and /var/log/containers + the journal from the node. Let me know what else is needed. The /var/log/pods directory with a missing entry is logtest8_centos-logtest-hzjbb_51b32576-1ef4-4c41-a8ea-44ceab560f1c I verified the pod is still running at the time the symlink was missing. Marking the severity high since following /var/log/containers is how OCP logging picks up messages to index. Version-Release number of selected component (if applicable): 4.4.0-0.nightly-2020-03-17-071839 How reproducible: Frequently when logging at this rate. I've reproduced it three times. Steps to Reproduce: 1. Run at least 20 pods on a node which are logging at a rate of 67+ 1KB messages/second Hopefully that does it. My actual environment is much more complex with a 3 node ES cluster and 30 nodes logging messages at this rate. But, on a single node I would hope that this reproduces it. Actual results: 1 or more current pod logs is not linked from /var/log/containers.
Peter could you take a look at this please?
There was an internal thread 'OCP 4.3 Kublet not creating log symlinks' I checked containerLogManager#rotateLog which only logs when there is an error. kuberuntime_gc.go only logs when removing symlink encounters error. I was thinking of the following addition: diff --git a/pkg/kubelet/kuberuntime/kuberuntime_gc.go b/pkg/kubelet/kuberuntime/kuberuntime_gc.go index b139dba9272..47c061d4331 100644 --- a/pkg/kubelet/kuberuntime/kuberuntime_gc.go +++ b/pkg/kubelet/kuberuntime/kuberuntime_gc.go @@ -359,6 +359,8 @@ func (cgc *containerGC) evictPodLogsDirectories(allSourcesReady bool) error { err := osInterface.Remove(logSymlink) if err != nil { klog.Errorf("Failed to remove container log dead symlink %q: %v", logSymlink, err) + } else { + klog.V(4).Infof("removed symlink of container log %q", logSymlink) } } }
There was only one entry in node-journal.log w.r.t. the missing symlink: Mar 18 19:45:17 ip-10-0-165-221 hyperkube[1265]: I0318 19:45:17.884849 1265 logs.go:309] Finish parsing log file "/var/log/pods/logtest8_centos-logtest-hzjbb_51b32576-1ef4-4c41-a8ea-44ceab560f1c/centos-logtest/0.log" The following command showed logtest8_centos-logtest-hzjbb_51b32576-1ef4-4c41-a8ea-44ceab560f1c having some files : tar ztvf var_log_pods_var_log_containers.tgz | grep '_51b32576-1ef4-4c41-a8ea-44ceab560f1c' drwxr-xr-x 0 root root 0 Mar 18 12:26 var/log/pods/logtest8_centos-logtest-hzjbb_51b32576-1ef4-4c41-a8ea-44ceab560f1c/centos-logtest/ -rw-r--r-- 0 root root 700824 Mar 18 12:26 var/log/pods/logtest8_centos-logtest-hzjbb_51b32576-1ef4-4c41-a8ea-44ceab560f1c/centos-logtest/0.log.20200318-191433.gz Likely the removal happened after 19:45:17
The component of this bug should be Node, I'm having trouble changing it
Changing component to Node per https://bugzilla.redhat.com/show_bug.cgi?id=1814859#c5
I searched node-journal.log from the bundle, looking for entry from runtime_oci.go (related to ReopenContainerLog). There was no entry from runtime_oci.go
PR is under review from random-liu and Dawn.
Waiting for random-liu to respond to my analysis: https://github.com/kubernetes/kubernetes/pull/89160#issuecomment-606817782
I want to see if it is feasible to add the following info log to openshift: diff --git a/pkg/kubelet/kuberuntime/kuberuntime_gc.go b/pkg/kubelet/kuberuntime/kuberuntime_gc.go index b139dba9272..1f89651a2c5 100644 --- a/pkg/kubelet/kuberuntime/kuberuntime_gc.go +++ b/pkg/kubelet/kuberuntime/kuberuntime_gc.go @@ -357,6 +357,7 @@ func (cgc *containerGC) evictPodLogsDirectories(allSourcesReady bool) error { for _, logSymlink := range logSymlinks { if _, err := osInterface.Stat(logSymlink); os.IsNotExist(err) { err := osInterface.Remove(logSymlink) + klog.Info("removed symlink %s", logSymlink) if err != nil { klog.Errorf("Failed to remove container log dead symlink %q: %v", logSymlink, err) } So that we can search for relevant information the next time missing symlink is discovered.
By shortening GC period and making parsedMaxSize small (1K), I was able to see the following in kubelet-gc-race.log (two occurrences): Apr 07 22:45:37 kind-worker kubelet[717]: I0407 22:45:37.611200 717 kuberuntime_gc.go:362] container still exists, not removing symlink: /var/log/containers/busybox-e8f88ccb-0a3f-4e40-af4a-cb504655c17d_container-probe-5751_busybox-3fa2103324039b46844069221408013eee39f4801c9032072dc187bc57700449.log ... Apr 07 22:46:02 kind-worker kubelet[717]: I0407 22:46:02.672998 717 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: 3fa2103324039b46844069221408013eee39f4801c9032072dc187bc57700449 We can see that the actual removal of the container was 25 seconds after the check of the symlink. Without the help of cgc.client.ContainerStats(), the symlink would have been removed at 22:45:37
PR has been updated to address latest round of reviews from Random-Liu.
*** This bug has been marked as a duplicate of bug 1823406 ***