Bug 1814859

Summary: symlink lost from /var/log/containers log subdirectory to corresponding /var/log/pods pod log
Product: OpenShift Container Platform Reporter: Mike Fiedler <mifiedle>
Component: NodeAssignee: Ted Yu <zyu>
Status: CLOSED DUPLICATE QA Contact: Sunil Choudhary <schoudha>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.4CC: aos-bugs, dwalsh, jokerman, rphillips, tsweeney, zyu
Target Milestone: ---Keywords: Reopened
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-26 14:26:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Mike Fiedler 2020-03-18 20:26:54 UTC
Description of problem:

While running an OpenShift logging test, one of the pod log indices showed missing data.  Investigating on the node where the pod was running showed that the subdirectory for the pod logs in /var/log/pods existed and contained the correct logs.   However the expected symlink in /var/log/containers referring to the current pod log in /var/log/pods was missing.

Apparently, when the log wrapped the symlink did not get recreated.

In this test there were 20 pods logging at a rate of 67 1KB messages/second.   This is a rate that has been handled easily in the past.

I will link the location of a tar of /var/log/pods and /var/log/containers + the journal from the node.  Let me know what else is needed.

The /var/log/pods directory with a missing entry is logtest8_centos-logtest-hzjbb_51b32576-1ef4-4c41-a8ea-44ceab560f1c

I verified the pod is still running at the time the symlink was missing. 

Marking the severity high since following /var/log/containers is how OCP logging picks up messages to index.

Version-Release number of selected component (if applicable): 4.4.0-0.nightly-2020-03-17-071839


How reproducible: Frequently when logging at this rate.  I've reproduced it three times.  


Steps to Reproduce:
1. Run at least 20 pods on a node which are logging at a rate of 67+ 1KB messages/second

Hopefully that does it.  My actual environment is much more complex with a 3 node ES cluster and 30 nodes logging messages at this rate.   But, on a single node I would hope that this reproduces it.

Actual results:

1 or more current pod logs is not linked from /var/log/containers.

Comment 2 Tom Sweeney 2020-03-19 19:39:16 UTC
Peter could you take a look at this please?

Comment 3 Ted Yu 2020-03-20 17:33:19 UTC
There was an internal thread 'OCP 4.3 Kublet not creating log symlinks'

I checked containerLogManager#rotateLog which only logs when there is an error.

kuberuntime_gc.go only logs when removing symlink encounters error. I was thinking of the following addition:

diff --git a/pkg/kubelet/kuberuntime/kuberuntime_gc.go b/pkg/kubelet/kuberuntime/kuberuntime_gc.go
index b139dba9272..47c061d4331 100644
--- a/pkg/kubelet/kuberuntime/kuberuntime_gc.go
+++ b/pkg/kubelet/kuberuntime/kuberuntime_gc.go
@@ -359,6 +359,8 @@ func (cgc *containerGC) evictPodLogsDirectories(allSourcesReady bool) error {
                        err := osInterface.Remove(logSymlink)
                        if err != nil {
                                klog.Errorf("Failed to remove container log dead symlink %q: %v", logSymlink, err)
+                       } else {
+                               klog.V(4).Infof("removed symlink of container log %q", logSymlink)
                        }
                }
        }

Comment 4 Ted Yu 2020-03-23 15:04:46 UTC
There was only one entry in node-journal.log w.r.t. the missing symlink:

Mar 18 19:45:17 ip-10-0-165-221 hyperkube[1265]: I0318 19:45:17.884849    1265 logs.go:309] Finish parsing log file "/var/log/pods/logtest8_centos-logtest-hzjbb_51b32576-1ef4-4c41-a8ea-44ceab560f1c/centos-logtest/0.log"

The following command showed logtest8_centos-logtest-hzjbb_51b32576-1ef4-4c41-a8ea-44ceab560f1c having some files :

tar ztvf var_log_pods_var_log_containers.tgz | grep '_51b32576-1ef4-4c41-a8ea-44ceab560f1c'

drwxr-xr-x  0 root   root        0 Mar 18 12:26 var/log/pods/logtest8_centos-logtest-hzjbb_51b32576-1ef4-4c41-a8ea-44ceab560f1c/centos-logtest/
-rw-r--r--  0 root   root   700824 Mar 18 12:26 var/log/pods/logtest8_centos-logtest-hzjbb_51b32576-1ef4-4c41-a8ea-44ceab560f1c/centos-logtest/0.log.20200318-191433.gz

Likely the removal happened after 19:45:17

Comment 5 Peter Hunt 2020-03-23 15:19:07 UTC
The component of this bug should be Node, I'm having trouble changing it

Comment 6 Tom Sweeney 2020-03-23 21:22:43 UTC
Changing component to Node per https://bugzilla.redhat.com/show_bug.cgi?id=1814859#c5

Comment 9 Ted Yu 2020-03-25 19:38:26 UTC
I searched node-journal.log from the bundle, looking for entry from runtime_oci.go (related to ReopenContainerLog).
There was no entry from runtime_oci.go

Comment 10 Ted Yu 2020-03-31 19:53:18 UTC
PR is under review from random-liu and Dawn.

Comment 11 Ted Yu 2020-04-06 23:01:45 UTC
Waiting for random-liu to respond to my analysis:

https://github.com/kubernetes/kubernetes/pull/89160#issuecomment-606817782

Comment 12 Ted Yu 2020-04-07 21:15:33 UTC
I want to see if it is feasible to add the following info log to openshift:

diff --git a/pkg/kubelet/kuberuntime/kuberuntime_gc.go b/pkg/kubelet/kuberuntime/kuberuntime_gc.go
index b139dba9272..1f89651a2c5 100644
--- a/pkg/kubelet/kuberuntime/kuberuntime_gc.go
+++ b/pkg/kubelet/kuberuntime/kuberuntime_gc.go
@@ -357,6 +357,7 @@ func (cgc *containerGC) evictPodLogsDirectories(allSourcesReady bool) error {
        for _, logSymlink := range logSymlinks {
                if _, err := osInterface.Stat(logSymlink); os.IsNotExist(err) {
                        err := osInterface.Remove(logSymlink)
+                       klog.Info("removed symlink %s", logSymlink)
                        if err != nil {
                                klog.Errorf("Failed to remove container log dead symlink %q: %v", logSymlink, err)
                        }

So that we can search for relevant information the next time missing symlink is discovered.

Comment 13 Ted Yu 2020-04-08 14:44:48 UTC
By shortening GC period and making parsedMaxSize small (1K), I was able to see the following in kubelet-gc-race.log (two occurrences):

Apr 07 22:45:37 kind-worker kubelet[717]: I0407 22:45:37.611200     717 kuberuntime_gc.go:362] container still exists, not removing symlink: /var/log/containers/busybox-e8f88ccb-0a3f-4e40-af4a-cb504655c17d_container-probe-5751_busybox-3fa2103324039b46844069221408013eee39f4801c9032072dc187bc57700449.log
...
Apr 07 22:46:02 kind-worker kubelet[717]: I0407 22:46:02.672998     717 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: 3fa2103324039b46844069221408013eee39f4801c9032072dc187bc57700449

We can see that the actual removal of the container was 25 seconds after the check of the symlink.

Without the help of cgc.client.ContainerStats(), the symlink would have been removed at 22:45:37

Comment 14 Ted Yu 2020-04-14 23:10:50 UTC
PR has been updated to address latest round of reviews from Random-Liu.

Comment 17 Ryan Phillips 2020-05-26 14:25:43 UTC

*** This bug has been marked as a duplicate of bug 1823406 ***

Comment 18 Ryan Phillips 2020-05-26 14:26:58 UTC

*** This bug has been marked as a duplicate of bug 1823406 ***