Bug 1814859
Summary: | symlink lost from /var/log/containers log subdirectory to corresponding /var/log/pods pod log | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Mike Fiedler <mifiedle> |
Component: | Node | Assignee: | Ted Yu <zyu> |
Status: | CLOSED DUPLICATE | QA Contact: | Sunil Choudhary <schoudha> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.4 | CC: | aos-bugs, dwalsh, jokerman, rphillips, tsweeney, zyu |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | 4.5.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-05-26 14:26:58 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Mike Fiedler
2020-03-18 20:26:54 UTC
Peter could you take a look at this please? There was an internal thread 'OCP 4.3 Kublet not creating log symlinks' I checked containerLogManager#rotateLog which only logs when there is an error. kuberuntime_gc.go only logs when removing symlink encounters error. I was thinking of the following addition: diff --git a/pkg/kubelet/kuberuntime/kuberuntime_gc.go b/pkg/kubelet/kuberuntime/kuberuntime_gc.go index b139dba9272..47c061d4331 100644 --- a/pkg/kubelet/kuberuntime/kuberuntime_gc.go +++ b/pkg/kubelet/kuberuntime/kuberuntime_gc.go @@ -359,6 +359,8 @@ func (cgc *containerGC) evictPodLogsDirectories(allSourcesReady bool) error { err := osInterface.Remove(logSymlink) if err != nil { klog.Errorf("Failed to remove container log dead symlink %q: %v", logSymlink, err) + } else { + klog.V(4).Infof("removed symlink of container log %q", logSymlink) } } } There was only one entry in node-journal.log w.r.t. the missing symlink: Mar 18 19:45:17 ip-10-0-165-221 hyperkube[1265]: I0318 19:45:17.884849 1265 logs.go:309] Finish parsing log file "/var/log/pods/logtest8_centos-logtest-hzjbb_51b32576-1ef4-4c41-a8ea-44ceab560f1c/centos-logtest/0.log" The following command showed logtest8_centos-logtest-hzjbb_51b32576-1ef4-4c41-a8ea-44ceab560f1c having some files : tar ztvf var_log_pods_var_log_containers.tgz | grep '_51b32576-1ef4-4c41-a8ea-44ceab560f1c' drwxr-xr-x 0 root root 0 Mar 18 12:26 var/log/pods/logtest8_centos-logtest-hzjbb_51b32576-1ef4-4c41-a8ea-44ceab560f1c/centos-logtest/ -rw-r--r-- 0 root root 700824 Mar 18 12:26 var/log/pods/logtest8_centos-logtest-hzjbb_51b32576-1ef4-4c41-a8ea-44ceab560f1c/centos-logtest/0.log.20200318-191433.gz Likely the removal happened after 19:45:17 The component of this bug should be Node, I'm having trouble changing it Changing component to Node per https://bugzilla.redhat.com/show_bug.cgi?id=1814859#c5 I searched node-journal.log from the bundle, looking for entry from runtime_oci.go (related to ReopenContainerLog). There was no entry from runtime_oci.go PR is under review from random-liu and Dawn. Waiting for random-liu to respond to my analysis: https://github.com/kubernetes/kubernetes/pull/89160#issuecomment-606817782 I want to see if it is feasible to add the following info log to openshift: diff --git a/pkg/kubelet/kuberuntime/kuberuntime_gc.go b/pkg/kubelet/kuberuntime/kuberuntime_gc.go index b139dba9272..1f89651a2c5 100644 --- a/pkg/kubelet/kuberuntime/kuberuntime_gc.go +++ b/pkg/kubelet/kuberuntime/kuberuntime_gc.go @@ -357,6 +357,7 @@ func (cgc *containerGC) evictPodLogsDirectories(allSourcesReady bool) error { for _, logSymlink := range logSymlinks { if _, err := osInterface.Stat(logSymlink); os.IsNotExist(err) { err := osInterface.Remove(logSymlink) + klog.Info("removed symlink %s", logSymlink) if err != nil { klog.Errorf("Failed to remove container log dead symlink %q: %v", logSymlink, err) } So that we can search for relevant information the next time missing symlink is discovered. By shortening GC period and making parsedMaxSize small (1K), I was able to see the following in kubelet-gc-race.log (two occurrences): Apr 07 22:45:37 kind-worker kubelet[717]: I0407 22:45:37.611200 717 kuberuntime_gc.go:362] container still exists, not removing symlink: /var/log/containers/busybox-e8f88ccb-0a3f-4e40-af4a-cb504655c17d_container-probe-5751_busybox-3fa2103324039b46844069221408013eee39f4801c9032072dc187bc57700449.log ... Apr 07 22:46:02 kind-worker kubelet[717]: I0407 22:46:02.672998 717 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: 3fa2103324039b46844069221408013eee39f4801c9032072dc187bc57700449 We can see that the actual removal of the container was 25 seconds after the check of the symlink. Without the help of cgc.client.ContainerStats(), the symlink would have been removed at 22:45:37 PR has been updated to address latest round of reviews from Random-Liu. *** This bug has been marked as a duplicate of bug 1823406 *** *** This bug has been marked as a duplicate of bug 1823406 *** |