1659858 – symbolic links go missing form /var/log/pods - Error " Unexpected fsnotify event "

Bug 1659858 - symbolic links go missing form /var/log/pods - Error " Unexpected fsnotify event "

Summary: symbolic links go missing form /var/log/pods - Error " Unexpected fsnotify ev...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	3.10.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	3.10.z
Assignee:	Rich Megginson
QA Contact:	Anping Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-12-17 03:53 UTC by Venkata Tadimarri
Modified:	2023-10-06 18:02 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-01-31 16:50:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Venkata Tadimarri 2018-12-17 03:53:14 UTC

Description of problem:

Customer is facing a problem with fluent not able to read the logs due to the symlinks missing, possible resulting in the following error. 

Nov 13 14:31:09 server atomic-openshift-node: E1113 14:31:09.000898 28872 logs.go:368] Unexpected fsnotify event: "/var/log/pods/dc8b8bb6-e13f-11e8-aa0f-001a4a409202/mule_0.log": RENAME, retrying...
Nov 13 14:31:09 server atomic-openshift-node: E1113 14:31:09.000898 28872 logs.go:368] Unexpected fsnotify event: "/var/log/pods/dc8b8bb6-e13f-11e8-aa0f-001a4a409202/mule_0.log": RENAME, retrying...
Nov 13 14:31:13 server atomic-openshift-node: E1113 14:31:13.819755 28872 logs.go:368] Unexpected fsnotify event: "/var/log/pods/dc8b8bb6-e13f-11e8-aa0f-001a4a409202/mule_0.log": RENAME, retrying...
Nov 13 14:31:13 server atomic-openshift-node: E1113 14:31:13.819755 28872 logs.go:368] Unexpected fsnotify event: "/var/log/pods/dc8b8bb6-e13f-11e8-aa0f-001a4a409202/mule_0.log": RENAME, retrying...
Nov 13 14:31:17 server atomic-openshift-node: E1113 14:31:17.239105 28872 logs.go:368] Unexpected fsnotify event: "/var/log/pods/dc8b8bb6-e13f-11e8-aa0f-001a4a409202/mule_0.log": CHMOD, retrying...
Nov 13 14:31:17 server atomic-openshift-node: E1113 14:31:17.239105 28872 logs.go:368] Unexpected fsnotify event: "/var/log/pods/dc8b8bb6-e13f-11e8-aa0f-001a4a409202/mule_0.log": CHMOD, retrying...


The issue seems to be closely related to the issue mentioned in the following discussion, and is causing the symbolic links to disapear on a regular  basis from the openshift nodes  which seems to be describes here: 
https://github.com/kubernetes/kubernetes/issues/52172


Version-Release number of selected component (if applicable):

atomic-openshift-3.10.72-1.git.0.3cb2fdc.el7.x86_64

Comment 7 Rich Megginson 2019-01-03 18:48:18 UTC

Fluentd reads from /var/log/containers/*.log.  It does not know and should not care that /var/log/containers/*.log are symlinks to /var/log/pods/*/*.log which in turn are symlinks to /var/lib/docker/containers/*/*.log  It has worked fine since day 1.  Not sure why this is a problem now.

There are no pertinent errors in the fluentd logs, but of course there are many errors in the atomic-openshift-node logs.  Are you saying that messages such as this are benign and do not indicate that fluentd should have a problem reading the logs?

Dec 11 17:02:09 prlonr34.nz.thenational.com atomic-openshift-node[114311]: E1211 17:02:09.445383  114311 logs.go:387] Unexpected fsnotify event: "/var/log/pods/bb05d500-fcf8-11e8-88cc-001a4a408f4f/sti-build/0.log": RENAME, retrying..

> Sending to Logging as this flow is bypassing the kubelet i.e. fluent is
> reading the raw container logs on the host from the container runtime.

Can I infer from this statement that fluentd should be doing something other than reading the raw container logs on the host from the container runtime?  If so, what should it be doing?

>It
> seems that fluent is not handling the log rotation done by the container
> runtime.

Fluentd does have log rotation handling code.

>Or something is messed up with the customers splunk connector.

We can try to debug what is happening with fluentd, to at least see if fluentd is reading the records.

1) oc project logging # or openshift-logging if logging is not present
2) oc edit cm logging-fluentd

Comment out this line:

    @include configs.d/openshift/system.conf

Add these lines just below:

    <system>
      @log_level trace
    </system>

Find these lines:

    <label @INGRESS>
    ## filters

Add this just below:

      <filter kubernetes.**>
        @type stdout
      </filter>

3) oc set env ds/logging-fluentd VERBOSE=true DEBUG=true

This will cause all fluentd in the cluster to restart and write their log files to /var/log/fluentd.log (or /var/log/fluentd/fluentd.log depending on what version you are running).

Then we can see if/when fluentd stops reading and hopefully why.

Note You need to log in before you can comment on or make changes to this bug.