Bug 1333663 - OpenShift node process keeps too many files opened
Summary: OpenShift node process keeps too many files opened
Keywords:
Status: CLOSED DUPLICATE of bug 1331235
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Seth Jennings
QA Contact: DeShuai Ma
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-05-06 06:35 UTC by Jiri Pechanec
Modified: 2016-06-16 16:46 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-06-16 16:46:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
lsof (181.20 KB, application/x-gzip)
2016-05-06 06:35 UTC, Jiri Pechanec
no flags Details

Description Jiri Pechanec 2016-05-06 06:35:25 UTC
Created attachment 1154491 [details]
lsof

One of our long-running OSE node started to exhibit very weird behaviour when we tried to do oc logs or oc logs.

When we tried to execute systemctl restart docker we got an error message - Too many open files

lsof shown us
Total files - 41187
OpenShift files - 37761
 - named /var/log/messages* - 16740
 - inotify - 15660

At the time of investigation there were 3 pods running

After OpenShift restart we got numbers
Total files - 5102
OpenShift files - 451

lsof result is attached

Comment 1 Jan Chaloupka 2016-05-06 08:36:39 UTC
Hi Jiri,

OpenShift restart meaning restarting nodes or services? For how long the node ran?

Comment 2 Jiri Pechanec 2016-05-06 10:37:43 UTC
I've just restarted atomic-openshift-node service
Unfortunately I cannot tell you for how the node run as part of cluster healing we've updated and restarted all nodes in the cluster - but my guess is between 2-4 weeks.

Comment 3 Andy Goldstein 2016-05-06 13:30:41 UTC
This is either a partial of full duplicate of bug 1331235

Comment 4 Tomas Schlosser 2016-05-06 14:38:46 UTC
(In reply to Andy Goldstein from comment #3)
> This is either a partial of full duplicate of bug 1331235

This is partially related to the bug you provided (you can see a lot of references to logrotated /var/log/messages), but there are also a lot of open inotify handlers.

The problem we're facing is more subtle - the disk space is free, there is no trace of error except when you call 'docker logs' or 'oc logs' you get only a part of the log. Deeper investigation was necessary to uncover the depleted file descriptors.

Comment 5 Seth Jennings 2016-06-09 16:23:10 UTC
Ok, I'm picking this one up again.

The only inotify user I can find in the openshift node is cadvisor.  It seems to do watches on the container subdirectories and the cgroup subdirectories.  There could be a leak there, or there might just be a ton of stuff that it is watching (probably more than it needs to).

It looks like this code may have been reworked upstream.  Still looking.

Comment 6 Seth Jennings 2016-06-14 02:57:56 UTC
After taking a crash course in lsof output, there are not 15660 unique inotify watches happening, there are 116 watches shared by 135 threads managed by the go runtime.  That seems like a reasonable to me.

Overall the openshift threads share 279 open FDs with 51 thread local FDs totaling 330 unique FDs.

The number of shared FDs to deleted log files:

     13 /var/log/messages-20160426
     18 /var/log/messages-20160427
     15 /var/log/messages-20160428
     15 /var/log/messages-20160429
      6 /var/log/messages-20160430
      7 /var/log/messages-20160501
      7 /var/log/messages-20160502
      8 /var/log/messages-20160503
     11 /var/log/messages-20160504
      7 /var/log/messages-20160505

Of the 279 shared FDs, 107 of them are to 10 deleted log files at ~512MB each, monopolizing ~5GB of disk space.

There are 17 shared FDs open for the non-deleted log file.

I'm not sure why there are between 7-18 FDs for each log file.

In summary 107+17 = 124 of the 279 shared FDs are log files.

124 log files (107 deleted)
116 inotify
39  other (libraries, text, cwd, sockets, etc)

My guess is that the escalating number of go runtime threads is due to goroutines blocking on deleted log files.  If there is a goroutine for each open FD on a deleted log file (107) and a thread per blocked goroutine, then 107 of the 135 threads are trying to read from deleted log files.

All that said, I don't think the number of inotify watches are the issue here and that this is a dup of 1331235

Is this sufficient explanation?

Comment 7 Seth Jennings 2016-06-16 16:46:30 UTC
Dup'ing.  The OSE 3.3 rebase just happened so we should be able to test soon if the fix for 1331235 also fixes this.

*** This bug has been marked as a duplicate of bug 1331235 ***


Note You need to log in before you can comment on or make changes to this bug.