1333663 – OpenShift node process keeps too many files opened

Bug 1333663 - OpenShift node process keeps too many files opened

Summary: OpenShift node process keeps too many files opened

Keywords:
Status:	CLOSED DUPLICATE of bug 1331235
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	3.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Seth Jennings
QA Contact:	DeShuai Ma
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-05-06 06:35 UTC by Jiri Pechanec
Modified:	2016-06-16 16:46 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-06-16 16:46:30 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
lsof (181.20 KB, application/x-gzip) 2016-05-06 06:35 UTC, Jiri Pechanec	no flags	Details
View All

Description Jiri Pechanec 2016-05-06 06:35:25 UTC

Created attachment 1154491 [details]
lsof

One of our long-running OSE node started to exhibit very weird behaviour when we tried to do oc logs or oc logs.

When we tried to execute systemctl restart docker we got an error message - Too many open files

lsof shown us
Total files - 41187
OpenShift files - 37761
 - named /var/log/messages* - 16740
 - inotify - 15660

At the time of investigation there were 3 pods running

After OpenShift restart we got numbers
Total files - 5102
OpenShift files - 451

lsof result is attached

Comment 1 Jan Chaloupka 2016-05-06 08:36:39 UTC

Hi Jiri,

OpenShift restart meaning restarting nodes or services? For how long the node ran?

Comment 2 Jiri Pechanec 2016-05-06 10:37:43 UTC

I've just restarted atomic-openshift-node service
Unfortunately I cannot tell you for how the node run as part of cluster healing we've updated and restarted all nodes in the cluster - but my guess is between 2-4 weeks.

Comment 3 Andy Goldstein 2016-05-06 13:30:41 UTC

This is either a partial of full duplicate of bug 1331235

Comment 4 Tomas Schlosser 2016-05-06 14:38:46 UTC

(In reply to Andy Goldstein from comment #3)
> This is either a partial of full duplicate of bug 1331235

This is partially related to the bug you provided (you can see a lot of references to logrotated /var/log/messages), but there are also a lot of open inotify handlers.

The problem we're facing is more subtle - the disk space is free, there is no trace of error except when you call 'docker logs' or 'oc logs' you get only a part of the log. Deeper investigation was necessary to uncover the depleted file descriptors.

Comment 5 Seth Jennings 2016-06-09 16:23:10 UTC

Ok, I'm picking this one up again.

The only inotify user I can find in the openshift node is cadvisor.  It seems to do watches on the container subdirectories and the cgroup subdirectories.  There could be a leak there, or there might just be a ton of stuff that it is watching (probably more than it needs to).

It looks like this code may have been reworked upstream.  Still looking.

Comment 6 Seth Jennings 2016-06-14 02:57:56 UTC

After taking a crash course in lsof output, there are not 15660 unique inotify watches happening, there are 116 watches shared by 135 threads managed by the go runtime. That seems like a reasonable to me.

Overall the openshift threads share 279 open FDs with 51 thread local FDs totaling 330 unique FDs.

The number of shared FDs to deleted log files:

13 /var/log/messages-20160426
18 /var/log/messages-20160427
15 /var/log/messages-20160428
15 /var/log/messages-20160429
6 /var/log/messages-20160430
7 /var/log/messages-20160501
7 /var/log/messages-20160502
8 /var/log/messages-20160503
11 /var/log/messages-20160504
7 /var/log/messages-20160505

Of the 279 shared FDs, 107 of them are to 10 deleted log files at ~512MB each, monopolizing ~5GB of disk space.

There are 17 shared FDs open for the non-deleted log file.

I'm not sure why there are between 7-18 FDs for each log file.

In summary 107+17 = 124 of the 279 shared FDs are log files.

124 log files (107 deleted)
116 inotify
39 other (libraries, text, cwd, sockets, etc)

My guess is that the escalating number of go runtime threads is due to goroutines blocking on deleted log files. If there is a goroutine for each open FD on a deleted log file (107) and a thread per blocked goroutine, then 107 of the 135 threads are trying to read from deleted log files.

All that said, I don't think the number of inotify watches are the issue here and that this is a dup of 1331235

Is this sufficient explanation?

Comment 7 Seth Jennings 2016-06-16 16:46:30 UTC

Dup'ing.  The OSE 3.3 rebase just happened so we should be able to test soon if the fix for 1331235 also fixes this.

*** This bug has been marked as a duplicate of bug 1331235 ***

Note You need to log in before you can comment on or make changes to this bug.