1537564 – Iterating over _SC_OPEN_MAX descriptors in doScriptExec() is causing serious performance regression when _SC_OPEN_MAX is high (e.g. Docker)

Bug 1537564 - Iterating over _SC_OPEN_MAX descriptors in doScriptExec() is causing serious performance regression when _SC_OPEN_MAX is high (e.g. Docker)

Summary: Iterating over _SC_OPEN_MAX descriptors in doScriptExec() is causing serious ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	rpm
Sub Component:
Version:	28
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Packaging Maintenance Team
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-01-23 14:03 UTC by Alex Kanavin
Modified:	2018-09-07 16:14 UTC (History)
CC List:	11 users (show)
Fixed In Version:	rpm-4.14.2-1.fc28
Clone Of:
Environment:
Last Closed:	2018-09-07 16:14:42 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
fcntl bomb via strace (342.75 KB, text/plain) 2018-05-08 07:54 UTC, Alex Kanavin	no flags	Details
View All

Description Alex Kanavin 2018-01-23 14:03:09 UTC

This commit:
https://github.com/rpm-software-management/rpm/commit/7a7c31f551ff167f8718aea6d5048f6288d60205

is causing a serious performance regression when ulimit -n is set significantly higher than the typical 1024 on the desktop. For instance, Docker configurations can set it to 1048576, and that causes a 3x time degradation or more when hundreds of packages are installed.

Comment 1 Alex Kanavin 2018-01-23 14:16:50 UTC

The bug that produced the fix in question is this:
https://bugzilla.redhat.com/show_bug.cgi?id=919801

Comment 2 Fedora End Of Life 2018-02-20 15:22:05 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 28 development cycle.
Changing version to '28'.

Comment 3 Paulo Neves 2018-05-04 13:30:57 UTC

This error just bit me and I got a full day scratching my head at why my jenkins/docker deployment was so slow. 

I thought I was being bitten by some strange problem in my harddrive or something similar, because strace was showing me an installation of 1 file per second. The weiredest thing was that my builds outside docker did not have this problem.

At the end of the day I finally found a post on github [1] telling me it is a combination of the rpm package manager and docker which caused massive slowdown. The solution they offer is just to use any other package manager than rpm.

In the mean time Open-embbeded has submitted a revert patch. Please consider it while this is not fixed.
[1] https://github.com/moby/moby/issues/23137

Comment 4 Mirza Krak 2018-05-07 09:00:23 UTC

(In reply to Paulo Neves from comment #3)
> This error just bit me and I got a full day scratching my head at why my
> jenkins/docker deployment was so slow. 
> 
> I thought I was being bitten by some strange problem in my harddrive or
> something similar, because strace was showing me an installation of 1 file
> per second. The weiredest thing was that my builds outside docker did not
> have this problem.
> 
> At the end of the day I finally found a post on github [1] telling me it is
> a combination of the rpm package manager and docker which caused massive
> slowdown. The solution they offer is just to use any other package manager
> than rpm.
> 
> In the mean time Open-embbeded has submitted a revert patch. Please consider
> it while this is not fixed.
> [1] https://github.com/moby/moby/issues/23137

Same issues, same workaround as Paulo.

Comment 5 Panu Matilainen 2018-05-07 11:32:03 UTC

Folks, I can see how looping fnctl() over a huge number of fd's would be slower than not doing so... but -p <lua> scriptlets calling posix.exec() is something that's expected to occur maybe a few times in a transaction, certainly not once per each installed file. Smells fishy.

Paulo/Mirza, please attach a full strace of an "one file per second" install (timings of course differ under strace)

Comment 6 Alex Kanavin 2018-05-08 07:06:18 UTC

TBH, seems fishy to me too. I am simply not seeing the fcntl bomb here, on a standard setup, building standard image. Strace does not show that fcntl is executed 'ulimit -n' (or so) times in a loop. What I said in the bug description was second hand info that I'm not able to reproduce (no docker here).

It's entirely possible folks seeing the problem are misusing Lua, or we have mis-diagnozed it otherwise.

Strace, please.

Comment 7 Alex Kanavin 2018-05-08 07:28:00 UTC

Nevermind, I found it. This has nothing to do with Lua, the problematic code is here:

https://github.com/rpm-software-management/rpm/blob/master/lib/rpmscript.c#L181

Works okay when it counts up to 1024, but I can imagine how counting up to a 1024*1024 is indeed 1024 times slower. And that's once per every package scriptlet I suppose.

Comment 8 Panu Matilainen 2018-05-08 07:40:52 UTC

Well but you just said you're not seeing an fcntl() bomb. Whether Lua or not, you'd be seeing those fcntl()'s in a strace (although you need to follow forks to see that). A strace please.

This all rings another bell, whether related or not: just recently I noticed the rpm build-time test-suite running *dreadfully* slow under mock-builds. I haven't had time to properly investigate, but it had to do with mock recently switching it's default to systemd-nspawn instead of chroot. And even with systemd-nspawn, things got back to normal when enabling network access. Pray tell why, it's not as if rpm does any networking.

Comment 9 Alex Kanavin 2018-05-08 07:45:05 UTC

I was doing a wrong grep. :) With the right grep I am indeed seeing it, although it counts only up to 1024, because I am not using Docker. I'll attach a sample strace from postinst execution in a second (I run strace -ff to better separate which process is doing what).

Comment 10 Alex Kanavin 2018-05-08 07:54:14 UTC

Created attachment 1433043 [details]
fcntl bomb via strace

Comment 11 Panu Matilainen 2018-05-08 08:03:20 UTC

Well obviously those fcntl()'s are going to be there, and that going through multitude more is going to take more time, that's not the question. There's still zero connection between multiple those fcntl()'s and seeing "file per second" install rate.

A strace from an install that actually exhibits the problem, please. Preferably a full strace at that, and with -tt so the timing can be seen.

Comment 12 Alex Kanavin 2018-05-08 09:47:47 UTC

The full strace of making an image is over a gigabyte, I don't want to attach that, and you probably don't want to look at it. Would you be okay with excerpts?

1. ulimit -n 1024 scenario

Making an image under strace:

10723 12:08:13.423430 fcntl(3, F_GETFD) = 0x1 (flags FD_CLOEXEC)
...
10723 12:08:13.499483 fcntl(1023, F_GETFD) = -1 EBADF (Bad file descriptor)

So 0.07 seconds or so.

Making an image without strace takes about 22 seconds


2. ulimit -n 1000000 scenario 
 
17249 12:17:20.625723 fcntl(3, F_GETFD) = 0x1 (flags FD_CLOEXEC)
...
17249 12:18:29.382538 fcntl(999999, F_GETFD) = -1 EBADF (Bad file descriptor)

This jumps to 70 seconds. Which is actually exactly 1000*0.07.

Making an image without strace takes 43 seconds.

Does this convince you? If not, I can patch various hardcoded values for maxfd into rpm's doScriptExec(), and tell you how that affects running time for us. No need to run strace for it.

Comment 13 Alex Kanavin 2018-05-08 09:52:42 UTC

When using 'ulimit -n 1000000' the strace size is already pushing 10 gigabytes here, and it's still not finished. Not surprising, considering each fcntl bomb adds hundreds of megabytes of output to it. You can't seriously expect us to provide that file :)

Comment 14 Panu Matilainen 2018-05-08 09:57:50 UTC

Right, I don't need proof for those fcntl()'s taking place, of course they do. And no, I don't want to look at 10gb's worth of fcntl()'s  :)

What I *do* want is the "one file per second" timed strace (if full not reasonable then at least a couple of package's worth of it), because that sounds potentially like a different issue.

Comment 15 Alex Kanavin 2018-05-08 10:03:09 UTC

I suspect 'one file per second' means 'one package containing a postinst script per second' which is almost certainly the same issue.

Comment 16 Alex Kanavin 2018-05-08 10:06:49 UTC

For my tests, the created image has about 30 packages that have postinsts in them, and the jump in image creation time (22sec->43sec) is consistent with that. Some people may have 10x the amount of such packages for their images, and that's where it starts to hurt seriously.

Comment 17 Panu Matilainen 2018-05-08 10:13:28 UTC

This data doesn't quite add up now. You're speaking of 'ulimit -n 1000000' causing one script taking 70s, yet 30 package image with about as many scripts takes 43s. I have to assume those are with different file limit settings, but you know what they say about assuming.

Comment 18 Alex Kanavin 2018-05-08 10:15:27 UTC

That's the overhead of strace. It has to process and write things out to disk, and of course that adds massively to the timing, considering each fcntl bomb adds 300 megabytes to the log.

Comment 19 Panu Matilainen 2018-05-08 10:18:47 UTC

Oh, of course. Said so myself in an earlier comment. Stupid me :)

Comment 20 Alex Kanavin 2018-05-08 10:24:30 UTC

Anyway, what should be done about this bit in rpmscript.c?

    /* XXX Force FD_CLOEXEC on all inherited fdno's. */
    open_max = sysconf(_SC_OPEN_MAX);
    if (open_max == -1) {
	open_max = 1024;
    }
    for (fdno = 3; fdno < open_max; fdno++) {
	flag = fcntl(fdno, F_GETFD);
	if (flag == -1 || (flag & FD_CLOEXEC))
	    continue;
	xx = fcntl(fdno, F_SETFD, FD_CLOEXEC);

We can patch it out in Yocto, and carry the patch forward, but if you can find a better way out, that'd be a lot better.

Comment 21 Peter Kjellerstedt 2018-05-14 16:08:57 UTC

Since we have been affected by this problems in our builds in Docker containers, where we have seen an increase in the time it takes to generate a rootfs using dnf from about five minutes to 30-45 minutes, I have invested some time in examining it. Here is a small table showing the times I measured when building core-image-minimal from Poky (it is an a lot smaller image than our regular ones):

Max files Time
--------- -----
1024 30 seconds
1048576 50 seconds
10000000 3 minutes 40 seconds
100000000 31 minutes

As can be seen, the time it takes to generate the rootfs is quite severely affected when the maximum number of file descriptors is increased.

It is possible to reproduce the problem without Docker by increasing
the maximum number of open files. This can be done by editing
/etc/security/limits.conf and adding:

* - nofile 100000000

where 100000000 is the maximum number of open files. It is also
very likely that you need to increase the global limit by running:

sudo sysctl -w fs.nr_open=100000000

After that, ssh to localhost to get a new session where the new
settings are active. Then you can use `ulimit -n <number>` to set
the max you want to try.

My suggestion for how to fix this problem is to make rpm explicitly set the maximum number of open file descriptors to a sane value when it starts using setrlimit(). That way the application is not adversely affected by unusual configurations for no reason. I do not know the rpm code well enough to suggest an implementation, or how a user can configure the actual limit used if that is deemed necessary.

Comment 22 Kirill Kolyshkin 2018-05-17 17:10:07 UTC

One other approach would be to do readdir(/proc/self/fd) and iterate over fds found in there. This can also be done conditionally, say if open_max > 1024.

Comment 23 Kirill Kolyshkin 2018-05-17 18:57:52 UTC

I went ahead to implement my suggestion, here's the patch:
https://github.com/rpm-software-management/rpm/pull/444

Comment 24 Peter Kjellerstedt 2018-05-17 21:02:51 UTC

(In reply to Kir Kolyshkin from comment #22)
> One other approach would be to do readdir(/proc/self/fd) and iterate over
> fds found in there. This can also be done conditionally, say if open_max >
> 1024.

Ooh, I like that idea. Much better than mine. 
I gave some more comments in the GitHub review.

Comment 25 Panu Matilainen 2018-06-29 13:22:38 UTC

This should be fixed in rawhide now (rpm >= 4.14.2-rc1)

Comment 26 Fedora Update System 2018-08-22 13:10:20 UTC

rpm-4.14.2-1.fc28 has been submitted as an update to Fedora 28. https://bodhi.fedoraproject.org/updates/FEDORA-2018-16c78b3d92

Comment 27 Fedora Update System 2018-08-23 11:02:20 UTC

rpm-4.14.2-1.fc28 has been pushed to the Fedora 28 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-16c78b3d92

Comment 28 Fedora Update System 2018-09-07 16:14:42 UTC

rpm-4.14.2-1.fc28 has been pushed to the Fedora 28 stable repository. If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.