1691822 – Parallelize /usr/lib/rpm/brp-strip-static-archive

Bug 1691822 - Parallelize /usr/lib/rpm/brp-strip-static-archive

Summary: Parallelize /usr/lib/rpm/brp-strip-static-archive

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	rpm
Sub Component:
Version:	rawhide
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	low
Target Milestone:	---
Assignee:	Florian Festi
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-03-22 15:21 UTC by Denys Vlasenko
Modified:	2019-06-13 11:30 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2019-06-13 11:30:43 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Denys Vlasenko 2019-03-22 15:21:58 UTC

Kernel's rpm build process executes this command at a certain stage:

/usr/lib/rpm/brp-strip-static-archive /usr/bin/strip

which is a shell script.

At this point, kernel build tree has ~60 thousand files. Even though our build machines have at least 6-10 CPUs, the script is written so that it executes "file FILE" for every file, in sequence:

# Strip static libraries.
for f in `find "$RPM_BUILD_ROOT" -type f -a -exec file {} \; | \
        grep -v "^${RPM_BUILD_ROOT}/\?usr/lib/debug"  | \
        grep 'current ar archive' | \
        sed -n -e 's/^\(.*\):[  ]*current ar archive/\1/p'`; do
        $STRIP -g "$f"
done

This means one fork per file. And totally sequential. For kernel, this takes 2 minutes on a fast x86 machine with 88 CPUs.

Proposed replacement:

# Strip static libraries.
#
# Sometimes this is run on trees with 100 thousand files. Be efficient:
# xargs -r -P$NPROC -n16 file | sed 's/:  */: /'
# is used to
# * xargs: avoid fork overhead per each file
# * -P$NPROC: parallelize
# * -n16: avoid running just a few "file LIST" cmds with huge LISTs
# * sed 's/:  */: /': defeat "file F1 FILE2" columnar formatting:
#   | F1:    type1 <--- extra spaces after colon
#   | FILE2: type2
NPROC=`nproc`
test "$NPROC" || NPROC=1
for f in `find "$RPM_BUILD_ROOT" -type f | \
        grep -v "^${RPM_BUILD_ROOT}/\?usr/lib/debug" | \
        xargs -r -P$NPROC -n16 file | sed 's/:  */: /' | \
        grep 'current ar archive' | \
        sed -n -e 's/^\(.*\):[  ]*current ar archive/\1/p'`; do
        $STRIP -g "$f"
done

For kernel build, this reduces time to run this fragment from 2 minutes to 3 seconds.

Comment 1 Panu Matilainen 2019-03-29 08:54:19 UTC

Most of the brp-* scripts originate from turn of the millenium when most people could only dream about SMP systems, and none of them execute anything in parallel. Ditto with -exec vs xargs, executing file a thousand times too many. Etc. 

No argument about -exec vs xargs, that is a no-brainer optimization pretty much all of those brp scripts could use. 

I didn't even know xargs has an option to enable parallel processing, every day you learn something new. I don't see any reason we couldn't introduce parallel jobs here, but it's not as straightforward as just adding a call to nproc. In rpm, parallel jobs are controlled with RPM_BUILD_NCPUS environment variable, together with %_smp_build_ncpus (upstream only for now) and %_smp_ncpus_max macros. The brp-* scripts would need to use the same mechanism.

The other worry here is that I'd expect this to be more IO- rather than CPU-bound stage, so at least with systems with spinning rust (as our friends at FB call 'em) throwing in more jobs might actually slow things down. Haven't tried, just pointing out that there's more to it that number of cpus, another related concern is amount of memory (see bug 1118734), although probably not relevant for this particular case.

Comment 2 Denys Vlasenko 2019-04-05 10:06:03 UTC

(In reply to Panu Matilainen from comment #1)
> In rpm, parallel jobs are controlled with RPM_BUILD_NCPUS environment
> variable, together with %_smp_build_ncpus (upstream only for now) and
> %_smp_ncpus_max macros. The brp-* scripts would need to use the same
> mechanism.

Sure, feel free to check and use those variables' value, if set.

> The other worry here is that I'd expect this to be more IO- rather than
> CPU-bound stage, so at least with systems with spinning rust (as our friends
> at FB call 'em) throwing in more jobs might actually slow things down.

This script is run not on a pristine cold filesystem, but somewhere
in the middle of build process. Since Linux caches filesystem data in RAM,
this means most/all of the filesystem data we need will be in cache
by the time the script is run.

However, let's test what happens on the cold data. Here's an old Phenom II x4
machine with mechanical hard drive:

# sync; echo 3 >/proc/sys/vm/drop_caches; sync
# time sh -c 'find /usr -type f | xargs -r -P1 -n16 file | sed "s/:  */: /" | wc -l'
93355
real	6m52.899s
user	3m4.985s
sys	0m22.265s

# sync; echo 3 >/proc/sys/vm/drop_caches; sync
# time sh -c 'find /usr -type f | xargs -r -P4 -n16 file | sed "s/:  */: /" | wc -l'
93355
real	4m31.889s
user	3m2.259s
sys	0m21.266s

Still speeds things up.

Comment 3 Florian Festi 2019-04-15 14:22:39 UTC

Upstream PR at https://github.com/rpm-software-management/rpm/pull/663

Comment 4 Denys Vlasenko 2019-05-02 13:44:31 UTC

(In reply to Denys Vlasenko from comment #0)
>         xargs -r -P$NPROC -n16 file | sed 's/:  */: /'

Unfortunately, this is not reliable. I managed to reproduce a case when with large $NPROC (larger then 100), there are so many writes that pipe becomes full. I used this for test:

find /usr | xargs -r -P199 -n16 file | sed 's/:  */: /' | sort >/tmp/11

I see this on a machine with 88 CPUs. Smaller machine with 8 CPUs did not generate output fast enough.

For some reason, this only happens on files in /usr/share/locale/.../*.mo - if I use "find /usr/lib*", all if fine. I assume  "file" processes .mo files especially fast.

Anyway, this is a real problem.

I propose a solution where we save output in a tempfile (so "pipe is full" problem is avoided)
and coalesce all "file" process input in one block (so that even if "file" writes output in small chunks, it still works correctly).
This test script so far did not fail even once:

i=1
while sleep 1; do
NPROC=199    #`nproc`
T1=/tmp/T1   #`mktemp`
T2=/tmp/T2   #`mktemp`
time find /usr -print0 | xargs -r -0 -P$NPROC -n64 sh -c 'file "$@" | dd bs=1M iflag=fullblock 2>/dev/null' ARG0 >$T1
time find /usr -print0 | xargs -r -0 -P$NPROC -n64 sh -c 'file "$@" | dd bs=1M iflag=fullblock 2>/dev/null' ARG0 >$T2
sed 's/:  */: /' <$T1 | sort >$T1.sorted
sed 's/:  */: /' <$T2 | sort >$T2.sorted
diff -u $T1.sorted $T2.sorted >/tmp/Z.diff || break
cat /tmp/Z.diff
echo "DONE:$((i++)) ^C to stop"
done

Notes:
sh -c 'file "$@" | dd' ARG0
  - this construct runs this small shell script, supplying filenames as $1, $2, $3... (ARG0 is necessary, otherwise 1st filename would become $0, not $1)
  - within this small shell script, nothing is parallelized - one "find" process writes to one "dd" process.
"dd bs=1M iflag=fullblock" is a rather crude method to ensure all input is read, grouped in one block, and written in one write() call.
  without this, "file" processes sometimes seem to do short writes. This works fine for serially running processes, but when 199 "file"s write to the same fd, short writes give opportunity for writes to interleave.
  tried using "sed" instead of dd, but apparently "sed" also can do short writes.
-print0 / -0 was needed because there _are_ file names with spaces in my /usr, and thus xargs misbehaves on them.
-n64 is chosen because it shows large (~4 times) faster run time than -n16

TODO: "dd bs=1M iflag=fullblock 2>/dev/null" is ugly. We need a better tool - for one, no need to have fixed-size buffer, it should grow as needed.

Comment 5 Panu Matilainen 2019-06-13 11:30:43 UTC

Fixed in rawhide as of rpm >= 4.14.90. Thanks for the suggestion and sample code!

Note You need to log in before you can comment on or make changes to this bug.