Bug 559647 - quadratic work: genisoimage per .rpm on split .iso
Summary: quadratic work: genisoimage per .rpm on split .iso
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Fedora
Classification: Fedora
Component: pungi
Version: rawhide
Hardware: All
OS: Linux
low
medium
Target Milestone: ---
Assignee: David Cantrell
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-01-28 16:50 UTC by John Reiser
Modified: 2013-01-10 05:41 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-01-28 18:35:41 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
splittree in a couple seconds (average 3 genisoimage per CD) (6.60 KB, patch)
2010-09-16 02:26 UTC, John Reiser
no flags Details | Diff
revised patch for splittree.py (9.00 KB, patch)
2010-09-29 14:20 UTC, John Reiser
no flags Details | Diff

Description John Reiser 2010-01-28 16:50:31 UTC
Description of problem: Building split .isos (images for CD-ROM) takes at least several minutes when it should be much faster (only a little more than the time for one copy of all the data.)  The split .isos are built by creating a directory of the fixed content (everything but the .rpm package files), then for each .rpm: hardlink the next .rpm into the directory, and call genisoimage.  The process of {add a hardlink for the next .rpm, call genisoimage} is repeated until the .iso reaches the size limit.  The total work required is quadratic in the number of .rpm packages per CD.  However, the size of any potential output from genisoimage can be estimated reliably in advance as the sum of the sizes of the files, plus the overhead for directories (sum of lengths of filenames plus 150 bytes per file), plus the overhead for Joliet, RockRidge, and TRANS.TBL info (which can be estimated based on past experience.)  Using such an estimate would reduce the work to about 3 calls to genisoimage per CD: an initial guess plus a couple of corrections.  In the worst case a binary search would be logarithmic (at most 12 or 13 calls per CD, based on 3000 total .rpm).


Version-Release number of selected component (if applicable):
anaconda-13.23-1
pungi-2.0.20-1.fc12.noarch


How reproducible: every time

Steps to Reproduce:
1. strace -f -o "| gzip -c >/tmp/strace-pungi.$$.gz" \
      -e trace=file,write,pwrite,fork,vfork,clone -v \
   pungi -c /usr/share/pungi/rawhide-fedora.ks --destdir=$DESTDIR \
       --name Fedora --ver $VERSION --nosource
Expect about 3.2GB uncompressed output from strace, about 167MB compressed, over about 3 hours.
2.  < /tmp/strace-pungi.$$.gz  gzip -d  | \
    sed -n '/execve(/s/[^"]*"\([^"]*\).*/\1/p'  | \
    sort  |  uniq -c  |  sort -rn  |  sed 15q
3. Analyze full strace output for execve of genisoimage.
  
Actual results:
  34795 /usr/bin/dirname
  17668 /bin/cp
  15474 /bin/egrep
  12078 /usr/bin/file
   6248 /bin/sed
   5772 /bin/basename
   4891 /usr/bin/readlink
   3934 /bin/rm
   3513 /usr/sbin/chroot
   3306 /bin/grep
   2968 /usr/bin/nm
   2840 /usr/bin/genisoimage     *** one per .rpm
-----
5130  clone(...) = 28752
28752 execve("/usr/bin/genisoimage", ...
   ...
28752 write(1, "106018\n", 7)           = 7     *** size of generated .iso
5130  --- SIGCHLD (Child exited) @ 0 (0) ---
5130  stat(".../libgcc-4.4.3-2.fc13.x86_64.rpm", ...
5130  link(".../libgcc-4.4.3-2.fc13.x86_64.rpm", ".../os-disc1/Packages/...")
5130  clone(...) = 28753
28753 execve("/usr/bin/genisoimage", ...
   ...
28753 write(1, "106060\n", 7)           = 7     *** size of generated .iso
5130  --- SIGCHLD (Child exited) @ 0 (0) ---
5130  stat(".../libgcc-4.4.3-2.fc13.i686.rpm", ...
5130  link(".../libgcc-4.4.3-2.fc13.i686.rpm", ".../os-disc1/Packages/...")
5130  clone(...) = 28754
28754 execve("/usr/bin/genisoimage",
   ...
-----
where 5130 is the PID of the original pungi process.

Expected results:  Significantly faster (by 10% of entire pungi run of two hours) by avoiding quadratic work.


Additional info: The histogram of filenames accessed during the entire pungi run begins:
-----
5039579 /etc/localtime
 382749 /usr/bin/dirname
 170218 /bin/egrep
 128312 /usr/bin/file
 127580 /lib64/libc.so.6
 126536 /etc/ld.so.cache
 126012 /etc/ld.so.preload
  94511 .
  68637 /bin/sed
  66709 ..
  63499 /bin/basename
-----
where genisoimage checks /etc/localtime to print the localtime in the periodic message " nn.nn% done, estimate finish <date>".  Can't this be turned off?

The script for filenames is:
   sed -n 's/[^"]*"\([^"]*\)".*/\1/p'  | \
   sort  |  uniq -c  |  sort -rn

Comment 1 Chris Lumens 2010-01-28 18:11:23 UTC
I believe it's pungi that calls genisoimage, not anaconda.

Comment 2 Jesse Keating 2010-01-28 18:35:41 UTC
10% savings for something that makes us a lot more likely to overrun sizes and a bunch of added complexity isn't something I'm interested in.  Certainly not something I'm going to spend time coding.  If you have patches, I might entertain them.  On the systems where we compose, the entire compose only takes about 40 minutes, so shaving 4 minutes off of 40 just doesn't really matter to me.  I'm closing this deferred.

Comment 3 John Reiser 2010-01-29 19:58:23 UTC
pkgorder+splittree would rise to 23% of elapsed time for a compose, assuming infinite download speed.  (18% of the 40 minutes mentioned in comment #2.)

I saved 5 minutes during buildinstall by using tmpfs in RAM:
   mkdir -p      /dev/shm/pungi-$$
   export TMPDIR=/dev/shm/pungi-$$
The peak usage was 1.58GB, so a 3GB machine is not quite enough for one simultaneous compose because the default for /dev/shm is 50% of RAM.  Previously I used TMPDIR=$DESTDIR/work/$ARCH/tmp on ext4 (SATA-II.)

My observed timeline this morning was:
  08:30  begin compose
  08:38  end package downloads (134 packages, total 285MB)
  08:40  buildinstall
  08:50  end buildinstall downloads (another 284MB)
  09:05  mkisofs DVD
  09:07  pkgorder
  09:10  splittree
  09:14  repo data
  09:16  mkisofs disc1
  09:18  end compose
=====
     48m  total elapsed time
      8m  package downloading
     10m  buildinstall downloading
-------
     30m  elapsed non-download
      3m  pkgorder
      4m  splittree

Comment 4 Jesse Keating 2010-01-29 20:31:08 UTC
Still not really enough for me to care, particularly without code in hand.

Comment 5 John Reiser 2010-09-16 02:26:40 UTC
Created attachment 447604 [details]
splittree in a couple seconds (average 3 genisoimage per CD)

This patch to pungi-2.1.2/src/pypungi/splittree.py implements the suggested "pack using .st_size as estimate, check using genisoimage, correct if necessary."

Comment 6 John Reiser 2010-09-29 14:20:14 UTC
Created attachment 450487 [details]
revised patch for splittree.py

Better detection and processing of space overflow.

Please consider for Fedora 15 (after Fedora 14.)

Comment 7 Jesse Keating 2010-11-16 00:48:53 UTC
splittree is dead in F15, so there is no need to patch it.


Note You need to log in before you can comment on or make changes to this bug.