Description of problem: Building split .isos (images for CD-ROM) takes at least several minutes when it should be much faster (only a little more than the time for one copy of all the data.) The split .isos are built by creating a directory of the fixed content (everything but the .rpm package files), then for each .rpm: hardlink the next .rpm into the directory, and call genisoimage. The process of {add a hardlink for the next .rpm, call genisoimage} is repeated until the .iso reaches the size limit. The total work required is quadratic in the number of .rpm packages per CD. However, the size of any potential output from genisoimage can be estimated reliably in advance as the sum of the sizes of the files, plus the overhead for directories (sum of lengths of filenames plus 150 bytes per file), plus the overhead for Joliet, RockRidge, and TRANS.TBL info (which can be estimated based on past experience.) Using such an estimate would reduce the work to about 3 calls to genisoimage per CD: an initial guess plus a couple of corrections. In the worst case a binary search would be logarithmic (at most 12 or 13 calls per CD, based on 3000 total .rpm). Version-Release number of selected component (if applicable): anaconda-13.23-1 pungi-2.0.20-1.fc12.noarch How reproducible: every time Steps to Reproduce: 1. strace -f -o "| gzip -c >/tmp/strace-pungi.$$.gz" \ -e trace=file,write,pwrite,fork,vfork,clone -v \ pungi -c /usr/share/pungi/rawhide-fedora.ks --destdir=$DESTDIR \ --name Fedora --ver $VERSION --nosource Expect about 3.2GB uncompressed output from strace, about 167MB compressed, over about 3 hours. 2. < /tmp/strace-pungi.$$.gz gzip -d | \ sed -n '/execve(/s/[^"]*"\([^"]*\).*/\1/p' | \ sort | uniq -c | sort -rn | sed 15q 3. Analyze full strace output for execve of genisoimage. Actual results: 34795 /usr/bin/dirname 17668 /bin/cp 15474 /bin/egrep 12078 /usr/bin/file 6248 /bin/sed 5772 /bin/basename 4891 /usr/bin/readlink 3934 /bin/rm 3513 /usr/sbin/chroot 3306 /bin/grep 2968 /usr/bin/nm 2840 /usr/bin/genisoimage *** one per .rpm ----- 5130 clone(...) = 28752 28752 execve("/usr/bin/genisoimage", ... ... 28752 write(1, "106018\n", 7) = 7 *** size of generated .iso 5130 --- SIGCHLD (Child exited) @ 0 (0) --- 5130 stat(".../libgcc-4.4.3-2.fc13.x86_64.rpm", ... 5130 link(".../libgcc-4.4.3-2.fc13.x86_64.rpm", ".../os-disc1/Packages/...") 5130 clone(...) = 28753 28753 execve("/usr/bin/genisoimage", ... ... 28753 write(1, "106060\n", 7) = 7 *** size of generated .iso 5130 --- SIGCHLD (Child exited) @ 0 (0) --- 5130 stat(".../libgcc-4.4.3-2.fc13.i686.rpm", ... 5130 link(".../libgcc-4.4.3-2.fc13.i686.rpm", ".../os-disc1/Packages/...") 5130 clone(...) = 28754 28754 execve("/usr/bin/genisoimage", ... ----- where 5130 is the PID of the original pungi process. Expected results: Significantly faster (by 10% of entire pungi run of two hours) by avoiding quadratic work. Additional info: The histogram of filenames accessed during the entire pungi run begins: ----- 5039579 /etc/localtime 382749 /usr/bin/dirname 170218 /bin/egrep 128312 /usr/bin/file 127580 /lib64/libc.so.6 126536 /etc/ld.so.cache 126012 /etc/ld.so.preload 94511 . 68637 /bin/sed 66709 .. 63499 /bin/basename ----- where genisoimage checks /etc/localtime to print the localtime in the periodic message " nn.nn% done, estimate finish <date>". Can't this be turned off? The script for filenames is: sed -n 's/[^"]*"\([^"]*\)".*/\1/p' | \ sort | uniq -c | sort -rn
I believe it's pungi that calls genisoimage, not anaconda.
10% savings for something that makes us a lot more likely to overrun sizes and a bunch of added complexity isn't something I'm interested in. Certainly not something I'm going to spend time coding. If you have patches, I might entertain them. On the systems where we compose, the entire compose only takes about 40 minutes, so shaving 4 minutes off of 40 just doesn't really matter to me. I'm closing this deferred.
pkgorder+splittree would rise to 23% of elapsed time for a compose, assuming infinite download speed. (18% of the 40 minutes mentioned in comment #2.) I saved 5 minutes during buildinstall by using tmpfs in RAM: mkdir -p /dev/shm/pungi-$$ export TMPDIR=/dev/shm/pungi-$$ The peak usage was 1.58GB, so a 3GB machine is not quite enough for one simultaneous compose because the default for /dev/shm is 50% of RAM. Previously I used TMPDIR=$DESTDIR/work/$ARCH/tmp on ext4 (SATA-II.) My observed timeline this morning was: 08:30 begin compose 08:38 end package downloads (134 packages, total 285MB) 08:40 buildinstall 08:50 end buildinstall downloads (another 284MB) 09:05 mkisofs DVD 09:07 pkgorder 09:10 splittree 09:14 repo data 09:16 mkisofs disc1 09:18 end compose ===== 48m total elapsed time 8m package downloading 10m buildinstall downloading ------- 30m elapsed non-download 3m pkgorder 4m splittree
Still not really enough for me to care, particularly without code in hand.
Created attachment 447604 [details] splittree in a couple seconds (average 3 genisoimage per CD) This patch to pungi-2.1.2/src/pypungi/splittree.py implements the suggested "pack using .st_size as estimate, check using genisoimage, correct if necessary."
Created attachment 450487 [details] revised patch for splittree.py Better detection and processing of space overflow. Please consider for Fedora 15 (after Fedora 14.)
splittree is dead in F15, so there is no need to patch it.