Bug 489368 - Duplicate dirnames in rpm headers and unsorted dirnames in general.
Summary: Duplicate dirnames in rpm headers and unsorted dirnames in general.
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Fedora
Classification: Fedora
Component: rpm
Version: 11
Hardware: All
OS: Linux
low
medium
Target Milestone: ---
Assignee: Florian Festi
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-03-09 18:10 UTC by Phil Knirsch
Modified: 2015-03-05 01:20 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-06-28 11:25:51 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Phil Knirsch 2009-03-09 18:10:05 UTC
Description of problem:

This bug has been around since the beginning of time. Actually not really, but since very early versions of rpm where the split of dirnames and basenames was introduced.

The way rpm generates dirnames for a build is simply put like this (in pseudocode):

Generate filelist with dirs
Sort filelist
for each file in filelist:
  Split dname and bname from file
  Search dirindex for dname in dirnames using glibc bsearch
  if found, use dirindex for that bname
  otherwise append dname to dirnames and use new dirindex for that bname
  append bname to basenames

Now on first glance this sounds correct. But if you look closely at how the filelist is generated you'll suddenly see that dirnames afterwards isn't actually a sorted list. Here a real life example of glibc-common.i386 from Fedora-9 release (glibc-common-2.8-3.i386.rpm)

Output of rpm -qpl glibc-common-2.8-3.i386.rpm:

/etc/default
/etc/default/nss
/usr/bin/catchsegv
/usr/bin/gencat
/usr/bin/getconf
/usr/bin/getent
/usr/bin/iconv
/usr/bin/ldd
/usr/bin/lddlibc4
/usr/bin/locale
/usr/bin/localedef
/usr/bin/rpcgen
/usr/bin/sprof
/usr/bin/tzselect
/usr/lib/locale
/usr/lib/locale/locale-archive
/usr/lib/locale/locale-archive.tmpl
/usr/libexec/pt_chown
/usr/sbin/build-locale-archive
/usr/sbin/tzdata-update
/usr/sbin/zdump
/usr/sbin/zic
/usr/share/doc/glibc-common-2.8
/usr/share/doc/glibc-common-2.8/ChangeLog.15.bz2
/usr/share/doc/glibc-common-2.8/ChangeLog.16.bz2
/usr/share/doc/glibc-common-2.8/ChangeLog.bz2
/usr/share/doc/glibc-common-2.8/README.timezone
/usr/share/doc/glibc-common-2.8/README.ufc-crypt
/usr/share/doc/glibc-common-2.8/gai.conf
/usr/share/i18n
/usr/share/i18n/charmaps
/usr/share/i18n/charmaps/ANSI_X3.110-1983.gz
/usr/share/i18n/charmaps/ANSI_X3.4-1968.gz
...
/usr/share/i18n/charmaps/UTF-8.gz
/usr/share/i18n/charmaps/VIDEOTEX-SUPPL.gz
/usr/share/i18n/charmaps/VISCII.gz
/usr/share/i18n/charmaps/WINDOWS-31J.gz
/usr/share/i18n/locales
/usr/share/i18n/locales/POSIX
/usr/share/i18n/locales/aa_DJ
...
/usr/share/i18n/locales/zh_SG
/usr/share/i18n/locales/zh_TW
/usr/share/i18n/locales/zu_ZA
/usr/share/locale
/usr/share/locale/be
/usr/share/locale/be/LC_MESSAGES
/usr/share/locale/be/LC_MESSAGES/libc.mo
/usr/share/locale/bg
/usr/share/locale/bg/LC_MESSAGES
/usr/share/locale/bg/LC_MESSAGES/libc.mo
/usr/share/locale/ca


Side by side comparison of dirnames and LANG=C sorted dirnames:

/etc/ /etc/
/etc/default/ /etc/default/
/usr/bin/ /usr/bin/
/usr/lib/ /usr/lib/
/usr/lib/locale/ /usr/lib/locale/
/usr/libexec/ /usr/libexec/
/usr/sbin/ /usr/sbin/
/usr/share/doc/ /usr/share/
/usr/share/doc/glibc-common-2.8/ /usr/share/
/usr/share/ /usr/share/doc/
/usr/share/i18n/ /usr/share/doc/glibc-common-2.8/
/usr/share/i18n/charmaps/ /usr/share/i18n/
/usr/share/i18n/locales/ /usr/share/i18n/charmaps/
/usr/share/ /usr/share/i18n/locales/
/usr/share/locale/ /usr/share/locale/


As you can see, /usr/share/ is listed twice in the dirnames list. This comes from the simple fact that after the splitting of dirnames and basenames for the files the resulting dirnames aren't sorted anymore:

...
/usr/share/doc/glibc-common-2.8 -> /usr/share/doc/ : glibc-common-2.8
...
/usr/share/i18n -> /usr/share/ : i18n
...
/usr/share/i18n/charmaps -> /usr/share/i18n/ : charmaps
...
/usr/share/i18n/charmaps/ANSI_X3.110-1983.gz -> /usr/share/i18n/charmaps/ : ANSI_X3.110-1983.gz
...
/usr/share/i18n/locales -> /usr/share/i18n/ : locales
...
/usr/share/i18n/locales/POSIX -> /usr/share/i18n/locales/ : POSIX
...
/usr/share/locale -> /usr/share/ : locale
...
/usr/share/locale/be -> /usr/share/locale/ : be

And as that list isn't sorted anymore it comes as no surprise that bsearch() will eventually not find an already existing dirname in dirnames, e.g. /usr/share/ in this example.

This has several obvious problems:

 - Duplicate dirname entries in dirnames, wasting space
 - If the bsearch() implementation of glibc should ever change you will get different dirnames headers for idential filelists
 - Any tool working with rpm headers and dirnames that assumes that dirnames is sorted and/or unique will fail miserably


Version-Release number of selected component (if applicable):

rpm since the change from "filenames" tag to "dirnames" and "basenames".

How reproducible:

Happens with lots of rpms with larger filelists.


A correct implementation would probably have to do use 2 pass algorithm to first collect a list of all possible dirnames and then uniquely sort it and in the 2nd pass extract the basenames and use the now correctly working bsearch() for any implementation to find the corresponding dirname from dirnames.

Thanks & regards, Phil

Comment 1 R P Herrold 2009-03-10 02:17:30 UTC
Phil

Why not just one pass, hashing each dir, and inserting into a b-tree, spotting [and avoiding] collisions (and thus dupes)?

-- Russ herrold

Comment 2 Phil Knirsch 2009-03-10 11:27:12 UTC
Hm, how would you manage the dirindexes for the basenames then? I suppose instead of using direct integers you could make it a pointer to a dirindex entry in a struct of the dirname and dirindex with which you populate the b-tree.

Sample:

struct dnames {
    int dirindex;
    char *dirname;
    struct dnames *left;
    struct dnames *right;
};

and have dirindexes:

int *dirindexes;
dirindexes = malloc(sizeof(int *) * len_of_filelist);

This way you can at least build the b-tree and the basenames list and the dirindexes pointer list in 1 pass and just need a quick 2nd pass to generate the final dirnames list together with the correct dirindexes in the b-tree and then save the values of the pointers in dirindexes. And as the dirnames should be a lot smaller than the filelist the time overhead should be minimal.

Comment 3 Jeff Johnson 2009-03-11 00:49:01 UTC
Let's try a reality check, shall we?

The savings in removing duplicates in RPMTAG_DIRNAMES is infintessimal. Do the math ...
E.g. the savings is (at least) an order of magnitude smaller than the added overhead
of attaching file capabilities to *every* file so that perhaps 100+ files can be installed
with capabilities.

The data in RPMTAG_DIRNAMES has *never* been guaranteed to be sorted. Examples
of unsorted RPMTAG_DIRNAMES are everywhere.

Pursuing a sort order for RPMTAG_DIRNAMES hardly changes anything. Unsorted
RPMTAG_DIRNAMES will remain in *.rpm packages forevermore. So rpm wil
continue to have to sort RPMTAG_DIRNAMES if bsearch is desired. Luckily, there's
no access on any critical path in rpm that would benefit from pre-sorted
RPMTAG_DIRNAMES.

There is no benefit -- to rpm, to developers, to Red Hat, to LSB, etc -- from
sorting RPMTAG_DIRNAMES. Nor are there any forseeable benefits from
a Newer! Better! Bestest! implementation that I can conceive of even after smoking
a lot of crack ...

Using a B-tree, or a data structure, to achieve a 1-pass generation of simultaneously
sorted RPMTAG_DIRNAMES and RPMTAG_FILENAMES with a RPMTAG_DIRINDEXES lookup
is just bleeping overkill. At least do the sort by embedding OCAML or using an Oracle
database optimized for parallel retrieval or something manly please ...

Comment 4 Bug Zapper 2009-06-09 12:02:59 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 11 development cycle.
Changing version to '11'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 5 Bug Zapper 2010-04-27 13:08:35 UTC
This message is a reminder that Fedora 11 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 11.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '11'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 11's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 11 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 6 Bug Zapper 2010-06-28 11:25:51 UTC
Fedora 11 changed to end-of-life (EOL) status on 2010-06-25. Fedora 11 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.