Red Hat Bugzilla – Bug 837296
inverted index: reduce size of postings file
Last modified: 2012-07-03 17:41:29 EDT
Description of problem:
The postings file of the reverted index (eg cscope.po.out) can get quite big, especially on 64 bits machine.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. git ls-files "*.[ch]" | cscope -bqkv -i-
$ ls -1s --block-size=1 cscope.*
$ ls -1s --block-size=1 cscope.*
(This output is actually fake).
0) This test was done with a very recent git repository of the Linux kernel. It contains 31384 .c and .h files.
1) The postings file consists of 18452992 entries of 24 bytes (on x86_64). I'll attach a script that simply prints these entries. It turns out there are a lot of duplicate entries.
perl postings.pl cscope.po.out | sort -u | wc -l
So if we (somehow) managed to filter these duplicate entries at build time we save ((18452812 - 10904488) * 24 =) 181159776 bytes (ie, almost 180 MB).
2) A posting entry has this format (on x86_64):
offset name size description
0 line 8 offset into cscope.out db of (ascii) line number
8 func 8 offset into cscope.out db of function name (or zero)
16 path 3 index into cscope.out db list of filepaths (zero based)
19 mark 1 cscope specific mark (at least 0x20, ' ')
20 4 padding
For cscope.out db's smaller than 4 GB the line and func values fit in 4 bytes. This means that half of a posting entry is either empty or padding. I assume most cscope.out db's will be smaller than 4 GB (the db's for over 10 kernel trees fit in that size). Switching to 12 byte posting entries would save in my example (18452812 * 12 =) about 220 MB. (No surprise there.)
Both ideas combined save over 300 MB here.
3) ENOPATCH. Would it worthwhile to further explore one or both of these ideas?
Created attachment 595944 [details]
simple parser of inverted index postings file
Usage: perl postings.pl cscope.po.out
Do you know why there are duplicate entries in the cscope db to begin with? It seems if there truly are duplicate entries it would be worth getting rid of them.
As for the file format, I'm really hesitant to change that. Even if we can fit the offsets into fewer bytes, we can't safely do that, as we would have to then introduce the concept of a variable format, dependent on file size, and that just sounds like a mess that isn't really worth dealing with.
The duplicate entries we should probably pursue though.
(In reply to comment #2)
> Do you know why there are duplicate entries in the cscope db to begin with?
> It seems if there truly are duplicate entries it would be worth getting rid
> of them.
0) I should have finished a parser for the index file (ie, cscope.in.out) before opening an issue. I'll be using this comment as a public notebook to make that clear.
1) The index file is basically a list of all terms in the cscope database (the number of terms in that list equals the ascii digit following the "-q" in the cscope db header line). Each of these terms in the index file is matched to an offset into the postings file and a count of posts for that term. Ie, the term is apparently linked to a number of consecutive posts and so the posts in the posting file are grouped by term.
2) Now, as I elaborated in an earlier comment, each post links to one or two offsets and one sourcefile in the cscope db: the line number offset, optionally a function name offset, and the sourcefile name.
3) Given this structure (term in index file, pointing to one or more posts in posting file, each pointing to one or two offsets and one filename in the cscope db) duplicate entries in the posting file are to be expected. I haven't bothered to check anymore, but if one takes a line like
foo.c:24: bar = baz * 8;
that should lead to two posting entries, one for bar and one for baz, containing identical offsets, filename, and "mark".
4) Summary: identical entries in the posting file should be expected.
> As for the file format, I'm really hesitant to change that. Even if we can
> fit the offsets into fewer bytes, we can't safely do that, as we would have
> to then introduce the concept of a variable format, dependent on file size,
> and that just sounds like a mess that isn't really worth dealing with.
5) Your call. I guess the code for handling a cscope db that somehow crosses the 4 GB size boundary would be rather tricky. Unless I can come up with a patch I won't bring up this issue again.
6) Thanks for your time and especially for asking the right question. I'll CLOSE this as NOTABUG.