Bug 837296 - inverted index: reduce size of postings file
inverted index: reduce size of postings file
Status: CLOSED NOTABUG
Product: Fedora
Classification: Fedora
Component: cscope (Show other bugs)
16
Unspecified Unspecified
unspecified Severity low
: ---
: ---
Assigned To: Neil Horman
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-07-03 07:50 EDT by Paul Bolle
Modified: 2012-07-03 17:41 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-07-03 17:41:29 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
simple parser of inverted index postings file (587 bytes, text/plain)
2012-07-03 07:54 EDT, Paul Bolle
no flags Details

  None (edit)
Description Paul Bolle 2012-07-03 07:50:27 EDT
Description of problem:
The postings file of the reverted index (eg cscope.po.out) can get quite big, especially on 64 bits machine. 

Version-Release number of selected component (if applicable):
15.8-1.fc16

How reproducible:
Always

Steps to Reproduce:
1. git ls-files "*.[ch]" | cscope -bqkv -i-
2.
3.
  
Actual results:
$ ls -1s --block-size=1 cscope.*
 66510848 cscope.in.out
314810368 cscope.out
442871808 cscope.po.out

Expected results:
$ ls -1s --block-size=1 cscope.*
 66510848 cscope.in.out
314810368 cscope.out
130853856 cscope.po.out

(This output is actually fake).

Additional info:
0) This test was done with a very recent git repository of the Linux kernel. It contains 31384 .c and .h files.

1) The postings file consists of 18452992 entries of 24 bytes (on x86_64). I'll attach a script that simply prints these entries. It turns out there are a lot of duplicate entries.

perl postings.pl cscope.po.out | sort -u | wc -l
10904488

So if we (somehow) managed to filter these duplicate entries at build time we save ((18452812 - 10904488) * 24 =) 181159776 bytes (ie, almost 180 MB).

2) A posting entry has this format (on x86_64):

offset name size description
0      line 8    offset into cscope.out db of (ascii) line number
8      func 8    offset into cscope.out db of function name (or zero)
16     path 3    index into cscope.out db list of filepaths (zero based)
19     mark 1    cscope specific mark (at least 0x20, ' ')
20          4    padding

For cscope.out db's smaller than 4 GB the line and func values fit in 4 bytes. This means that half of a posting entry is either empty or padding. I assume most cscope.out db's will be smaller than 4 GB (the db's for over 10 kernel trees fit in that size). Switching to 12 byte posting entries would save in my example (18452812 * 12 =) about 220 MB. (No surprise there.)

Both ideas combined save over 300 MB here.

3) ENOPATCH. Would it worthwhile to further explore one or both of these ideas?
Comment 1 Paul Bolle 2012-07-03 07:54:58 EDT
Created attachment 595944 [details]
simple parser of inverted index postings file

Usage: perl postings.pl cscope.po.out
Comment 2 Neil Horman 2012-07-03 10:28:02 EDT
Do you know why there are duplicate entries in the cscope db to begin with?  It seems if there truly are duplicate entries it would be worth getting rid of them.

As for the file format, I'm really hesitant to change that.  Even if we can fit the offsets into fewer bytes, we can't safely do that, as we would have to then introduce the concept of a variable format, dependent on file size, and that just sounds like a mess that isn't really worth dealing with.

The duplicate entries we should probably pursue though.
Comment 3 Paul Bolle 2012-07-03 17:41:29 EDT
(In reply to comment #2)
> Do you know why there are duplicate entries in the cscope db to begin with? 
> It seems if there truly are duplicate entries it would be worth getting rid
> of them.

0) I should have finished a parser for the index file (ie, cscope.in.out) before opening an issue. I'll be using this comment as a public notebook to make that clear.

1) The index file is basically a list of all terms in the cscope database (the number of terms in that list equals the ascii digit following the "-q" in the cscope db header line). Each of these terms in the index file is matched to an offset into the postings file and a count of posts for that term. Ie, the term is apparently linked to a number of consecutive posts and so the posts in the posting file are grouped by term.

2) Now, as I elaborated in an earlier comment, each post links to one or two offsets and one sourcefile in the cscope db: the line number offset, optionally a function name offset, and the sourcefile name.

3) Given this structure (term in index file, pointing to one or more posts in posting file, each pointing to one or two offsets and one filename in the cscope db) duplicate entries in the posting file are to be expected. I haven't bothered to check anymore, but if one takes a line like
    foo.c:24:    bar = baz * 8;

that should lead to two posting entries, one for bar and one for baz, containing identical offsets, filename, and "mark".

4) Summary: identical entries in the posting file should be expected.

> As for the file format, I'm really hesitant to change that.  Even if we can
> fit the offsets into fewer bytes, we can't safely do that, as we would have
> to then introduce the concept of a variable format, dependent on file size,
> and that just sounds like a mess that isn't really worth dealing with.

5) Your call. I guess the code for handling a cscope db that somehow crosses the 4 GB size boundary would be rather tricky. Unless I can come up with a patch I won't bring up this issue again.

6) Thanks for your time and especially for asking the right question. I'll CLOSE this as NOTABUG.

Note You need to log in before you can comment on or make changes to this bug.