Bug 1017957 - lbzip2 thread crashes with signal SIGSEGV
Summary: lbzip2 thread crashes with signal SIGSEGV
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: lbzip2
Version: 19
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Mikolaj Izdebski
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-10-10 20:43 UTC by Will Bending
Modified: 2013-11-10 06:35 UTC (History)
3 users (show)

Fixed In Version: 2.2-4
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-11-06 09:42:58 UTC
Type: Bug


Attachments (Terms of Use)
backtrace (2.34 KB, text/plain)
2013-10-10 20:43 UTC, Will Bending
no flags Details
Backtrace with more information (13.01 KB, text/plain)
2013-10-21 00:48 UTC, Will Bending
no flags Details
Dump of assembler code for function retrieve (58.26 KB, text/plain)
2013-10-21 01:00 UTC, Will Bending
no flags Details
Dump of assembler code for function do_retrieve (10.66 KB, text/plain)
2013-10-21 01:05 UTC, Will Bending
no flags Details
GDB session better look at variables (4.55 KB, text/plain)
2013-10-22 05:46 UTC, Will Bending
no flags Details
C to calculate value s (578 bytes, text/x-csrc)
2013-10-22 05:50 UTC, Will Bending
no flags Details
Process memory map (3.42 KB, text/plain)
2013-10-22 05:57 UTC, Will Bending
no flags Details
non optimized srpm (612.04 KB, application/x-rpm)
2013-10-23 19:52 UTC, Will Bending
no flags Details
non optimized binary rpm (86.38 KB, application/x-rpm)
2013-10-23 19:54 UTC, Will Bending
no flags Details
non optimized debug symbols (155.70 KB, application/x-rpm)
2013-10-23 19:54 UTC, Will Bending
no flags Details
gdb session non optimized binary (13.77 KB, text/plain)
2013-10-23 19:57 UTC, Will Bending
no flags Details
assembler for non optimized binary (77.51 KB, text/plain)
2013-10-23 19:59 UTC, Will Bending
no flags Details
process memory map non optimized binary (3.42 KB, text/plain)
2013-10-23 20:00 UTC, Will Bending
no flags Details

Description Will Bending 2013-10-10 20:43:05 UTC
Created attachment 810741 [details]
backtrace

Description of problem:
lbzip2 worker thread crashing with signal 11.  In my case this is during an attempted file restore using Amanda's amrecover command.  I am using lbzip2 as the compression option in my Amanda configuration.  It is being invoked by amrecover via Bash shell script wrapper with -n 5 to use 5 CPU cores.

Version-Release number of selected component (if applicable):
2.2-2.fc19

How reproducible:
100% in my Amanda configuration.  Unsure how reproducible elsewhere.

Steps to Reproduce:
1. configure lbzip2 as Amanda's compression program with -n 5
2. perform backup testing
3. restore multiple files/large directories of jpg images with amrecover

Actual results:
lbzip2 segfaults aborting the restore

Expected results:
restore succeeds

Additional info:
backtrace attached

Comment 1 Lukas Zapletal 2013-10-15 10:11:32 UTC
Hello,

would you mind trying out with 2.3 version? I built the RPM package for you.

http://koji.fedoraproject.org/koji/taskinfo?taskID=6061595

What platform are you trying this on? Is that Intel or ARM? Thanks.

Comment 2 Mikolaj Izdebski 2013-10-15 19:41:13 UTC
I am upstream maintainer of lbzip2.

To be able to analyze the case I need a reproducer (compressed file which causes segfault during decompression). If the file is too big (larger than several MB) then you can use lbzrecover (from lbzip2-utils) to cut it to smaller pieces and attach one of the small files.

If for some reason (confidential data etc.) you cannot include reproducer (even send privately to me) then recompile lbzip2 with debugging information and with no optimisation and make sure that segfault can be reproduced consistently (happens on the same instruction). Then attach lbzip2 binary you used with detailed backtrace and register dump (the more information the better). I will try debugging the problem, but without reproducer it may be impossible.

Comment 4 Will Bending 2013-10-19 13:50:54 UTC
(In reply to Lukas Zapletal from comment #1)
> Hello,
> 
> would you mind trying out with 2.3 version? I built the RPM package for you.
> 
> http://koji.fedoraproject.org/koji/taskinfo?taskID=6061595
> 
> What platform are you trying this on? Is that Intel or ARM? Thanks.

It looks like 2.3 exhibits the same segfault.

Comment 5 Will Bending 2013-10-19 14:01:41 UTC
(In reply to Mikolaj Izdebski from comment #2)
> I am upstream maintainer of lbzip2.
> 
> To be able to analyze the case I need a reproducer (compressed file which
> causes segfault during decompression). If the file is too big (larger than
> several MB) then you can use lbzrecover (from lbzip2-utils) to cut it to
> smaller pieces and attach one of the small files.
> 
> If for some reason (confidential data etc.) you cannot include reproducer
> (even send privately to me) then recompile lbzip2 with debugging information
> and with no optimisation and make sure that segfault can be reproduced
> consistently (happens on the same instruction). Then attach lbzip2 binary
> you used with detailed backtrace and register dump (the more information the
> better). I will try debugging the problem, but without reproducer it may be
> impossible.

Yes it is confidential data unfortunately.  Process run time is on the order of 1 - 1.5 hours before it happens, and the input is a tar image that is ~230G in size.

I'll try and create a reproducer tar file with random data..

Comment 6 Will Bending 2013-10-19 16:57:49 UTC
(In reply to Lukas Zapletal from comment #1)

> What platform are you trying this on? Is that Intel or ARM? Thanks.

Forgot to say.. Intel x86_64

Comment 7 Will Bending 2013-10-20 23:19:22 UTC
(In reply to Mikolaj Izdebski from comment #2)
> I am upstream maintainer of lbzip2.
> 
> To be able to analyze the case I need a reproducer (compressed file which
> causes segfault during decompression). If the file is too big (larger than
> several MB) then you can use lbzrecover (from lbzip2-utils) to cut it to
> smaller pieces and attach one of the small files.
> 
> If for some reason (confidential data etc.) you cannot include reproducer
> (even send privately to me) then recompile lbzip2 with debugging information
> and with no optimisation and make sure that segfault can be reproduced
> consistently (happens on the same instruction). Then attach lbzip2 binary
> you used with detailed backtrace and register dump (the more information the
> better). I will try debugging the problem, but without reproducer it may be
> impossible.

So far I have been unable to reproduce this with random data files as input.  I am trying to reproduce it again in gdb against the real data input.

Are there any specific gdb command outputs I can provide?

Comment 8 Will Bending 2013-10-21 00:48:02 UTC
Created attachment 814353 [details]
Backtrace with more information

Reproduced again.  More debugging info

Comment 9 Will Bending 2013-10-21 01:00:06 UTC
Created attachment 814358 [details]
Dump of assembler code for function retrieve

Dump of assembler code for function retrieve.

Comment 10 Will Bending 2013-10-21 01:05:33 UTC
Created attachment 814367 [details]
Dump of assembler code for function do_retrieve

Dump of assembler code for function do_retrieve

Comment 11 Mikolaj Izdebski 2013-10-21 09:03:52 UTC
Thank you for more detailed information.
I will analyze the data you provided and try to reproduce the crash.
Just to confirm, these dumps are for lbzip2-2.2-2.fc19.x86_64?

Comment 12 Will Bending 2013-10-21 12:52:56 UTC
(In reply to Mikolaj Izdebski from comment #11)
> Thank you for more detailed information.
> I will analyze the data you provided and try to reproduce the crash.
> Just to confirm, these dumps are for lbzip2-2.2-2.fc19.x86_64?

Yes lbzip2-2.2-2.fc19.x86_64 and lbzip2-debuginfo-2.2-2.fc19.x86_64 are what I'm working with.

Thanks

Comment 13 Will Bending 2013-10-22 05:46:52 UTC
Created attachment 814852 [details]
GDB session better look at variables

I reproduced this again and have gdb attached to it.  Looking at s = T->perm[T->count[k] + ((v - T->base[k]) >> (64 - k))]; at current runtime, GDB says it cannot access memory.

Comment 14 Will Bending 2013-10-22 05:50:13 UTC
Created attachment 814854 [details]
C to calculate value s

This C program is what I used to calculate s value.  There is probably a way to make gdb do it, but this was faster for me.

Comment 15 Will Bending 2013-10-22 05:54:53 UTC
(In reply to Will Bending from comment #14)
> Created attachment 814854 [details]
> C to calculate value s
> 
> This C program is what I used to calculate s value.  There is probably a way
> to make gdb do it, but this was faster for me.

Correction, to calculate the index for T->perm[T->count[k] + ((v - T->base[k]) >> (64 - k))]  not s value, although this memory's contents would be assigned to variable s.  

(gdb) print T->perm[2826676582]
Cannot access memory at address 0x7fb604f81670

Comment 16 Will Bending 2013-10-22 05:57:34 UTC
Created attachment 814855 [details]
Process memory map

Here is the map of memory at time of the crash in attachment 814852 [details].

Comment 17 Will Bending 2013-10-23 14:59:22 UTC
(In reply to Will Bending from comment #16)
> Created attachment 814855 [details]
> Process memory map
> 
> Here is the map of memory at time of the crash in attachment 814852 [details]
> [details].

Another error correction: memory map is here in attachment 814855 [details]

I am working on reproducing again, but so far it seems like it is crashing reliably at the same place, although I have seen very different variable values in the scope of the frame where this illegal access happens.

Comment 18 Mikolaj Izdebski 2013-10-23 15:01:53 UTC
Thank you for helping with this bug.  I really would like to fix it, but I won't have time until Sunday.

Comment 19 Will Bending 2013-10-23 15:20:00 UTC
(In reply to Mikolaj Izdebski from comment #18)
> Thank you for helping with this bug.  I really would like to fix it, but I
> won't have time until Sunday.

I understand.  It is not holding me up on this project as I changed to pigz, although I would prefer this algorithm.  It has better compression ratio, and I can live with the performance loss because I can get more data on the LTO3 tape.

Let me know if I can get anything else useful out of GDB or /proc for you.  I am recording this next crash so hopefully running it in reverse will show something more.

Comment 20 Will Bending 2013-10-23 19:52:58 UTC
Created attachment 815533 [details]
non optimized srpm

srpm used to build the binary with optimizations turned off.

%configure \
  --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= \
--disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin \
--sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include \
--libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var \
--sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info
  make %{?_smp_mflags} CFLAGS='-O0 -g -pipe -Wall -Wp,-fexceptions -fstack-protector --param=ssp-buffer-size=4 -grecord-gcc-switches  -m64 -mtune=generic' \
CXXFLAGS='-O0 -g -pipe -Wall -Wp,-fexceptions -fstack-protector --param=ssp-buffer-size=4 -grecord-gcc-switches  -m64 -mtune=generic' \
FFLAGS='-O0 -g -pipe -Wall -Wp,-fexceptions -fstack-protector --param=ssp-buffer-size=4 -grecord-gcc-switches  -m64 -mtune=generic -I/usr/lib64/gfortran/modules' \
FCFLAGS='-O0 -g -pipe -Wall -Wp,-fexceptions -fstack-protector --param=ssp-buffer-size=4 -grecord-gcc-switches  -m64 -mtune=generic -I/usr/lib64/gfortran/modules' \
LDFLAGS='-Wl,-z,relro '

Comment 21 Will Bending 2013-10-23 19:54:05 UTC
Created attachment 815534 [details]
non optimized binary rpm

non optimized binary rpm

Comment 22 Will Bending 2013-10-23 19:54:43 UTC
Created attachment 815535 [details]
non optimized debug symbols

non optimized debug symbols

Comment 23 Will Bending 2013-10-23 19:57:43 UTC
Created attachment 815536 [details]
gdb session non optimized binary

This is a good gdb session of the non-optimized binary having the crash.  All variables in frame 0 scope are shown.

Comment 24 Will Bending 2013-10-23 19:59:25 UTC
Created attachment 815537 [details]
assembler for non optimized binary

The assembler code for the non optimized binary.

Comment 25 Will Bending 2013-10-23 20:00:25 UTC
Created attachment 815538 [details]
process memory map non optimized binary

process memory map non optimized binary

Comment 26 Will Bending 2013-10-23 20:12:49 UTC
(In reply to Mikolaj Izdebski from comment #18)
> Thank you for helping with this bug.  I really would like to fix it, but I
> won't have time until Sunday.

I went ahead and rebuilt lbzip2 with optimizations switched off like you asked originally.  It reproduces in the same call, and the data structures are easy to look at.

So to me it would appear this has a very out-of-bounds k value when considering arrays base[] and count[].

I intend to leave this GDB session paused where it is if possible.. if not it is easy enough to reproduce this.  Let me know if you have something you want looked at in GDB session.  Unfortunately reverse only takes me back to where it was interrupted by GDB attaching.. maybe I am using 'record' incorrectly.

Comment 27 Mikolaj Izdebski 2013-10-27 10:16:17 UTC
(In reply to Will Bending from comment #26)
> So to me it would appear this has a very out-of-bounds k value when
> considering arrays base[] and count[].

That seems to be the problem.  base[] array seems to have incorrect
element base[21] which causes k to run out of bounds (1 <= k <= 20).

I have prepared a fix.  Because I don't have reproducer I cannot
verify the fix myself.  Will, could you check if the following RPM
fixes the problem for you?

http://mizdebsk.fedorapeople.org/lbzip2-2.2-3.fc21.0.1.x86_66.rpm
http://mizdebsk.fedorapeople.org/lbzip2-2.2-3.fc21.0.1.src.rpm

Comment 29 Will Bending 2013-10-27 16:46:37 UTC
(In reply to Mikolaj Izdebski from comment #28)
> (In reply to Mikolaj Izdebski from comment #27)
> > http://mizdebsk.fedorapeople.org/lbzip2-2.2-3.fc21.0.1.x86_66.rpm
> > http://mizdebsk.fedorapeople.org/lbzip2-2.2-3.fc21.0.1.src.rpm
> 
> There is a typo.  Of course I meant:
> 
> http://mizdebsk.fedorapeople.org/lbzip2-2.2-3.fc21.0.1.x86_64.rpm
> http://mizdebsk.fedorapeople.org/lbzip2-2.2-3.fc21.0.1.src.rpm

Hey looks like you have a good fix.  Nice work.

I built a binary from this srpm and confirm I am not seeing the crash.  It looks like Amanda's amrecover completed successfully.

I will continue testing with other large restores from this data set and let you know results.

Thanks very much.

Comment 30 Mikolaj Izdebski 2013-10-27 18:49:15 UTC
Fixed in lbzip2-2.2-4

Comment 31 Mikolaj Izdebski 2013-10-27 18:52:34 UTC
(In reply to Will Bending from comment #29)
> Hey looks like you have a good fix.  Nice work.
> 
> I built a binary from this srpm and confirm I am not seeing the crash.  It
> looks like Amanda's amrecover completed successfully.
> 
> I will continue testing with other large restores from this data set and let
> you know results.
> 
> Thanks very much.

I'm glad the fix works for you.  I'll create an update soon.

Thank you for taking time in reporting this and providing all the details.  Without backtrace and data structure dumps I wouldn't be able to fix it.

Comment 32 Fedora Update System 2013-10-27 19:01:12 UTC
lbzip2-2.2-4.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/lbzip2-2.2-4.fc20

Comment 33 Fedora Update System 2013-10-27 19:01:55 UTC
lbzip2-2.2-4.fc19 has been submitted as an update for Fedora 19.
https://admin.fedoraproject.org/updates/lbzip2-2.2-4.fc19

Comment 34 Fedora Update System 2013-10-27 19:02:29 UTC
lbzip2-2.2-4.fc18 has been submitted as an update for Fedora 18.
https://admin.fedoraproject.org/updates/lbzip2-2.2-4.fc18

Comment 35 Will Bending 2013-10-28 00:38:02 UTC
(In reply to Mikolaj Izdebski from comment #31)
> (In reply to Will Bending from comment #29)
> > Hey looks like you have a good fix.  Nice work.
> > 
> > I built a binary from this srpm and confirm I am not seeing the crash.  It
> > looks like Amanda's amrecover completed successfully.
> > 
> > I will continue testing with other large restores from this data set and let
> > you know results.
> > 
> > Thanks very much.
> 
> I'm glad the fix works for you.  I'll create an update soon.
> 
> Thank you for taking time in reporting this and providing all the details. 
> Without backtrace and data structure dumps I wouldn't be able to fix it.

Glad to have helped.  I have tested several more times today and cannot reproduce the crash with this patch applied.  I will switch my backup compression method back to lbzip2 and finish evaluating Amanda.  Thanks for the quick fix.

Comment 36 Fedora Update System 2013-11-06 07:36:22 UTC
lbzip2-2.2-4.fc19 has been pushed to the Fedora 19 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 37 Fedora Update System 2013-11-06 07:38:19 UTC
lbzip2-2.2-4.fc18 has been pushed to the Fedora 18 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 38 Mikolaj Izdebski 2013-11-06 09:42:58 UTC
I believe that this bug is fixed in lbzip2-2.2-4,
which is available in updates for Fedora 19, so I am closing this bug now.

The build containing the fix can be found at Koji:
http://koji.fedoraproject.org/koji/buildinfo?buildID=474109

Comment 39 Fedora Update System 2013-11-10 06:35:50 UTC
lbzip2-2.2-4.fc20 has been pushed to the Fedora 20 stable repository.  If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.