Bug 851287 - [liblvm] stdio setvbuf & memory locking issues
Summary: [liblvm] stdio setvbuf & memory locking issues
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: lvm2
Version: 19
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Tony Asleson
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-08-23 17:09 UTC by Andy Grover
Modified: 2014-09-15 23:52 UTC (History)
11 users (show)

Fixed In Version: lvm2-2.02.98-1.fc19
Clone Of:
Environment:
Last Closed: 2014-09-15 23:15:52 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
core file (3.84 MB, application/x-gzip-compressed)
2012-08-23 21:06 UTC, Andy Grover
no flags Details
backtrace (25.17 KB, text/plain)
2012-08-23 21:36 UTC, Andy Grover
no flags Details

Description Andy Grover 2012-08-23 17:09:18 UTC
I am using lvm2-libs-2.02.95-6.fc17.x86_64 via python-lvm, in targetd, and experiencing Python interpreter crashes in different places. I believe there is memory corruption.

I've narrowed it down to if lvm_init() followed by lvm_quit() is called (via python-lvm "lvm_handle = lvm.Liblvm()" followed by "lvm_handle.close()")

lvm_init() without lvm_quit() doesn't crash, but of course is a leak. (targetd is a long-running daemon that repeatedly calls lvm_init/lvm_quit.)

I'm looking into it but wanted to open a bug too.

python-lvm:
https://github.com/agrover/python-lvm/blob/master/liblvm.c

targetd:
https://github.com/agrover/targetd/blob/master/targetd

Comment 1 Andy Grover 2012-08-23 21:06:26 UTC
Created attachment 606706 [details]
core file

Here's a core-file of the python executable when running targetd and hitting the issue. It's dying in a place in the interpreter where I can't believe there'd be a bug.

Comment 2 Andy Grover 2012-08-23 21:09:09 UTC
Also, adding dummy variables to targetd seems to sometimes make the crash go away. But, it never happens when lvm_quit() is not being called.

Comment 3 Alasdair Kergon 2012-08-23 21:33:00 UTC
I think it's important to check the right versions of all the libraries - and their dependencies are being used and the assumptions are consistent.  (reentrant? right udev lib linked?)

Try to reproduce outside python.

a quick unscientific comparison under valgrind with 2.02.97+ gives

==16906== Invalid write of size 4
==16906==    at 0x3C84283350: __GI_mempcpy (memcpy.S:91)
==16906==    by 0x3C84270FF6: _IO_file_xsputn@@GLIBC_2.2.5 (fileops.c:1329)
==16906==    by 0x3C842682DD: puts (ioputs.c:42)
==16906==    by 0x400B66: main <snip>
==16906==  Address 0x537ed82 is 4,098 bytes inside a block of size 8,192 free'd
==16906==    at 0x4A05187: free (vg_replace_malloc.c:325)
==16906==    by 0x4E6BF04: destroy_toolcontext (toolcontext.c:1565)
==16906==    by 0x4E5C870: lvm_quit (lvm_base.c:74)
==16906==    by 0x400B5C: main <snip>

and

==16906== Syscall param write(buf) points to unaddressable byte(s)
==16906==    at 0x3C842D34F0: __write_nocancel (syscall-template.S:82)
==16906==    by 0x3C842710A2: _IO_file_write@@GLIBC_2.2.5 (fileops.c:1268)
==16906==    by 0x3C84270CE9: new_do_write (fileops.c:522)
==16906==    by 0x3C84272224: _IO_do_write@@GLIBC_2.2.5 (fileops.c:495)
==16906==    by 0x3C842725A6: _IO_file_overflow@@GLIBC_2.2.5 (fileops.c:881)
==16906==    by 0x3C842683E0: puts (ioputs.c:43)
==16906==    by 0x400B66: main <snip>

which don't appear with 2.02.84.

Comment 4 Andy Grover 2012-08-23 21:36:53 UTC
Created attachment 606713 [details]
backtrace

Comment 5 Andy Grover 2012-08-23 23:00:41 UTC
yay I can bisect it, v2_02_84 seems fine. Will have the bisect results shortly.

Comment 6 Alasdair Kergon 2012-08-24 01:10:52 UTC
My guess (hinted at by the valgrind output) would be release 2.02.86 eee66d2a80df173126d0db4f345fbf601c50859b where setvbuf() is being used incorrectly.

toolcontext.c:

                if ((setvbuf(stdin, cmd->linebuffer, _IOLBF, linebuffer_size) ||
                     setvbuf(stdout, cmd->linebuffer + linebuffer_size,
                             _IOLBF, linebuffer_size))) {
                        log_sys_error("setvbuf", "");
                        goto out;
                }

and

                /* Reset stream buffering to defaults */
                setlinebuf(stdin);
                fflush(stdout);
                setlinebuf(stdout);


Both these invoke behaviour undefined by the standards.

Setting the buffer must be done *before* the streams are used.

Instead - in both places - the code should close the streams and open them afresh (from dup()s) before calling setvbuf/setlinebuf.

Comment 7 Bryn M. Reeves 2012-08-24 09:15:36 UTC
We'd run that codepath (via lvm2_init()->cmdlib_lvm2_init()->init_lvm()->create_toolcontext()) in the heap corruption problem that was reported on linux-lvm at the beginning of the month:

https://www.redhat.com/archives/linux-lvm/2012-July/msg00052.html

His working and non-working releases would also be in the right range for this (.66 and .96).

Comment 8 Zdenek Kabelac 2012-08-24 09:26:17 UTC
I think the key issue is - the  lvm is not designed for being called     multiple times with  lvm_init/exit()   - it should never be used this way.

There are multiple static variables for the lifetime of process which
are not reinitialized.  So any user of liblvm should stay with 1 call
for both these functions.

Issue with  buffer setting is there to be sure, we have buffer in memory critical situation (i.e. not getting stuck in out-of-memory case and being able to resume suspended devices which may hold swap)

Comment 9 Bryn M. Reeves 2012-08-24 09:35:25 UTC
Why provide *_exit/_quit at all then? If it can't fully tear down the context such that it can be re-initialised in the same process it seems pretty useless. 

Just exit the process and leave the c runtime to tidy up (maybe I missed it but that path does not seem to tear down any process-external resources e.g. files or locks and regular file system locks will be freed on exit anyway).

Even if this is intended - it should have a large warning sign to avoid surprising users of the interface - I doubt most people would expect a library to behave like that.

Comment 10 Alasdair Kergon 2012-08-24 13:22:49 UTC
As Andy pointed out on bug 835717 the library should be usable in a mode that doesn't touch stdin/stdout, and then this particular problem (from python) will go away.

It's not 'one size fits all'.

statics shouldn't be a problem for init/quit: if there are any that cause trouble, they need identifying and fixing.

Callers do need to weigh up whether to use the library or fork/exec (memory usage/performance).  As the lvmetad work progresses the balance may change.  Ultimately the library may become thinner with another daemon (alongside lvmetad) taking on increasing amounts of work.

Comment 11 Andy Grover 2012-08-25 00:14:02 UTC
yes eee66d2a80df173126d0db4f345fbf601c50859b is the commit that causes the issue.

Comment 12 Alasdair Kergon 2012-08-26 00:26:55 UTC
I've turned setvbuf off in liblvm upstream and I'm leaning towards removing this completely from the code.

Commit eee66d2a80df173126d0db4f345fbf601c50859b said:

    When glibc needs buffers for line buffering of input and output buffers, it
    allocates these buffers in such way it adds memory page for each such buffer
    and size of unlock memory check will mismatch by 1 or 2 pages.
    
    This happens when we print or read lines without '\n' so these buffers are
    used. To avoid this extra allocation, use setvbuf to set these buffers ahead.
    

But as we've seen, the original implementation was incorrect (it assumed nothing had ever used stdin/stdout before the setvbuf calls).  None of our code should ever be using stdin/stdout/stderr while devices are suspended (unless we hit irrecoverable errors) - assertions in the code should detect that.  When devices aren't suspended but memory is still locked, it's OK to use stdin/stdout/stderr and it's OK for them to allocate extra pages if that's what they are doing.  And since nothing should access the streams during activation it shouldn't really matter whether or not those new pages end up locked.  So I don't see justification for using setvbuf.

- Is memory left locked between any liblvm calls?
Regardless, the implications of memory locking ought to be documented in lvm2app.h.

Comment 13 Andy Grover 2012-08-26 00:49:01 UTC
I may not be up on all the issues but I agree, liblvm should avoid calling setvbuf.

Thanks Alasdair for the quick progress on this issue.

Will this make it into Fedora 18? :)

Comment 14 Zdenek Kabelac 2012-08-27 08:16:33 UTC
(In reply to comment #12)
  
> 
> But as we've seen, the original implementation was incorrect (it assumed
> nothing had ever used stdin/stdout before the setvbuf calls).  None of our

Yes - it's been written with assumption that initilization part is called
before any other action of  lvm library user.

> code should ever be using stdin/stdout/stderr while devices are suspended
> (unless we hit irrecoverable errors) - assertions in the code should detect
> that.  When devices aren't suspended but memory is still locked, it's OK to
> use stdin/stdout/stderr and it's OK for them to allocate extra pages if
> that's what they are doing.  And since nothing should access the streams
> during activation it shouldn't really matter whether or not those new pages
> end up locked.  So I don't see justification for using setvbuf.
> 
> - Is memory left locked between any liblvm calls?

Few things come to my mind here -

We need to be able to print error messages while being suspended (we do not have any internal buffering for this) and to succeed to resume, there should not be any memory allocation before resume - however glibc AFAIK has not been using 'standard' malloc mechanism for vbuf - but direct mmap trick - thus ignored our preallocated memory chunk.

Number of our test suite cases were exercising this behavior triggering error paths and invoking allocation of vbuf memory in the middle of suspend - at that time the easiest solution was to preallocate vbuf ahead - but the external user of library calling such init function past the generation of stdout was not expected.

> Regardless, the implications of memory locking ought to be documented in
> lvm2app.h.

Comment 15 Alasdair Kergon 2013-03-14 02:53:26 UTC
Summary: the bug was present in releases 2.02.86 to 2.02.97 inclusive.

It was fixed in 2.02.98.

We've kept this rawhide bugzilla open to consider the other matters brought up in the discussion.

Comment 16 Fedora End Of Life 2013-04-03 19:40:41 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 19 development cycle.
Changing version to '19'.

(As we did not run this process for some time, it could affect also pre-Fedora 19 development
cycle bugs. We are very sorry. It will help us with cleanup during Fedora 19 End Of Life. Thank you.)

More information and reason for this action is here:
https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora19


Note You need to log in before you can comment on or make changes to this bug.