I am using lvm2-libs-2.02.95-6.fc17.x86_64 via python-lvm, in targetd, and experiencing Python interpreter crashes in different places. I believe there is memory corruption. I've narrowed it down to if lvm_init() followed by lvm_quit() is called (via python-lvm "lvm_handle = lvm.Liblvm()" followed by "lvm_handle.close()") lvm_init() without lvm_quit() doesn't crash, but of course is a leak. (targetd is a long-running daemon that repeatedly calls lvm_init/lvm_quit.) I'm looking into it but wanted to open a bug too. python-lvm: https://github.com/agrover/python-lvm/blob/master/liblvm.c targetd: https://github.com/agrover/targetd/blob/master/targetd
Created attachment 606706 [details] core file Here's a core-file of the python executable when running targetd and hitting the issue. It's dying in a place in the interpreter where I can't believe there'd be a bug.
Also, adding dummy variables to targetd seems to sometimes make the crash go away. But, it never happens when lvm_quit() is not being called.
I think it's important to check the right versions of all the libraries - and their dependencies are being used and the assumptions are consistent. (reentrant? right udev lib linked?) Try to reproduce outside python. a quick unscientific comparison under valgrind with 2.02.97+ gives ==16906== Invalid write of size 4 ==16906== at 0x3C84283350: __GI_mempcpy (memcpy.S:91) ==16906== by 0x3C84270FF6: _IO_file_xsputn@@GLIBC_2.2.5 (fileops.c:1329) ==16906== by 0x3C842682DD: puts (ioputs.c:42) ==16906== by 0x400B66: main <snip> ==16906== Address 0x537ed82 is 4,098 bytes inside a block of size 8,192 free'd ==16906== at 0x4A05187: free (vg_replace_malloc.c:325) ==16906== by 0x4E6BF04: destroy_toolcontext (toolcontext.c:1565) ==16906== by 0x4E5C870: lvm_quit (lvm_base.c:74) ==16906== by 0x400B5C: main <snip> and ==16906== Syscall param write(buf) points to unaddressable byte(s) ==16906== at 0x3C842D34F0: __write_nocancel (syscall-template.S:82) ==16906== by 0x3C842710A2: _IO_file_write@@GLIBC_2.2.5 (fileops.c:1268) ==16906== by 0x3C84270CE9: new_do_write (fileops.c:522) ==16906== by 0x3C84272224: _IO_do_write@@GLIBC_2.2.5 (fileops.c:495) ==16906== by 0x3C842725A6: _IO_file_overflow@@GLIBC_2.2.5 (fileops.c:881) ==16906== by 0x3C842683E0: puts (ioputs.c:43) ==16906== by 0x400B66: main <snip> which don't appear with 2.02.84.
Created attachment 606713 [details] backtrace
yay I can bisect it, v2_02_84 seems fine. Will have the bisect results shortly.
My guess (hinted at by the valgrind output) would be release 2.02.86 eee66d2a80df173126d0db4f345fbf601c50859b where setvbuf() is being used incorrectly. toolcontext.c: if ((setvbuf(stdin, cmd->linebuffer, _IOLBF, linebuffer_size) || setvbuf(stdout, cmd->linebuffer + linebuffer_size, _IOLBF, linebuffer_size))) { log_sys_error("setvbuf", ""); goto out; } and /* Reset stream buffering to defaults */ setlinebuf(stdin); fflush(stdout); setlinebuf(stdout); Both these invoke behaviour undefined by the standards. Setting the buffer must be done *before* the streams are used. Instead - in both places - the code should close the streams and open them afresh (from dup()s) before calling setvbuf/setlinebuf.
We'd run that codepath (via lvm2_init()->cmdlib_lvm2_init()->init_lvm()->create_toolcontext()) in the heap corruption problem that was reported on linux-lvm at the beginning of the month: https://www.redhat.com/archives/linux-lvm/2012-July/msg00052.html His working and non-working releases would also be in the right range for this (.66 and .96).
I think the key issue is - the lvm is not designed for being called multiple times with lvm_init/exit() - it should never be used this way. There are multiple static variables for the lifetime of process which are not reinitialized. So any user of liblvm should stay with 1 call for both these functions. Issue with buffer setting is there to be sure, we have buffer in memory critical situation (i.e. not getting stuck in out-of-memory case and being able to resume suspended devices which may hold swap)
Why provide *_exit/_quit at all then? If it can't fully tear down the context such that it can be re-initialised in the same process it seems pretty useless. Just exit the process and leave the c runtime to tidy up (maybe I missed it but that path does not seem to tear down any process-external resources e.g. files or locks and regular file system locks will be freed on exit anyway). Even if this is intended - it should have a large warning sign to avoid surprising users of the interface - I doubt most people would expect a library to behave like that.
As Andy pointed out on bug 835717 the library should be usable in a mode that doesn't touch stdin/stdout, and then this particular problem (from python) will go away. It's not 'one size fits all'. statics shouldn't be a problem for init/quit: if there are any that cause trouble, they need identifying and fixing. Callers do need to weigh up whether to use the library or fork/exec (memory usage/performance). As the lvmetad work progresses the balance may change. Ultimately the library may become thinner with another daemon (alongside lvmetad) taking on increasing amounts of work.
yes eee66d2a80df173126d0db4f345fbf601c50859b is the commit that causes the issue.
I've turned setvbuf off in liblvm upstream and I'm leaning towards removing this completely from the code. Commit eee66d2a80df173126d0db4f345fbf601c50859b said: When glibc needs buffers for line buffering of input and output buffers, it allocates these buffers in such way it adds memory page for each such buffer and size of unlock memory check will mismatch by 1 or 2 pages. This happens when we print or read lines without '\n' so these buffers are used. To avoid this extra allocation, use setvbuf to set these buffers ahead. But as we've seen, the original implementation was incorrect (it assumed nothing had ever used stdin/stdout before the setvbuf calls). None of our code should ever be using stdin/stdout/stderr while devices are suspended (unless we hit irrecoverable errors) - assertions in the code should detect that. When devices aren't suspended but memory is still locked, it's OK to use stdin/stdout/stderr and it's OK for them to allocate extra pages if that's what they are doing. And since nothing should access the streams during activation it shouldn't really matter whether or not those new pages end up locked. So I don't see justification for using setvbuf. - Is memory left locked between any liblvm calls? Regardless, the implications of memory locking ought to be documented in lvm2app.h.
I may not be up on all the issues but I agree, liblvm should avoid calling setvbuf. Thanks Alasdair for the quick progress on this issue. Will this make it into Fedora 18? :)
(In reply to comment #12) > > But as we've seen, the original implementation was incorrect (it assumed > nothing had ever used stdin/stdout before the setvbuf calls). None of our Yes - it's been written with assumption that initilization part is called before any other action of lvm library user. > code should ever be using stdin/stdout/stderr while devices are suspended > (unless we hit irrecoverable errors) - assertions in the code should detect > that. When devices aren't suspended but memory is still locked, it's OK to > use stdin/stdout/stderr and it's OK for them to allocate extra pages if > that's what they are doing. And since nothing should access the streams > during activation it shouldn't really matter whether or not those new pages > end up locked. So I don't see justification for using setvbuf. > > - Is memory left locked between any liblvm calls? Few things come to my mind here - We need to be able to print error messages while being suspended (we do not have any internal buffering for this) and to succeed to resume, there should not be any memory allocation before resume - however glibc AFAIK has not been using 'standard' malloc mechanism for vbuf - but direct mmap trick - thus ignored our preallocated memory chunk. Number of our test suite cases were exercising this behavior triggering error paths and invoking allocation of vbuf memory in the middle of suspend - at that time the easiest solution was to preallocate vbuf ahead - but the external user of library calling such init function past the generation of stdout was not expected. > Regardless, the implications of memory locking ought to be documented in > lvm2app.h.
Summary: the bug was present in releases 2.02.86 to 2.02.97 inclusive. It was fixed in 2.02.98. We've kept this rawhide bugzilla open to consider the other matters brought up in the discussion.
This bug appears to have been reported against 'rawhide' during the Fedora 19 development cycle. Changing version to '19'. (As we did not run this process for some time, it could affect also pre-Fedora 19 development cycle bugs. We are very sorry. It will help us with cleanup during Fedora 19 End Of Life. Thank you.) More information and reason for this action is here: https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora19