Description of problem: This was actually reported against CentOS, but I'm filing this against RHEL because I believe the same bug could occur here. Version-Release number of selected component (if applicable): /usr/bin/platform-python 3.6.8-31.el8 libguestfs-1.40.2-25.module_el8.3.0+555+a55c8938.x86_64 How reproducible: 100% Steps to Reproduce: (Details to follow) Actual results: A crash was observed in kojid which uses the Python bindings to libguestfs. The logs leading up to the crash were: 2021-01-29 15:48:03,025 [DEBUG] {368614} koji.TaskManager:918 bins: {'1:i386': 1, '1:x86_64': 1, '1:noarch': 1, '2:i386': 1, '2:x86_64': 1, '2:noarch': 1, '7:i386': 1, '7:x86_64': 1, '7:noarch': 1} libguestfs: trace: set_verbose true libguestfs: trace: set_verbose = 0 libguestfs: create: flags = 0, handle = 0x7ff40c01b110, program = platform-python libguestfs: trace: set_program "kojid" libguestfs: trace: set_program = 0 libguestfs: trace: add_drive_ro "/var/builder/builder/tasks/5516/5516/output_image/7938c61b-46bb-4b6a-b9c8-008e2ae5e900.body" libguestfs: trace: add_drive "/var/builder/builder/tasks/5516/5516/output_image/7938c61b-46bb-4b6a-b9c8-008e2ae5e900.body" "readonly:true" libguestfs: creating COW overlay to protect original drive content libguestfs: trace: disk_format "/var/builder/builder/tasks/5516/5516/output_image/7938c61b-46bb-4b6a-b9c8-008e2ae5e900.body" libguestfs: command: run: qemu-img --help | grep -sqE -- '\binfo\b.*-U\b' libguestfs: command: run: qemu-img libguestfs: command: run: \ info libguestfs: command: run: \ -U libguestfs: command: run: \ --output json libguestfs: command: run: \ /var/builder/builder/tasks/5516/5516/output_image/7938c61b-46bb-4b6a-b9c8-008e2ae5e900.body libguestfs: parse_json: qemu-img info JSON output:\n{\n "virtual-size": 10737418240,\n "filename": "/var/builder/builder/tasks/5516/5516/output_image/7938c61b-46bb-4b6a-b9c8-008e2ae5e900.body",\n "format": "raw",\n "actual-size": 615018496,\n "dirty-flag": false\n}\n\n free(): invalid pointer The reporter saw the following stack trace: #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #1 0x00007ff432b23c35 in __GI_abort () at abort.c:79 #2 0x00007ff432b7c987 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ff432c8911d "%s\n") at ../sysdeps/posix/libc_fatal.c:181 #3 0x00007ff432b83d8c in malloc_printerr (str=str@entry=0x7ff432c872fb "free(): invalid pointer") at malloc.c:5374 #4 0x00007ff432b8568c in _int_free (av=<optimized out>, p=<optimized out>, have_lock=<optimized out>) at malloc.c:4185 #5 0x00007ff41c461b99 in json_delete () from /lib64/libjansson.so.4 #6 0x00007ff41cd6d0c5 in json_decref (json=0x7ff40c013470) at /usr/include/jansson.h:129 #7 cleanup_json_t_decref (ptr=<synthetic pointer>) at info.c:52 #8 guestfs_impl_disk_format (g=g@entry=0x7ff40c01b110, filename=filename@entry=0x7ff40c027ad0 "/var/builder/builder/tasks/5516/5516/output_image/7938c61b-46bb-4b6a-b9c8-008e2ae5e900.body") at info.c:66 #9 0x00007ff41cd3f6b9 in guestfs_disk_format (g=g@entry=0x7ff40c01b110, filename=0x7ff40c027ad0 "/var/builder/builder/tasks/5516/5516/output_image/7938c61b-46bb-4b6a-b9c8-008e2ae5e900.body") at actions-6.c:65 #10 0x00007ff41cd74e41 in get_source_format_or_autodetect (g=0x7ff40c01b110, drv=<optimized out>) at launch-libvirt.c:201 #11 0x00007ff41cd76118 in create_cow_overlay_libvirt (g=0x7ff40c01b110, datav=0x7ff40c016360, drv=0x7ff40c2b9990) at launch-libvirt.c:270 #12 0x00007ff41cd6359a in create_overlay (g=0x7ff40c01b110, drv=0x7ff40c2b9990) at drives.c:87 #13 0x00007ff41cd637cb in create_drive_file (g=g@entry=0x7ff40c01b110, data=data@entry=0x7ff413c6b870) at drives.c:119 #14 0x00007ff41cd640d8 in guestfs_impl_add_drive_opts (g=g@entry=0x7ff40c01b110, filename=<optimized out>, filename@entry=0x7ff4180ba600 "/var/builder/builder/tasks/5516/5516/output_image/7938c61b-46bb-4b6a-b9c8-008e2ae5e900.body", optargs=optargs@entry=0x7ff413c6ba10) at drives.c:830 #15 0x00007ff41ccf3698 in guestfs_add_drive_opts_argv (g=0x7ff40c01b110, filename=0x7ff4180ba600 "/var/builder/builder/tasks/5516/5516/output_image/7938c61b-46bb-4b6a-b9c8-008e2ae5e900.body", optargs=optargs@entry=0x7ff413c6ba10) at actions-2.c:181 #16 0x00007ff41cd64a85 in guestfs_impl_add_drive_ro (filename=<optimized out>, g=<optimized out>) at drives.c:942 #17 guestfs_impl_add_cdrom (g=<optimized out>, filename=<optimized out>) at drives.c:1013 #18 0x00007ff41cccdf34 in guestfs_add_drive_ro (g=g@entry=0x7ff40c01b110, filename=0x7ff4180ba600 "/var/builder/builder/tasks/5516/5516/output_image/7938c61b-46bb-4b6a-b9c8-008e2ae5e900.body") at actions-0.c:188 #19 0x00007ff41d01233d in guestfs_int_py_add_drive_ro (self=<optimized out>, args=<optimized out>) at actions-0.c:200 ...
Reporter says libguestfs-test-tool output is fine. libjansson 2.11-3.el8
I need to reproduce this one. Setting Triaged.
Hi, we're seeing this issue in RHELBLD on a RHEL 8 host - see the linked RHELBLD issue.
I was never able to reproduce this bug, and therefore diagnose it any further. Some kind of minimal reproducer would be greatly helpful. You could start by modifying this example: https://github.com/libguestfs/libguestfs/blob/master/python/examples/inspect_vm.py I notice the actual code which crashes: https://github.com/redhat-imaging/imagefactory/blob/master/imgfac/FactoryUtils.py#L22 does not use python_return_dict=True (so it invokes the older, and these days less tested path). It also uses g.add_drive_ro which is an older synonym for g.add_drive_opts(disk, readonly=1), but also a less tested path.
I had a go at reproducing this, but I got nothing, on RHEL 8 or Fedora. I also examined the code (libguestfs.git/lib/info.c) but I don't see any obvious errors. As well as trying to find a reproducer, would suggest: - Upgrade libguestfs as you're using an old version. - Try downgrading jansson, since the crash is actually happening in that library. - Try a different version of Python, ie. platform-python vs one from a module. But what we really need most of all is a small reproducer.
FWIW my non-reproducer was simply: $ truncate -s 1G test.raw $ python3 ./test.py where test.py is: import guestfs g = guestfs.GuestFS() g.add_drive_ro("test.raw") g.launch() You can also try different python interpreters, etc. I could not reproduce the bug here.
A run under valgrind gave this error about the invalid free: ==2345280== Thread 2: ==2345280== Invalid free() / delete / delete[] / realloc() ==2345280== at 0x4C3610C: free (vg_replace_malloc.c:538) ==2345280== by 0x24240B98: json_delete (in /usr/lib64/libjansson.so.4.11.0) ==2345280== by 0x2396E0C4: ??? (in /usr/lib64/libguestfs.so.0.507.0) ==2345280== by 0x239406B8: guestfs_disk_format (in /usr/lib64/libguestfs.so.0.507.0) ==2345280== by 0x23975E40: ??? (in /usr/lib64/libguestfs.so.0.507.0) ==2345280== by 0x23977117: ??? (in /usr/lib64/libguestfs.so.0.507.0) ==2345280== by 0x23964599: ??? (in /usr/lib64/libguestfs.so.0.507.0) ==2345280== by 0x239647CA: ??? (in /usr/lib64/libguestfs.so.0.507.0) ==2345280== by 0x239650D7: ??? (in /usr/lib64/libguestfs.so.0.507.0) ==2345280== by 0x238F4697: guestfs_add_drive_opts_argv (in /usr/lib64/libguestfs.so.0.507.0) ==2345280== by 0x23965A84: ??? (in /usr/lib64/libguestfs.so.0.507.0) ==2345280== by 0x238CEF33: guestfs_add_drive_ro (in /usr/lib64/libguestfs.so.0.507.0) ==2345280== Address 0x1ddb7f31 is 1 bytes inside a block of size 128 alloc'd ==2345280== at 0x4C34F0B: malloc (vg_replace_malloc.c:307) ==2345280== by 0x2423AE30: ??? (in /usr/lib64/libjansson.so.4.11.0) ==2345280== by 0x2424020A: json_object (in /usr/lib64/libjansson.so.4.11.0) ==2345280== by 0x2423C6D4: ??? (in /usr/lib64/libjansson.so.4.11.0) ==2345280== by 0x2423C965: ??? (in /usr/lib64/libjansson.so.4.11.0) ==2345280== by 0x2423CC0C: json_loadb (in /usr/lib64/libjansson.so.4.11.0) ==2345280== by 0x2396DEF9: ??? (in /usr/lib64/libguestfs.so.0.507.0) ==2345280== by 0x239612AF: ??? (in /usr/lib64/libguestfs.so.0.507.0) ==2345280== by 0x23961C6C: ??? (in /usr/lib64/libguestfs.so.0.507.0) ==2345280== by 0x2396DD0C: ??? (in /usr/lib64/libguestfs.so.0.507.0) ==2345280== by 0x2396E005: ??? (in /usr/lib64/libguestfs.so.0.507.0) ==2345280== by 0x239406B8: guestfs_disk_format (in /usr/lib64/libguestfs.so.0.507.0) ==2345280== Things I notice here: - Pointer was allocated in json_object() (as expected). - Pointer has somehow been incremented by 1 before being passed to free.
Created attachment 1784004 [details] test.c The reason is ... #0 0x00007f1d7e1bd7a7 in json_object_iter_next () at /lib64/libjson-c.so.4 #1 0x00007f1d6a7d6efd in do_dump (json=json@entry=0x7f1d60186f70, flags=flags@entry=0, depth=depth@entry=0, parents=parents@entry=0x7f1d684e1650, dump=dump@entry=0x7f1d6a7d68c0 <dump_to_file>, data=data@entry=0x7f1d845b36e0 <_IO_2_1_stdout_>) at dump.c:400 If you look very closely you'll see frame 1 is in the Jansson library. Frame 0 is in a completely different and unrelated library (json-c) that happens to contain a symbol with the same name as one from Jansson (json_object_iter_next). Looking at /proc/pid/maps verifies that kojid is dynamically loading both libjansson and libjson-c. This causes memory corruption which creates the problem we've seen. CCing Dan Berrange. Dan: I'm fairly sure we've had this problem before ... with libvirt? Do you remember more about that? There are various reports of this on the web: https://github.com/json-c/json-c/issues/621 https://github.com/akheron/jansson/issues/536 A simple reproducer is that you can try on the Beaker machine is attached. You have to compile it this way to link both libraries: $ gcc -Wall -Werror test.c -ljson-c -ljansson -o test $ ./test Segmentation fault (core dumped) (Note if you link the libraries in the other order then it doesn't crash.) At the moment I don't know how to fix this, but at least I finally know what the problem is, which is a good start. Another thing I don't know is which particular Python dependency of Koji is using json-c.
Just for the record json-c is coming to kojid from dnf, while jansson is from guestfs.
@rjones thanks for this excellent debugging. It seems newer versions of json-c and jansson implement symbol versions, which would resolve this. Can RH8 get those updates in order to fix this issue?
I don't know if there is a specific plan to do that. A possible workaround is to try importing both json extensions somewhere early on in the Koji code. Try importing them in one order, and then the other way around, and one way round ought to work.
Created attachment 1804516 [details] libdnf vs guestfs invalid pointer
@rjones - you are correct. At least in the context of koji, I have managed to work around this issue by simply adding 'import guestfs' at the beginning of kojid I've added a simple python script that illustrates the issue @tkopecek do you see a better way to work around this bug, or should I open an issue for tracking purposes at https://pagure.io/koji/issues?
Let's open an issue for koji - it is not that easy (guestfs needn't be present for all builders, just image ones) but we should figure it out.
Ok, I opened https://pagure.io/koji/issue/2964.
Rich, I assigned this to you, since you identified and debugged the issue already, but my understanding is that the real fix for this would be elsewhere (updated packaging for jansson or json-c)? At any rate, this sounds sufficiently important that we need to prioritize it, even if we use this bug to track a workaround, and create another one to fix the underlying issue. Do you agree?
This isn't really a bug in libguestfs, it's a bug caused by two JSON libraries having conflicting symbols so if (any) program happens to load both libraries at the same time then bad things happen. kojid was already using one of the JSON libraries via Python, and libguestfs happened to be the thing that pulled in the other JSON library so we get the blame for causing the crash, but it could equally well have been caused by anything else using the other JSON library. Interestingly it's been fixed already (in the two JSON libraries) in RHEL 9 by some ELF trickery.
(In reply to Richard W.M. Jones from comment #18) > This isn't really a bug in libguestfs, it's a bug caused by > two JSON libraries having conflicting symbols so if (any) > program happens to load both libraries at the same time then > bad things happen. kojid was already using one of the JSON > libraries via Python, and libguestfs happened to be the thing > that pulled in the other JSON library so we get the blame for > causing the crash, but it could equally well have been caused > by anything else using the other JSON library. > > Interestingly it's been fixed already (in the two JSON libraries) > in RHEL 9 by some ELF trickery. We're in agreement. I just need to know how to wrap our work on this. i.e., is the issue opened elsewhere (https://pagure.io/koji/issue/2964) sufficient, or should we re-route this one to some RHEL toolchain component? Thanks.
So how can we progress here? Are we tracking the fix for this anywhere else? What is the completion criteria? I'm setting sev/prio to low, assuming that this is no longer blocking any team.
I don't know - there's nothing we can do to fix this in libguestfs, and in any case it's not a bug in libguestfs at all, so I'm going to close the bug.
PSA: See bug 2001062.