Bug 911609
Summary: | [abrt] libvirt-client-0.10.2-18.el6: remoteClientCloseFunc: Process /usr/bin/virsh was killed by signal 11 (SIGSEGV) | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | David Jaša <djasa> | ||||||||||||||||||||||
Component: | libvirt | Assignee: | Peter Krempa <pkrempa> | ||||||||||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||||||||||||||||||
Severity: | high | Docs Contact: | |||||||||||||||||||||||
Priority: | high | ||||||||||||||||||||||||
Version: | 6.4 | CC: | acathrow, cpelland, cwei, dallan, dyuan, eblake, mjenner, mzhan, pkrempa, ydu, zhwang | ||||||||||||||||||||||
Target Milestone: | rc | Keywords: | ZStream | ||||||||||||||||||||||
Target Release: | --- | ||||||||||||||||||||||||
Hardware: | x86_64 | ||||||||||||||||||||||||
OS: | Unspecified | ||||||||||||||||||||||||
Whiteboard: | abrt_hash:25a80aa5b774a648781787aa1f2a8213e8f3eb99 | ||||||||||||||||||||||||
Fixed In Version: | libvirt-0.10.2-19.el6 | Doc Type: | Bug Fix | ||||||||||||||||||||||
Doc Text: |
Due to a race condition in the libvirt client library, any application using libvirt could terminate unexpectedly with a segmentation fault. This happened when one thread executed the connection close callback, while another one freed the connection object, and the connection callback thread then accessed memory that had been already freed. This update fixes the possibility of freeing the callback data when they are still being accessed.
|
Story Points: | --- | ||||||||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||||||||
Last Closed: | 2013-11-21 08:45:15 UTC | Type: | --- | ||||||||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||||
Embargoed: | |||||||||||||||||||||||||
Bug Depends On: | |||||||||||||||||||||||||
Bug Blocks: | 950599 | ||||||||||||||||||||||||
Attachments: |
|
Description
David Jaša
2013-02-15 13:04:49 UTC
Created attachment 697773 [details]
File: maps
Created attachment 697774 [details]
File: var_log_messages
Created attachment 697775 [details]
File: open_fds
Created attachment 697776 [details]
File: environ
Created attachment 697777 [details]
File: dso_list
Created attachment 697778 [details]
File: sosreport.tar.xz
Created attachment 697779 [details]
File: backtrace
Created attachment 697780 [details]
File: build_ids
Created attachment 697781 [details]
File: limits
Created attachment 697782 [details]
File: cgroup
This request was not resolved in time for the current release. Red Hat invites you to ask your support representative to propose this request, if still desired, for consideration in the next release of Red Hat Enterprise Linux. I try it on rhel6.4,however i can't reproduce it while I test it with the following steps,can you give me some advise ? or what other operation can i do during the reproduce ? thanks Version-Release number of selected component: libvirt-client-0.10.2-18.el6 kernel-2.6.32-356.el6.x86_64 steps 1 prepare a guests # virsh list --all Id Name State ---------------------------------------------------- - rhelnew1 shut off 2 do the start/destroy operation for the guest,for many times,then check the guest status # for i in {1..20};do virsh start rhelnew1;sleep 3;virsh list --all; virsh destroy rhelnew1;virsh list --all;done 3 connect the guest in the remote host,then do some operation with the guest # virsh -c qemu+ssh://$hostIP/system Type: 'help' for help with commands 'quit' to quit virsh # list --all Id Name State ---------------------------------------------------- - rhelnew1 shut off virsh #start rhelnew1 virsh # list --all Id Name State ---------------------------------------------------- 25 rhelnew1 running https://www.redhat.com/archives/libvir-list/2013-March/msg01517.html might be the right upstream patch for this The crash happens due to a race condition of two threads, one of them is executing the connection close callback, while the other one frees the connection object. The connection callback thread then accesses memory that has been already freed. There is only a very slight chance how this might happen (the probability can be improved using wait states in the code). The fix for this issue is posted upstream as http://www.redhat.com/archives/libvir-list/2013-March/msg01539.html Fixed upstream with: commit 8ad126e695e5cef5da9d62ccfde7338317041e84 Author: Peter Krempa <pkrempa> Date: Fri Mar 29 18:21:19 2013 +0100 rpc: Fix connection close callback race condition and memory corruption/crash The last Viktor's effort to fix the race and memory corruption unfortunately wasn't complete in the case the close callback was not registered in an connection. At that time, the trail of event's that I'll describe later could still happen and corrupt the memory or cause a crash of the client (including the daemon in case of a p2p migration). Consider the following prerequisities and trail of events: Let's have a remote connection to a hypervisor that doesn't have a close callback registered and the client is using the event loop. The crash happens in cooperation of 2 threads. Thread E is the event loop and thread W is the worker that does some stuff. R denotes the remote client. 1.) W - The client finishes everything and sheds the last reference on the client 2.) W - The virObject stuff invokes virConnectDispose that invokes doRemoteClose 3.) W - the remote close method invokes the REMOTE_PROC_CLOSE RPC method. 4.) W - The thread is preempted at this point. 5.) R - The remote side receives the close and closes the socket. 6.) E - poll() wakes up due to the closed socket and invokes the close callback 7.) E - The event loop is preempted right before remoteClientCloseFunc is called 8.) W - The worker now finishes, and frees the conn object. 9.) E - The remoteClientCloseFunc accesses the now-freed conn object in the attempt to retrieve pointer for the real close callback. 10.) Kaboom, corrupted memory/segfault. This patch tries to fix this by introducing a new object that survives the freeing of the connection object. We can't increase the reference count on the connection object itself or the connection would never be closed, as the connection is closed only when the reference count reaches zero. The new object - virConnectCloseCallbackData - is a lockable object that keeps the pointers to the real user registered callback and ensures that the connection callback is either not called if the connection was already freed or that the connection isn't freed while this is being called. commit 69ab07560a134e82e36b6391be3c806d3dbdb16c Author: Peter Krempa <pkrempa> Date: Wed Mar 27 14:37:01 2013 +0100 virsh: Register and unregister the close callback also in cmdConnect This patch improves the error message after disconnecting from the hypervisor and adds the close callback operations required not to leak the callback reference. commit ca9e73ebb60e2efb1ea835e9a394a8b64ecb97c1 Author: Peter Krempa <pkrempa> Date: Wed Mar 27 14:22:47 2013 +0100 virsh: Move cmdConnect from virsh-host.c to virsh.c The function is used to establish connection so it should be in the main virsh file. This movement also enables further improvements done in next patches. Note that the "connect" command has moved from the host section of virsh to the main section. It is now listed by 'virsh help virsh' instead of 'virsh help host'. commit e964ba2786f6736613de1f14db4d3407f6928f50 Author: Viktor Mihajlovski <mihajlov.ibm.com> Date: Tue Mar 26 10:54:55 2013 +0100 virsh: Unregister the connection close notifier upon termination Before closing the connection we unregister the close callback to prevent a reference leak. Further, the messages on virConnectClose != 0 are a bit more specific now. Signed-off-by: Viktor Mihajlovski <mihajlov.ibm.com> commit 03a43efa86f5099d3f6df334f73961a535e488b5 Author: Viktor Mihajlovski <mihajlov.ibm.com> Date: Tue Mar 26 10:54:53 2013 +0100 libvirt: Increase connection reference count for callbacks By adjusting the reference count of the connection object we prevent races between callback function and virConnectClose. Signed-off-by: Viktor Mihajlovski <mihajlov.ibm.com> commit d0cc811ed02d49e60193dfe6601e53adadebb114 Author: Viktor Mihajlovski <mihajlov.ibm.com> Date: Tue Mar 26 10:54:54 2013 +0100 remote: Don't call NULL closeFreeCallback Check function pointer before calling. Signed-off-by: Viktor Mihajlovski <mihajlov.ibm.com> git describe: v1.0.4-57-g8ad126e This patch - http://www.redhat.com/archives/libvir-list/2013-March/msg01683.html - when applied on the unfixed source tree makes the crash much more likely to happen. hi Peter: I try to reproduce it with virsh cmd before,however I can't reproduce it, so can you give me the steps to reproduce this bug ? so that we can verified it correctly later.thanks Hi Zhenfeng, please see the patch linked in Comment 19. When you apply that patch to the unfixed tree, the race becomes much more reproducible (around 90%). Also that thread contains more information how the bug exposes and what stack traces to expect. Hi Peter: I just rebuild the libvirtd packet with the libvirt-rhel.git,however, I found there were too many duplicated and dependence during my rebuilding. I spent the whole afternoon perpareing for it and not get it out yet,so can you spare some time to rebuild a temporary packet for me when you are free,meanwhile, I'll keep on doing it. thanks (In reply to comment #23) > Hi Peter: > I just rebuild the libvirtd packet with the libvirt-rhel.git,however, I > found there were too many duplicated and dependence during my rebuilding. I > spent the whole afternoon perpareing for it and not get it out yet,so can > you spare some time to rebuild a temporary packet for me when you are > free,meanwhile, I'll keep on doing it. thanks Hi, zhenfeng I just rebuild this libvirt rpm packages(applied patch of comment 19, and you can reach me to get it), and reproduce this bug. In one terminal: # for i in {1..100}; do virsh list --all & virsh list --all ;done In another one: # virsh list Id Name State ---------------------------------------------------- DEBUG: Connection close called, sleeping DEBUG: calling the close callback DEBUG: Finishing close Segmentation fault (core dumped) Hi peter: I just reproduce this bug with the libvirt-0.10.2-18.el6 which compiled by ydu in comment 24, unfortunately, I can't reproduce this issue according to ydu's comment in my environment, maybe my environment was little different with ydu's. However, i can get another Segmentation fault with the migration. Since there was something wrong with the abrt(909617),so i can't get the coredump file from my machine,so that I'm not sure whether it was also our expect segmentation fault. Here I got some log about it ,can you help me have a look about it by this log messages? thanks package vesion kernel-2.6.32-356.el6.x86_64 qemu-kvm-rhev-0.12.1.2-2.355.el6.x86_64 libvirt-0.10.2-18.el6.x86_64 steps virsh migrate --live win qemu+ssh://xx.xx.xx.xx/system --verbose root.xx.xx's password: DEBUG: Connection close called, sleeping DEBUG: calling the close callback DEBUG: Finishing close error: cannot open file '/var/lib/libvirt/images/b.iso': No such file or directory DEBUG: Connection close called, sleeping DEBUG: Finishing close Segmentation fault (core dumped) ul 11 20:16:07 zhwang64 kernel: virsh[4796]: segfault at 21 ip 0000000000000021 sp 00007fed37b53b58 error 14 in virsh[400000+58000] Jul 11 20:16:07 zhwang64 abrtd: Directory 'ccpp-2013-07-11-20:16:07-4795' creation detected Jul 11 20:16:07 zhwang64 abrt[4801]: Saved core dump of pid 4795 (/usr/bin/virsh) to /var/spool/abrt/ccpp-2013-07-11-20:16:07-4795 (31240192 bytes) Jul 11 20:16:08 zhwang64 abrtd: Package 'libvirt-client' isn't signed with proper key Jul 11 20:16:08 zhwang64 abrtd: 'post-create' on '/var/spool/abrt/ccpp-2013-07-11-20:16:07-4795' exited with 1 Jul 11 20:16:08 zhwang64 abrtd: Deleting problem directory '/var/spool/abrt/ccpp-2013-07-11-20:16:07-4795' # ll /var/spool/abrt total 4 -rw-------. 1 root root 14 Jul 11 20:16 last-ccpp cat /var/spool/abrt/last-ccpp /usr/bin/virsh The problem was in the close callback, thus every client that is using the callback was likely to crash. The patch I attached to the 6.4.z clone of this bug can be used to reproduce this with 100% success rate. Your steps above use migration for that purpose which is okay in terms of the original bug, but it should crash virtually for every virsh command that is remote. hi Peter, Thanks for your replay. Sorry to say that i can still only reproduce this issue with the migration, can't reproduce it with other virsh command evenif i try to reproduce it with many times. And I found that the issue in comment 26 has gone while i updated the libvirt to libvirt-0.10.2-19.el6, so can I mark this bug verified with the comment 26 steps? thanks Since it works well on the libvirt-0.10.2-19.el6 which was with the reproducer patch applied , also have confirmed it with peter on irc, so mark this bug verified Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1581.html |