Bug 911609
| Summary: | [abrt] libvirt-client-0.10.2-18.el6: remoteClientCloseFunc: Process /usr/bin/virsh was killed by signal 11 (SIGSEGV) | ||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | David Jaša <djasa> | ||||||||||||||||||||||
| Component: | libvirt | Assignee: | Peter Krempa <pkrempa> | ||||||||||||||||||||||
| Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||||||||||||||||||
| Severity: | high | Docs Contact: | |||||||||||||||||||||||
| Priority: | high | ||||||||||||||||||||||||
| Version: | 6.4 | CC: | acathrow, cpelland, cwei, dallan, dyuan, eblake, mjenner, mzhan, pkrempa, ydu, zhwang | ||||||||||||||||||||||
| Target Milestone: | rc | Keywords: | ZStream | ||||||||||||||||||||||
| Target Release: | --- | ||||||||||||||||||||||||
| Hardware: | x86_64 | ||||||||||||||||||||||||
| OS: | Unspecified | ||||||||||||||||||||||||
| Whiteboard: | abrt_hash:25a80aa5b774a648781787aa1f2a8213e8f3eb99 | ||||||||||||||||||||||||
| Fixed In Version: | libvirt-0.10.2-19.el6 | Doc Type: | Bug Fix | ||||||||||||||||||||||
| Doc Text: |
Due to a race condition in the libvirt client library, any application using libvirt could terminate unexpectedly with a segmentation fault. This happened when one thread executed the connection close callback, while another one freed the connection object, and the connection callback thread then accessed memory that had been already freed. This update fixes the possibility of freeing the callback data when they are still being accessed.
|
Story Points: | --- | ||||||||||||||||||||||
| Clone Of: | Environment: | ||||||||||||||||||||||||
| Last Closed: | 2013-11-21 08:45:15 UTC | Type: | --- | ||||||||||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||||
| Embargoed: | |||||||||||||||||||||||||
| Bug Depends On: | |||||||||||||||||||||||||
| Bug Blocks: | 950599 | ||||||||||||||||||||||||
| Attachments: |
|
||||||||||||||||||||||||
|
Description
David Jaša
2013-02-15 13:04:49 UTC
Created attachment 697773 [details]
File: maps
Created attachment 697774 [details]
File: var_log_messages
Created attachment 697775 [details]
File: open_fds
Created attachment 697776 [details]
File: environ
Created attachment 697777 [details]
File: dso_list
Created attachment 697778 [details]
File: sosreport.tar.xz
Created attachment 697779 [details]
File: backtrace
Created attachment 697780 [details]
File: build_ids
Created attachment 697781 [details]
File: limits
Created attachment 697782 [details]
File: cgroup
This request was not resolved in time for the current release. Red Hat invites you to ask your support representative to propose this request, if still desired, for consideration in the next release of Red Hat Enterprise Linux. I try it on rhel6.4,however i can't reproduce it while I test it with the following steps,can you give me some advise ? or what other operation can i do during the reproduce ? thanks
Version-Release number of selected component:
libvirt-client-0.10.2-18.el6
kernel-2.6.32-356.el6.x86_64
steps
1 prepare a guests
# virsh list --all
Id Name State
----------------------------------------------------
- rhelnew1 shut off
2 do the start/destroy operation for the guest,for many times,then check the guest status
# for i in {1..20};do virsh start rhelnew1;sleep 3;virsh list --all; virsh destroy rhelnew1;virsh list --all;done
3 connect the guest in the remote host,then do some operation with the guest
# virsh -c qemu+ssh://$hostIP/system
Type: 'help' for help with commands
'quit' to quit
virsh # list --all
Id Name State
----------------------------------------------------
- rhelnew1 shut off
virsh #start rhelnew1
virsh # list --all
Id Name State
----------------------------------------------------
25 rhelnew1 running
https://www.redhat.com/archives/libvir-list/2013-March/msg01517.html might be the right upstream patch for this The crash happens due to a race condition of two threads, one of them is executing the connection close callback, while the other one frees the connection object. The connection callback thread then accesses memory that has been already freed. There is only a very slight chance how this might happen (the probability can be improved using wait states in the code). The fix for this issue is posted upstream as http://www.redhat.com/archives/libvir-list/2013-March/msg01539.html Fixed upstream with:
commit 8ad126e695e5cef5da9d62ccfde7338317041e84
Author: Peter Krempa <pkrempa>
Date: Fri Mar 29 18:21:19 2013 +0100
rpc: Fix connection close callback race condition and memory corruption/crash
The last Viktor's effort to fix the race and memory corruption unfortunately
wasn't complete in the case the close callback was not registered in an
connection. At that time, the trail of event's that I'll describe later could
still happen and corrupt the memory or cause a crash of the client (including
the daemon in case of a p2p migration).
Consider the following prerequisities and trail of events:
Let's have a remote connection to a hypervisor that doesn't have a close
callback registered and the client is using the event loop. The crash happens in
cooperation of 2 threads. Thread E is the event loop and thread W is the worker
that does some stuff. R denotes the remote client.
1.) W - The client finishes everything and sheds the last reference on the client
2.) W - The virObject stuff invokes virConnectDispose that invokes doRemoteClose
3.) W - the remote close method invokes the REMOTE_PROC_CLOSE RPC method.
4.) W - The thread is preempted at this point.
5.) R - The remote side receives the close and closes the socket.
6.) E - poll() wakes up due to the closed socket and invokes the close callback
7.) E - The event loop is preempted right before remoteClientCloseFunc is called
8.) W - The worker now finishes, and frees the conn object.
9.) E - The remoteClientCloseFunc accesses the now-freed conn object in the
attempt to retrieve pointer for the real close callback.
10.) Kaboom, corrupted memory/segfault.
This patch tries to fix this by introducing a new object that survives the
freeing of the connection object. We can't increase the reference count on the
connection object itself or the connection would never be closed, as the
connection is closed only when the reference count reaches zero.
The new object - virConnectCloseCallbackData - is a lockable object that keeps
the pointers to the real user registered callback and ensures that the
connection callback is either not called if the connection was already freed or
that the connection isn't freed while this is being called.
commit 69ab07560a134e82e36b6391be3c806d3dbdb16c
Author: Peter Krempa <pkrempa>
Date: Wed Mar 27 14:37:01 2013 +0100
virsh: Register and unregister the close callback also in cmdConnect
This patch improves the error message after disconnecting from the
hypervisor and adds the close callback operations required not to leak
the callback reference.
commit ca9e73ebb60e2efb1ea835e9a394a8b64ecb97c1
Author: Peter Krempa <pkrempa>
Date: Wed Mar 27 14:22:47 2013 +0100
virsh: Move cmdConnect from virsh-host.c to virsh.c
The function is used to establish connection so it should be in the main
virsh file. This movement also enables further improvements done in next
patches.
Note that the "connect" command has moved from the host section of virsh to the
main section. It is now listed by 'virsh help virsh' instead of 'virsh help
host'.
commit e964ba2786f6736613de1f14db4d3407f6928f50
Author: Viktor Mihajlovski <mihajlov.ibm.com>
Date: Tue Mar 26 10:54:55 2013 +0100
virsh: Unregister the connection close notifier upon termination
Before closing the connection we unregister the close callback
to prevent a reference leak.
Further, the messages on virConnectClose != 0 are a bit more specific
now.
Signed-off-by: Viktor Mihajlovski <mihajlov.ibm.com>
commit 03a43efa86f5099d3f6df334f73961a535e488b5
Author: Viktor Mihajlovski <mihajlov.ibm.com>
Date: Tue Mar 26 10:54:53 2013 +0100
libvirt: Increase connection reference count for callbacks
By adjusting the reference count of the connection object we
prevent races between callback function and virConnectClose.
Signed-off-by: Viktor Mihajlovski <mihajlov.ibm.com>
commit d0cc811ed02d49e60193dfe6601e53adadebb114
Author: Viktor Mihajlovski <mihajlov.ibm.com>
Date: Tue Mar 26 10:54:54 2013 +0100
remote: Don't call NULL closeFreeCallback
Check function pointer before calling.
Signed-off-by: Viktor Mihajlovski <mihajlov.ibm.com>
git describe: v1.0.4-57-g8ad126e This patch - http://www.redhat.com/archives/libvir-list/2013-March/msg01683.html - when applied on the unfixed source tree makes the crash much more likely to happen. hi Peter: I try to reproduce it with virsh cmd before,however I can't reproduce it, so can you give me the steps to reproduce this bug ? so that we can verified it correctly later.thanks Hi Zhenfeng, please see the patch linked in Comment 19. When you apply that patch to the unfixed tree, the race becomes much more reproducible (around 90%). Also that thread contains more information how the bug exposes and what stack traces to expect. Hi Peter: I just rebuild the libvirtd packet with the libvirt-rhel.git,however, I found there were too many duplicated and dependence during my rebuilding. I spent the whole afternoon perpareing for it and not get it out yet,so can you spare some time to rebuild a temporary packet for me when you are free,meanwhile, I'll keep on doing it. thanks (In reply to comment #23) > Hi Peter: > I just rebuild the libvirtd packet with the libvirt-rhel.git,however, I > found there were too many duplicated and dependence during my rebuilding. I > spent the whole afternoon perpareing for it and not get it out yet,so can > you spare some time to rebuild a temporary packet for me when you are > free,meanwhile, I'll keep on doing it. thanks Hi, zhenfeng I just rebuild this libvirt rpm packages(applied patch of comment 19, and you can reach me to get it), and reproduce this bug. In one terminal: # for i in {1..100}; do virsh list --all & virsh list --all ;done In another one: # virsh list Id Name State ---------------------------------------------------- DEBUG: Connection close called, sleeping DEBUG: calling the close callback DEBUG: Finishing close Segmentation fault (core dumped) Hi peter: I just reproduce this bug with the libvirt-0.10.2-18.el6 which compiled by ydu in comment 24, unfortunately, I can't reproduce this issue according to ydu's comment in my environment, maybe my environment was little different with ydu's. However, i can get another Segmentation fault with the migration. Since there was something wrong with the abrt(909617),so i can't get the coredump file from my machine,so that I'm not sure whether it was also our expect segmentation fault. Here I got some log about it ,can you help me have a look about it by this log messages? thanks package vesion kernel-2.6.32-356.el6.x86_64 qemu-kvm-rhev-0.12.1.2-2.355.el6.x86_64 libvirt-0.10.2-18.el6.x86_64 steps virsh migrate --live win qemu+ssh://xx.xx.xx.xx/system --verbose root.xx.xx's password: DEBUG: Connection close called, sleeping DEBUG: calling the close callback DEBUG: Finishing close error: cannot open file '/var/lib/libvirt/images/b.iso': No such file or directory DEBUG: Connection close called, sleeping DEBUG: Finishing close Segmentation fault (core dumped) ul 11 20:16:07 zhwang64 kernel: virsh[4796]: segfault at 21 ip 0000000000000021 sp 00007fed37b53b58 error 14 in virsh[400000+58000] Jul 11 20:16:07 zhwang64 abrtd: Directory 'ccpp-2013-07-11-20:16:07-4795' creation detected Jul 11 20:16:07 zhwang64 abrt[4801]: Saved core dump of pid 4795 (/usr/bin/virsh) to /var/spool/abrt/ccpp-2013-07-11-20:16:07-4795 (31240192 bytes) Jul 11 20:16:08 zhwang64 abrtd: Package 'libvirt-client' isn't signed with proper key Jul 11 20:16:08 zhwang64 abrtd: 'post-create' on '/var/spool/abrt/ccpp-2013-07-11-20:16:07-4795' exited with 1 Jul 11 20:16:08 zhwang64 abrtd: Deleting problem directory '/var/spool/abrt/ccpp-2013-07-11-20:16:07-4795' # ll /var/spool/abrt total 4 -rw-------. 1 root root 14 Jul 11 20:16 last-ccpp cat /var/spool/abrt/last-ccpp /usr/bin/virsh The problem was in the close callback, thus every client that is using the callback was likely to crash. The patch I attached to the 6.4.z clone of this bug can be used to reproduce this with 100% success rate. Your steps above use migration for that purpose which is okay in terms of the original bug, but it should crash virtually for every virsh command that is remote. hi Peter, Thanks for your replay. Sorry to say that i can still only reproduce this issue with the migration, can't reproduce it with other virsh command evenif i try to reproduce it with many times. And I found that the issue in comment 26 has gone while i updated the libvirt to libvirt-0.10.2-19.el6, so can I mark this bug verified with the comment 26 steps? thanks Since it works well on the libvirt-0.10.2-19.el6 which was with the reproducer patch applied , also have confirmed it with peter on irc, so mark this bug verified Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1581.html |