Bug 866524

Summary: use-after-free on virsh node-memory-tune
Product: Red Hat Enterprise Linux 6 Reporter: Alex Jia <ajia>
Component: libvirtAssignee: Osier Yang <jyang>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 6.5CC: acathrow, berrange, bili, dallan, dyasny, dyuan, eblake, mzhan, pkrempa
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: libvirt-0.10.2-12.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-02-21 07:10:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 888457    
Attachments:
Description Flags
valgrind report
none
valgrind report none

Description Alex Jia 2012-10-15 14:44:52 UTC
Description of problem:
Valgrind complains use-after-free memory error when passed a invalid number to virsh node-memory-tune.

Version-Release number of selected component (if applicable):
# rpm -q libvirt
libvirt-0.10.2-3.el6.x86_64

How reproducible:
always

Steps to Reproduce:
# valgrind -v virsh node-memory-tune --shm-sleep-millisecs -1

  
Actual results:
# valgrind -v virsh node-memory-tune --shm-sleep-millisecs -1
                                                                   
error: invalid shm-sleep-millisecs number



Expected results:
please see attachment.

Additional info:

Comment 1 Alex Jia 2012-10-15 14:45:27 UTC
Created attachment 627492 [details]
valgrind report

Comment 3 Osier Yang 2012-10-16 10:43:13 UTC
(In reply to comment #0)
> Description of problem:
> Valgrind complains use-after-free memory error when passed a invalid number
> to virsh node-memory-tune.
> 
> Version-Release number of selected component (if applicable):
> # rpm -q libvirt
> libvirt-0.10.2-3.el6.x86_64
> 
> How reproducible:
> always
> 
> Steps to Reproduce:
> # valgrind -v virsh node-memory-tune --shm-sleep-millisecs -1
> 
>   
> Actual results:
> # valgrind -v virsh node-memory-tune --shm-sleep-millisecs -1
>                                                                    

==8062==  Address 0x52e78a8 is 280 bytes inside a block of size 312 free'd
==8062==    at 0x4A0595D: free (vg_replace_malloc.c:366)
==8062==    by 0x4C83508: virFree (memory.c:309)
==8062==    by 0x4C94E7C: virObjectUnref (virobject.c:145)
==8062==    by 0x4D13FE7: virConnectClose (libvirt.c:1458)
==8062==    by 0x40C868: vshDeinit (virsh.c:2571)
==8062==    by 0x41069E: main (virsh.c:3009)

Hm, I don't see these errors with both libvirt and libvirt-client are 0.10.2-3

Comment 4 Alex Jia 2012-10-16 11:32:41 UTC
(In reply to comment #3)
> 
> Hm, I don't see these errors with both libvirt and libvirt-client are
> 0.10.2-3

I just confirmed this, it's not always reproducible, you may ran valgrind serveral times and then I beleive you will see previous memory errors.

Comment 5 Osier Yang 2012-10-24 10:57:55 UTC
Even tried 50 times, I can't see the error.

Comment 6 Alex Jia 2012-10-24 12:56:25 UTC
(In reply to comment #5)
> Even tried 50 times, I can't see the error.

what's libvirt version you are trying? libvirt-0.10.2-3.el6.x86_64?

BTW, the libvirt-0.10.2-5.el6.x86_64 is fine for me.

Comment 7 Osier Yang 2012-10-24 13:20:57 UTC
(In reply to comment #6)
> (In reply to comment #5)
> > Even tried 50 times, I can't see the error.
> 
> what's libvirt version you are trying? libvirt-0.10.2-3.el6.x86_64?
> 
> BTW, the libvirt-0.10.2-5.el6.x86_64 is fine for me.

I tried with the top of rhel git. And to be sure, I tried again with installing
0.10.2-3 (both libvirt and libvirt-client), can't reproduce either, am I too
unlucky, or you are too lucky?

Comment 8 Alex Jia 2012-10-24 14:34:34 UTC
I reproduced it on libvirt-0.10.2-3.el6.x86_64 again, but it isn't always reproducible, I restart libvirtd service then try 4 times to reproduce it:


==27860== 1 errors in context 1 of 9:
==27860== Invalid write of size 4
==27860==    at 0x38BAA0A5E5: pthread_mutex_unlock (pthread_mutex_unlock.c:46)
==27860==    by 0x4D604B2: virNetClientCloseLocked (virnetclient.c:688)
==27860==    by 0x4D62998: virNetClientIncomingEvent (virnetclient.c:1859)
==27860==    by 0x4C7C6EE: virEventPollRunOnce (event_poll.c:485)
==27860==    by 0x4C7B486: virEventRunDefaultImpl (event.c:247)
==27860==    by 0x40CD11: vshEventLoop (virsh.c:2115)
==27860==    by 0x4C8CD48: virThreadHelper (threads-pthread.c:161)
==27860==    by 0x38BAA077F0: start_thread (pthread_create.c:301)
==27860==    by 0x33F68E570C: clone (clone.S:115)
==27860==  Address 0x52e5bb8 is 152 bytes inside a block of size 312 free'd
==27860==    at 0x4A0595D: free (vg_replace_malloc.c:366)
==27860==    by 0x4C83508: virFree (memory.c:309)
==27860==    by 0x4C94E7C: virObjectUnref (virobject.c:145)
==27860==    by 0x4D13FE7: virConnectClose (libvirt.c:1458)
==27860==    by 0x40C868: vshDeinit (virsh.c:2571)
==27860==    by 0x41069E: main (virsh.c:3009)

Comment 9 Alex Jia 2012-10-24 14:41:00 UTC
(In reply to comment #8)
> I reproduced it on libvirt-0.10.2-3.el6.x86_64 again, but it isn't always
> reproducible, I restart libvirtd service then try 4 times to reproduce it:
> 

I also reproduced on libvirt-0.10.2-5.el6.x86_64, a easy to reproduce method is to restart libvirtd service then run several times valgrind.

Comment 10 Alex Jia 2012-11-01 09:47:18 UTC
Created attachment 636527 [details]
valgrind report

To reproduce it on libvirt-0.10.2-6.el6.x86_64.

Comment 13 Daniel Berrangé 2012-12-04 10:42:51 UTC
(In reply to comment #8)
> I reproduced it on libvirt-0.10.2-3.el6.x86_64 again, but it isn't always
> reproducible, I restart libvirtd service then try 4 times to reproduce it:
> 
> 
> ==27860== 1 errors in context 1 of 9:
> ==27860== Invalid write of size 4
> ==27860==    at 0x38BAA0A5E5: pthread_mutex_unlock
> (pthread_mutex_unlock.c:46)
> ==27860==    by 0x4D604B2: virNetClientCloseLocked (virnetclient.c:688)
> ==27860==    by 0x4D62998: virNetClientIncomingEvent (virnetclient.c:1859)
> ==27860==    by 0x4C7C6EE: virEventPollRunOnce (event_poll.c:485)
> ==27860==    by 0x4C7B486: virEventRunDefaultImpl (event.c:247)
> ==27860==    by 0x40CD11: vshEventLoop (virsh.c:2115)
> ==27860==    by 0x4C8CD48: virThreadHelper (threads-pthread.c:161)
> ==27860==    by 0x38BAA077F0: start_thread (pthread_create.c:301)
> ==27860==    by 0x33F68E570C: clone (clone.S:115)
> ==27860==  Address 0x52e5bb8 is 152 bytes inside a block of size 312 free'd
> ==27860==    at 0x4A0595D: free (vg_replace_malloc.c:366)
> ==27860==    by 0x4C83508: virFree (memory.c:309)
> ==27860==    by 0x4C94E7C: virObjectUnref (virobject.c:145)
> ==27860==    by 0x4D13FE7: virConnectClose (libvirt.c:1458)
> ==27860==    by 0x40C868: vshDeinit (virsh.c:2571)
> ==27860==    by 0x41069E: main (virsh.c:3009)

Thanks to compiler optization, this stack trace is somewhat incomplete and misleading.

The pthread_mutex_unlock call in the first stack trace is actually being invoked by 'remoteClientCloseFunc', which is in turn called from 'virNetClientCloseLocked'. The mutex is the virConnectPtr mutex.

The problem is that this is happening after the last reference to virConnectPtr has been released.

The following ought to solve it

diff --git a/src/remote/remote_driver.c b/src/remote/remote_driver.c
index 072d347..96bf0d8 100644
--- a/src/remote/remote_driver.c
+++ b/src/remote/remote_driver.c
@@ -1001,6 +1001,10 @@ doRemoteClose(virConnectPtr conn, struct private_data *priv)
 
     virObjectUnref(priv->tls);
     priv->tls = NULL;
+    virNetClientSetCloseCallback(priv->client,
+                                 NULL,
+                                 NULL,
+                                 NULL);
     virNetClientClose(priv->client);
     virObjectUnref(priv->client);
     priv->client = NULL;


Though I have not verified, since I'm unable to reproduce the race condition. I don't believe this is at all specific to the 'node-memory-tune' command in virsh - it ought to happen on pretty much any virsh command.

Comment 14 Osier Yang 2012-12-04 16:30:59 UTC
(In reply to comment #13)
> (In reply to comment #8)
> > I reproduced it on libvirt-0.10.2-3.el6.x86_64 again, but it isn't always
> > reproducible, I restart libvirtd service then try 4 times to reproduce it:
> > 
> > 
> > ==27860== 1 errors in context 1 of 9:
> > ==27860== Invalid write of size 4
> > ==27860==    at 0x38BAA0A5E5: pthread_mutex_unlock
> > (pthread_mutex_unlock.c:46)
> > ==27860==    by 0x4D604B2: virNetClientCloseLocked (virnetclient.c:688)
> > ==27860==    by 0x4D62998: virNetClientIncomingEvent (virnetclient.c:1859)
> > ==27860==    by 0x4C7C6EE: virEventPollRunOnce (event_poll.c:485)
> > ==27860==    by 0x4C7B486: virEventRunDefaultImpl (event.c:247)
> > ==27860==    by 0x40CD11: vshEventLoop (virsh.c:2115)
> > ==27860==    by 0x4C8CD48: virThreadHelper (threads-pthread.c:161)
> > ==27860==    by 0x38BAA077F0: start_thread (pthread_create.c:301)
> > ==27860==    by 0x33F68E570C: clone (clone.S:115)
> > ==27860==  Address 0x52e5bb8 is 152 bytes inside a block of size 312 free'd
> > ==27860==    at 0x4A0595D: free (vg_replace_malloc.c:366)
> > ==27860==    by 0x4C83508: virFree (memory.c:309)
> > ==27860==    by 0x4C94E7C: virObjectUnref (virobject.c:145)
> > ==27860==    by 0x4D13FE7: virConnectClose (libvirt.c:1458)
> > ==27860==    by 0x40C868: vshDeinit (virsh.c:2571)
> > ==27860==    by 0x41069E: main (virsh.c:3009)
> 
> Thanks to compiler optization, this stack trace is somewhat incomplete and
> misleading.
> 
> The pthread_mutex_unlock call in the first stack trace is actually being
> invoked by 'remoteClientCloseFunc', which is in turn called from
> 'virNetClientCloseLocked'. The mutex is the virConnectPtr mutex.

Right, I tracked to here.

> 
> The problem is that this is happening after the last reference to
> virConnectPtr has been released.
> 
> The following ought to solve it
> 
> diff --git a/src/remote/remote_driver.c b/src/remote/remote_driver.c
> index 072d347..96bf0d8 100644
> --- a/src/remote/remote_driver.c
> +++ b/src/remote/remote_driver.c
> @@ -1001,6 +1001,10 @@ doRemoteClose(virConnectPtr conn, struct private_data
> *priv)
>  
>      virObjectUnref(priv->tls);
>      priv->tls = NULL;
> +    virNetClientSetCloseCallback(priv->client,
> +                                 NULL,
> +                                 NULL,
> +                                 NULL);

Thanks, Daniel, I believe this fixes the problem, but I think it doesn't
fixes the root cause. I.E. Should we mutex the whole connection object
when disposing it? Assuming the event thread get the lock before/during
doRemoteClose, and do some tricks, doRemoteClose might drop into
similar situation then too.


>      virNetClientClose(priv->client);
>      virObjectUnref(priv->client);
>      priv->client = NULL;
> 
> 
> Though I have not verified, since I'm unable to reproduce the race
> condition. I don't believe this is at all specific to the 'node-memory-tune'
> command in virsh - it ought to happen on pretty much any virsh command.

Right, the bug title is misleading.

Comment 15 Alex Jia 2012-12-05 02:31:10 UTC
(In reply to comment #13)
> Though I have not verified, since I'm unable to reproduce the race
> condition. I don't believe this is at all specific to the 'node-memory-tune'
> command in virsh - it ought to happen on pretty much any virsh command.

Yes, I think so, however, as it happens, I met the issue when I ran virsh node-memory-tune command.

Comment 16 Osier Yang 2012-12-05 15:11:51 UTC
Patches posted upstream:

https://www.redhat.com/archives/libvir-list/2012-December/msg00165.html

One is Dainel's method, one is to lock the whole virConnect object.

Comment 17 Osier Yang 2012-12-05 16:16:54 UTC
(In reply to comment #16)
> Patches posted upstream:
> 
> https://www.redhat.com/archives/libvir-list/2012-December/msg00165.html

Please ignore this wrong link, the right one is:

https://www.redhat.com/archives/libvir-list/2012-December/msg00219.html

> 
> One is Dainel's method, one is to lock the whole virConnect object.

Comment 18 Osier Yang 2012-12-05 16:48:33 UTC
(In reply to comment #17)
> (In reply to comment #16)
> > Patches posted upstream:
> > 
> > https://www.redhat.com/archives/libvir-list/2012-December/msg00165.html
> 
> Please ignore this wrong link, the right one is:
> 
> https://www.redhat.com/archives/libvir-list/2012-December/msg00219.html
> 
> > 
> > One is Dainel's method, one is to lock the whole virConnect object.

Committed into upstream. Commit ID: ce5a1f6f9

Comment 22 EricLee 2012-12-14 08:26:42 UTC
Verified it with libvirt-0.10.2-12.el6:

# service libvirtd restart
Stopping libvirtd daemon:                                  [  OK  ]
Starting libvirtd daemon:                                  [  OK  ]

Mnually run following command about 10 times:
# valgrind -v virsh node-memory-tune --shm-sleep-millisecs -1

Can not see ERROR like:
==27860== 1 errors in context 1 of 9:
==27860== Invalid write of size 4
==27860==    at 0x38BAA0A5E5: pthread_mutex_unlock (pthread_mutex_unlock.c:46)
==27860==    by 0x4D604B2: virNetClientCloseLocked (virnetclient.c:688)
==27860==    by 0x4D62998: virNetClientIncomingEvent (virnetclient.c:1859)
==27860==    by 0x4C7C6EE: virEventPollRunOnce (event_poll.c:485)
==27860==    by 0x4C7B486: virEventRunDefaultImpl (event.c:247)
==27860==    by 0x40CD11: vshEventLoop (virsh.c:2115)
==27860==    by 0x4C8CD48: virThreadHelper (threads-pthread.c:161)
==27860==    by 0x38BAA077F0: start_thread (pthread_create.c:301)
==27860==    by 0x33F68E570C: clone (clone.S:115)
==27860==  Address 0x52e5bb8 is 152 bytes inside a block of size 312 free'd
==27860==    at 0x4A0595D: free (vg_replace_malloc.c:366)
==27860==    by 0x4C83508: virFree (memory.c:309)
==27860==    by 0x4C94E7C: virObjectUnref (virobject.c:145)
==27860==    by 0x4D13FE7: virConnectClose (libvirt.c:1458)
==27860==    by 0x40C868: vshDeinit (virsh.c:2571)
==27860==    by 0x41069E: main (virsh.c:3009)

So moving to VERIFIED.

Comment 23 errata-xmlrpc 2013-02-21 07:10:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0276.html