Bug 672226

Summary: virDomainCreateXML over remote driver sometimes hangs
Product: Red Hat Enterprise Linux 6 Reporter: Dan Kenigsberg <danken>
Component: libvirtAssignee: Daniel Berrangé <berrange>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.0CC: abaron, dallan, dnaori, eblake, hateya, ilvovsky, jdenemar, jyang, mgoldboi, mjenner, mzhan, vbian, xen-maint, yoyzhang
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: libvirt-0.8.7-5.el6 Doc Type: Bug Fix
Doc Text:
Using SASL authentication with a single libvirt connection for multiple threads could result in libvirt hanging while waiting for a response from the libvirt daemon. SASL decoding has been fixed such that clients do not wait for further data while already decoded SASL data remains unprocessed, so libvirt no longer hangs in this situation.
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-05-19 13:26:17 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 669974    
Attachments:
Description Flags
"t a a bt full" in libvirtd
none
"t a a bt full" in vdsm
none
"t a a bt full" in libvirtd
none
multiple_thread script none

Description Dan Kenigsberg 2011-01-24 13:33:23 UTC
Description of problem:
Sometimes, a vdsm-issued virDomainCreateXML call succeeds to create a VM, but does not return. The libvirtd side shows nothing suspicious (to me), but the client side blocks on poll() - see attached backtrace.

Version-Release number of selected component (if applicable):
vdsm-4.9-43
libvirt-0.8.7-3.el6.x86_64

How reproducible:
Seldom. Start and destroy multiple VMs via vdsm.

Comment 1 Dan Kenigsberg 2011-01-24 13:35:48 UTC
Apparently, after a long wait, that poll() has ended.

Just before createXML we printed:

Thread-2506::DEBUG::2011-01-24 13:54:52,726::libvirtvm::755::vds.vmlog.b63a6c0e-afd1-4962-836e-3ea4bebcb884::(_run) DDD

Just after createXML we printed:

Thread-2506::DEBUG::2011-01-24 15:19:30,793::libvirtvm::759::vds.vmlog.b63a6c0e-afd1-4962-836e-3ea4bebcb884::(_run) DDD

Comment 2 Dave Allan 2011-01-24 21:07:38 UTC
Dan B, have any theories about what's going on here?

Comment 3 Daniel Berrangé 2011-01-25 10:38:17 UTC
The client side blocking on poll(), means it is waiting for the reply from the daemon. This suggests that the Create API is still running. Would need to capture the libvirtd stack trace at this point to see what its doing while the client is waiting.

Comment 4 Dave Allan 2011-01-25 14:56:20 UTC
Dan K., can you get a trace from libvirtd the next time you see this?

Comment 5 Dan Kenigsberg 2011-01-26 16:27:23 UTC
Created attachment 475433 [details]
"t a a bt full" in libvirtd

I hope this helps.

Comment 6 Daniel Berrangé 2011-01-27 13:19:22 UTC
That attachment shows libvirtd completely idle, in its normal waiting for more RPC calls state.

Can we get a trace of all threads in the client too at time of hang. I'm more suspicious of the pure python event loop impl in the client now, because of the problems just identified in  bug 624252 comment #38

Comment 7 Dan Kenigsberg 2011-01-31 08:46:13 UTC
Created attachment 476139 [details]
"t a a bt full" in vdsm

Comment 8 Dan Kenigsberg 2011-01-31 08:48:13 UTC
Created attachment 476141 [details]
"t a a bt full" in libvirtd

Happened again (libvirt-0.8.7-3.el6.x86_64), backtraces for client and server attached.

Comment 9 Daniel Berrangé 2011-02-01 11:02:18 UTC
When this occurs, does VDSM hang forever, or does it eventually recover after a long wait ?

Comment 10 Daniel Berrangé 2011-02-01 11:36:46 UTC
Can you also attach the full VDSM log, and the libvirtd daemon logs from the same period, so I can try and correlate what's going on between the two. If it happens again, it'd be helpful if I can login & poke libvirtd with GDB while it is still hung.

Comment 11 Daniel Berrangé 2011-02-01 13:00:57 UTC
Interesting points from debugging on a live system

 - libvirtd is completely idle. Analysing the logs shows that every single virDomainCreateXML call completed in < 2 seconds.

 - On client, there are about 50 threads, all blocked up on remoteIO() and one in remoteIOEventLoop().

 - All the blocked threads have transmitted all their data and are in 'mode = 1' which indicates waiting for 'rx' of reply data.

 - Poll() is listening on POLLIN on data, so if there was any data from libvirtd it would have seen it

 - saslDecodedLength = 256, and saslDecodedOffset = 96. This indicates we didn't consume all the decoded SASL data for inexplicable reasons.

The last point is the only sign of clear trouble thus far. We should not be sitting in poll() waiting to rx data, if we still have some previously decoded SASL data available.

Comment 12 Daniel Berrangé 2011-02-01 13:15:09 UTC
One idea worth testing - disable the SASL authentication on the UNIX socket. eg edit /etc/libvirt/libvirtd.conf and set

  auth_unix_rw = "none"
  auth_unix_ro = "none"

and restart libvirtd. This will cause libvirt to take a different code path for I/O which may avoid this problem.

Comment 13 Daniel Berrangé 2011-02-01 16:11:50 UTC
Confirmed that there is an I/O flaw in the libvirt client SASL codepath. The following patch upstream will fix it

http://www.redhat.com/archives/libvir-list/2011-February/msg00017.html

It will affect potentially any API call, not simply virDomainCreateXML. Depending entirely on luck the client might unblock itself automagically or not.

Disabling SASL auth on the server is a suitable workaround.

Comment 16 David Naori 2011-02-28 19:28:30 UTC
Verified on libvirt-0.8.7-8.el6.x86_64

Comment 17 Min Zhan 2011-03-04 06:20:51 UTC
I have reproduced this bug with libvirt-0.8.7-3.el6.x86_64, and verify this bug with libvirt-0.8.7-8.el6.x86_64.

Reproduce steps:

1. Modify libvirtd conf file 
# cat /etc/libvirt/libvirtd.conf
make sure:

auth_unix_ro = "sasl"

auth_unix_rw = "sasl"

2. # service libvirtd restart

3. In host use sasl encryption
# saslpasswd2 -a libvirt root
Password: xxxxxx
Again (for verification): xxxxxx

4. Setup libvirt-test-API environment, which can be found on libvirt.org (http://libvirt.org/git/?p=libvirt-test-API.git;a=summary)

The main script about multithread is: ~/libvirt-test-API/repos/Python/regression/multiple_thread_block_on_domain_create.py

you can set the guest name and number in ~/libvirt-test-API/case.conf
set guestnum: 0-10 means guest0,...,guest9

Final execute
# python libvirt-test-api.py

Then guest0,...,guest9 disk img will be created and installed in /var/lib/libvirt/images/guest*.

5. Since this bug is about multithread reply problem, use script and case.conf  to create some guests at a time, and keep these running at the same time, then repeat; finally the script is hang for waiting for the reply/return.

6. Check libvirtd status, found it is idle
# top -p `pidof libvirtd`

top - 16:57:41 up 19:23,  8 users,  load average: 0.00, 0.15, 2.19
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.2%us,  0.0%sy,  0.0%ni, 99.7%id,  0.2%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   3756240k total,   495500k used,  3260740k free,     5052k buffers
Swap:  2047992k total,   222760k used,  1825232k free,    81596k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                         
16983 root      20   0  701m  29m   24 S  0.0  0.8   0:08.29 libvirtd   


Now I have tried several times about step5 in libvirt-0.8.7-8.el6.x86_64,and it works well now.

Comment 18 Min Zhan 2011-03-04 06:23:03 UTC
Created attachment 482231 [details]
multiple_thread script

Comment 19 Jiri Denemark 2011-05-03 12:33:35 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause
    enabling SASL authentication and using a single libvirt connection from
    multiple threads

Consequence
    libvirt client may hang while waiting for a reply from libvirtd

Fix
    SASL decoding was fixed so that clients are not waiting for more data even
    though existing SASL decoded data remains unprocessed

Result
    libvirt client does not hang when SASL authentication is enabled

Comment 22 Laura Bailey 2011-05-04 02:20:59 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,13 +1 @@
-Cause
+Using SASL authentication with a single libvirt connection for multiple threads could result in libvirt hanging while waiting for a response from the libvirt daemon. SASL decoding has been fixed such that clients do not wait for further data while already decoded SASL data remains unprocessed, so libvirt no longer hangs in this situation.-    enabling SASL authentication and using a single libvirt connection from
-    multiple threads
-
-Consequence
-    libvirt client may hang while waiting for a reply from libvirtd
-
-Fix
-    SASL decoding was fixed so that clients are not waiting for more data even
-    though existing SASL decoded data remains unprocessed
-
-Result
-    libvirt client does not hang when SASL authentication is enabled

Comment 23 errata-xmlrpc 2011-05-19 13:26:17 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0596.html