Bug 675698

Summary: libvirt dies a quickly - udevGetSCSIType:753 : Failed to find SCSI device type 12
Product: Red Hat Enterprise Linux 6 Reporter: Mark Wagner <mwagner>
Component: libvirtAssignee: Daniel Veillard <veillard>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 6.1CC: berrange, ccui, dallan, dyuan, eblake, jdenemar, jyang, mjenner, mzhan, xen-maint, yoyzhang
Target Milestone: rcKeywords: TestBlocker
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: libvirt-0.8.7-6.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-05-19 13:26:45 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 675756    
Bug Blocks: 684655, 698208    
Attachments:
Description Flags
Dump from debug output
none
Only initialize libpciaccess once none

Description Mark Wagner 2011-02-07 12:32:49 UTC
Description of problem:
When trying to run libvirtd, it appears to start but dies quickly. /var/log/messages 

Version-Release number of selected component (if applicable):
Happens with libvirt based on 0.8.7-4.el6 and 0.8.7-5.el6 

How reproducible:
Works when logging in enabled but fails otherwise

Steps to Reproduce:
1. service libvirtd restart 
2. service libvirtd status
3.
  
Actual results:
[root@perf22 ~]# service libvirtd start
Starting libvirtd daemon:                                  [  OK  ]
[root@perf22 ~]# 
[root@perf22 ~]# service libvirtd status
libvirtd dead but subsys locked


Expected results:
libvirtd should show up as OK


Additional info:

Comment 2 Mark Wagner 2011-02-07 12:34:51 UTC
Created attachment 477401 [details]
Dump from debug output

Here is a dump of the output when running in debug mode.  Note this case passes and libvirtd is running.

Comment 3 Daniel Berrangé 2011-02-07 14:28:37 UTC
There's something very nasty going on here, it is very non-deterministic, but one thing I am seeing is quite a few BADF errors. In one trace I saw that something closed our syslog() socket behind our back !

sendto(15, "<11>Feb  7 09:13:26 libvirtd: 09:13:26.000: 6056: error : virStateInitialize:1022 : Initialization of QEMU state driver failed\n", 127, MSG_NOSIGNAL, NULL, 0) = -1 EBADF (Bad file descriptor)
close(15)                               = -1 EBADF (Bad file descriptor)
socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 15
connect(15, {sa_family=AF_FILE, path="/dev/log"}, 110) = 0
sendto(15, "<11>Feb  7 09:13:26 libvirtd: 09:13:26.000: 6056: error : virStateInitialize:1022 : Initialization of QEMU state driver failed\n", 127, MSG_NOSIGNAL, NULL, 0) = 127


So there must be some code doing a close() on a file descriptor it doesn't own. This seems to occurr in middle of dealing with udev driver startup.

Comment 4 Daniel Berrangé 2011-02-07 16:15:22 UTC
The problem here is libpciaccess.

libvirt calls the following 3 functions in a sequence:

 pci_system_init()
 pci_get_strings()
 pci_system_cleanup()


The 'pci_get_strings' method iterates over PCI devices opening the 'config' file in sysfs. For some reason it caches the 'fd' for the config file in a global variable. 'pci_system_cleanup' closes that cached fd, but doesn't reset it to -1, so when libvirt repeats that sequence of 3 libpciaccess calls, the FD is closed for a second time.

This can be trivially demonstrated

 pci_system_init()
 pci_get_strings()
 pci_system_cleanup()
 nullfd = open('/dev/null', O_RDONLY)
 pci_system_init()
 pci_get_strings()
 pci_system_cleanup()
 close(nullfd);

The final close() call will get an error, because the FD associated with nullfd was closed by the second pci_get_strings() call.

Most of the time in libvirt we are lucky, but if something run from the method udevProcessPCI(), issues a libvirt logging API call, and this is the first logging API call in the libvirtd daemon, then the libvirt logging FD is the one that ends up getting closed.

We need to get libpciaccess fixed. We should also be more efficient and not call pci_system_init/cleanup multiple times. We can just do it once at startup and shutdown of the udev driver, not once per PCI device.

Comment 5 Daniel Berrangé 2011-02-07 17:06:41 UTC
Workaround for libpciaccess brokenness in libvirt:

http://www.redhat.com/archives/libvir-list/2011-February/msg00159.html

Comment 6 Daniel Berrangé 2011-02-07 17:40:34 UTC
Created attachment 477465 [details]
Only initialize libpciaccess once

Comment 9 Cui Chun 2011-02-15 02:33:46 UTC
Hi Mark,

According to comments, this bug is fixed in libvirt-0.8.7-6.el6. 
But It is difficult to reproduce it with our existing test environment in libvirt-0.8.7-4.el6.
Can this bug be found with your test environment in libvirt-0.8.7-6.el6? 
Do you have any idea to how to reproduce it with root cause?

Thanks
Chun Cui

Comment 10 Cui Chun 2011-03-01 02:30:36 UTC
Mark,

Any update for this bug?

Comment 11 Mark Wagner 2011-03-15 14:27:39 UTC
I don't have time to go back and check that specific version. Running libvirtd  0.8.7-10.el6 and I do not see the problem.

Comment 12 Cui Chun 2011-03-16 02:03:50 UTC
Thanks Mark's information. 

Change it to verified according to comment 11.

Comment 13 Eric Blake 2011-03-16 19:43:52 UTC
looks like this patch caused a regression in the ability to connect to qemu:///session as non-root, per bug 684655

Comment 14 zhanghaiyan 2011-04-20 08:44:13 UTC
Hi, Mark

As now we are doing rhel6.1 snapshot4 full regression testing and we haven't got this bug reproduce environment and exact steps, could you please do a favor to verify this bug with libvirt-0.8.7-18.el6.x86_64 to see whether it is still working properly?
Because per to comment 13, we are worrying the patch fix for bug 684655 may cause some regression to this bug.
Thanks a lot in advance.

Regards,
Yoyo

Comment 15 Eric Blake 2011-05-03 13:55:40 UTC
This patch also caused the regression in bug 698208

Comment 18 errata-xmlrpc 2011-05-19 13:26:45 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0596.html