Description of problem: Upstream libvirt 0.8.0 introduced a minor API break in the definition of enum virErrorNumber, which is passed over the wire between server and client. Since libvirt was rebased from 0.6.3 to 0.8.2 in the move from RHEL 5.5 to RHEL 5.6, this means that anyone using RHEL 5.6 to talk to an external libvirt server running libvirt versions 0.7.1 through 0.7.7 may misinterpret the virErrorNumber and print a misleading message for failures related to any of the affected enum values. Fortunately, there is no way to crash a server or client due to the API difference; the extent of the problem is merely a misleading error message. Version-Release number of selected component (if applicable): libvirt-0.8.2-15.el5 How reproducible: so far, this problem was found by code inspection; I haven't tried to actually create a testcase where it makes a difference, but think this will work: Steps to Reproduce: 1. Use a RHEL 5.6 virsh client (libvirt 0.8.2) to connect to an external (non-RHEL) libvirt server, where the server is in the range of 0.7.1 to 0.7.7 inclusive. 2. Trigger an action that causes the server to report a failure with one of the affected virErrorNumber values. To be more specific: it should be possible to trigger VIR_ERR_CONFIG_UNSUPPORTED by specifying a timezone in the <clock> element of the XML for a xen domain (see xen/xend_internal.c:xenDaemonFormatSxpr). 3. The libvirt client on RHEL will issue a misleading message; in the VIR_ERR_CONFIG_UNSUPPORTED case, the client will use the error msaage for VIR_ERR_BUILD_FIREWALL instead. Actual results: Expected results: Additional info: An upstream patch to teach clients how to interoperate with older servers was posted here: https://www.redhat.com/archives/libvir-list/2010-December/msg00855.html This patch should be backported to RHEL 5.7, but the issue is probably not severe enough to warrant worrying about it in RHEL 5.6. In the converse direction, there's no easy way to make the server hand back modified enum values to an external libvirt 0.7.* client, but as 0.7.* was lacking APIs that would trigger the error messages that changed in 0.8.0, this is not likely to be much of a problem.
In POST: http://post-office.corp.redhat.com/archives/rhvirt-patches/2011-April/msg00437.html
Another case that might reproduce the issue, using qemu rather than requiring a libvirt server on a xen host, might be as follows (also untested by me): 1. Two machines, client RHEL 5 machine with pre- and post-patch builds to be tested, and server on any other machine running a self-built libvirt from an 0.7.7 tarball 2. start a qemu guest running on the second machine 3. attach gdb to the libvirtd on the second machine, and register a breakpoint in virCondWaitUntil, and resume the program 4. from the RHEL 5 client, create a remote connection to the second machine and trigger virDomainSuspend of the running domain (via 'virsh -c ssh://second/system pause domain') 5. when gdb breaks inside virCondWaitUntil on the server, corrupt the timespec before the pthread_cond_timedwait call (via 'p ts.tv_nsec=-10), then step past the system call (which should fail with EINVAL), then change errno (via 'p ret=ETIMEDOUT') prior to the code that assigns 'errno = ret'. This should be sufficient to make the caller (qemuDomainObjBeginJobWithDriver) see a failure and go ahead and report VIR_ERR_OPERATION_TIMEDOUT with "cannot acquire state change lock" 6. resume libvirt on the second machine, and the client on the first machine will see VIR_ERR_OPERATION_TIMEDOUT "Timed out during operation" if the libvirt client is patched, but VIR_ERR_INVALID_NWFILTER "Invalid network filter" if the client is not compensating for the older server's response being inconsistent with the client's expectations.
(In reply to comment #5) > 5. when gdb breaks inside virCondWaitUntil on the server, corrupt the timespec > before the pthread_cond_timedwait call (via 'p ts.tv_nsec=-10), then step past > the system call (which should fail with EINVAL), then change errno (via 'p > ret=ETIMEDOUT') prior to the code that assigns 'errno = ret'. This should be > sufficient to make the caller (qemuDomainObjBeginJobWithDriver) see a failure > and go ahead and report VIR_ERR_OPERATION_TIMEDOUT with "cannot acquire state > change lock" Hi Eric , Today I installed libvirt-debuginfo , and with the gdb libvirtd started on the second machine , I will get the following result : (gdb) p ts.tv_nsec=-10 No symbol "ts" in current context. (gdb) p ret='ETIMEDOUT' No symbol "ETIMEDOUT" in current context. So anything I missed to installed on the second machine ? And btw after perform 'p ts.tv_nsec=-10' what should I do to step past the system call (which should fail with EINVAL) ? Addtional info : What was installed on the second machine : libvirt-client-0.7.7-3.fc13.x86_64 libvirt-0.7.7-3.fc13.x86_64 libvirt-python-0.7.7-3.fc13.x86_64 libvirt-debuginfo-0.7.7-3.fc13.x86_64 qemu-common-0.12.3-8.fc13.x86_64 qemu-debuginfo-0.13.0-1.fc13.x86_64 gpxe-roms-qemu-1.0.0-1.fc13.noarch qemu-img-0.12.3-8.fc13.x86_64 qemu-system-x86-0.12.3-8.fc13.x86_64 qemu-kvm-0.12.3-8.fc13.x86_64 Thanks Vivian
Vivian, I guess your gdb didn't breaks at virCondWaitUntil. As virCondWaitUntil will only be invoked when there is already a running job, and you trying to start another job. <quote> 4. from the RHEL 5 client, create a remote connection to the second machine and trigger virDomainSuspend of the running domain (via 'virsh -c ssh://second/system pause domain') </quote> Change above step into following will work: 4. from the RHEL 5 client, connect to server and try to do a coredump ("virsh -c qemu+ssh://second/system dump $domain $domain.dump"), before the dumping completes, open another terminal, and try to pause the domain ("virsh -c qemu+ssh://second/system suspend $domain"). Then your gdb will breaks into virCondWaitUntil, then you can do as left steps as Eric wrote.
Tried following steps to verify this bug . And passed . [Verified versions] libvirt-0.7.7-3.fc13.x86_64 libvirt-0.8.2-19.el5 [Client] libvirt-0.8.2-19.el5 # virsh -c qemu+ssh://second/system suspend test root@second's password: error: Failed to suspend domain test error: Timed out during operation: cannot acquire state change lock [Server] libvirt-0.7.7-3.fc13.x86_64 (gdb) b virCondWaitUntil Breakpoint 1 at 0x42d450: file util/threads-pthread.c, line 96. (gdb) r Starting program: /usr/sbin/libvirtd [Thread debugging using libthread_db enabled] [New Thread 0x7ffff6f99710 (LWP 10251)] [New Thread 0x7ffff6598710 (LWP 10252)] [New Thread 0x7ffff5b97710 (LWP 10253)] Detaching after fork from child process 10254. [New Thread 0x7ffff5196710 (LWP 10255)] Detaching after fork from child process 10257. [New Thread 0x7fffeffff710 (LWP 10258)] [New Thread 0x7fffef5fe710 (LWP 10259)] Detaching after fork from child process 10260. Detaching after fork from child process 10261. Detaching after fork from child process 10262. Detaching after fork from child process 10263. Detaching after fork from child process 10264. Detaching after fork from child process 10265. Detaching after fork from child process 10266. Detaching after fork from child process 10267. Detaching after fork from child process 10268. Detaching after fork from child process 10269. Detaching after fork from child process 10270. Detaching after fork from child process 10271. Detaching after fork from child process 10272. Detaching after fork from child process 10273. Detaching after fork from child process 10274. Detaching after fork from child process 10275. Detaching after fork from child process 10276. Detaching after fork from child process 10277. 17:25:37.610: warning : qemudStartup:1150 : Unable to create cgroup for driver: No such device or address Detaching after fork from child process 10278. Detaching after fork from child process 10279. Detaching after fork from child process 10280. Detaching after fork from child process 10281. Detaching after fork from child process 10282. Detaching after fork from child process 10283. Detaching after fork from child process 10284. Detaching after fork from child process 10285. Detaching after fork from child process 10286. Detaching after fork from child process 10287. 17:25:37.883: warning : lxcStartup:1748 : Unable to create cgroup for driver: No such device or address [Switching to Thread 0x7ffff6598710 (LWP 10252)] Breakpoint 1, virCondWaitUntil (c=0x8dd900, m=0x8dd8a0, whenms=1307006777426) at util/threads-pthread.c:96 96 ts.tv_sec = whenms / 1000; (gdb) s 92 { (gdb) s 96 ts.tv_sec = whenms / 1000; (gdb) s 92 { (gdb) s 96 ts.tv_sec = whenms / 1000; (gdb) s 92 { (gdb) s 96 ts.tv_sec = whenms / 1000; (gdb) s 97 ts.tv_nsec = (whenms % 1000) * 1000; (gdb) p ts.tv_nsec=-10 $1 = -10 (gdb) s 99 if ((ret = pthread_cond_timedwait(&c->cond, &m->lock, &ts)) != 0) { (gdb) p errno=110 $2 = 110 (gdb) s 97 ts.tv_nsec = (whenms % 1000) * 1000; (gdb) p ts.tv_nsec=-10 $3 = -10 (gdb) s 99 if ((ret = pthread_cond_timedwait(&c->cond, &m->lock, &ts)) != 0) { (gdb) s pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:53 53 pushq %r12 (gdb) p errno=110 $4 = 110 (gdb) s 56 pushq %r13 (gdb) fin Run till exit from #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:56 0x000000000042d492 in virCondWaitUntil (c=<value optimized out>, m=<value optimized out>, whenms=<value optimized out>) at util/threads-pthread.c:99 99 if ((ret = pthread_cond_timedwait(&c->cond, &m->lock, &ts)) != 0) { (gdb) p errno=110 $5 = 110 (gdb) s 100 errno = ret; (gdb) p errno=110 $6 = 110 (gdb) s __errno_location () at ../csu/errno-loc.c:34 34 { (gdb) s __errno_location () at ../csu/errno-loc.c:36 36 } (gdb) s virCondWaitUntil (c=<value optimized out>, m=<value optimized out>, whenms=<value optimized out>) at util/threads-pthread.c:101 101 return -1; (gdb) s 104 } (gdb) s qemuDomainObjBeginJobWithDriver (driver=0x8d9e30, obj=0x8dd8a0) at qemu/qemu_driver.c:377 377 virDomainObjUnref(obj); (gdb) s virDomainObjUnref (dom=0x8dd8a0) at conf/domain_conf.c:705 705 { (gdb) s 707 VIR_DEBUG("obj=%p refs=%d", dom, dom->refs); (gdb) s 705 { (gdb) s 706 dom->refs--; (gdb) s 707 VIR_DEBUG("obj=%p refs=%d", dom, dom->refs); (gdb) s virLogMessage (category=0x3d29b0e9f7 "file.conf/domain_conf.c", priority=1, funcname=0x3d29b10b60 "virDomainObjUnref", linenr=707, flags=0, fmt=0x3d29b0e9e8 "obj=%p refs=%d") at util/logging.c:517 517 long long linenr, int flags, const char *fmt, ...) { (gdb) s 524 if (!virLogInitialized) (gdb) s 518 char *str = NULL; (gdb) s 524 if (!virLogInitialized) (gdb) s 527 if (fmt == NULL) (gdb) s 533 fprio = virLogFiltersCheck(category); (gdb) s virLogFiltersCheck (category=0x3d29b0e9f7 "file.conf/domain_conf.c", priority=1, funcname=0x3d29b10b60 "virDomainObjUnref", linenr=707, flags=0, fmt=0x3d29b0e9e8 "obj=%p refs=%d") at util/logging.c:422 422 virLogLock(); (gdb) s virLogLock () at util/logging.c:138 138 virMutexLock(&virLogMutex); (gdb) s virMutexLock (m=0x6ca640) at util/threads-pthread.c:52 52 pthread_mutex_lock(&m->lock); (gdb) s __pthread_mutex_lock (mutex=0x6ca640) at pthread_mutex_lock.c:50 50 unsigned int type = PTHREAD_MUTEX_TYPE (mutex); (gdb) s 47 { (gdb) s 51 if (__builtin_expect (type & ~PTHREAD_MUTEX_KIND_MASK_NP, 0)) (gdb) s 50 unsigned int type = PTHREAD_MUTEX_TYPE (mutex); (gdb) s 51 if (__builtin_expect (type & ~PTHREAD_MUTEX_KIND_MASK_NP, 0)) (gdb) s 54 pid_t id = THREAD_GETMEM (THREAD_SELF, tid); (gdb) s 56 if (__builtin_expect (type, PTHREAD_MUTEX_TIMED_NP) (gdb) s 61 LLL_MUTEX_LOCK (mutex); (gdb) s 127 ++mutex->__data.__nusers; (gdb) s 125 mutex->__data.__owner = id; (gdb) s 127 ++mutex->__data.__nusers; (gdb) s __pthread_mutex_lock (mutex=0x6ca640) at pthread_mutex_lock.c:131 131 } (gdb) s virLogFiltersCheck (category=0x3d29b0e9f7 "file.conf/domain_conf.c", priority=1, funcname=0x3d29b10b60 "virDomainObjUnref", linenr=707, flags=0, fmt=0x3d29b0e9e8 "obj=%p refs=%d") at util/logging.c:423 423 for (i = 0;i < virLogNbFilters;i++) { (gdb) s virLogMessage (category=0x3d29b0e9f7 "file.conf/domain_conf.c", priority=1, funcname=0x3d29b10b60 "virDomainObjUnref", linenr=707, flags=0, fmt=0x3d29b0e9e8 "obj=%p refs=%d") at util/logging.c:533 533 fprio = virLogFiltersCheck(category); (gdb) s virLogFiltersCheck (category=0x3d29b0e9f7 "file.conf/domain_conf.c", priority=1, funcname=0x3d29b10b60 "virDomainObjUnref", linenr=707, flags=0, fmt=0x3d29b0e9e8 "obj=%p refs=%d") at util/logging.c:429 429 virLogUnlock(); (gdb) s virLogUnlock () at util/logging.c:142 142 virMutexUnlock(&virLogMutex); (gdb) s virMutexUnlock (m=0x6ca640) at util/threads-pthread.c:57 57 pthread_mutex_unlock(&m->lock); (gdb) s __pthread_mutex_unlock (mutex=0x6ca640) at pthread_mutex_unlock.c:290 290 return __pthread_mutex_unlock_usercnt (mutex, 1); (gdb) s __pthread_mutex_unlock_usercnt (mutex=0x6ca640) at pthread_mutex_unlock.c:289 289 { (gdb) s 290 return __pthread_mutex_unlock_usercnt (mutex, 1); (gdb) s __pthread_mutex_unlock_usercnt (mutex=0x6ca640) at pthread_mutex_unlock.c:37 37 int type = PTHREAD_MUTEX_TYPE (mutex); (gdb) s 38 if (__builtin_expect (type & ~PTHREAD_MUTEX_KIND_MASK_NP, 0)) (gdb) s 41 if (__builtin_expect (type, PTHREAD_MUTEX_TIMED_NP) (gdb) s 46 mutex->__data.__owner = 0; (gdb) s 49 --mutex->__data.__nusers; (gdb) s 52 lll_unlock (mutex->__data.__lock, PTHREAD_MUTEX_PSHARED (mutex)); (gdb) s virLogMessage (category=0x3d29b0e9f7 "file.conf/domain_conf.c", priority=1, funcname=0x3d29b10b60 "virDomainObjUnref", linenr=707, flags=0, fmt=0x3d29b0e9e8 "obj=%p refs=%d") at util/logging.c:535 535 if (priority < virLogDefaultPriority) (gdb) s 587 } (gdb) s virDomainObjUnref (dom=0x8dd8a0) at conf/domain_conf.c:708 708 if (dom->refs == 0) { (gdb) s 714 } (gdb) s qemuDomainObjBeginJobWithDriver (driver=0x8d9e30, obj=0x8dd8a0) at qemu/qemu_driver.c:378 378 if (errno == ETIMEDOUT) (gdb) s __errno_location () at ../csu/errno-loc.c:34 34 { (gdb) s __errno_location () at ../csu/errno-loc.c:36 36 } (gdb) s qemuDomainObjBeginJobWithDriver (driver=0x8d9e30, obj=<value optimized out>) at qemu/qemu_driver.c:379 379 qemuReportError(VIR_ERR_OPERATION_TIMEOUT, (gdb) s __dcgettext (domainname=0x48da09 "libvirt", msgid=0x494710 "cannot acquire state change lock", category=5) at dcgettext.c:52 52 { (gdb) c Continuing. 17:28:26.061: error : qemuDomainObjBeginJobWithDriver:380 : Timed out during operation: cannot acquire state change lock libvir: QEMU error : Timed out during operation: cannot acquire state change lock
In comment 9, server can be setup with Fedora 13(release version) and install libvirt-debuginfo-0.7.7-3.fc13.x86_64. First [Server] 1.# /etc/init.d/libvirtd stop 2.# gdb libivrtd (gdb) b virCondWaitUntil Breakpoint 1 at 0x42d450: file util/threads-pthread.c, line 96. (gdb) r Starting program: /usr/sbin/libvirtd Then [Client] 1. in one console # virsh -c qemu+ssh://server/system dump vm /home/vm.dump 2. in another console and before step1 finish # virsh -c qemu+ssh://server/system suspend vm Third [Server] Continue "First" part gdb process # gdb libivrtd (gdb) b virCondWaitUntil Breakpoint 1 at 0x42d450: file util/threads-pthread.c, line 96. (gdb) r Starting program: /usr/sbin/libvirtd ........... (gdb) s ........
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1019.html