Bug 1199036

Summary: Libvirtd was restarted when do active blockcommit while there is a blockpull job running
Product: Red Hat Enterprise Linux 7 Reporter: Shanzhi Yu <shyu>
Component: libvirtAssignee: Peter Krempa <pkrempa>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 7.1CC: dyuan, eblake, jdenemar, jkurik, mst, mzhan, pkrempa, rbalakri, shyu, xuzhang, yanyang
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: libvirt-1.2.14-1.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1202719 (view as bug list) Environment:
Last Closed: 2015-11-19 06:18:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1199182, 1202719    
Attachments:
Description Flags
script reproduce the new bug none

Description Shanzhi Yu 2015-03-05 11:14:18 UTC
Description of problem:


Libvirtd was restarted when do active blockcommit while there is a blockpull job running

Version-Release number of selected component (if applicable):

libvirt-1.2.8-16.el7_1.1.x86_64

How reproducible:

100%

Steps to Reproduce:

1. Prepare a guest, create external snapshot, check pid of libvirtd

# pidof libvirtd
46363


# virsh list 
 Id    Name                           State
----------------------------------------------------
 8     r7                             running

# virsh snapshot-create-as r7 s1 --disk-only 
Domain snapshot s1 created

2. Do blockpull 
# virsh blockpull r7 vda --verbose --wait 
Block Pull: [ 15 %]


3. Do active blockcommit before blockpull job is finished.
# virsh blockcommit r7 vda --active --verbose --wait 

after seconds

error in step 2 :
# virsh blockpull r7 vda --verbose --wait 
Block Pull: [ 15 %]error: failed to query job for disk vda
error: Timed out during operation: cannot acquire state change lock

error in step 3:

# virsh blockcommit r7 vda --active --verbose --wait 
2015-03-05 11:02:16.929+0000: 46543: info : libvirt version: 1.2.8, package: 16.el7_1.1 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2015-02-25-05:03:17, x86-025.build.eng.bos.redhat.com)
2015-03-05 11:02:16.929+0000: 46543: warning : virKeepAliveTimerInternal:143 : No response from client 0x7f4166cce470 after 6 keepalive messages in 35 seconds
2015-03-05 11:02:16.929+0000: 46544: warning : virKeepAliveTimerInternal:143 : No response from client 0x7f4166cce470 after 6 keepalive messages in 35 seconds
error: internal error: received hangup / error event on socket
error: Failed to reconnect to the hypervisor


4. check libvirtd status
# pidof libvirtd
46799

# systemctl status libvirtd.service
libvirtd.service - Virtualization daemon
   Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; enabled)
   Active: active (running) since Thu 2015-03-05 19:03:58 CST; 40s ago
     Docs: man:libvirtd(8)
           http://libvirt.org
 Main PID: 46799 (libvirtd)
   CGroup: /system.slice/libvirtd.service
           ├─ 1739 /sbin/dnsmasq --conf-file=/var/lib/libvirt/dnsmasq/default.conf --dhcp-script=/usr/libexec/libvirt_leaseshelper
           ├─ 1740 /sbin/dnsmasq --conf-file=/var/lib/libvirt/dnsmasq/default.conf --dhcp-script=/usr/libexec/libvirt_leaseshelper
           └─46799 /usr/sbin/libvirtd



Actual results:


Expected results:


Additional info:


(gdb) t a a bt

Thread 11 (Thread 0x7f0069b7f700 (LWP 46364)):
#0  0x00007f0075d09705 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f0078474a86 in virCondWait (c=c@entry=0x7f0079e51c00, m=m@entry=0x7f0079e51bd8) at util/virthread.c:153
#2  0x00007f0078474f3b in virThreadPoolWorker (opaque=opaque@entry=0x7f0079e46150) at util/virthreadpool.c:104
#3  0x00007f007847483e in virThreadHelper (data=<optimized out>) at util/virthread.c:197
#4  0x00007f0075d05df5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f007562c1ad in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x7f006937e700 (LWP 46365)):
#0  0x00007f0075d09705 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f0078474a86 in virCondWait (c=c@entry=0x7f0079e51c00, m=m@entry=0x7f0079e51bd8) at util/virthread.c:153
#2  0x00007f0078474f3b in virThreadPoolWorker (opaque=opaque@entry=0x7f0079e45ff0) at util/virthreadpool.c:104
#3  0x00007f007847483e in virThreadHelper (data=<optimized out>) at util/virthread.c:197
#4  0x00007f0075d05df5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f007562c1ad in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x7f0068b7d700 (LWP 46366)):
#0  0x00007f0075d09705 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f0078474a86 in virCondWait (c=c@entry=0x7f0079e51c00, m=m@entry=0x7f0079e51bd8) at util/virthread.c:153
#2  0x00007f0078474f3b in virThreadPoolWorker (opaque=opaque@entry=0x7f0079e45e90) at util/virthreadpool.c:104
#3  0x00007f007847483e in virThreadHelper (data=<optimized out>) at util/virthread.c:197
#4  0x00007f0075d05df5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f007562c1ad in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7f006837c700 (LWP 46367)):
#0  0x00007f00755b656c in free () from /lib64/libc.so.6
#1  0x00007f007841befa in virFree (ptrptr=ptrptr@entry=0x7f0048000f78) at util/viralloc.c:582
#2  0x00007f007846e4aa in virStorageSourceClear (def=0x7f0048000f70) at util/virstoragefile.c:2018
#3  0x00007f007846d9f3 in virStorageSourceFree (def=0x7f0048000f70) at util/virstoragefile.c:2042
#4  0x00007f0061b88e70 in qemuDomainBlockCommit (dom=<optimized out>, path=<optimized out>, base=<optimized out>, top=<optimized out>, bandwidth=0, 
    flags=<optimized out>) at qemu/qemu_driver.c:16243
#5  0x00007f0078529d4d in virDomainBlockCommit (dom=dom@entry=0x7f0048000b90, disk=0x7f0048000b70 "vda", base=0x0, top=0x0, bandwidth=0, flags=4)
    at libvirt.c:20226
---Type <return> to continue, or q <return> to quit---
#6  0x00007f0078f89bbc in remoteDispatchDomainBlockCommit (server=<optimized out>, msg=<optimized out>, args=0x7f0048000e70, rerr=0x7f006837bc80, 
    client=<optimized out>) at remote_dispatch.h:2597
#7  remoteDispatchDomainBlockCommitHelper (server=<optimized out>, client=<optimized out>, msg=<optimized out>, rerr=0x7f006837bc80, args=0x7f0048000e70, 
    ret=<optimized out>) at remote_dispatch.h:2567
#8  0x00007f0078571322 in virNetServerProgramDispatchCall (msg=0x7f0079e5f940, client=0x7f0079e60510, server=0x7f0079e51a80, prog=0x7f0079e5cca0)
    at rpc/virnetserverprogram.c:437
#9  virNetServerProgramDispatch (prog=0x7f0079e5cca0, server=server@entry=0x7f0079e51a80, client=0x7f0079e60510, msg=0x7f0079e5f940)
    at rpc/virnetserverprogram.c:307
#10 0x00007f0078fb23ed in virNetServerProcessMsg (msg=<optimized out>, prog=<optimized out>, client=<optimized out>, srv=0x7f0079e51a80) at rpc/virnetserver.c:172
#11 virNetServerHandleJob (jobOpaque=<optimized out>, opaque=0x7f0079e51a80) at rpc/virnetserver.c:193
#12 0x00007f0078474ea5 in virThreadPoolWorker (opaque=opaque@entry=0x7f0079e45d30) at util/virthreadpool.c:145
#13 0x00007f007847483e in virThreadHelper (data=<optimized out>) at util/virthread.c:197
#14 0x00007f0075d05df5 in start_thread () from /lib64/libpthread.so.0
#15 0x00007f007562c1ad in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7f0067b7b700 (LWP 46368)):
#0  0x00007f0075d09705 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f0078474a86 in virCondWait (c=c@entry=0x7f0079e51c00, m=m@entry=0x7f0079e51bd8) at util/virthread.c:153
#2  0x00007f0078474f3b in virThreadPoolWorker (opaque=opaque@entry=0x7f0079e45ff0) at util/virthreadpool.c:104
#3  0x00007f007847483e in virThreadHelper (data=<optimized out>) at util/virthread.c:197
#4  0x00007f0075d05df5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f007562c1ad in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7f006737a700 (LWP 46369)):
#0  0x00007f0075d09705 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f0078474a86 in virCondWait (c=c@entry=0x7f0079e51c98, m=m@entry=0x7f0079e51bd8) at util/virthread.c:153
#2  0x00007f0078474f5b in virThreadPoolWorker (opaque=opaque@entry=0x7f0079e45e90) at util/virthreadpool.c:104
#3  0x00007f007847483e in virThreadHelper (data=<optimized out>) at util/virthread.c:197
#4  0x00007f0075d05df5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f007562c1ad in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7f0066b79700 (LWP 46370)):
#0  0x00007f0075d09705 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f0078474a86 in virCondWait (c=c@entry=0x7f0079e51c98, m=m@entry=0x7f0079e51bd8) at util/virthread.c:153
---Type <return> to continue, or q <return> to quit---
#2  0x00007f0078474f5b in virThreadPoolWorker (opaque=opaque@entry=0x7f0079e45ff0) at util/virthreadpool.c:104
#3  0x00007f007847483e in virThreadHelper (data=<optimized out>) at util/virthread.c:197
#4  0x00007f0075d05df5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f007562c1ad in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f0066378700 (LWP 46371)):
#0  0x00007f0075d09705 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f0078474a86 in virCondWait (c=c@entry=0x7f0079e51c98, m=m@entry=0x7f0079e51bd8) at util/virthread.c:153
#2  0x00007f0078474f5b in virThreadPoolWorker (opaque=opaque@entry=0x7f0079e45e90) at util/virthreadpool.c:104
#3  0x00007f007847483e in virThreadHelper (data=<optimized out>) at util/virthread.c:197
#4  0x00007f0075d05df5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f007562c1ad in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f0065b77700 (LWP 46372)):
#0  0x00007f0075d09705 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f0078474a86 in virCondWait (c=c@entry=0x7f0079e51c98, m=m@entry=0x7f0079e51bd8) at util/virthread.c:153
#2  0x00007f0078474f5b in virThreadPoolWorker (opaque=opaque@entry=0x7f0079e45d30) at util/virthreadpool.c:104
#3  0x00007f007847483e in virThreadHelper (data=<optimized out>) at util/virthread.c:197
#4  0x00007f0075d05df5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f007562c1ad in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f0065376700 (LWP 46373)):
#0  0x00007f0075d09705 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f0078474a86 in virCondWait (c=c@entry=0x7f0079e51c98, m=m@entry=0x7f0079e51bd8) at util/virthread.c:153
#2  0x00007f0078474f5b in virThreadPoolWorker (opaque=opaque@entry=0x7f0079e46150) at util/virthreadpool.c:104
#3  0x00007f007847483e in virThreadHelper (data=<optimized out>) at util/virthread.c:197
#4  0x00007f0075d05df5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f007562c1ad in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f0078f38880 (LWP 46363)):
#0  0x00007f0075621b7d in poll () from /lib64/libc.so.6
#1  0x00007f0078438cb1 in poll (__timeout=4341, __nfds=10, __fds=<optimized out>) at /usr/include/bits/poll2.h:46
#2  virEventPollRunOnce () at util/vireventpoll.c:643
#3  0x00007f00784377a2 in virEventRunDefaultImpl () at util/virevent.c:308
---Type <return> to continue, or q <return> to quit---
#4  0x00007f0078fb389d in virNetServerRun (srv=0x7f0079e51a80) at rpc/virnetserver.c:1139
#5  0x00007f0078f805b8 in main (argc=<optimized out>, argv=<optimized out>) at libvirtd.c:1507

Comment 1 Shanzhi Yu 2015-03-05 11:18:05 UTC
More info maybe useful:

(gdb) c
Continuing.
Detaching after fork from child process 19544.
Detaching after fork from child process 19554.

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7fd64e66b700 (LWP 19317)]
0x00007fd6593828d7 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:55
55	  return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) t a a bt

Thread 11 (Thread 0x7fd64f66d700 (LWP 19315)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007fd65ae8e446 in virCondWait (c=c@entry=0x7fd65c9f3990, m=m@entry=0x7fd65c9f3968) at util/virthread.c:153
#2  0x00007fd65ae8e8fb in virThreadPoolWorker (opaque=opaque@entry=0x7fd65c9ff730) at util/virthreadpool.c:104
#3  0x00007fd65ae8e1fe in virThreadHelper (data=<optimized out>) at util/virthread.c:197
#4  0x00007fd65971252a in start_thread (arg=0x7fd64f66d700) at pthread_create.c:310
#5  0x00007fd65944e22d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 10 (Thread 0x7fd64ee6c700 (LWP 19316)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1  0x00007fd65ae8e4b5 in virCondWaitUntil (c=c@entry=0x7fd62400b150, m=m@entry=0x7fd624000b80, whenms=whenms@entry=1425554002914) at util/virthread.c:168
#2  0x00007fd64881bfb0 in qemuDomainObjBeginJobInternal (driver=driver@entry=0x7fd64008ff60, obj=0x7fd624000b70, job=job@entry=QEMU_JOB_QUERY, 
    asyncJob=asyncJob@entry=QEMU_ASYNC_JOB_NONE) at qemu/qemu_domain.c:1347
#3  0x00007fd64881da4b in qemuDomainObjBeginJob (driver=driver@entry=0x7fd64008ff60, obj=<optimized out>, job=job@entry=QEMU_JOB_QUERY) at qemu/qemu_domain.c:1427
#4  0x00007fd6488878ef in qemuDomainGetBlockJobInfo (dom=<optimized out>, path=0x7fd63c001070 "vda", info=0x7fd64ee6bbb0, flags=0) at qemu/qemu_driver.c:15939
#5  0x00007fd65af2d4b4 in virDomainGetBlockJobInfo (dom=dom@entry=0x7fd63c001590, disk=0x7fd63c001070 "vda", info=info@entry=0x7fd64ee6bbb0, flags=0) at libvirt-domain.c:9523
#6  0x00007fd65b951abc in remoteDispatchDomainGetBlockJobInfo (server=<optimized out>, msg=<optimized out>, ret=0x7fd63c0016a0, args=0x7fd63c001600, rerr=0x7fd64ee6bcb0, 
    client=<optimized out>) at remote.c:2730
#7  remoteDispatchDomainGetBlockJobInfoHelper (server=<optimized out>, client=<optimized out>, msg=<optimized out>, rerr=0x7fd64ee6bcb0, args=0x7fd63c001600, 
    ret=0x7fd63c0016a0) at remote_dispatch.h:3982
#8  0x00007fd65af82489 in virNetServerProgramDispatchCall (msg=0x7fd65c9f54a0, client=0x7fd65c9fcb90, server=0x7fd65c9f3820, prog=0x7fd65c9fbaf0)
    at rpc/virnetserverprogram.c:437
#9  virNetServerProgramDispatch (prog=0x7fd65c9fbaf0, server=server@entry=0x7fd65c9f3820, client=0x7fd65c9fcb90, msg=0x7fd65c9f54a0) at rpc/virnetserverprogram.c:307
#10 0x00007fd65b971288 in virNetServerProcessMsg (msg=<optimized out>, prog=<optimized out>, client=<optimized out>, srv=0x7fd65c9f3820) at rpc/virnetserver.c:172
#11 virNetServerHandleJob (jobOpaque=<optimized out>, opaque=0x7fd65c9f3820) at rpc/virnetserver.c:193
#12 0x00007fd65ae8e85e in virThreadPoolWorker (opaque=opaque@entry=0x7fd65c9ff3b0) at util/virthreadpool.c:144
#13 0x00007fd65ae8e1fe in virThreadHelper (data=<optimized out>) at util/virthread.c:197
#14 0x00007fd65971252a in start_thread (arg=0x7fd64ee6c700) at pthread_create.c:310
#15 0x00007fd65944e22d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 9 (Thread 0x7fd64e66b700 (LWP 19317)):
#0  0x00007fd6593828d7 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007fd65938453a in __GI_abort () at abort.c:89
#2  0x00007fd6593c5da3 in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7fd6594d52f0 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/posix/libc_fatal.c:175
#3  0x00007fd6593d19f5 in malloc_printerr (ptr=<optimized out>, str=0x7fd6594d2f15 "free(): invalid pointer", action=<optimized out>) at malloc.c:4974
#4  _int_free (have_lock=0, p=<optimized out>, av=<optimized out>) at malloc.c:3841
#5  __GI___libc_free (mem=<optimized out>) at malloc.c:2951
#6  0x00007fd65ae34c2b in virFree (ptrptr=ptrptr@entry=0x7fd62400f7b8) at util/viralloc.c:582
#7  0x00007fd65ae87e3a in virStorageSourceClear (def=0x7fd62400f7b0) at util/virstoragefile.c:2024
#8  0x00007fd65ae87313 in virStorageSourceFree (def=0x7fd62400f7b0) at util/virstoragefile.c:2048
---Type <return> to continue, or q <return> to quit---
#9  0x00007fd648885021 in qemuDomainBlockCommit (dom=<optimized out>, path=<optimized out>, base=<optimized out>, top=<optimized out>, bandwidth=<optimized out>, 
    flags=<optimized out>) at qemu/qemu_driver.c:16601
#10 0x00007fd65af2e503 in virDomainBlockCommit (dom=dom@entry=0x7fd624000c80, disk=0x7fd62400d230 "vda", base=0x0, top=0x0, bandwidth=0, flags=4) at libvirt-domain.c:10067
#11 0x00007fd65b94c0ce in remoteDispatchDomainBlockCommit (server=<optimized out>, msg=<optimized out>, args=0x7fd624000a70, rerr=0x7fd64e66acb0, client=<optimized out>)
    at remote_dispatch.h:2594
#12 remoteDispatchDomainBlockCommitHelper (server=<optimized out>, client=<optimized out>, msg=<optimized out>, rerr=0x7fd64e66acb0, args=0x7fd624000a70, ret=<optimized out>)
    at remote_dispatch.h:2564
#13 0x00007fd65af82489 in virNetServerProgramDispatchCall (msg=0x7fd65c9f5510, client=0x7fd65c9fd660, server=0x7fd65c9f3820, prog=0x7fd65c9fbaf0)
    at rpc/virnetserverprogram.c:437
#14 virNetServerProgramDispatch (prog=0x7fd65c9fbaf0, server=server@entry=0x7fd65c9f3820, client=0x7fd65c9fd660, msg=0x7fd65c9f5510) at rpc/virnetserverprogram.c:307
#15 0x00007fd65b971288 in virNetServerProcessMsg (msg=<optimized out>, prog=<optimized out>, client=<optimized out>, srv=0x7fd65c9f3820) at rpc/virnetserver.c:172
#16 virNetServerHandleJob (jobOpaque=<optimized out>, opaque=0x7fd65c9f3820) at rpc/virnetserver.c:193
#17 0x00007fd65ae8e85e in virThreadPoolWorker (opaque=opaque@entry=0x7fd65c9ff230) at util/virthreadpool.c:144
#18 0x00007fd65ae8e1fe in virThreadHelper (data=<optimized out>) at util/virthread.c:197
#19 0x00007fd65971252a in start_thread (arg=0x7fd64e66b700) at pthread_create.c:310
#20 0x00007fd65944e22d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

It's from fedora21 with latest libvirt

Comment 2 Peter Krempa 2015-03-16 11:03:01 UTC
Fixed upstream:

commit 51f9f03a4ca50b070c0fbfb29748d49f583e15e1
Author: Peter Krempa <pkrempa>
Date:   Fri Mar 13 17:22:04 2015 +0100

    qemu: Disallow concurrent block jobs on a single disk
    
    While qemu may be prepared to do this libvirt is not. Forbid the block
    ops until we fix our code.

commit 1a92c719101e5bfa6fe2b78006ad04c7f075ea28
Author: Peter Krempa <pkrempa>
Date:   Fri Mar 13 17:00:03 2015 +0100

    qemu: event: Don't fiddle with disk backing trees without a job
    
    Surprisingly we did not grab a VM job when a block job finished and we'd
    happily rewrite the backing chain data. This made it possible to crash
    libvirt when queueing two backing chains tightly and other badness.
    
    To fix it, add yet another handler to the helper thread that handles
    monitor events that require a job.

Comment 3 Shanzhi Yu 2015-03-16 14:38:17 UTC
Peter,

I'm not sure the new error is caused by this patches serial, but I got below error when do test just as steps in 
https://bugzilla.redhat.com/show_bug.cgi?id=1199182#c14

2015-03-16 14:30:45.282+0000: 25137: error : qemuDomainDiskBlockJobIsActive:2774 : Operation not supported: disk 'vda' already in active block job

This error come from your new patch.

# git describe 
v1.2.13-181-g4bca619


# virsh blockjob testvm3 vda 
No current block job for vda

# virsh blockcommit --active --verbose testvm3 vda --shallow --pivot
error: Operation not supported: disk 'vda' already in active block job


# virsh dumpxml testvm3 |grep disk -A 15
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/tmp/images/c/../b/b'/>
      <backingStore type='file' index='1'>
        <format type='qcow2'/>
        <source file='/tmp/images/c/../b/../a/a'/>
        <backingStore type='network' index='2'>
          <format type='qcow2'/>
          <source protocol='gluster' name='gluster-vol1/r7-qcow2.img'>
            <host name='10.66.5.38'/>
          </source>
          <backingStore/>
        </backingStore>
      </backingStore>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </disk>

Comment 4 Peter Krempa 2015-03-16 14:55:02 UTC
(In reply to Shanzhi Yu from comment #3)
> Peter,
> 
> I'm not sure the new error is caused by this patches serial, but I got below
> error when do test just as steps in 
> https://bugzilla.redhat.com/show_bug.cgi?id=1199182#c14
> 
> 2015-03-16 14:30:45.282+0000: 25137: error :
> qemuDomainDiskBlockJobIsActive:2774 : Operation not supported: disk 'vda'
> already in active block job
> 
> This error come from your new patch.
> 
> # git describe 
> v1.2.13-181-g4bca619
> 
> 
> # virsh blockjob testvm3 vda 
> No current block job for vda
> 
> # virsh blockcommit --active --verbose testvm3 vda --shallow --pivot
> error: Operation not supported: disk 'vda' already in active block job
> 

This error is caused by patch "qemu: Disallow concurrent block jobs on a single disk". What steps did you do before the error happened? One of the previous block jobs probably didn't reset the interlocking flag correctly.

Comment 5 Shanzhi Yu 2015-03-16 15:16:50 UTC
Just run the attached script, which used in bug 1199182. 

# sh testvm3-start.sh 
Domain testvm3 destroyed

Formatting 'images/a/a', fmt=qcow2 size=10485760000 backing_file='gluster://10.66.5.38/gluster-vol1/r7-qcow2.img' backing_fmt='qcow2' encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16
Formatting 'b', fmt=qcow2 size=10485760000 backing_file='../a/a' backing_fmt='qcow2' encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16
Formatting 'c', fmt=qcow2 size=10485760000 backing_file='../b/b' backing_fmt='qcow2' encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16
Domain testvm3 created from /dev/stdin

Block Commit: [100 %]
Successfully pivoted
Block Commit: [100 %]
Successfully pivoted
Block Commit: [100 %]
Successfully pivoted
Domain snapshot 1426518648 created
Domain snapshot 1426518651 created
Domain snapshot 1426518655 created
Block Commit: [100 %]
Successfully pivoted
error: internal error: unable to execute QEMU command 'block-commit': Top image file /tmp/images/c/../b/b not found

error: Operation not supported: disk 'vda' already in active block job


The first error is caused by bug 1199182, the second error is caused by the new patch.

After run the script, make some check by manual

# virsh blockjob testvm3 vda 
No current block job for vda

# virsh blockcommit testvm3 vda --pivot 
error: Operation not supported: disk 'vda' already in active block job
 
# virsh dumpxml testvm3|grep disk -A 14
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/tmp/images/c/../b/b'/>
      <backingStore type='file' index='1'>
        <format type='qcow2'/>
        <source file='/tmp/images/c/../b/../a/a'/>
        <backingStore type='network' index='2'>
          <format type='qcow2'/>
          <source protocol='gluster' name='gluster-vol1/r7-qcow2.img'>
            <host name='10.66.5.38'/>
          </source>
          <backingStore/>
        </backingStore>
      </backingStore>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </disk>

Comment 6 Shanzhi Yu 2015-03-16 15:17:25 UTC
Created attachment 1002350 [details]
script reproduce the new bug

Comment 7 Peter Krempa 2015-03-16 16:30:04 UTC
(In reply to Shanzhi Yu from comment #5)
> Just run the attached script, which used in bug 1199182. 
> 
> # sh testvm3-start.sh 
> Domain testvm3 destroyed
> 
> Formatting 'images/a/a', fmt=qcow2 size=10485760000
> backing_file='gluster://10.66.5.38/gluster-vol1/r7-qcow2.img'
> backing_fmt='qcow2' encryption=off cluster_size=65536 lazy_refcounts=off
> refcount_bits=16
> Formatting 'b', fmt=qcow2 size=10485760000 backing_file='../a/a'
> backing_fmt='qcow2' encryption=off cluster_size=65536 lazy_refcounts=off
> refcount_bits=16
> Formatting 'c', fmt=qcow2 size=10485760000 backing_file='../b/b'
> backing_fmt='qcow2' encryption=off cluster_size=65536 lazy_refcounts=off
> refcount_bits=16
> Domain testvm3 created from /dev/stdin
> 
> Block Commit: [100 %]
> Successfully pivoted
> Block Commit: [100 %]
> Successfully pivoted
> Block Commit: [100 %]
> Successfully pivoted
> Domain snapshot 1426518648 created
> Domain snapshot 1426518651 created
> Domain snapshot 1426518655 created
> Block Commit: [100 %]
> Successfully pivoted
> error: internal error: unable to execute QEMU command 'block-commit': Top
> image file /tmp/images/c/../b/b not found

... this was the problem that caused ...


> 
> error: Operation not supported: disk 'vda' already in active block job
> 
> 
> The first error is caused by bug 1199182, the second error is caused by the
> new patch.
> 
> After run the script, make some check by manual
> 
> # virsh blockjob testvm3 vda 
> No current block job for vda
> 
> # virsh blockcommit testvm3 vda --pivot 
> error: Operation not supported: disk 'vda' already in active block job

... this problem later as the disk was falsely marked as having a active block job.


Fixed upstream:

commit ee744b5b387b5123ee40683c52ab40783ffc3020
Author: Peter Krempa <pkrempa>
Date:   Mon Mar 16 16:52:44 2015 +0100

    qemu: block-commit: Mark disk in block jobs only on successful command
    
    Patch 51f9f03a4ca50b070c0fbfb29748d49f583e15e1 introduces a regression
    where if a blockCommit operation fails the disk is still marked as being
    part of a block job but can't be unmarked later.

Comment 9 Eric Blake 2015-03-16 21:47:11 UTC
Potential 7.1.z patches: http://post-office.corp.redhat.com/archives/rhvirt-patches/2015-March/msg00447.html

Comment 10 Shanzhi Yu 2015-03-17 04:26:01 UTC
(In reply to Peter Krempa from comment #7)
> (In reply to Shanzhi Yu from comment #5)
> > The first error is caused by bug 1199182, the second error is caused by the
> > new patch.
> > 
> > After run the script, make some check by manual
> > 
> > # virsh blockjob testvm3 vda 
> > No current block job for vda
> > 
> > # virsh blockcommit testvm3 vda --pivot 
> > error: Operation not supported: disk 'vda' already in active block job
> 
> ... this problem later as the disk was falsely marked as having a active
> block job.
> 
> 
> Fixed upstream:
> 
> commit ee744b5b387b5123ee40683c52ab40783ffc3020
> Author: Peter Krempa <pkrempa>
> Date:   Mon Mar 16 16:52:44 2015 +0100
> 
>     qemu: block-commit: Mark disk in block jobs only on successful command
>     
>     Patch 51f9f03a4ca50b070c0fbfb29748d49f583e15e1 introduces a regression
>     where if a blockCommit operation fails the disk is still marked as being
>     part of a block job but can't be unmarked later.

Ok, this patch fix above problem. But there still exist a case, when fail to abort job with pivot, libvirt will keep a "Block Commit" job underground. Before abort that job, other ops will be block

See below steps:

# sh testvm3-start.sh
+ cd /tmp
+ virsh -k0 destroy testvm3
Domain testvm3 destroyed

+ rm -rf images
+ mkdir -p images images/a images/b images/c
+ qemu-img create -f qcow2 images/a/a -b gluster://10.66.5.38/gluster-vol1/r7-qcow2.img -o backing_fmt=qcow2
Formatting 'images/a/a', fmt=qcow2 size=10485760000 backing_file='gluster://10.66.5.38/gluster-vol1/r7-qcow2.img' backing_fmt='qcow2' encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16
+ cd images/b
+ qemu-img create -f qcow2 -o backing_file=../a/a,backing_fmt=qcow2 b
Formatting 'b', fmt=qcow2 size=10485760000 backing_file='../a/a' backing_fmt='qcow2' encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16
+ cd images/c
+ qemu-img create -f qcow2 -o backing_file=../b/b,backing_fmt=qcow2 c
Formatting 'c', fmt=qcow2 size=10485760000 backing_file='../b/b' backing_fmt='qcow2' encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16
+ setenforce 0
+ virsh -k0 create /dev/stdin
Domain testvm3 created from /dev/stdin

+ virsh -k0 blockcommit --active --verbose testvm3 vda --shallow --pivot
Block Commit: [100 %]
Successfully pivoted
+ virsh -k0 blockcommit --active --verbose testvm3 vda --shallow --pivot
Block Commit: [100 %]
Successfully pivoted
+ virsh blockjob testvm3 vda
No current block job for vda
+ virsh -k0 blockcommit --active --verbose testvm3 vda --shallow --pivot
Block Commit: [100 %]error: failed to pivot job for disk vda
error: internal error: unable to execute QEMU command 'block-job-complete': The active block job for device 'drive-virtio-disk0' cannot be completed

+ exit 0

# virsh blockjob testvm3 vda 


# virsh blockjob testvm3 vda 


# virsh blockjob testvm3 vda 
Block Commit: [100 %]

#virsh dumpxml testvm3|grep mirror


# virsh blockjob testvm3 vda --pivot 
error: Requested operation is not valid: pivot of disk 'vda' requires an active copy job

# virsh snapshot-create-as testvm3 --no-metadata  --disk-only 
error: internal error: unable to execute QEMU command 'transaction': Device 'drive-virtio-disk0' is busy: block device is in use by block job: commit

# virsh blockjob testvm3 vda --abort 

# virsh snapshot-create-as testvm3 --no-metadata  --disk-only 
Domain snapshot 1426566021 created


Base on latest libvirt.git
# git describe 
v1.2.13-196-ga7d6b94

qemu-kvm-2.2.0-5.fc21.x86_64

Comment 11 Peter Krempa 2015-03-17 08:49:16 UTC
(In reply to Shanzhi Yu from comment #10)
> (In reply to Peter Krempa from comment #7)
> > (In reply to Shanzhi Yu from comment #5)

> 
> Ok, this patch fix above problem. But there still exist a case, when fail to
> abort job with pivot, libvirt will keep a "Block Commit" job underground.
> Before abort that job, other ops will be block
> 
> See below steps:

... 

> + virsh blockjob testvm3 vda
> No current block job for vda
> + virsh -k0 blockcommit --active --verbose testvm3 vda --shallow --pivot
> Block Commit: [100 %]error: failed to pivot job for disk vda
> error: internal error: unable to execute QEMU command 'block-job-complete':
> The active block job for device 'drive-virtio-disk0' cannot be completed

After this command fails the following piece of code is executed:
    if (ret < 0) {
        /* On failure, qemu abandons the mirror, and reverts back to
         * the source disk (RHEL 6.3 has a bug where the revert could
         * cause catastrophic failure in qemu, but we don't need to
         * worry about it here as it is not an upstream qemu problem.  */
        /* XXX should we be parsing the exact qemu error, or calling
         * 'query-block', to see what state we really got left in
         * before killing the mirroring job?

The comment above hints that it might happen that a not-so-ancient qemu actually does not abandon the mirror but rather keeps it active and thus removing the mirror job here is not what should happen.

         * XXX We want to revoke security labels and disk lease, as
         * well as audit that revocation, before dropping the original
         * source.  But it gets tricky if both source and mirror share
         * common backing files (we want to only revoke the non-shared
         * portion of the chain); so for now, we leak the access to
         * the original.  */
        virStorageSourceFree(disk->mirror);
        disk->mirror = NULL;
        disk->mirrorState = VIR_DOMAIN_DISK_MIRROR_STATE_NONE;
        disk->mirrorJob = VIR_DOMAIN_BLOCK_JOB_TYPE_UNKNOWN;
        disk->blockjob = false;
    }

> 
> # virsh blockjob testvm3 vda --pivot 
> error: Requested operation is not valid: pivot of disk 'vda' requires an
> active copy job

At that point, as the mirror job info is removed the above operation fails due to a check in libvirt not being satisfied as the mirror job info was removed in the code above, but ...

> 
> # virsh snapshot-create-as testvm3 --no-metadata  --disk-only 
> error: internal error: unable to execute QEMU command 'transaction': Device
> 'drive-virtio-disk0' is busy: block device is in use by block job: commit

... this fails in a check in qemu as the mirror is still in place actually.

> 
> # virsh blockjob testvm3 vda --abort 

This command does not have a check in libvirt so it completes successfully and kills the mirror.

> 
> # virsh snapshot-create-as testvm3 --no-metadata  --disk-only 
> Domain snapshot 1426566021 created

And now you are able to execute block operations again.

> 
> 
> Base on latest libvirt.git
> # git describe 
> v1.2.13-196-ga7d6b94
> 
> qemu-kvm-2.2.0-5.fc21.x86

I think this needs a new bugzilla.

Comment 14 Shanzhi Yu 2015-03-19 08:12:28 UTC
I file a bug for the issue in comment 11, see bug 1202704

Comment 16 Yang Yang 2015-06-03 07:27:16 UTC
Verify it on libvirt-1.2.15-2.el7.x86_64

Steps:

1.Do blockcommit while there is a blockpull job running already
# virsh blockpull simple vda --wait --verbose
Block Pull: [ 39 %]

# virsh blockcommit simple vda --wait --verbose --active
error: Operation not supported: disk 'vda' already in active block job

2. do blockpull while there is a blockpull job running already
# virsh blockpull simple vda --wait --verbose 
Block Pull: [ 36 %]

# virsh blockpull simple vda --wait --verbose 
error: Operation not supported: disk 'vda' already in active block job

3. do blockcopy while there is a blockpull job running already
# virsh blockpull simple vda --wait --verbose 
Block Pull: [ 20 %]

# virsh blockcopy simple vda --shallow /tmp/copy
error: Operation not supported: disk 'vda' already in active block job

4. Do blockpull while there is a blockcommit job running already
# virsh blockcommit simple vda --wait --verbose --active --bandwidth 1
Block Commit: [  2 %]

# virsh blockpull simple vda --wait --verbose
error: block copy still active: disk 'vda' already in active block job

5. do blockcommit while there is a blockcommit job running already
# virsh blockcommit simple vda --wait --verbose --active --bandwidth 1
Block Commit: [  1 %]

# virsh blockcommit simple vda --wait --verbose --active
error: block copy still active: disk 'vda' already in active block job

6. do blockcoy while there is a blockcommit job running already
# virsh blockcommit simple vda --wait --verbose --active --bandwidth 1
Block Commit: [  1 %]

# virsh blockcopy simple vda --shallow /tmp/copy
error: block copy still active: disk 'vda' already in active block job


7. do blockpull while there is a blockcopy job running already
# virsh blockcopy simple vda /tmp/copy
Block Copy started
# virsh blockjob simple vda
Block Copy: [ 11 %]

# virsh blockpull simple vda --wait --verbose 
error: block copy still active: disk 'vda' already in active block job

8. do blockcommit while there is a blockcopy job running already
# virsh blockcopy simple vda /tmp/copy
Block Copy started
# virsh blockjob simple vda
Block Copy: [ 11 %]

# virsh blockcommit simple vda --wait --verbose --active
error: block copy still active: disk 'vda' already in active block job

9. do blockcopy while there is a blockcopy job running already
# virsh blockcopy simple vda --shallow /tmp/copy
Block Copy started

# virsh blockcopy simple vda /tmp/copy1
error: block copy still active: disk 'vda' already in active block job

Comment 18 errata-xmlrpc 2015-11-19 06:18:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-2202.html