Description of problem: Tearing down incorrectly a tap:aio:... device causes the backend to hang and prevents further management operations in the host (including listing and creating domains). Version-Release number of selected component (if applicable): Present since forever. How reproducible: 100% Actual results: The backend of the domain's xvdb device will never be torn down, and the domain will remain in "xm list" as a zombie until xend is restarted. Even then, most "xm" commands will not work anymore in the host. The following messages will appear in /var/log/messages: kernel: INFO: task xenwatch:21 blocked for more than 120 seconds. kernel: xenwatch D ffff8801de591100 0 21 19 22 (L-TLB) kernel: ffff8801de5abdc0 0000000000000246 0000000000000009 ffff8801de591100 kernel: 0000000000000009 ffff8801de591100 ffff8800a8476040 0000000000000924 kernel: ffff8801de5912e8 ffffffffffffffff kernel: Call Trace: kernel: [<ffffffff802893ce>] enqueue_task+0x41/0x56 kernel: [<ffffffff8029d110>] keventd_create_kthread+0x0/0xc4 kernel: [<ffffffff88755901>] :blktap:tap_blkif_free+0x72/0x97 kernel: [<ffffffff8029d328>] autoremove_wake_function+0x0/0x2e kernel: [<ffffffff887555e2>] :blktap:tap_frontend_changed+0x1d5/0x231 kernel: [<ffffffff803ba494>] xenbus_read_driver_state+0x26/0x3b Expected results: The backend can tolerate guests with erroneous backend behavior. Additional info: The root cause of the failure is that in step 4a we skip the "Closing" phase of the xenbus protocol, where the kernel thread is released: case XenbusStateClosing: if (be->blkif->xenblkd) { kthread_stop(be->blkif->xenblkd); be->blkif->xenblkd = NULL; } tap_blkif_free(be->blkif); This code is never executed. Then, at step 4d, another thread is started. At step 5, the frontend goes to the Closing state, and the code above _is_ executed. The second xenblkd thread _is_ stopped when the Closing state is reached, but the leaked one keeps a reference to be->blkif and thus tap_blkif_free hangs. The whole xenwatch process then cannot run anymore.
Upstream commit: http://xenbits.xen.org/linux-2.6.18-xen.hg?rev/59f097ef181b
This issue has been addressed in following products: Red Hat Enterprise Linux 5 Via RHSA-2011:0004 https://rhn.redhat.com/errata/RHSA-2011-0004.html