VM fails to shutdown, getting stuck in 'Powering down' status. This is because its 'qemu-kvm' process gets in a zombie/defunct state: more ps-Ll.txt F S UID PID PPID LWP C PRI NI ADDR SZ WCHAN TTY TIME CMD 6 Z 107 20631 1 20631 0 80 0 - 0 do_exi ? 8:45 [qemu-kvm] <defunct> 3 D 107 20631 1 20635 0 80 0 - 2386845 fuse_r ? 1:12 [qemu-kvm] <defunct> The customer has collected a crash dump of the affected VM and also statedumps from all the glusterfs process running in this machine when this problem is present. Thread ID 20635 is the one of interest: crash> bt 20635 PID: 20635 TASK: ffff9ed3926eb0c0 CPU: 7 COMMAND: "IO iothread1" #0 [ffff9ec8e351fa28] __schedule at ffffffff91967747 #1 [ffff9ec8e351fab0] schedule at ffffffff91967c49 #2 [ffff9ec8e351fac0] __fuse_request_send at ffffffffc09d24e5 [fuse] #3 [ffff9ec8e351fb30] fuse_request_send at ffffffffc09d26e2 [fuse] #4 [ffff9ec8e351fb40] fuse_send_write at ffffffffc09dbc76 [fuse] #5 [ffff9ec8e351fb70] fuse_direct_io at ffffffffc09dc0d6 [fuse] #6 [ffff9ec8e351fc58] __fuse_direct_write at ffffffffc09dc562 [fuse] #7 [ffff9ec8e351fca8] fuse_direct_IO at ffffffffc09dd3ca [fuse] #8 [ffff9ec8e351fd70] generic_file_direct_write at ffffffff913b8663 #9 [ffff9ec8e351fdc8] fuse_file_aio_write at ffffffffc09ddbd5 [fuse] #10 [ffff9ec8e351fe60] do_io_submit at ffffffff91497a73 #11 [ffff9ec8e351ff40] sys_io_submit at ffffffff91497f40 #12 [ffff9ec8e351ff50] tracesys at ffffffff9197505b (via system_call) RIP: 00007f9ff0758697 RSP: 00007f9db86814b8 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 0000000000000001 RCX: ffffffffffffffff RDX: 00007f9db86814d0 RSI: 0000000000000001 RDI: 00007f9ff268e000 RBP: 0000000000000080 R8: 0000000000000080 R9: 000000000000006a R10: 0000000000000078 R11: 0000000000000246 R12: 00007f9db86814c0 R13: 0000560264b9b518 R14: 0000560264b9b4f0 R15: 00007f9db8681bb0 ORIG_RAX: 00000000000000d1 CS: 0033 SS: 002b From the core, this is the file the above process is writing to: crash> files -d 0xffff9ec8e8f9f740 DENTRY INODE SUPERBLK TYPE PATH ffff9ec8e8f9f740 ffff9ed39e705700 ffff9ee009adc000 REG /rhev/data-center/mnt/glusterSD/172.16.20.21:_vmstore2/e5dd645f-88bb-491c-9145-38fa229cbc4d/images/8e84c1ed-48ba-4b82-9882-c96e6f260bab/29bba0a1-6c7b-4358-9ef2-f8080405778d So in this case we're accessing the vmstore2 volume. This is the glusterfs process: root 4863 0.0 0.0 1909580 49316 ? S<sl Apr04 25:51 /usr/sbin/glusterfs --volfile-server=172.16.20.21 --volfile-server=172.16.20.22 --volfile-server=172.16.20.23 --volfile-id=/vmstore2 /rhev/data-center/mnt/glusterSD/172.16.20.21:_vmstore2 Backtrace of this process on the crash dump: crash> bt 4863 PID: 4863 TASK: ffff9edfa9ff9040 CPU: 11 COMMAND: "glusterfs" #0 [ffff9ed3a332fc28] __schedule at ffffffff91967747 #1 [ffff9ed3a332fcb0] schedule at ffffffff91967c49 #2 [ffff9ed3a332fcc0] futex_wait_queue_me at ffffffff9130cf76 #3 [ffff9ed3a332fd00] futex_wait at ffffffff9130dc5b #4 [ffff9ed3a332fe48] do_futex at ffffffff9130f9a6 #5 [ffff9ed3a332fed8] sys_futex at ffffffff9130fec0 #6 [ffff9ed3a332ff50] system_call_fastpath at ffffffff91974ddb RIP: 00007f6e5eeccf47 RSP: 00007ffdd311c7d0 RFLAGS: 00000246 RAX: 00000000000000ca RBX: 00007f6e59496700 RCX: ffffffffffffffff RDX: 0000000000001308 RSI: 0000000000000000 RDI: 00007f6e594969d0 RBP: 00007f6e60552780 R8: 0000000000000000 R9: 00007f6e5e6e314d R10: 0000000000000000 R11: 0000000000000246 R12: 00007f6e59496d28 R13: 0000000000000000 R14: 0000000000000006 R15: 00007ffdd311c920 ORIG_RAX: 00000000000000ca CS: 0033 SS: 002b We have a few pending frames in this process. Reviewing the corresponding statedump: grep complete=0 glusterdump.4863.dump.1556091368 -c 7 Looking for these pending frames in the statedump: ~~~ [global.callpool.stack.1] stack=0x7f6e4007c828 uid=107 gid=107 pid=20635 unique=5518502 lk-owner=bd2351a6cc7fcb8b op=WRITE type=1 cnt=6 [global.callpool.stack.1.frame.1] frame=0x7f6dec04de38 ref_count=0 translator=vmstore2-write-behind complete=0 parent=vmstore2-open-behind wind_from=default_writev_resume wind_to=(this->children->xlator)->fops->writev unwind_to=default_writev_cbk [global.callpool.stack.1.frame.2] frame=0x7f6dec0326f8 ref_count=1 translator=vmstore2-open-behind complete=0 parent=vmstore2-md-cache wind_from=mdc_writev wind_to=(this->children->xlator)->fops->writev unwind_to=mdc_writev_cbk [global.callpool.stack.1.frame.3] frame=0x7f6dec005bf8 ref_count=1 translator=vmstore2-md-cache complete=0 parent=vmstore2-io-threads wind_from=default_writev_resume wind_to=(this->children->xlator)->fops->writev unwind_to=default_writev_cbk [global.callpool.stack.1.frame.4] frame=0x7f6e400ab0f8 ref_count=1 translator=vmstore2-io-threads complete=0 parent=vmstore2 wind_from=io_stats_writev wind_to=(this->children->xlator)->fops->writev unwind_to=io_stats_writev_cbk [global.callpool.stack.1.frame.5] frame=0x7f6e4007c6c8 ref_count=1 translator=vmstore2 complete=0 parent=fuse wind_from=fuse_write_resume wind_to=FIRST_CHILD(this)->fops->writev unwind_to=fuse_writev_cbk [global.callpool.stack.1.frame.6] frame=0x7f6e4002cb98 ref_count=1 translator=fuse complete=0 ~~~ So I believe we're pending in the 'write-behind' translator. Please, I'd need some help to figure out the cause of the hang. Thank you, Natalia
I do see a write request hung in write-behind. Details of write-request from state-dump: [xlator.performance.write-behind.wb_inode] path=/e5dd645f-88bb-491c-9145-38fa229cbc4d/images/8e84c1ed-48ba-4b82-9882-c96e6f260bab/29bba0a1-6c7b-4358-9ef2-f8080405778d inode=0x7f6e40060888 gfid=6348d15d-7b17-4993-9da9-3f588c2ad5a8 window_conf=1048576 window_current=0 transit-size=0 dontsync=0 [.WRITE] unique=5518502 refcount=1 wound=no generation-number=0 req->op_ret=131072 req->op_errno=0 sync-attempts=0 sync-in-progress=no size=131072 offset=4184756224 lied=0 append=0 fulfilled=0 go=0 I'll go through this and will try to come up with an RCA. --- Additional comment from Raghavendra G on 2019-04-29 07:21:50 UTC --- There is a race in the way O_DIRECT writes are handled. Assume two overlapping write requests w1 and w2. * w1 is issued and is in wb_inode->wip queue as the response is still pending from bricks. Also wb_request_unref in wb_do_winds is not yet invoked. list_for_each_entry_safe (req, tmp, tasks, winds) { list_del_init (&req->winds); if (req->op_ret == -1) { call_unwind_error_keep_stub (req->stub, req->op_ret, req->op_errno); } else { call_resume_keep_stub (req->stub); } wb_request_unref (req); } * w2 is issued and wb_process_queue is invoked. w2 is not picked up for winding as w1 is still in wb_inode->wip. w1 is added to todo list and wb_writev for w2 returns. * response to w1 is received and invokes wb_request_unref. Assume wb_request_unref in wb_do_winds (see point 1) is not invoked yet. Since there is one more refcount, wb_request_unref in wb_writev_cbk of w1 doesn't remove w1 from wip. * wb_process_queue is invoked as part of wb_writev_cbk of w1. But, it fails to wind w2 as w1 is still in wip. * wb_requet_unref is invoked on w1 as part of wb_do_winds. w1 is removed from all queues including w1. * After this point there is no invocation of wb_process_queue unless new request is issued from application causing w2 to be hung till the next request. This bug is similar to bz 1626780 and bz 1379655. Though the issue is similar, fixes to these to bzs won't fix the current bug and hence this bug is not a duplicate. This bug will require a new fix and I'll post a patch to gerrit shortly.
REVIEW: https://review.gluster.org/22668 (performance/write-behind: remove request from wip list in wb_writev_cbk) posted (#1) for review on release-6 by Raghavendra G
This bug is moved to https://github.com/gluster/glusterfs/issues/1080, and will be tracked there from now on. Visit GitHub issues URL for further details