1707195 – VM stuck in a shutdown because of a pending fuse request

Bug 1707195 - VM stuck in a shutdown because of a pending fuse request

Summary: VM stuck in a shutdown because of a pending fuse request

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	write-behind
Sub Component:
Version:	6
Hardware:	Unspecified
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1702686 1705865 1707198
Blocks:	1707200
TreeView+	depends on / blocked

Reported:	2019-05-07 02:46 UTC by Raghavendra G
Modified:	2020-03-12 15:00 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:	1705865
Clones:	1707200 (view as bug list)
Environment:
Last Closed:	2020-03-12 15:00:57 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Gluster.org Gerrit	22668	0	None	Abandoned	performance/write-behind: remove request from wip list in wb_writev_cbk	2019-10-03 19:32:03 UTC

Comment 1 Raghavendra G 2019-05-07 02:48:00 UTC

VM fails to shutdown, getting stuck in 'Powering down' status. This is because its 'qemu-kvm' process gets in a zombie/defunct state:

more ps-Ll.txt
F S   UID   PID  PPID   LWP  C PRI  NI ADDR SZ WCHAN  TTY        TIME CMD
6 Z   107 20631     1 20631  0  80   0 -     0 do_exi ?          8:45 [qemu-kvm] <defunct>
3 D   107 20631     1 20635  0  80   0 - 2386845 fuse_r ?        1:12 [qemu-kvm] <defunct>

The customer has collected a crash dump of the affected VM and also statedumps from all the glusterfs process running in this machine when this problem is present.

Thread ID 20635 is the one of interest:
   
crash> bt 20635
PID: 20635  TASK: ffff9ed3926eb0c0  CPU: 7   COMMAND: "IO iothread1"
 #0 [ffff9ec8e351fa28] __schedule at ffffffff91967747
 #1 [ffff9ec8e351fab0] schedule at ffffffff91967c49
 #2 [ffff9ec8e351fac0] __fuse_request_send at ffffffffc09d24e5 [fuse]
 #3 [ffff9ec8e351fb30] fuse_request_send at ffffffffc09d26e2 [fuse]
 #4 [ffff9ec8e351fb40] fuse_send_write at ffffffffc09dbc76 [fuse]
 #5 [ffff9ec8e351fb70] fuse_direct_io at ffffffffc09dc0d6 [fuse]
 #6 [ffff9ec8e351fc58] __fuse_direct_write at ffffffffc09dc562 [fuse]
 #7 [ffff9ec8e351fca8] fuse_direct_IO at ffffffffc09dd3ca [fuse]
 #8 [ffff9ec8e351fd70] generic_file_direct_write at ffffffff913b8663
 #9 [ffff9ec8e351fdc8] fuse_file_aio_write at ffffffffc09ddbd5 [fuse]
#10 [ffff9ec8e351fe60] do_io_submit at ffffffff91497a73
#11 [ffff9ec8e351ff40] sys_io_submit at ffffffff91497f40
#12 [ffff9ec8e351ff50] tracesys at ffffffff9197505b (via system_call)
    RIP: 00007f9ff0758697  RSP: 00007f9db86814b8  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 0000000000000001  RCX: ffffffffffffffff
    RDX: 00007f9db86814d0  RSI: 0000000000000001  RDI: 00007f9ff268e000
    RBP: 0000000000000080   R8: 0000000000000080   R9: 000000000000006a
    R10: 0000000000000078  R11: 0000000000000246  R12: 00007f9db86814c0
    R13: 0000560264b9b518  R14: 0000560264b9b4f0  R15: 00007f9db8681bb0
    ORIG_RAX: 00000000000000d1  CS: 0033  SS: 002b

From the core, this is the file the above process is writing to:

crash> files -d 0xffff9ec8e8f9f740
     DENTRY           INODE           SUPERBLK     TYPE PATH
ffff9ec8e8f9f740 ffff9ed39e705700 ffff9ee009adc000 REG  /rhev/data-center/mnt/glusterSD/172.16.20.21:_vmstore2/e5dd645f-88bb-491c-9145-38fa229cbc4d/images/8e84c1ed-48ba-4b82-9882-c96e6f260bab/29bba0a1-6c7b-4358-9ef2-f8080405778d

So in this case we're accessing the vmstore2 volume.

This is the glusterfs process:

root      4863  0.0  0.0 1909580 49316 ?       S<sl Apr04  25:51 /usr/sbin/glusterfs --volfile-server=172.16.20.21 --volfile-server=172.16.20.22 --volfile-server=172.16.20.23 --volfile-id=/vmstore2 /rhev/data-center/mnt/glusterSD/172.16.20.21:_vmstore2

Backtrace of this process on the crash dump:

crash> bt 4863
PID: 4863   TASK: ffff9edfa9ff9040  CPU: 11  COMMAND: "glusterfs"
 #0 [ffff9ed3a332fc28] __schedule at ffffffff91967747
 #1 [ffff9ed3a332fcb0] schedule at ffffffff91967c49
 #2 [ffff9ed3a332fcc0] futex_wait_queue_me at ffffffff9130cf76
 #3 [ffff9ed3a332fd00] futex_wait at ffffffff9130dc5b
 #4 [ffff9ed3a332fe48] do_futex at ffffffff9130f9a6
 #5 [ffff9ed3a332fed8] sys_futex at ffffffff9130fec0
 #6 [ffff9ed3a332ff50] system_call_fastpath at ffffffff91974ddb
    RIP: 00007f6e5eeccf47  RSP: 00007ffdd311c7d0  RFLAGS: 00000246
    RAX: 00000000000000ca  RBX: 00007f6e59496700  RCX: ffffffffffffffff
    RDX: 0000000000001308  RSI: 0000000000000000  RDI: 00007f6e594969d0
    RBP: 00007f6e60552780   R8: 0000000000000000   R9: 00007f6e5e6e314d
    R10: 0000000000000000  R11: 0000000000000246  R12: 00007f6e59496d28
    R13: 0000000000000000  R14: 0000000000000006  R15: 00007ffdd311c920
    ORIG_RAX: 00000000000000ca  CS: 0033  SS: 002b

We have a few pending frames in this process. Reviewing the corresponding statedump:

grep complete=0 glusterdump.4863.dump.1556091368 -c
7

Looking for these pending frames in the statedump:

~~~

[global.callpool.stack.1]
stack=0x7f6e4007c828
uid=107
gid=107
pid=20635
unique=5518502
lk-owner=bd2351a6cc7fcb8b
op=WRITE
type=1
cnt=6

[global.callpool.stack.1.frame.1]
frame=0x7f6dec04de38
ref_count=0
translator=vmstore2-write-behind
complete=0
parent=vmstore2-open-behind
wind_from=default_writev_resume
wind_to=(this->children->xlator)->fops->writev
unwind_to=default_writev_cbk

[global.callpool.stack.1.frame.2]
frame=0x7f6dec0326f8
ref_count=1
translator=vmstore2-open-behind
complete=0
parent=vmstore2-md-cache
wind_from=mdc_writev
wind_to=(this->children->xlator)->fops->writev
unwind_to=mdc_writev_cbk

[global.callpool.stack.1.frame.3]
frame=0x7f6dec005bf8
ref_count=1
translator=vmstore2-md-cache
complete=0
parent=vmstore2-io-threads
wind_from=default_writev_resume
wind_to=(this->children->xlator)->fops->writev
unwind_to=default_writev_cbk

[global.callpool.stack.1.frame.4]
frame=0x7f6e400ab0f8
ref_count=1
translator=vmstore2-io-threads
complete=0
parent=vmstore2
wind_from=io_stats_writev
wind_to=(this->children->xlator)->fops->writev
unwind_to=io_stats_writev_cbk

[global.callpool.stack.1.frame.5]
frame=0x7f6e4007c6c8
ref_count=1
translator=vmstore2
complete=0
parent=fuse
wind_from=fuse_write_resume
wind_to=FIRST_CHILD(this)->fops->writev
unwind_to=fuse_writev_cbk

[global.callpool.stack.1.frame.6]
frame=0x7f6e4002cb98
ref_count=1
translator=fuse
complete=0

~~~

So I believe we're pending in the 'write-behind' translator. Please, I'd need some help to figure out the cause of the hang. 

Thank you,

Natalia

Comment 2 Raghavendra G 2019-05-07 02:48:45 UTC

I do see a write request hung in write-behind. Details of write-request from state-dump:

[xlator.performance.write-behind.wb_inode]
path=/e5dd645f-88bb-491c-9145-38fa229cbc4d/images/8e84c1ed-48ba-4b82-9882-c96e6f260bab/29bba0a1-6c7b-4358-9ef2-f8080405778d
inode=0x7f6e40060888
gfid=6348d15d-7b17-4993-9da9-3f588c2ad5a8
window_conf=1048576
window_current=0
transit-size=0
dontsync=0

[.WRITE]
unique=5518502
refcount=1
wound=no
generation-number=0
req->op_ret=131072
req->op_errno=0
sync-attempts=0
sync-in-progress=no
size=131072
offset=4184756224
lied=0
append=0
fulfilled=0
go=0

I'll go through this and will try to come up with an RCA.

--- Additional comment from Raghavendra G on 2019-04-29 07:21:50 UTC ---

There is a race in the way O_DIRECT writes are handled. Assume two overlapping write requests w1 and w2.

* w1 is issued and is in wb_inode->wip queue as the response is still pending from bricks. Also wb_request_unref in wb_do_winds is not yet invoked.

       list_for_each_entry_safe (req, tmp, tasks, winds) {
		list_del_init (&req->winds);

                if (req->op_ret == -1) {
			call_unwind_error_keep_stub (req->stub, req->op_ret,
		                                     req->op_errno);
                } else {
                        call_resume_keep_stub (req->stub);
		}

                wb_request_unref (req);
        }

* w2 is issued and wb_process_queue is invoked. w2 is not picked up for winding as w1 is still in wb_inode->wip. w1 is added to todo list and wb_writev for w2 returns.

* response to w1 is received and invokes wb_request_unref. Assume wb_request_unref in wb_do_winds (see point 1) is not invoked yet. Since there is one more refcount, wb_request_unref in wb_writev_cbk of w1 doesn't remove w1 from wip.
* wb_process_queue is invoked as part of wb_writev_cbk of w1. But, it fails to wind w2 as w1 is still in wip.
* wb_requet_unref is invoked on w1 as part of wb_do_winds. w1 is removed from all queues including w1.
* After this point there is no invocation of wb_process_queue unless new request is issued from application causing w2 to be hung till the next request.

This bug is similar to bz 1626780 and bz 1379655. Though the issue is similar, fixes to these to bzs won't fix the current bug and hence this bug is not a duplicate. This bug will require a new fix and I'll post a patch to gerrit shortly.

Comment 3 Worker Ant 2019-05-07 02:50:50 UTC

REVIEW: https://review.gluster.org/22668 (performance/write-behind: remove request from wip list in wb_writev_cbk) posted (#1) for review on release-6 by Raghavendra G

Comment 4 Worker Ant 2020-03-12 15:00:57 UTC

This bug is moved to https://github.com/gluster/glusterfs/issues/1080, and will be tracked there from now on. Visit GitHub issues URL for further details

Note You need to log in before you can comment on or make changes to this bug.