Bug 1471737

Summary:	file being created by dd doesn't automatically release control when the volume is down
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Nag Pavan Chilakam <nchilaka>
Component:	write-behind	Assignee:	Csaba Henk <csaba>
Status:	CLOSED NOTABUG	QA Contact:	Rahul Hinduja <rhinduja>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	rhgs-3.3	CC:	csaba, jahernan, rhs-bugs, storage-qa-internal
Target Milestone:	---	Keywords:	Triaged, ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-01-20 12:20:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Nag Pavan Chilakam 2017-07-17 11:21:47 UTC

Description of problem:
======================
I have seen that when I am writing to a file using dd, and I stop the volume, the control of cli is never released by the dd command, until user intervenes.
However, when there is a transport end point error , why are we still holding the lock?
It gets released only when an user does a terminate of the command(ctrl+c)

Any file created henceforth, using dd or touch or any other manner, fails with "Transport endpoint not connected" which is correct


Version-Release number of selected component (if applicable):
=======
3.8.4-33

How reproducible:
========
always

Steps to Reproduce:
1.create a 1x3 volume and fuse mount it 
2. create a file using dd or put in loop say for 10 files
for i in {1..10000};do dd if=/dev/urandom of=file.$i bs=1024 count=100000;done

3. stop the volume using vol stop command
3.you can notice that the cli prompt is never returned until the user intervenes.
I waited for about 5 min


The file handle shows transport end point error

[root@dhcp35-103 glusterfs]#  lsof|grep file.5
lsof: WARNING: can't stat() fuse.glusterfs file system /mnt/vol_3
      Output information may be incomplete.
dd        12428          root    1w  unknown                                         /mnt/vol_3/dd2/file.5 (stat: Transport endpoint is not connected)




Actual results:
===========


Expected results:
======
the shell control must be returned to user once there is a transport end point error

Additional info:
==============
this can be a problem especially if IOs are being done through another application, where user may not be able to identify.




Also fuse logs point to the right errors

[2017-07-17 11:09:19.826321] W [socket.c:595:__socket_rwv] 0-vol_3-client-0: readv on 10.70.35.45:49154 failed (Connection reset by peer)
[2017-07-17 11:09:19.826399] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-vol_3-client-0: disconnected from vol_3-client-0. Client process will keep trying to connect to glusterd until brick's port is available
[2017-07-17 11:09:19.827417] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f0ef8dae1e2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f0ef8b738ae] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f0ef8b739be] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x90)[0x7f0ef8b75130] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7f0ef8b75be0] ))))) 0-vol_3-client-0: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2017-07-17 11:09:19.824810 (xid=0x1b0f9)
[2017-07-17 11:09:19.827457] W [MSGID: 114031] [client-rpc-fops.c:855:client3_3_writev_cbk] 0-vol_3-client-0: remote operation failed [Transport endpoint is not connected]
[2017-07-17 11:09:19.827506] W [MSGID: 114061] [client-common.c:460:client_pre_fsync] 0-vol_3-client-0:  (5bd904f5-384b-4cbf-b0ae-1849f7e0ad36) remote_fd is -1. EBADFD [File descriptor in bad state]
[2017-07-17 11:09:19.827561] W [MSGID: 108035] [afr-transaction.c:2231:afr_changelog_fsync_cbk] 0-vol_3-replicate-0: fsync(5bd904f5-384b-4cbf-b0ae-1849f7e0ad36) failed on subvolume vol_3-client-0. Transaction was WRITE [File descriptor in bad state]
[2017-07-17 11:09:20.010047] E [MSGID: 114031] [client-rpc-fops.c:1601:client3_3_finodelk_cbk] 0-vol_3-client-0: remote operation failed [Transport endpoint is not connected]
[2017-07-17 11:09:21.846196] W [socket.c:595:__socket_rwv] 0-vol_3-client-2: readv on 10.70.35.122:49154 failed (Connection reset by peer)
[2017-07-17 11:09:21.846287] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-vol_3-client-2: disconnected from vol_3-client-2. Client process will keep trying to connect to glusterd until brick's port is available
[2017-07-17 11:09:21.846292] W [socket.c:595:__socket_rwv] 0-vol_3-client-1: readv on 10.70.35.130:49154 failed (Connection reset by peer)
[2017-07-17 11:09:21.846328] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-vol_3-client-1: disconnected from vol_3-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2017-07-17 11:09:21.846417] W [MSGID: 108001] [afr-common.c:4820:afr_notify] 0-vol_3-replicate-0: Client-quorum is not met
[2017-07-17 11:09:21.846560] E [MSGID: 108006] [afr-common.c:4731:afr_notify] 0-vol_3-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.
[2017-07-17 11:09:21.847039] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f0ef8dae1e2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f0ef8b738ae] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f0ef8b739be] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x90)[0x7f0ef8b75130] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7f0ef8b75be0] ))))) 0-vol_3-client-2: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2017-07-17 11:09:21.844728 (xid=0x1c6c1)
[2017-07-17 11:09:21.847047] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f0ef8dae1e2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f0ef8b738ae] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f0ef8b739be] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x90)[0x7f0ef8b75130] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7f0ef8b75be0] ))))) 0-vol_3-client-1: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2017-07-17 11:09:21.844687 (xid=0x1c6c3)
[2017-07-17 11:09:21.847064] W [MSGID: 114031] [client-rpc-fops.c:855:client3_3_writev_cbk] 0-vol_3-client-2: remote operation failed [Transport endpoint is not connected]
[2017-07-17 11:09:21.847072] W [MSGID: 114031] [client-rpc-fops.c:855:client3_3_writev_cbk] 0-vol_3-client-1: remote operation failed [Transport endpoint is not connected]
[2017-07-17 11:09:21.847099] W [MSGID: 114061] [client-common.c:460:client_pre_fsync] 0-vol_3-client-1:  (de308b78-ace3-40b5-bbb1-d658cdf1fb9c) remote_fd is -1. EBADFD [File descriptor in bad state]
[2017-07-17 11:09:21.847120] W [MSGID: 108035] [afr-transaction.c:2231:afr_changelog_fsync_cbk] 0-vol_3-replicate-0: fsync(de308b78-ace3-40b5-bbb1-d658cdf1fb9c) failed on subvolume vol_3-client-1. Transaction was WRITE [File descriptor in bad state]
[2017-07-17 11:09:21.847136] W [MSGID: 114061] [client-common.c:460:client_pre_fsync] 0-vol_3-client-2:  (de308b78-ace3-40b5-bbb1-d658cdf1fb9c) remote_fd is -1. EBADFD [File descriptor in bad state]
[2017-07-17 11:09:21.847151] W [MSGID: 108035] [afr-transaction.c:2231:afr_changelog_fsync_cbk] 0-vol_3-replicate-0: fsync(de308b78-ace3-40b5-bbb1-d658cdf1fb9c) failed on subvolume vol_3-client-2. Transaction was WRITE [File descriptor in bad state]
[2017-07-17 11:09:21.847166] E [MSGID: 114031] [client-rpc-fops.c:1601:client3_3_finodelk_cbk] 0-vol_3-client-1: remote operation failed [Transport endpoint is not connected]
[2017-07-17 11:09:21.847185] E [MSGID: 114031] [client-rpc-fops.c:1601:client3_3_finodelk_cbk] 0-vol_3-client-2: remote operation failed [Transport endpoint is not connected]
[2017-07-17 11:09:21.847246] I [MSGID: 108006] [afr-common.c:4874:afr_local_init] 0-vol_3-replicate-0: no subvolumes up
The message "I [MSGID: 108006] [afr-common.c:4874:afr_local_init] 0-vol_3-replicate-0: no subvolumes up" repeated 38282 times between [2017-07-17 11:09:21.847246] and [2017-07-17 11:09:30.576305]
[2017-07-17 11:09:30.576570] E [MSGID: 114058] [client-handshake.c:1537:client_query_portmap_cbk] 0-vol_3-client-0: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running.
[2017-07-17 11:09:30.576701] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-vol_3-client-0: disconnected from vol_3-client-0. Client process will keep trying to connect to glusterd until brick's port is available
[2017-07-17 11:09:30.577094] I [MSGID: 108006] [afr-common.c:4874:afr_local_init] 0-vol_3-replicate-0: no subvolumes up
The message "I [MSGID: 108006] [afr-common.c:4874:afr_local_init] 0-vol_3-replicate-0: no subvolumes up" repeated 3457 times between [2017-07-17 11:09:30.577094] and [2017-07-17 11:09:32.579105]
[2017-07-17 11:09:32.579465] E [MSGID: 114058] [client-handshake.c:1537:client_query_portmap_cbk] 0-vol_3-client-2: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running.
[2017-07-17 11:09:32.579573] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-vol_3-client-2: disconnected from vol_3-client-2. Client process will keep trying to connect to glusterd until brick's port is available
[2017-07-17 11:09:32.580279] I [MSGID: 108006] [afr-common.c:4874:afr_local_init] 0-vol_3-replicate-0: no subvolumes up
[2017-07-17 11:09:32.581463] I [MSGID: 108006] [afr-common.c:4874:afr_local_init] 0-vol_3-replicate-0: no subvolumes up
[2017-07-17 11:09:32.581700] E [MSGID: 114058] [client-handshake.c:1537:client_query_portmap_cbk] 0-vol_3-client-1: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running.
[2017-07-17 11:09:32.581782] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-vol_3-client-1: disconnected from vol_3-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2017-07-17 11:09:32.582687] I [MSGID: 108006] [afr-common.c:4874:afr_local_init] 0-vol_3-replicate-0: no subvolumes up
The message "I [MSGID: 108006] [afr-common.c:4874:afr_local_init] 0-vol_3-replicate-0: no subvolumes up" repeated 61047 times between [2017-07-17 11:09:32.582687] and [2017-07-17 11:10:30.728966]
[2017-07-17 11:10:30.729836] I [MSGID: 108006] [afr-common.c:4874:afr_local_init] 0-vol_3-replicate-0: no subvolumes up



version:
3.8.4-33

Comment 4 Ravishankar N 2018-11-14 11:54:03 UTC

This behaviour seems to be due to write-behind returning success for the writes. I was able to reproduce a similar behaviour using a C program which does write(2) in a while loop. Even after killing all 3 bricks of a 1x3 volume, further write syscalls on the same fd succeeds:
---------------------------------------------------------------------------------
(gdb) bt
#0  fuse_writev_cbk (frame=0x7fd9d8004780, cookie=0x7fd9d80042a0, this=0x12a2100, op_ret=17, op_errno=0, stbuf=0x7fd9e0a812a0, postbuf=0x7fd9e0a812a0, xdata=0x0) at fuse-bridge.c:2398
#1  0x00007fd9e1fa8cda in io_stats_writev_cbk (frame=0x7fd9d80042a0, cookie=0x7fd9d8004490, this=0x7fd9dc01aac0, op_ret=17, op_errno=0, prebuf=0x7fd9e0a812a0, postbuf=0x7fd9e0a812a0, 
    xdata=0x0) at io-stats.c:2107
#2  0x00007fd9e21d9192 in mdc_writev_cbk (frame=0x7fd9d8004490, cookie=0x7fd9d8003ad0, this=0x7fd9dc019480, op_ret=17, op_errno=0, prebuf=0x7fd9e0a812a0, postbuf=0x7fd9e0a812a0, xdata=0x0)
    at md-cache.c:2116
#3  0x00007fd9f0fdc121 in default_writev_cbk (frame=0x7fd9d8003ad0, cookie=0x7fd9d8003cc0, this=0x7fd9dc017e40, op_ret=17, op_errno=0, prebuf=0x7fd9e0a812a0, postbuf=0x7fd9e0a812a0, 
    xdata=0x0) at defaults.c:1242
#4  0x00007fd9f0fdc121 in default_writev_cbk (frame=0x7fd9d8003cc0, cookie=0x7fd9d8003dd0, this=0x7fd9dc016790, op_ret=17, op_errno=0, prebuf=0x7fd9e0a812a0, postbuf=0x7fd9e0a812a0, 
    xdata=0x0) at defaults.c:1242
#5  0x00007fd9e280a0bc in ioc_writev_cbk (frame=0x7fd9d8003dd0, cookie=0x7fd9d80020e0, this=0x7fd9dc0151c0, op_ret=17, op_errno=0, prebuf=0x7fd9e0a812a0, postbuf=0x7fd9e0a812a0, xdata=0x0)
    at io-cache.c:1234
#6  0x00007fd9e2a1f697 in ra_writev_cbk (frame=0x7fd9d80020e0, cookie=0x7fd9d8002300, this=0x7fd9dc013bf0, op_ret=17, op_errno=0, prebuf=0x7fd9e0a812a0, postbuf=0x7fd9e0a812a0, xdata=0x0)
    at read-ahead.c:656
#7  0x00007fd9e2c30029 in wb_do_unwinds (wb_inode=0x7fd9dc06a900, lies=0x7fd9e0a81380) at write-behind.c:1204
#8  0x00007fd9e2c315b4 in wb_process_queue (wb_inode=0x7fd9dc06a900) at write-behind.c:1707
#9  0x00007fd9e2c31f1c in wb_writev (frame=0x7fd9d8002300, this=0x7fd9dc012480, fd=0x7fd9d800ec50, vector=0x7fd9d80045c0, count=1, offset=211, flags=32769, iobref=0x7fd9d8003a00, 
    xdata=0x0) at write-behind.c:1822
#10 0x00007fd9e2a1fb4d in ra_writev (frame=0x7fd9d80020e0, this=0x7fd9dc013bf0, fd=0x7fd9d800ec50, vector=0x7fd9d80045c0, count=1, offset=211, flags=32769, iobref=0x7fd9d8003a00, 
    xdata=0x0) at read-ahead.c:684
#11 0x00007fd9e280a70e in ioc_writev (frame=0x7fd9d8003dd0, this=0x7fd9dc0151c0, fd=0x7fd9d800ec50, vector=0x7fd9d80045c0, count=1, offset=211, flags=32769, iobref=0x7fd9d8003a00, 
    xdata=0x0) at io-cache.c:1275
#12 0x00007fd9e25fac2a in qr_writev (frame=0x7fd9d8003cc0, this=0x7fd9dc016790, fd=0x7fd9d800ec50, iov=0x7fd9d80045c0, count=1, offset=211, flags=32769, iobref=0x7fd9d8003a00, xdata=0x0)
    at quick-read.c:639
#13 0x00007fd9f0fe5bd9 in default_writev_resume (frame=0x7fd9d8003ad0, this=0x7fd9dc017e40, fd=0x7fd9d800ec50, vector=0x7fd9d80045c0, count=1, off=211, flags=32769, iobref=0x7fd9d8003a00, 
    xdata=0x0) at defaults.c:1849
#14 0x00007fd9f0f4f601 in call_resume_wind (stub=0x7fd9d80193a0) at call-stub.c:2045
#15 0x00007fd9f0f5f3ec in call_resume (stub=0x7fd9d80193a0) at call-stub.c:2512
#16 0x00007fd9e23ed91b in open_and_resume (this=0x7fd9dc017e40, fd=0x7fd9d800ec50, stub=0x7fd9d80193a0) at open-behind.c:246
#17 0x00007fd9e23ee87c in ob_writev (frame=0x7fd9d8003ad0, this=0x7fd9dc017e40, fd=0x7fd9d800ec50, iov=0x7fd9d8009c90, count=1, offset=211, flags=32769, iobref=0x7fd9d8003a00, xdata=0x0)
    at open-behind.c:424
#18 0x00007fd9e21d94e7 in mdc_writev (frame=0x7fd9d8004490, this=0x7fd9dc019480, fd=0x7fd9d800ec50, vector=0x7fd9d8009c90, count=1, offset=211, flags=32769, iobref=0x7fd9d8003a00, 
    xdata=0x0) at md-cache.c:2134
#19 0x00007fd9e1fb88c0 in io_stats_writev (frame=0x7fd9d80042a0, this=0x7fd9dc01aac0, fd=0x7fd9d800ec50, vector=0x7fd9d8009c90, count=1, offset=211, flags=32769, iobref=0x7fd9d8003a00, 
    xdata=0x0) at io-stats.c:2955
#20 0x00007fd9f0fec869 in default_writev (frame=0x7fd9d80042a0, this=0x7fd9dc01c5b0, fd=0x7fd9d800ec50, vector=0x7fd9d8009c90, count=1, off=211, flags=32769, iobref=0x7fd9d8003a00, 
    xdata=0x0) at defaults.c:2543
#21 0x00007fd9e1d89957 in meta_writev (frame=0x7fd9d80042a0, this=0x7fd9dc01c5b0, fd=0x7fd9d800ec50, iov=0x7fd9d8009c90, count=1, offset=211, flags=32769, iobref=0x7fd9d8003a00, xdata=0x0)
    at meta.c:142
#22 0x00007fd9e8303687 in fuse_write_resume (state=0x7fd9d8009580) at fuse-bridge.c:2457
#23 0x00007fd9e82f72c7 in fuse_fop_resume (state=0x7fd9d8009580) at fuse-bridge.c:625
#24 0x00007fd9e82f4a2b in fuse_resolve_done (state=0x7fd9d8009580) at fuse-resolve.c:663
#25 0x00007fd9e82f4b01 in fuse_resolve_all (state=0x7fd9d8009580) at fuse-resolve.c:690
#26 0x00007fd9e82f4a0c in fuse_resolve (state=0x7fd9d8009580) at fuse-resolve.c:654
#27 0x00007fd9e82f4ad8 in fuse_resolve_all (state=0x7fd9d8009580) at fuse-resolve.c:686
#28 0x00007fd9e82f4b5f in fuse_resolve_continue (state=0x7fd9d8009580) at fuse-resolve.c:706
#29 0x00007fd9e82f47c5 in fuse_resolve_fd (state=0x7fd9d8009580) at fuse-resolve.c:566
#30 0x00007fd9e82f49ba in fuse_resolve (state=0x7fd9d8009580) at fuse-resolve.c:643
#31 0x00007fd9e82f4a83 in fuse_resolve_all (state=0x7fd9d8009580) at fuse-resolve.c:679
#32 0x00007fd9e82f4b9d in fuse_resolve_and_resume (state=0x7fd9d8009580, fn=0x7fd9e83030d5 <fuse_write_resume>) at fuse-resolve.c:718
---Type <return> to continue, or q <return> to quit---
#33 0x00007fd9e830387a in fuse_write (this=0x12a2100, finh=0x7fd9d80046a0, msg=0x7fd9ef22a000) at fuse-bridge.c:2509
#34 0x00007fd9e83105a2 in fuse_thread_proc (data=0x12a2100) at fuse-bridge.c:5083
#35 0x00007fd9efd7fe25 in start_thread () from /lib64/libpthread.so.0
#36 0x00007fd9ef648bad in clone () from /lib64/libc.so.6
--------------------------------------------------------------------------------
I think the guarantee write-behind gives is that it will fail the next fsync call, but I'm moving it to WB to confirm if this is a bug that needs to be fixed.

Comment 5 Yaniv Kaul 2019-06-13 08:41:13 UTC

Latest status?

Comment 6 Csaba Henk 2019-07-12 16:41:59 UTC

Proposal is to close it with NOTABUG, as it's the user application that does not comply fully to POSIX semantics, not glusterfs. However, there are some minor questions to clarify before doing this.

Comment 7 Xavi Hernandez 2019-12-09 16:01:33 UTC

(In reply to Csaba Henk from comment #6)
> Proposal is to close it with NOTABUG, as it's the user application that does
> not comply fully to POSIX semantics, not glusterfs. However, there are some
> minor questions to clarify before doing this.

I don't understand. Shouldn't gluster return ENOTCONN as soon as there are no backends to use ? what's wrong with 'dd' ?

I think write-behind should detect that bricks are offline (probably by handling event CHILD_DOWN), flush all pending data, mark any open fd as invalid and fail any future request with ENOTCONN.

Comment 8 Csaba Henk 2019-12-30 13:15:40 UTC

POSIX vfs semantics gives you a chance to make sure to get notified about success of your writes via fsync. Which is not done by dd. Write-behind cannot reply with error if it's not being asked.

Comment 9 Xavi Hernandez 2020-01-09 16:37:59 UTC

(In reply to Csaba Henk from comment #8)
> POSIX vfs semantics gives you a chance to make sure to get notified about
> success of your writes via fsync. Which is not done by dd. Write-behind
> cannot reply with error if it's not being asked.

I think that if we can detect an error we should report it as soon as possible, but even if we ignore it, how is it possible that the 'dd' keeps running forever ? sooner or later write-behind cache should get full and some write requests should be sent to the bricks, which will fail. At this point any future request should fail (either with ENOTCONN or EIO). I don't see any benefit on hiding this problem to the user.

Comment 10 Xavi Hernandez 2020-01-20 12:20:47 UTC

Based on Csaba's comments #6 and #8, I'm closing this bug as NOTABUG. We'll need to decide what are the guarantees that gluster gives to decide how to behave in these cases, even if it's posix compliant like in this case.

Comment 11 Csaba Henk 2020-02-25 18:44:17 UTC

Removing needinfo on me as bug has been closed.