Description of problem: ======================= IO got hanged while doing in-service update from rhgs 3.1.3 to 3.2. When IO got hanged the server was in RHGS3.2 and client was in 3.1.3 bits. some dev debug details on the live setup: ========================================= [root@dhcp gluster]# cat /proc/6370/stack [<ffffffff811a490b>] pipe_wait+0x5b/0x80 [<ffffffff811a4caa>] pipe_write+0x37a/0x6b0 [<ffffffff8119996a>] do_sync_write+0xfa/0x140 [<ffffffff81199c68>] vfs_write+0xb8/0x1a0 [<ffffffff8119a7a1>] sys_write+0x51/0xb0 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff [root@dhcp gluster]# cat /proc/6369/stack [<ffffffffa02592ad>] __fuse_request_send+0xed/0x2b0 [fuse] [<ffffffffa0259482>] fuse_request_send+0x12/0x20 [fuse] [<ffffffffa0260176>] fuse_flush+0x106/0x140 [fuse] [<ffffffff8119683c>] filp_close+0x3c/0x90 [<ffffffff81196935>] sys_close+0xa5/0x100 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff [root@dhcp gluster]# lsof /mnt COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME bash 6240 root cwd DIR 0,20 4096 1 /mnt tar 6369 root cwd DIR 0,20 4096 1 /mnt xz 6370 root cwd DIR 0,20 4096 1 /mnt xz 6370 root 0r REG 0,20 91976832 11426561208144685685 /mnt/linux-4.8.11.tar.xz (deleted) Version-Release number of selected component (if applicable): ============================================================= Server: glusterfs-3.8.4-5.el6rhs.x86_64 Client: glusterfs-3.7.9-12.el6.x86_64 How reproducible: ================= One time Steps to Reproduce: ==================== 1. Do in-service update from 3.1.3 to 3.2 2. 3. Actual results: =============== IO got hanged Expected results: ================= IOs should not hang. Additional info:
Some more details: [root@ ~]# lsof /mnt/ COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME bash 32183 root cwd DIR 0,20 4096 1 /mnt bash 32369 root cwd DIR 0,20 4096 10728560248349618169 /mnt/tmp tar 32403 root cwd DIR 0,20 4096 10728560248349618169 /mnt/tmp xz 32404 root cwd DIR 0,20 4096 10728560248349618169 /mnt/tmp xz 32404 root 0r REG 0,20 91976832 10482997327777766662 /mnt/tmp/linux-4.8.11.tar.xz [root@ ~]# cat /proc/32404/stack [<ffffffff811a490b>] pipe_wait+0x5b/0x80 [<ffffffff811a4caa>] pipe_write+0x37a/0x6b0 [<ffffffff8119996a>] do_sync_write+0xfa/0x140 [<ffffffff81199c68>] vfs_write+0xb8/0x1a0 [<ffffffff8119a7a1>] sys_write+0x51/0xb0 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff [root@ ~]# [root@ ~]# cat /proc/32403/stack [<ffffffffa0259181>] wait_answer_interruptible+0x81/0xc0 [fuse] [<ffffffffa025939b>] __fuse_request_send+0x1db/0x2b0 [fuse] [<ffffffffa0259482>] fuse_request_send+0x12/0x20 [fuse] [<ffffffffa0260176>] fuse_flush+0x106/0x140 [fuse] [<ffffffff8119683c>] filp_close+0x3c/0x90 [<ffffffff81196935>] sys_close+0xa5/0x100 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff [root@ ~]#
Given this is hit one more time, its need to be fixed looking at the severity as it impacts the upgrade path. Providing dev_ack.
This is a bug in 3.1.3 as per my observations and testing. There is a fix which is missing in 3.1.3 and the same is fixed in 3.2.0. [1] is the link for the fix in upstream and [2] in downstream. The patch explains the scenario very well. I applied [1] on 3.1.3 and tried to reproduce the issue with single and multiple clients, and upgraded the servers to glusterfs-3.8.4-7.el6rhs.x86_64 build. I did not hit the issue in both cases. [3] is the link to the custom build I used to reproduce the issue, which includes [1]. It took ~30 mins to untar the linux kernal in both cases. Could you please try to reproduce the issue with [3] and confirm whether we hit this again or not. [1] http://review.gluster.org/#/c/15579/ [2] https://code.engineering.redhat.com/gerrit/#/c/91956/ [3] https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12279228
Verified this issue multiple times from 3.1.3 bits to 3.2.0 ( glusterfs-3.8.4-10) and from glusterfs-3.8.4-7 to glusterfs-3.8.4-10. In both cases,update worked well, didn't seen the reported issue. Moving to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days