Description of problem: This is a split from bug 859589. Version-Release number of selected component (if applicable): 3.3.0rhsvirt1-9,glusterfs-3.3.0.6rhs-3 How reproducible: Mostly Steps to Reproduce: 1. Created a 6X2 distribute-replicate volume 2. Tagged it with group virt 3. Set storage.owner-uid and gid to 36:36 4. Fuse mounted the volume and created 8 qcow2 image of 20G on it 5. Changed the mount dir perms to qemu:qemu 6. Create 8 application VM with RHEL6.3 with above created qcow2 image as disk backend 7. Added a pair of bricks to the volume 8. Issued rebalance fixlayout followed by rebalance 9. Got the error messages in VM in few [ 4 ] VMs Actual results: VM hangs for a brief period. echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message INFO: task sadc:4263 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message INFO: task jbd2/dm-0-8:365 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message INFO: task master:3935 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message INFO: task jbd2/m-0-8:365 blocked for more than 120 seconds. Expected results: Any activity in VM should not be affected. Additional info: sosreports available at, http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/859589/ This was detected in SMB testing as well: Steps: 1. create 6x2 dis-rep volume and start the volume 2. SMB mount on the RHEL client and perform dd dd if=/dev/zero/ of=/mntpt/file1 bs=10M count=2048 3. On the server, add 2 bricks to make the volume 7x2 and perform rebalance operation. 4. dd hangs and dd prcess cannot be killed. dmesg on client: INFO: task dd:1878 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. dd D 0000000000000000 0 1878 1789 0x00000084 ffff880037ff1cd8 0000000000000086 0000000000000000 ffff880037ff1c78 ffff88007ce15d70 ffffffff00000009 ffff88007cde3d70 0000000000000008 ffff88007b929098 ffff880037ff1fd8 000000000000fb88 ffff88007b929098 Call Trace: [<ffffffff810a1ac9>] ? ktime_get_ts+0xa9/0xe0 [<ffffffff81119db0>] ? sync_page+0x0/0x50 [<ffffffff8150e0f3>] io_schedule+0x73/0xc0 [<ffffffff81119ded>] sync_page+0x3d/0x50 [<ffffffff8150eaaf>] __wait_on_bit+0x5f/0x90 [<ffffffff8111a023>] wait_on_page_bit+0x73/0x80 [<ffffffff81096ce0>] ? wake_bit_function+0x0/0x50 [<ffffffff8112f025>] ? pagevec_lookup_tag+0x25/0x40 [<ffffffff8111a44b>] wait_on_page_writeback_range+0xfb/0x190 [<ffffffff8111a50f>] filemap_fdatawait+0x2f/0x40 [<ffffffff8111ab34>] filemap_write_and_wait+0x44/0x60 [<ffffffffa02f37d5>] cifs_flush+0x55/0x80 [cifs] [<ffffffff8117dd9c>] filp_close+0x3c/0x90 [<ffffffff8117de95>] sys_close+0xa5/0x100 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b statedump and sosreport are at: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/859589/smb_test/
Comment copied from https://bugzilla.redhat.com/show_bug.cgi?id=859589#c41 <snip> Here is the technical info on why normally at least one time a flush() may take extra time. When a rebalance starts on the file, since we are handling 'open-fd-rebalancing', the clients will understand that the destination file is under migration. After knowing this, before returning 'flush()' call back to application, the client does 'open()' on the destination file, upon success it will do the same 'flush()' on the destination too. Which inturn will have to wait in the queue of long list of activities handled by glusterfsd process. That is the reason why someone would see such logs for 'hung for > 120sec' in dmesg while rebalance is happening. For the claim that it can happen only if there are few VMs or more VMs, it will depend on number of I/O waiting on server side, and also the normal system setup. This is the exact reason why we recommend to do 'rebalance' operation when the load on the system is low (read as weekends/holidays etc). Also, please note that 'rebalance' is advised every time a 'add-brick' is done (which by itself is rare operation ?). I propose that we take off 'blocker' flag out of it as long as none of the VMs are hung and we are not hitting non-recoverable state. </snip>
Amar, Naga has identified this bug as Known Issue for Update 5 release. Could you provide the inputs in the Doc Text field? Thanks, Divya
The product version of Red Hat Storage on which this issue was reported has reached End Of Life (EOL) [1], hence this bug report is being closed. If the issue is still observed on a current version of Red Hat Storage, please file a new bug report on the current version. [1] https://rhn.redhat.com/errata/RHSA-2014-0821.html