Bug 1162507 - Call traces (blocked for more than 120 seconds) found in App VMs while rebalance operation is in progress
Summary: Call traces (blocked for more than 120 seconds) found in App VMs while rebala...
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: distribute
Version: 2.1
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: ---
Assignee: Bug Updates Notification Mailing List
QA Contact: storage-qa-internal@redhat.com
URL:
Whiteboard:
Depends On:
Blocks: 1286087
TreeView+ depends on / blocked
 
Reported: 2014-11-11 08:16 UTC by SATHEESARAN
Modified: 2015-11-27 10:42 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1286087 (view as bug list)
Environment:
Last Closed: 2015-11-27 10:41:57 UTC
Embargoed:


Attachments (Terms of Use)
/var/log/message as seen on appvm1 (97.92 KB, text/plain)
2014-11-11 08:29 UTC, SATHEESARAN
no flags Details
/var/log/message as seen on appvm2 (57.73 KB, text/plain)
2014-11-11 08:29 UTC, SATHEESARAN
no flags Details
sosreport from RHS Node 1 (6.98 MB, application/x-xz)
2014-11-11 08:32 UTC, SATHEESARAN
no flags Details
sosreport from RHS Node 2 (10.35 MB, application/x-xz)
2014-11-11 08:34 UTC, SATHEESARAN
no flags Details

Description SATHEESARAN 2014-11-11 08:16:04 UTC
Description of problem:
-----------------------
There are few App VMs having their disk images on distributed-replicate gluster volumes. This VMs were performing some IO operation simulated with 'dd' command.

While performing rebalance, after adding a pair of bricks to this volume,
there are call traces present in /var/log/messages and in /var/log/dmesg

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
glusterfs-3.4.0.70rhs-1.el6rhs

How reproducible:
-----------------
tried once, but definite that this would be consistently reproducible

Steps to Reproduce:
-------------------
1. Create 2X2 distributed-replicate volume
2. Optimize the volume for virt-store by tagging the volume with group 'virt'
3. Use this volume as a 'DataStore' in RHEV-M Datacenter
4. Create few App VMs with its disk images on that 'DataStore' backed by gluster volume
5. Install RHEL 6.6 on that VM and reboot post installation
6. Run sample IO Load on that VM, simulated by 'dd' command. 
Use this to continuously create files.
7. Add a pair of bricks to this volume
8. Perform rebalance on the volume

Actual results:
---------------
Observed some call traces in /var/log/messages and in /var/log/dmesg

Expected results:
-----------------
Call traces should be observed

Comment 1 SATHEESARAN 2014-11-11 08:20:21 UTC
Additional Information
----------------------

Call traces as seen in AppVms
------------------------------

AppVM1
-------
Nov 10 11:53:28 appvm1 kernel: INFO: task sadc:22619 blocked for more than 120 seconds.
Nov 10 11:53:28 appvm1 kernel:      Not tainted 2.6.32-504.el6.x86_64 #1
Nov 10 11:53:28 appvm1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 10 11:53:28 appvm1 kernel: sadc          D 0000000000000000     0 22619  22615 0x00000084
Nov 10 11:53:28 appvm1 kernel: ffff88003a80fdc8 0000000000000082 ffff88003a80fe70 0000000000000001
Nov 10 11:53:28 appvm1 kernel: 000000000000000e ffff88003a80fde8 ffff88003a80fd88 ffffffff81123d80
Nov 10 11:53:28 appvm1 kernel: ffff88003e0dedd8 0000000101383693 ffff88003b39a5f8 ffff88003a80ffd8
Nov 10 11:53:28 appvm1 kernel: Call Trace:
Nov 10 11:53:28 appvm1 kernel: [<ffffffff81123d80>] ? find_get_pages_tag+0x40/0x130
Nov 10 11:53:28 appvm1 kernel: [<ffffffff8109ee2e>] ? prepare_to_wait+0x4e/0x80
Nov 10 11:53:28 appvm1 kernel: [<ffffffffa00945a5>] jbd2_log_wait_commit+0xc5/0x140 [jbd2]
Nov 10 11:53:28 appvm1 kernel: [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40
Nov 10 11:53:28 appvm1 kernel: [<ffffffffa0094938>] jbd2_complete_transaction+0x68/0xb0 [jbd2]
Nov 10 11:53:28 appvm1 kernel: [<ffffffffa00b0061>] ext4_sync_file+0x121/0x1d0 [ext4]
Nov 10 11:53:28 appvm1 kernel: [<ffffffff811c07f1>] vfs_fsync_range+0xa1/0x100
Nov 10 11:53:28 appvm1 kernel: [<ffffffff811c08bd>] vfs_fsync+0x1d/0x20
Nov 10 11:53:28 appvm1 kernel: [<ffffffff811c08fe>] do_fsync+0x3e/0x60
Nov 10 11:53:28 appvm1 kernel: [<ffffffff811c0933>] sys_fdatasync+0x13/0x20
Nov 10 11:53:28 appvm1 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Nov 10 12:33:27 appvm1 kernel: INFO: task master:1446 blocked for more than 120 seconds.
Nov 10 12:33:27 appvm1 kernel:      Not tainted 2.6.32-504.el6.x86_64 #1
Nov 10 12:33:27 appvm1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 10 12:33:27 appvm1 kernel: master        D 0000000000000000     0  1446      1 0x00000084
Nov 10 12:33:27 appvm1 kernel: ffff88003a20ba48 0000000000000082 ffff88003a20b9c8 0000000300000001
Nov 10 12:33:27 appvm1 kernel: ffff88003a20b9c8 ffff88003bad5a40 ffff88003a20b9c8 ffffffff811240de
Nov 10 12:33:27 appvm1 kernel: 0000000000013760 0000000000180087 ffff88003ae27af8 ffff88003a20bfd8
Nov 10 12:33:27 appvm1 kernel: Call Trace:
Nov 10 12:33:27 appvm1 kernel: [<ffffffff811240de>] ? find_get_page+0x1e/0xa0
Nov 10 12:33:27 appvm1 kernel: [<ffffffff8109ee2e>] ? prepare_to_wait+0x4e/0x80
Nov 10 12:33:27 appvm1 kernel: [<ffffffffa008e09d>] do_get_write_access+0x29d/0x520 [jbd2]
Nov 10 12:33:27 appvm1 kernel: [<ffffffff8109eb80>] ? wake_bit_function+0x0/0x50
Nov 10 12:33:27 appvm1 kernel: [<ffffffffa008e471>] jbd2_journal_get_write_access+0x31/0x50 [jbd2]
Nov 10 12:33:27 appvm1 kernel: [<ffffffffa00dbd98>] __ext4_journal_get_write_access+0x38/0x80 [ext4]
Nov 10 12:33:27 appvm1 kernel: [<ffffffffa00b5bb3>] ext4_reserve_inode_write+0x73/0xa0 [ext4]
Nov 10 12:33:27 appvm1 kernel: [<ffffffffa00b5c2c>] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4]
Nov 10 12:33:27 appvm1 kernel: [<ffffffffa008d495>] ? jbd2_journal_start+0xb5/0x100 [jbd2]
Nov 10 12:33:27 appvm1 kernel: [<ffffffffa00b5f20>] ext4_dirty_inode+0x40/0x60 [ext4]
Nov 10 12:33:27 appvm1 kernel: [<ffffffff811ba68b>] __mark_inode_dirty+0x3b/0x160
Nov 10 12:33:27 appvm1 kernel: [<ffffffff811aad72>] file_update_time+0xf2/0x170
Nov 10 12:33:27 appvm1 kernel: [<ffffffff81198ec0>] pipe_write+0x300/0x6a0
Nov 10 12:33:27 appvm1 kernel: [<ffffffff8118dd5a>] do_sync_write+0xfa/0x140
Nov 10 12:33:27 appvm1 kernel: [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40
Nov 10 12:33:27 appvm1 kernel: [<ffffffff8123a5eb>] ? selinux_file_permission+0xfb/0x150
Nov 10 12:33:27 appvm1 kernel: [<ffffffff8122d446>] ? security_file_permission+0x16/0x20
Nov 10 12:33:27 appvm1 kernel: [<ffffffff8118e058>] vfs_write+0xb8/0x1a0
Nov 10 12:33:27 appvm1 kernel: [<ffffffff8118ea21>] sys_write+0x51/0x90
Nov 10 12:33:27 appvm1 kernel: [<ffffffff810e5a7e>] ? __audit_syscall_exit+0x25e/0x290
Nov 10 12:33:27 appvm1 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Nov 10 12:33:27 appvm1 kernel: INFO: task qmgr:1460 blocked for more than 120 seconds.

AppVM2
------
Nov 10 13:44:08 appvm2 kernel: INFO: task sadc:30563 blocked for more than 120 seconds.
Nov 10 13:44:08 appvm2 kernel:      Not tainted 2.6.32-504.el6.x86_64 #1
Nov 10 13:44:08 appvm2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 10 13:44:08 appvm2 kernel: sadc          D 0000000000000000     0 30563  30559 0x00000084
Nov 10 13:44:08 appvm2 kernel: ffff88007d197dc8 0000000000000086 0000000000000000 0000000000000001
Nov 10 13:44:08 appvm2 kernel: 000000000000000e ffff88007d197de8 00001bd101703ffc ffffffff81123d80
Nov 10 13:44:08 appvm2 kernel: ffff88007dd25608 0000000101ce1aae ffff88007bc33098 ffff88007d197fd8
Nov 10 13:44:08 appvm2 kernel: Call Trace:
Nov 10 13:44:08 appvm2 kernel: [<ffffffff81123d80>] ? find_get_pages_tag+0x40/0x130
Nov 10 13:44:08 appvm2 kernel: [<ffffffffa00945a5>] jbd2_log_wait_commit+0xc5/0x140 [jbd2]
Nov 10 13:44:08 appvm2 kernel: [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40
Nov 10 13:44:08 appvm2 kernel: [<ffffffffa0094938>] jbd2_complete_transaction+0x68/0xb0 [jbd2]
Nov 10 13:44:08 appvm2 kernel: [<ffffffffa00b0061>] ext4_sync_file+0x121/0x1d0 [ext4]
Nov 10 13:44:08 appvm2 kernel: [<ffffffff811c07f1>] vfs_fsync_range+0xa1/0x100
Nov 10 13:44:08 appvm2 kernel: [<ffffffff811c08bd>] vfs_fsync+0x1d/0x20
Nov 10 13:44:08 appvm2 kernel: [<ffffffff811c08fe>] do_fsync+0x3e/0x60
Nov 10 13:44:08 appvm2 kernel: [<ffffffff811c0933>] sys_fdatasync+0x13/0x20
Nov 10 13:44:08 appvm2 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Nov 10 14:54:08 appvm2 kernel: INFO: task sadc:30837 blocked for more than 120 seconds.
Nov 10 14:54:08 appvm2 kernel:      Not tainted 2.6.32-504.el6.x86_64 #1
Nov 10 14:54:08 appvm2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 10 14:54:08 appvm2 kernel: sadc          D 0000000000000000     0 30837  30833 0x00000084
Nov 10 14:54:08 appvm2 kernel: ffff88007b21ddc8 0000000000000086 0000000000000000 0000000000000001
Nov 10 14:54:08 appvm2 kernel: 000000000000000e ffff88007b21dde8 00001fa6594c9608 ffffffff81123d80
Nov 10 14:54:08 appvm2 kernel: ffff88007dd25608 00000001020e6a88 ffff88007b2ab098 ffff88007b21dfd8
Nov 10 14:54:08 appvm2 kernel: Call Trace:
Nov 10 14:54:08 appvm2 kernel: [<ffffffff81123d80>] ? find_get_pages_tag+0x40/0x130
Nov 10 14:54:08 appvm2 kernel: [<ffffffffa00945a5>] jbd2_log_wait_commit+0xc5/0x140 [jbd2]
Nov 10 14:54:08 appvm2 kernel: [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40
Nov 10 14:54:08 appvm2 kernel: [<ffffffffa0094938>] jbd2_complete_transaction+0x68/0xb0 [jbd2]
Nov 10 14:54:08 appvm2 kernel: [<ffffffffa00b0061>] ext4_sync_file+0x121/0x1d0 [ext4]
Nov 10 14:54:08 appvm2 kernel: [<ffffffff811c07f1>] vfs_fsync_range+0xa1/0x100
Nov 10 14:54:08 appvm2 kernel: [<ffffffff811c08bd>] vfs_fsync+0x1d/0x20
Nov 10 14:54:08 appvm2 kernel: [<ffffffff811c08fe>] do_fsync+0x3e/0x60
Nov 10 14:54:08 appvm2 kernel: [<ffffffff811c0933>] sys_fdatasync+0x13/0x20
Nov 10 14:54:08 appvm2 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

Comment 2 SATHEESARAN 2014-11-11 08:29:04 UTC
Created attachment 956174 [details]
/var/log/message as seen on appvm1

/var/log/messages file as seen on appvm1

Comment 3 SATHEESARAN 2014-11-11 08:29:46 UTC
Created attachment 956175 [details]
/var/log/message as seen on appvm2

/var/log/message as seen on appvm2

Comment 4 SATHEESARAN 2014-11-11 08:32:15 UTC
Created attachment 956176 [details]
sosreport from RHS Node 1

sosreport from rhs node1

Comment 5 SATHEESARAN 2014-11-11 08:34:59 UTC
Created attachment 956177 [details]
sosreport from RHS Node 2

sosreport from rhs node 2

Comment 6 SATHEESARAN 2014-11-11 09:17:13 UTC
This issue was similar to the one seen with RHS 2.0 and was tracked with this bug - https://bugzilla.redhat.com/show_bug.cgi?id=979901

Not sure both this issues are the same.

Apart from the error messages and a momentary VM hang, everthing resumes normal.
I could reboot the VM and there were no other problems and hence this bug with medium severity.

If the issue seen in this bug is same as BZ 979901, then this issue need to be added to known-issue

Comment 7 SATHEESARAN 2014-11-11 09:17:49 UTC
The comment, https://bugzilla.redhat.com/show_bug.cgi?id=859589#c41 also provides little information on this issue

Comment 8 SATHEESARAN 2014-11-12 05:51:52 UTC
The issue was already available in Release Notes as a 'known issue'.

As RHS 2.0 has reached End Of Life ( EOL ),this bug could be used for tracking the issue with RHS 2.1.x

Comment 9 Susant Kumar Palai 2015-11-27 10:41:57 UTC
Cloning to 3.1. To be fixed in future release.


Note You need to log in before you can comment on or make changes to this bug.