Bug 595507 - basic I/O to cmirrors hangs and causes backtraces
basic I/O to cmirrors hangs and causes backtraces
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: lvm2 (Show other bugs)
6.0
All Linux
high Severity high
: rc
: ---
Assigned To: Jonathan Earl Brassow
Corey Marthaler
: Regression, TestBlocker, ZStream
Depends On:
Blocks: 641812
  Show dependency treegraph
 
Reported: 2010-05-24 16:25 EDT by Corey Marthaler
Modified: 2011-05-19 10:25 EDT (History)
12 users (show)

See Also:
Fixed In Version: lvm2-2.02.82-1.el6
Doc Type: Bug Fix
Doc Text:
Under some circumstances, creating cluster mirrors with the '--nosync' option may cause I/O to become extremely slow. Note that this issue only effects I/O immediately after the creation of the mirror, and only when '--nosync' is used. To work around this issue, run the following command after the creating the mirror. lvchange --refresh <VG>/<LV>
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-05-19 10:25:54 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Patch to fix problem (1.78 KB, patch)
2010-08-30 16:14 EDT, Jonathan Earl Brassow
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:0772 normal SHIPPED_LIVE lvm2 bug fix and enhancement update 2011-05-18 14:08:31 EDT

  None (edit)
Description Corey Marthaler 2010-05-24 16:25:44 EDT
Description of problem:
The following test doesn't fail, but takes quite awhile and causes timeout stack traces on the console.

SCENARIO - [block_io_to_mirrors]
Create 5 mirrors with differing sizes and then do block level I/O...
        - write to mirror and verify data after each creation       
        - second, write and verify randomly from all volumes        

Making mirror block_mirror100M of origin volume
taft-04: lvcreate -m 1 -n block_mirror100M -L 100M --nosync mirror_sanity
  WARNING: New mirror won't be synchronised. Don't read what you didn't write!
Running block level I/O to mirror 100M                                        
b_iogen starting up with the following:                                       

Iterations:      Infinite
Seed:            5943    
Offset-mode:     random  
Single Pass:     off     
Overlap Flag:    on      
Mintrans:        512000  
Maxtrans:        5120000 
Syscalls:        write  writev 
Flags:          direct         

Test Devices:

Path                                                      Size
                                                        (bytes)
---------------------------------------------------------------
/dev/mirror_sanity/block_mirror100M                                        104857600
[...]

INFO: task b_doio:6240 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
b_doio        D ffff88021fc24500     0  6240   6238 0x00000080
 ffff880213801b18 0000000000000086 0000000000000000 ffffffffa00042cc
 ffff880213801ae8 00000000c4d86ceb 0000000000000002 0000000100130963
 ffff880210115a98 ffff880213801fd8 000000000000fd38 ffff880210115a98
Call Trace:
 [<ffffffffa00042cc>] ? dm_table_unplug_all+0x5c/0xd0 [dm_mod]
 [<ffffffff814c8c83>] io_schedule+0x73/0xc0
 [<ffffffff8119854e>] __blockdev_direct_IO+0x6fe/0xc30
 [<ffffffff811248fd>] ? zone_statistics+0x7d/0xa0
 [<ffffffff811856fc>] ? xattr_getsecurity+0x3c/0xa0
 [<ffffffff8119625e>] blkdev_direct_IO+0x4e/0x50
 [<ffffffff81195490>] ? blkdev_get_blocks+0x0/0xc0
 [<ffffffff81106cc2>] generic_file_direct_write+0xc2/0x190
 [<ffffffff8117b41c>] ? file_update_time+0xfc/0x170
 [<ffffffff81108405>] __generic_file_aio_write+0x345/0x480
 [<ffffffff811959ec>] blkdev_aio_write+0x3c/0xa0
 [<ffffffff81161c2a>] do_sync_write+0xfa/0x140
 [<ffffffff8108cbb0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff811fd5fb>] ? selinux_file_permission+0xfb/0x150
 [<ffffffff811f0be6>] ? security_file_permission+0x16/0x20
 [<ffffffff81161f28>] vfs_write+0xb8/0x1a0
 [<ffffffff810cf102>] ? audit_syscall_entry+0x252/0x280
 [<ffffffff81162961>] sys_write+0x51/0x90
 [<ffffffff81013172>] system_call_fastpath+0x16/0x1b


Version-Release number of selected component (if applicable):
2.6.32-25.el6.x86_64

lvm2-2.02.65-1.el6    BUILT: Tue May 18 04:46:06 CDT 2010
lvm2-libs-2.02.65-1.el6    BUILT: Tue May 18 04:46:06 CDT 2010
lvm2-cluster-2.02.65-1.el6    BUILT: Tue May 18 04:46:06 CDT 2010
device-mapper-1.02.48-1.el6    BUILT: Tue May 18 04:46:06 CDT 2010
device-mapper-libs-1.02.48-1.el6    BUILT: Tue May 18 04:46:06 CDT 2010
device-mapper-event-1.02.48-1.el6    BUILT: Tue May 18 04:46:06 CDT 2010
device-mapper-event-libs-1.02.48-1.el6    BUILT: Tue May 18 04:46:06 CDT 2010
cmirror-2.02.65-1.el6    BUILT: Wed May 19 11:19:57 CDT 2010
Comment 2 Jonathan Earl Brassow 2010-06-30 09:20:19 EDT
single machine or cluster mirror?
Comment 3 Corey Marthaler 2010-06-30 11:07:26 EDT
Cluster mirrors, I wasn't able to repo this in single mode, but was able to repo right away again today with cluster mirrors on the following build.

 2.6.32-25.el6.x86_64

lvm2-2.02.68-1.el6    BUILT: Thu Jun 24 10:38:40 CDT 2010
lvm2-libs-2.02.68-1.el6    BUILT: Thu Jun 24 10:38:40 CDT 2010
lvm2-cluster-2.02.68-1.el6    BUILT: Thu Jun 24 10:38:40 CDT 2010
udev-147-2.18.el6    BUILT: Fri Jun 11 07:47:21 CDT 2010
device-mapper-1.02.50-1.el6    BUILT: Thu Jun 24 10:38:40 CDT 2010
device-mapper-libs-1.02.50-1.el6    BUILT: Thu Jun 24 10:38:40 CDT 2010
device-mapper-event-1.02.50-1.el6    BUILT: Thu Jun 24 10:38:40 CDT 2010
device-mapper-event-libs-1.02.50-1.el6    BUILT: Thu Jun 24 10:38:40 CDT 2010
cmirror-2.02.68-1.el6    BUILT: Thu Jun 24 10:38:40 CDT 2010


Running this cmdline should hit this issue:
./mirror_sanity -l $sts-root -r /usr/tests/sts-rhel6.0 -R $cluster.xml -e block_io_to_mirrors
Comment 4 Corey Marthaler 2010-07-02 12:57:55 EDT
I'm hitting this when attempting simple FS I/O cases as well.

SCENARIO - [fs_io_checkit]
Create mirror with simple fs data (checkit) and then verify it
grant-02: lvcreate -m 1 -n fs_mirror -L 20G --nosync mirror_sanity
  WARNING: New mirror won't be synchronised. Don't read what you didn't write!
Placing a lock_dlm gfs2 filesystem on fs_mirror volume
mkfs.gfs2 -j 3 -p lock_dlm -t GRANT:gfs2 /dev/mirror_sanity/fs_mirror -O
Mounting fs_mirror volume

Writing files to /mnt/fs_mirror/grant-01
checkit starting with:
CREATE
Num files:          500
Random Seed:        16454
Verify XIOR Stream: /tmp/checkit_fs_mirror1
Working dir:        /mnt/fs_mirror/grant-01

Jul  2 11:44:49 grant-01 qarshd[16453]: Running cmdline: /usr/tests/sts-rhel6.0/bin/checkit -w /mnt/fs_mirror/grant-01 -f /tmp/checkit_fs_mirror1 -n 500
Jul  2 11:45:53 grant-01 qarshd[16453]: Sending child 16454 signal 2
Jul  2 11:47:04 grant-01 kernel: INFO: task gfs2_logd:16445 blocked for more than 120 seconds.
Jul  2 11:47:04 grant-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul  2 11:47:04 grant-01 kernel: gfs2_logd     D 0000000000000002     0 16445      2 0x00000080
Jul  2 11:47:04 grant-01 kernel: ffff88011a3adcf0 0000000000000046 ffff88011a3adcb0 ffffffffa000434c
Jul  2 11:47:04 grant-01 kernel: ffff880117bb5358 ffff880117bb4b00 ffff880117bb4b00 0000000000000008
Jul  2 11:47:04 grant-01 kernel: ffff880117bb50b8 ffff88011a3adfd8 000000000000fcb8 ffff880117bb50b8
Jul  2 11:47:04 grant-01 kernel: Call Trace:
Jul  2 11:47:04 grant-01 kernel: [<ffffffffa000434c>] ? dm_table_unplug_all+0x5c/0xd0 [dm_mod]
Jul  2 11:47:04 grant-01 kernel: [<ffffffff81098239>] ? ktime_get_ts+0xa9/0xe0
Jul  2 11:47:04 grant-01 kernel: [<ffffffff81192aa0>] ? sync_buffer+0x0/0x50
Jul  2 11:47:04 grant-01 kernel: [<ffffffff814c7e53>] io_schedule+0x73/0xc0
Jul  2 11:47:04 grant-01 kernel: [<ffffffff81192ae0>] sync_buffer+0x40/0x50
Jul  2 11:47:04 grant-01 kernel: [<ffffffff814c86df>] __wait_on_bit+0x5f/0x90
Jul  2 11:47:04 grant-01 kernel: [<ffffffff81192aa0>] ? sync_buffer+0x0/0x50
Jul  2 11:47:04 grant-01 kernel: [<ffffffff814c8788>] out_of_line_wait_on_bit+0x78/0x90
[...]


lvm2-2.02.69-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
lvm2-libs-2.02.69-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
lvm2-cluster-2.02.69-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
device-mapper-1.02.51-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
device-mapper-libs-1.02.51-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
device-mapper-event-1.02.51-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
device-mapper-event-libs-1.02.51-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
cmirror-2.02.69-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
Comment 5 Jonathan Earl Brassow 2010-07-28 16:33:07 EDT
Not sure what is causing this, but will try to fix by last snapshot (or before).
Comment 8 Corey Marthaler 2010-08-24 14:48:39 EDT
Even a simple mkfs can cause this.

Placing a lock_dlm gfs2 filesystem on split_io_verify volume
mkfs.gfs2 -j 3 -p lock_dlm -t GRANT:gfs2 /dev/mirror_sanity/split_io_verify -O

Aug 23 18:03:46 grant-02 qarshd[5779]: Running cmdline: mkfs.gfs2 -j 3 -p lock_dlm -t GRANT:gfs2 /dev/mirror_sanity/split_io_verify -O
Aug 23 18:14:55 grant-02 kernel: INFO: task mkfs.gfs2:5780 blocked for more than 120 seconds.
Aug 23 18:14:55 grant-02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 23 18:14:55 grant-02 kernel: mkfs.gfs2     D ffff88011fc24300     0  5780   5779 0x00000080
Aug 23 18:14:55 grant-02 kernel: ffff880218f75c08 0000000000000082 ffff880218f75bd0 ffff880218f75bcc
Aug 23 18:14:55 grant-02 kernel: 0000000000000000 ffff88011fc24300 ffff880028216980 0000000100a252cd
Aug 23 18:14:55 grant-02 kernel: ffff88021aa6c638 ffff880218f75fd8 0000000000010518 ffff88021aa6c638
Aug 23 18:14:55 grant-02 kernel: Call Trace:
Aug 23 18:14:55 grant-02 kernel: [<ffffffff8110cdc0>] ? sync_page+0x0/0x50
Aug 23 18:14:55 grant-02 kernel: [<ffffffff814c4253>] io_schedule+0x73/0xc0
Aug 23 18:14:55 grant-02 kernel: [<ffffffff8110cdfd>] sync_page+0x3d/0x50
Aug 23 18:14:55 grant-02 kernel: [<ffffffff814c4acf>] __wait_on_bit+0x5f/0x90
Aug 23 18:14:55 grant-02 kernel: [<ffffffff8119e9e0>] ? blkdev_get_block+0x0/0x70
Aug 23 18:14:55 grant-02 kernel: [<ffffffff8110cfb3>] wait_on_page_bit+0x73/0x80
Aug 23 18:14:55 grant-02 kernel: [<ffffffff81091a30>] ? wake_bit_function+0x0/0x50
Aug 23 18:14:55 grant-02 kernel: [<ffffffff8111df77>] ? __writepage+0x17/0x40
Aug 23 18:14:55 grant-02 kernel: [<ffffffff8111f108>] write_cache_pages+0x198/0x3f0
Aug 23 18:14:55 grant-02 kernel: [<ffffffff8111df60>] ? __writepage+0x0/0x40
Aug 23 18:14:55 grant-02 kernel: [<ffffffff8111f384>] generic_writepages+0x24/0x30
Aug 23 18:14:55 grant-02 kernel: [<ffffffff8111f3b1>] do_writepages+0x21/0x40
Aug 23 18:14:55 grant-02 kernel: [<ffffffff8110d49b>] __filemap_fdatawrite_range+0x5b/0x60
Aug 23 18:14:55 grant-02 kernel: [<ffffffff8110d4fa>] filemap_write_and_wait_range+0x5a/0x90
Aug 23 18:14:55 grant-02 kernel: [<ffffffff811966be>] vfs_fsync_range+0x7e/0xe0
Aug 23 18:14:55 grant-02 kernel: [<ffffffff8119678d>] vfs_fsync+0x1d/0x20
Aug 23 18:14:55 grant-02 kernel: [<ffffffff811967ce>] do_fsync+0x3e/0x60
Aug 23 18:14:55 grant-02 kernel: [<ffffffff81196820>] sys_fsync+0x10/0x20
Aug 23 18:14:55 grant-02 kernel: [<ffffffff81013172>] system_call_fastpath+0x16/0x1b
Aug 23 18:14:55 grant-02 kernel: INFO: task flush-253:7:5781 blocked for more than 120 seconds.
Comment 9 Corey Marthaler 2010-08-24 17:11:46 EDT
There should be a 6.0 release note for this issue.
Comment 11 Corey Marthaler 2010-08-30 11:32:49 EDT
Just a note that this issue had been seen before with helter_skelter, a test that does not use --nosync mirrors. This bug is not --nosync specific.
Comment 13 Jonathan Earl Brassow 2010-08-30 16:12:28 EDT
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
An issue has been found, when creating cluster mirrors with the '--nosync' option, that can cause I/O to become extremely slow.  This seems to affect I/O only immediately following the creation of the mirror and only when '--nosync' is used.  A 'lvchange --refresh <VG>/<LV>' after creating the mirror will circumvent the issue.
Comment 14 Jonathan Earl Brassow 2010-08-30 16:14:10 EDT
Created attachment 442025 [details]
Patch to fix problem

Problem + solution description in patch header.
Comment 17 Ryan Lerch 2010-09-23 23:26:44 EDT
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1,3 @@
-An issue has been found, when creating cluster mirrors with the '--nosync' option, that can cause I/O to become extremely slow.  This seems to affect I/O only immediately following the creation of the mirror and only when '--nosync' is used.  A 'lvchange --refresh <VG>/<LV>' after creating the mirror will circumvent the issue.+Under some circumstances, creating cluster mirrors with the '--nosync' option may cause I/O to become extremely slow. Note that this issue only effects I/O immediately after the creation of the mirror, and only when '--nosync' is used. To work around this issue, run the following command after the creating the mirror.
+
+lvchange --refresh <VG>/<LV>
Comment 20 Corey Marthaler 2011-03-16 14:39:39 EDT
This issue no longer exists in the latest 6.1 cmirror testing. Marking verifed.

2.6.32-94.el6.x86_64

lvm2-2.02.83-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
lvm2-libs-2.02.83-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
lvm2-cluster-2.02.83-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
udev-147-2.31.el6    BUILT: Wed Jan 26 05:39:15 CST 2011
device-mapper-1.02.62-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
device-mapper-libs-1.02.62-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
device-mapper-event-1.02.62-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
device-mapper-event-libs-1.02.62-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
cmirror-2.02.83-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
Comment 21 errata-xmlrpc 2011-05-19 10:25:54 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0772.html

Note You need to log in before you can comment on or make changes to this bug.