595507 – basic I/O to cmirrors hangs and causes backtraces

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 595507 - basic I/O to cmirrors hangs and causes backtraces

Summary: basic I/O to cmirrors hangs and causes backtraces

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	lvm2
Sub Component:
Version:	6.0
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Jonathan Earl Brassow
QA Contact:	Corey Marthaler
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	641812
TreeView+	depends on / blocked

Reported:	2010-05-24 20:25 UTC by Corey Marthaler
Modified:	2011-05-19 14:25 UTC (History)
CC List:	12 users (show)
Fixed In Version:	lvm2-2.02.82-1.el6
Doc Type:	Bug Fix
Doc Text:	Under some circumstances, creating cluster mirrors with the '--nosync' option may cause I/O to become extremely slow. Note that this issue only effects I/O immediately after the creation of the mirror, and only when '--nosync' is used. To work around this issue, run the following command after the creating the mirror. lvchange --refresh <VG>/<LV>
Clone Of:
Environment:
Last Closed:	2011-05-19 14:25:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Patch to fix problem (1.78 KB, patch) 2010-08-30 20:14 UTC, Jonathan Earl Brassow	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2011:0772	0	normal	SHIPPED_LIVE	lvm2 bug fix and enhancement update	2011-05-18 18:08:31 UTC

Description Corey Marthaler 2010-05-24 20:25:44 UTC

Description of problem:
The following test doesn't fail, but takes quite awhile and causes timeout stack traces on the console.

SCENARIO - [block_io_to_mirrors]
Create 5 mirrors with differing sizes and then do block level I/O...
        - write to mirror and verify data after each creation       
        - second, write and verify randomly from all volumes        

Making mirror block_mirror100M of origin volume
taft-04: lvcreate -m 1 -n block_mirror100M -L 100M --nosync mirror_sanity
  WARNING: New mirror won't be synchronised. Don't read what you didn't write!
Running block level I/O to mirror 100M                                        
b_iogen starting up with the following:                                       

Iterations:      Infinite
Seed:            5943    
Offset-mode:     random  
Single Pass:     off     
Overlap Flag:    on      
Mintrans:        512000  
Maxtrans:        5120000 
Syscalls:        write  writev 
Flags:          direct         

Test Devices:

Path                                                      Size
                                                        (bytes)
---------------------------------------------------------------
/dev/mirror_sanity/block_mirror100M                                        104857600
[...]

INFO: task b_doio:6240 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
b_doio        D ffff88021fc24500     0  6240   6238 0x00000080
 ffff880213801b18 0000000000000086 0000000000000000 ffffffffa00042cc
 ffff880213801ae8 00000000c4d86ceb 0000000000000002 0000000100130963
 ffff880210115a98 ffff880213801fd8 000000000000fd38 ffff880210115a98
Call Trace:
 [<ffffffffa00042cc>] ? dm_table_unplug_all+0x5c/0xd0 [dm_mod]
 [<ffffffff814c8c83>] io_schedule+0x73/0xc0
 [<ffffffff8119854e>] __blockdev_direct_IO+0x6fe/0xc30
 [<ffffffff811248fd>] ? zone_statistics+0x7d/0xa0
 [<ffffffff811856fc>] ? xattr_getsecurity+0x3c/0xa0
 [<ffffffff8119625e>] blkdev_direct_IO+0x4e/0x50
 [<ffffffff81195490>] ? blkdev_get_blocks+0x0/0xc0
 [<ffffffff81106cc2>] generic_file_direct_write+0xc2/0x190
 [<ffffffff8117b41c>] ? file_update_time+0xfc/0x170
 [<ffffffff81108405>] __generic_file_aio_write+0x345/0x480
 [<ffffffff811959ec>] blkdev_aio_write+0x3c/0xa0
 [<ffffffff81161c2a>] do_sync_write+0xfa/0x140
 [<ffffffff8108cbb0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff811fd5fb>] ? selinux_file_permission+0xfb/0x150
 [<ffffffff811f0be6>] ? security_file_permission+0x16/0x20
 [<ffffffff81161f28>] vfs_write+0xb8/0x1a0
 [<ffffffff810cf102>] ? audit_syscall_entry+0x252/0x280
 [<ffffffff81162961>] sys_write+0x51/0x90
 [<ffffffff81013172>] system_call_fastpath+0x16/0x1b


Version-Release number of selected component (if applicable):
2.6.32-25.el6.x86_64

lvm2-2.02.65-1.el6    BUILT: Tue May 18 04:46:06 CDT 2010
lvm2-libs-2.02.65-1.el6    BUILT: Tue May 18 04:46:06 CDT 2010
lvm2-cluster-2.02.65-1.el6    BUILT: Tue May 18 04:46:06 CDT 2010
device-mapper-1.02.48-1.el6    BUILT: Tue May 18 04:46:06 CDT 2010
device-mapper-libs-1.02.48-1.el6    BUILT: Tue May 18 04:46:06 CDT 2010
device-mapper-event-1.02.48-1.el6    BUILT: Tue May 18 04:46:06 CDT 2010
device-mapper-event-libs-1.02.48-1.el6    BUILT: Tue May 18 04:46:06 CDT 2010
cmirror-2.02.65-1.el6    BUILT: Wed May 19 11:19:57 CDT 2010

Comment 2 Jonathan Earl Brassow 2010-06-30 13:20:19 UTC

single machine or cluster mirror?

Comment 3 Corey Marthaler 2010-06-30 15:07:26 UTC

Cluster mirrors, I wasn't able to repo this in single mode, but was able to repo right away again today with cluster mirrors on the following build.

 2.6.32-25.el6.x86_64

lvm2-2.02.68-1.el6    BUILT: Thu Jun 24 10:38:40 CDT 2010
lvm2-libs-2.02.68-1.el6    BUILT: Thu Jun 24 10:38:40 CDT 2010
lvm2-cluster-2.02.68-1.el6    BUILT: Thu Jun 24 10:38:40 CDT 2010
udev-147-2.18.el6    BUILT: Fri Jun 11 07:47:21 CDT 2010
device-mapper-1.02.50-1.el6    BUILT: Thu Jun 24 10:38:40 CDT 2010
device-mapper-libs-1.02.50-1.el6    BUILT: Thu Jun 24 10:38:40 CDT 2010
device-mapper-event-1.02.50-1.el6    BUILT: Thu Jun 24 10:38:40 CDT 2010
device-mapper-event-libs-1.02.50-1.el6    BUILT: Thu Jun 24 10:38:40 CDT 2010
cmirror-2.02.68-1.el6    BUILT: Thu Jun 24 10:38:40 CDT 2010


Running this cmdline should hit this issue:
./mirror_sanity -l $sts-root -r /usr/tests/sts-rhel6.0 -R $cluster.xml -e block_io_to_mirrors

Comment 4 Corey Marthaler 2010-07-02 16:57:55 UTC

I'm hitting this when attempting simple FS I/O cases as well.

SCENARIO - [fs_io_checkit]
Create mirror with simple fs data (checkit) and then verify it
grant-02: lvcreate -m 1 -n fs_mirror -L 20G --nosync mirror_sanity
  WARNING: New mirror won't be synchronised. Don't read what you didn't write!
Placing a lock_dlm gfs2 filesystem on fs_mirror volume
mkfs.gfs2 -j 3 -p lock_dlm -t GRANT:gfs2 /dev/mirror_sanity/fs_mirror -O
Mounting fs_mirror volume

Writing files to /mnt/fs_mirror/grant-01
checkit starting with:
CREATE
Num files:          500
Random Seed:        16454
Verify XIOR Stream: /tmp/checkit_fs_mirror1
Working dir:        /mnt/fs_mirror/grant-01

Jul  2 11:44:49 grant-01 qarshd[16453]: Running cmdline: /usr/tests/sts-rhel6.0/bin/checkit -w /mnt/fs_mirror/grant-01 -f /tmp/checkit_fs_mirror1 -n 500
Jul  2 11:45:53 grant-01 qarshd[16453]: Sending child 16454 signal 2
Jul  2 11:47:04 grant-01 kernel: INFO: task gfs2_logd:16445 blocked for more than 120 seconds.
Jul  2 11:47:04 grant-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul  2 11:47:04 grant-01 kernel: gfs2_logd     D 0000000000000002     0 16445      2 0x00000080
Jul  2 11:47:04 grant-01 kernel: ffff88011a3adcf0 0000000000000046 ffff88011a3adcb0 ffffffffa000434c
Jul  2 11:47:04 grant-01 kernel: ffff880117bb5358 ffff880117bb4b00 ffff880117bb4b00 0000000000000008
Jul  2 11:47:04 grant-01 kernel: ffff880117bb50b8 ffff88011a3adfd8 000000000000fcb8 ffff880117bb50b8
Jul  2 11:47:04 grant-01 kernel: Call Trace:
Jul  2 11:47:04 grant-01 kernel: [<ffffffffa000434c>] ? dm_table_unplug_all+0x5c/0xd0 [dm_mod]
Jul  2 11:47:04 grant-01 kernel: [<ffffffff81098239>] ? ktime_get_ts+0xa9/0xe0
Jul  2 11:47:04 grant-01 kernel: [<ffffffff81192aa0>] ? sync_buffer+0x0/0x50
Jul  2 11:47:04 grant-01 kernel: [<ffffffff814c7e53>] io_schedule+0x73/0xc0
Jul  2 11:47:04 grant-01 kernel: [<ffffffff81192ae0>] sync_buffer+0x40/0x50
Jul  2 11:47:04 grant-01 kernel: [<ffffffff814c86df>] __wait_on_bit+0x5f/0x90
Jul  2 11:47:04 grant-01 kernel: [<ffffffff81192aa0>] ? sync_buffer+0x0/0x50
Jul  2 11:47:04 grant-01 kernel: [<ffffffff814c8788>] out_of_line_wait_on_bit+0x78/0x90
[...]


lvm2-2.02.69-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
lvm2-libs-2.02.69-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
lvm2-cluster-2.02.69-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
device-mapper-1.02.51-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
device-mapper-libs-1.02.51-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
device-mapper-event-1.02.51-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
device-mapper-event-libs-1.02.51-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
cmirror-2.02.69-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010

Comment 5 Jonathan Earl Brassow 2010-07-28 20:33:07 UTC

Not sure what is causing this, but will try to fix by last snapshot (or before).

Comment 8 Corey Marthaler 2010-08-24 18:48:39 UTC

Even a simple mkfs can cause this.

Placing a lock_dlm gfs2 filesystem on split_io_verify volume
mkfs.gfs2 -j 3 -p lock_dlm -t GRANT:gfs2 /dev/mirror_sanity/split_io_verify -O

Aug 23 18:03:46 grant-02 qarshd[5779]: Running cmdline: mkfs.gfs2 -j 3 -p lock_dlm -t GRANT:gfs2 /dev/mirror_sanity/split_io_verify -O
Aug 23 18:14:55 grant-02 kernel: INFO: task mkfs.gfs2:5780 blocked for more than 120 seconds.
Aug 23 18:14:55 grant-02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 23 18:14:55 grant-02 kernel: mkfs.gfs2     D ffff88011fc24300     0  5780   5779 0x00000080
Aug 23 18:14:55 grant-02 kernel: ffff880218f75c08 0000000000000082 ffff880218f75bd0 ffff880218f75bcc
Aug 23 18:14:55 grant-02 kernel: 0000000000000000 ffff88011fc24300 ffff880028216980 0000000100a252cd
Aug 23 18:14:55 grant-02 kernel: ffff88021aa6c638 ffff880218f75fd8 0000000000010518 ffff88021aa6c638
Aug 23 18:14:55 grant-02 kernel: Call Trace:
Aug 23 18:14:55 grant-02 kernel: [<ffffffff8110cdc0>] ? sync_page+0x0/0x50
Aug 23 18:14:55 grant-02 kernel: [<ffffffff814c4253>] io_schedule+0x73/0xc0
Aug 23 18:14:55 grant-02 kernel: [<ffffffff8110cdfd>] sync_page+0x3d/0x50
Aug 23 18:14:55 grant-02 kernel: [<ffffffff814c4acf>] __wait_on_bit+0x5f/0x90
Aug 23 18:14:55 grant-02 kernel: [<ffffffff8119e9e0>] ? blkdev_get_block+0x0/0x70
Aug 23 18:14:55 grant-02 kernel: [<ffffffff8110cfb3>] wait_on_page_bit+0x73/0x80
Aug 23 18:14:55 grant-02 kernel: [<ffffffff81091a30>] ? wake_bit_function+0x0/0x50
Aug 23 18:14:55 grant-02 kernel: [<ffffffff8111df77>] ? __writepage+0x17/0x40
Aug 23 18:14:55 grant-02 kernel: [<ffffffff8111f108>] write_cache_pages+0x198/0x3f0
Aug 23 18:14:55 grant-02 kernel: [<ffffffff8111df60>] ? __writepage+0x0/0x40
Aug 23 18:14:55 grant-02 kernel: [<ffffffff8111f384>] generic_writepages+0x24/0x30
Aug 23 18:14:55 grant-02 kernel: [<ffffffff8111f3b1>] do_writepages+0x21/0x40
Aug 23 18:14:55 grant-02 kernel: [<ffffffff8110d49b>] __filemap_fdatawrite_range+0x5b/0x60
Aug 23 18:14:55 grant-02 kernel: [<ffffffff8110d4fa>] filemap_write_and_wait_range+0x5a/0x90
Aug 23 18:14:55 grant-02 kernel: [<ffffffff811966be>] vfs_fsync_range+0x7e/0xe0
Aug 23 18:14:55 grant-02 kernel: [<ffffffff8119678d>] vfs_fsync+0x1d/0x20
Aug 23 18:14:55 grant-02 kernel: [<ffffffff811967ce>] do_fsync+0x3e/0x60
Aug 23 18:14:55 grant-02 kernel: [<ffffffff81196820>] sys_fsync+0x10/0x20
Aug 23 18:14:55 grant-02 kernel: [<ffffffff81013172>] system_call_fastpath+0x16/0x1b
Aug 23 18:14:55 grant-02 kernel: INFO: task flush-253:7:5781 blocked for more than 120 seconds.

Comment 9 Corey Marthaler 2010-08-24 21:11:46 UTC

There should be a 6.0 release note for this issue.

Comment 11 Corey Marthaler 2010-08-30 15:32:49 UTC

Just a note that this issue had been seen before with helter_skelter, a test that does not use --nosync mirrors. This bug is not --nosync specific.

Comment 13 Jonathan Earl Brassow 2010-08-30 20:12:28 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
An issue has been found, when creating cluster mirrors with the '--nosync' option, that can cause I/O to become extremely slow.  This seems to affect I/O only immediately following the creation of the mirror and only when '--nosync' is used.  A 'lvchange --refresh <VG>/<LV>' after creating the mirror will circumvent the issue.

Comment 14 Jonathan Earl Brassow 2010-08-30 20:14:10 UTC

Created attachment 442025 [details]
Patch to fix problem

Problem + solution description in patch header.

Comment 17 Ryan Lerch 2010-09-24 03:26:44 UTC

    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1,3 @@
-An issue has been found, when creating cluster mirrors with the '--nosync' option, that can cause I/O to become extremely slow.  This seems to affect I/O only immediately following the creation of the mirror and only when '--nosync' is used.  A 'lvchange --refresh <VG>/<LV>' after creating the mirror will circumvent the issue.+Under some circumstances, creating cluster mirrors with the '--nosync' option may cause I/O to become extremely slow. Note that this issue only effects I/O immediately after the creation of the mirror, and only when '--nosync' is used. To work around this issue, run the following command after the creating the mirror.
+
+lvchange --refresh <VG>/<LV>

Comment 20 Corey Marthaler 2011-03-16 18:39:39 UTC

This issue no longer exists in the latest 6.1 cmirror testing. Marking verifed.

2.6.32-94.el6.x86_64

lvm2-2.02.83-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
lvm2-libs-2.02.83-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
lvm2-cluster-2.02.83-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
udev-147-2.31.el6    BUILT: Wed Jan 26 05:39:15 CST 2011
device-mapper-1.02.62-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
device-mapper-libs-1.02.62-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
device-mapper-event-1.02.62-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
device-mapper-event-libs-1.02.62-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
cmirror-2.02.83-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011

Comment 21 errata-xmlrpc 2011-05-19 14:25:54 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0772.html

Note You need to log in before you can comment on or make changes to this bug.