Description of problem: It appears that 461289 may have never gone away. I've reproduced this corruption every time I've tried this scenario. SCENARIO - [fs_io_A] Create snapshots of origin with fs data, and then verify that data on snapshots Making origin volume Placing an ext filesystem on origin volume mke2fs 1.39 (29-May-2006) Mounting origin volume Writing files to /mnt/origin checkit starting with: CREATE Num files: 500 Random Seed: 13469 Verify XIOR Stream: /tmp/checkit_origin_1 Working dir: /mnt/origin Checking files on /mnt/origin checkit starting with: VERIFY Verify XIOR Stream: /tmp/checkit_origin_1 Working dir: /mnt/origin Making 1st snapshot of origin volume Mounting 1st snap volume Checking files on /mnt/fs_snap1 checkit starting with: VERIFY Verify XIOR Stream: /tmp/checkit_origin_1 Working dir: /mnt/fs_snap1 Writing files to /mnt/origin checkit starting with: CREATE Num files: 500 Random Seed: 13539 Verify XIOR Stream: /tmp/checkit_origin_2 Working dir: /mnt/origin Making 2nd snapshot of origin volume Mounting 2nd snap volume Writing files to /mnt/origin checkit starting with: CREATE Num files: 500 Random Seed: 13605 Verify XIOR Stream: /tmp/checkit_origin_3 Working dir: /mnt/origin Making 3rd snapshot of origin volume Mounting 3rd snap volume Checking files on /mnt/fs_snap1 checkit starting with: VERIFY Verify XIOR Stream: /tmp/checkit_origin_1 Working dir: /mnt/fs_snap1 Checking files on /mnt/fs_snap2 checkit starting with: VERIFY Verify XIOR Stream: /tmp/checkit_origin_1 Working dir: /mnt/fs_snap2 Checking files on /mnt/fs_snap2 checkit starting with: VERIFY Verify XIOR Stream: /tmp/checkit_origin_2 Working dir: /mnt/fs_snap2 *** DATA COMPARISON ERROR [file:xywjvucpowuwokipvvwengsbxvabicwhcuqarupjslgjx] *** Corrupt regions follow - unprintable chars are represented as '.' ----------------------------------------------------------------- corrupt bytes starting at file offset 188416 1st 32 expected bytes: 66666666666666666666666666666666 1st 32 actual bytes: !..."...#...$...%...&...'...(... Version-Release number of selected component (if applicable): 2.6.18-256.el5 lvm2-2.02.84-3.el5 BUILT: Wed Apr 27 03:42:24 CDT 2011 lvm2-cluster-2.02.84-3.el5 BUILT: Wed Apr 27 03:42:43 CDT 2011 device-mapper-1.02.63-2.el5 BUILT: Fri Mar 4 10:23:17 CST 2011 device-mapper-event-1.02.63-2.el5 BUILT: Fri Mar 4 10:23:17 CST 2011 cmirror-1.1.39-10.el5 BUILT: Wed Sep 8 16:32:05 CDT 2010 kmod-cmirror-0.1.22-3.el5 BUILT: Tue Dec 22 13:39:47 CST 2009 How reproducible: Everytime
Is it ext2 or ext3 filesystem?
Please send me programs that you used to reproduce this bug with some description how to run it on a plain RHEL 5 system.
Here's what the test case does: [root@taft-01 ~]# pvscan PV /dev/sda2 VG VolGroup00 lvm2 [68.12 GB / 0 free] PV /dev/sdb1 lvm2 [135.67 GB] PV /dev/sdc1 lvm2 [135.67 GB] PV /dev/sdd1 lvm2 [135.67 GB] PV /dev/sde1 lvm2 [135.67 GB] PV /dev/sdf1 lvm2 [135.67 GB] PV /dev/sdg1 lvm2 [135.67 GB] PV /dev/sdh1 lvm2 [135.67 GB] Total: 8 [1017.78 GB] / in use: 1 [68.12 GB] / in no VG: 7 [949.66 GB] [root@taft-01 ~]# vgcreate taft /dev/sd[bcdefgh]1 Volume group "taft" successfully created [root@taft-01 ~]# lvcreate -n origin -L 4G taft Logical volume "origin" created [root@taft-01 ~]# mkfs /dev/taft/origin mke2fs 1.39 (29-May-2006) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) 524288 inodes, 1048576 blocks 52428 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=1073741824 32 block groups 32768 blocks per group, 32768 fragments per group 16384 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736 Writing inode tables: done Writing superblocks and filesystem accounting information: done This filesystem will be automatically checked every 30 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override. [root@taft-01 ~]# mkdir -p /mnt/origin [root@taft-01 ~]# mount /dev/taft/origin /mnt/origin # FIRST WRITE TO ORIGIN [root@taft-01 ~]# /usr/tests/sts-rhel5.6/bin/checkit -w /mnt/origin -f /tmp/checkit_1 -n 500 checkit starting with: CREATE Num files: 500 Random Seed: 27109 Verify XIOR Stream: /tmp/checkit_1 Working dir: /mnt/origin # VERIFY FIRST WRITE FROM ORIGIN [root@taft-01 ~]# /usr/tests/sts-rhel5.6/bin/checkit -w /mnt/origin -f /tmp/checkit_1 -n 500 -v checkit starting with: VERIFY Verify XIOR Stream: /tmp/checkit_1 Working dir: /mnt/origin [root@taft-01 ~]# lvcreate -s /dev/taft/origin -c 128 -n snap1 -L 1G Logical volume "snap1" created [root@taft-01 ~]# mkdir -p /mnt/snap1 [root@taft-01 ~]# mount /dev/taft/snap1 /mnt/snap1 # VERIFY FIRST WRITE FROM SNAP 1 [root@taft-01 ~]# /usr/tests/sts-rhel5.6/bin/checkit -w /mnt/snap1 -f /tmp/checkit_1 -n 500 -v checkit starting with: VERIFY Verify XIOR Stream: /tmp/checkit_1 Working dir: /mnt/snap1 # SECOND WRITE TO ORIGIN [root@taft-01 ~]# /usr/tests/sts-rhel5.6/bin/checkit -w /mnt/origin/ -f /tmp/checkit_2 -n 500 checkit starting with: CREATE Num files: 500 Random Seed: 27183 Verify XIOR Stream: /tmp/checkit_2 Working dir: /mnt/origin/ # VERIFY SECOND WRITE FROM ORIGIN [root@taft-01 ~]# /usr/tests/sts-rhel5.6/bin/checkit -w /mnt/origin/ -f /tmp/checkit_2 -n 500 -v checkit starting with: VERIFY Verify XIOR Stream: /tmp/checkit_2 Working dir: /mnt/origin/ [root@taft-01 ~]# lvcreate -s /dev/taft/origin -c 128 -n snap2 -L 1G Logical volume "snap2" created [root@taft-01 ~]# mkdir -p /mnt/snap2 [root@taft-01 ~]# mount /dev/taft/snap2 /mnt/snap2 # VERIFY FIRST WRITE FROM SNAP 1 [root@taft-01 ~]# /usr/tests/sts-rhel5.6/bin/checkit -w /mnt/snap1 -f /tmp/checkit_1 -n 500 -v checkit starting with: VERIFY Verify XIOR Stream: /tmp/checkit_1 Working dir: /mnt/snap1 # VERIFY FIRST WRITE FROM SNAP 2 [root@taft-01 ~]# /usr/tests/sts-rhel5.6/bin/checkit -w /mnt/snap2 -f /tmp/checkit_1 -n 500 -v checkit starting with: VERIFY Verify XIOR Stream: /tmp/checkit_1 Working dir: /mnt/snap2 # VERIFY SECOND WRITE FROM SNAP 2 [root@taft-01 ~]# /usr/tests/sts-rhel5.6/bin/checkit -w /mnt/snap2 -f /tmp/checkit_2 -n 500 -v checkit starting with: VERIFY Verify XIOR Stream: /tmp/checkit_2 Working dir: /mnt/snap2 *** DATA COMPARISON ERROR [file:haeifcdyljqfynateuofomxylheimhgthkeophpbqiiqwkmbf] *** Corrupt regions follow - unprintable chars are represented as '.' ----------------------------------------------------------------- corrupt bytes starting at file offset 475136 1st 32 expected bytes: SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS 1st 32 actual bytes: 33333333333333333333333333333333
You can get the above checkit test from the latest sts-rhel5.6 rpm: [root@taft-01 ~]# cat /etc/yum.repos.d/qe-rhel5.repo [qe-rhel5] name=qe-rhel5 baseurl=http://sts.lab.msp.redhat.com/dist/brewroot/repos/qe-rhel5/$basearch enabled=1 gpgcheck=0
I just noticed this output in /var/log/messages after the creation of the second snapshot volume. May 5 14:24:56 taft-01 kernel: Incorrect number of segments after building list May 5 14:24:56 taft-01 kernel: counted 10, received 9 May 5 14:24:56 taft-01 kernel: req nr_sec 768, cur_nr_sec 8 May 5 14:24:56 taft-01 kernel: Incorrect number of segments after building list May 5 14:24:56 taft-01 kernel: counted 9, received 8 May 5 14:24:56 taft-01 kernel: req nr_sec 768, cur_nr_sec 8 May 5 14:24:56 taft-01 kernel: Incorrect number of segments after building list May 5 14:24:56 taft-01 kernel: counted 10, received 9 May 5 14:24:56 taft-01 kernel: req nr_sec 768, cur_nr_sec 8
This is kernel error from scsi lib If it is regression, can you check if older kernel works?
I tried it here and it works as expected. What are the underlying devices? Can you provide lvmdump?
[root@taft-01 ~]# cat /proc/scsi/scsi Attached devices: Host: scsi0 Channel: 00 Id: 06 Lun: 00 Vendor: PE/PV Model: 1x6 SCSI BP Rev: 1.0 Type: Processor ANSI SCSI revision: 02 Host: scsi0 Channel: 02 Id: 00 Lun: 00 Vendor: MegaRAID Model: LD 0 RAID0 69G Rev: 521S Type: Direct-Access ANSI SCSI revision: 02 Host: scsi1 Channel: 00 Id: 00 Lun: 00 Vendor: COMPAQ Model: MSA1000 Rev: 2.38 Type: RAID ANSI SCSI revision: 04 Host: scsi1 Channel: 00 Id: 00 Lun: 01 Vendor: COMPAQ Model: MSA1000 VOLUME Rev: 2.38 Type: Direct-Access ANSI SCSI revision: 04 Host: scsi1 Channel: 00 Id: 00 Lun: 02 Vendor: COMPAQ Model: MSA1000 VOLUME Rev: 2.38 Type: Direct-Access ANSI SCSI revision: 04 Host: scsi1 Channel: 00 Id: 00 Lun: 03 Vendor: COMPAQ Model: MSA1000 VOLUME Rev: 2.38 Type: Direct-Access ANSI SCSI revision: 04 Host: scsi1 Channel: 00 Id: 00 Lun: 04 Vendor: COMPAQ Model: MSA1000 VOLUME Rev: 2.38 Type: Direct-Access ANSI SCSI revision: 04 Host: scsi1 Channel: 00 Id: 00 Lun: 05 Vendor: COMPAQ Model: MSA1000 VOLUME Rev: 2.38 Type: Direct-Access ANSI SCSI revision: 04 Host: scsi1 Channel: 00 Id: 00 Lun: 06 Vendor: COMPAQ Model: MSA1000 VOLUME Rev: 2.38 Type: Direct-Access ANSI SCSI revision: 04 Host: scsi1 Channel: 00 Id: 00 Lun: 07 Vendor: COMPAQ Model: MSA1000 VOLUME Rev: 2.38 Type: Direct-Access ANSI SCSI revision: 04
This was reproduced on Winchester devices as well. [root@grant-01 ~]# cat /proc/scsi/scsi Attached devices: Host: scsi0 Channel: 00 Id: 00 Lun: 00 Vendor: ATA Model: WDC WD800JD-75MS Rev: 10.0 Type: Direct-Access ANSI SCSI revision: 05 Host: scsi4 Channel: 00 Id: 00 Lun: 00 Vendor: WINSYS Model: FC3454 Rev: 342S Type: Direct-Access ANSI SCSI revision: 03 Host: scsi4 Channel: 00 Id: 00 Lun: 01 Vendor: WINSYS Model: FC3454 Rev: 342S Type: Direct-Access ANSI SCSI revision: 03
These are the drivers on these machines: May 6 10:44:50 taft-02 kernel: qla2xxx 0000:0b:02.0: May 6 10:44:50 taft-02 kernel: QLogic Fibre Channel HBA Driver: 8.03.07.00.05.07 May 6 10:44:50 taft-02 kernel: QLogic QLA2340 - 133MHz PCI-X to 2Gb FC, Single Channel May 6 10:44:50 taft-02 kernel: ISP2312: PCI-X (133 MHz) @ 0000:0b:02.0 hdma+, host#=1, fw=3.03.26 IPX May 6 10:26:30 grant-01 kernel: qla2xxx 0000:06:00.1: May 6 10:26:30 grant-01 kernel: QLogic Fibre Channel HBA Driver: 8.03.07.00.05.07 May 6 10:26:30 grant-01 kernel: QLogic QLE2462 - PCI-Express Dual Channel 4Gb Fibre Channel HBA May 6 10:26:30 grant-01 kernel: ISP2432: PCIe (2.5Gb/s x4) @ 0000:06:00.1 hdma+, host#=5, fw=5.03.16 (496)
Created attachment 497412 [details] full log from grant-01 starting up and then the bug occuring
Reassing it to the maintainers of that QLogic SCSI driver. Messages "Incorrect number of segments after building list", "Buffer I/O error on device dm-2," and "lost page write due to I/O error on dm-2" indicate that it is SCSI driver error and it has nothing to do with lvm or snapshots.
Correct, I have machine with both SATA disks and Qlogic HBA. It works with SATA disks, but fails when running over VG on Qlogic: VERIFY SECOND WRITE FROM SNAP 2 checkit starting with: VERIFY Verify XIOR Stream: /tmp/checkit_2 Working dir: /mnt/snap2 *** DATA COMPARISON ERROR [file:lmerougbglbjaqostjoovpnpajiuwjuiyydpvjrcwulxnbqh] *** Corrupt regions follow - unprintable chars are represented as '.' ----------------------------------------------------------------- corrupt bytes starting at file offset 225280 1st 32 expected bytes: BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 1st 32 actual bytes: YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY kernel: Incorrect number of segments after building list kernel: counted 16, received 15 kernel: req nr_sec 1024, cur_nr_sec 8 Using: 03:00.0 Fibre Channel: QLogic Corp. SP232-based 4Gb Fibre Channel to PCI Express HBA (rev 02)
Chad posted the following updates for RHEL5.7: bug #660386 - [Qlogic 5.7 feat] qla2xxx: update driver to 8.03.07.00.05.07. - in kernel-2.6.18-248.el5 bug #686462 - [QLogic 5.7 FEAT] qla2xxx: Update driver to 8.03.07.03.05.07-k. - seems these changes never went in? bug #537277 - KERNEL: QLA2XXX 0000:0E:00.0: RISC PAUSED -- HCCR=0, DUMPING FIRMWARE! - in kernel-2.6.18-248.el5 bug #682305 - [QLogic 5.7 feat] qla2xxx: Update firmware for 4G and 8G HBAs to 5.03.16. - in kernel-2.6.18-254.el5 NOTE: all the DM updates for 5.7 went in kernel-2.6.18-249.el5 So testing a kernel < 2.6.18-248.el5 (to avoid the qlogic update) would avoid testing the new DM changes.
(In reply to comment #16) > NOTE: all the DM updates for 5.7 went in kernel-2.6.18-249.el5 > So testing a kernel < 2.6.18-248.el5 (to avoid the qlogic update) would avoid > testing the new DM changes. Would be great to run this snapshot test against a 2.6.18-256.el5 but with the qla2xxx driver and firmware from < 2.6.18-248.el5. That means manually unloading the qla2xxx kernel module and loading the older qla2xxx with insmod.
I am able to reproduce it with 2.6.18-250.el5 but not with 2.6.18-248.el5 (249 is not in Brew). I'll try to use old qla2xxx with recent kernel now...
Hm, I loaded qla2xxx from 2.6.18-247 on 259 kernel ant it still fails. Strange..
(In reply to comment #19) > Hm, I loaded qla2xxx from 2.6.18-247 on 259 kernel ant it still fails. > Strange.. Do we see this failure on RHEL 5.6 GA?
> Do we see this failure on RHEL 5.6 GA? No, I tried updated 5.6 (kernel-2.6.18-238.5.1.el5.x86_64) and cannot reproduce it. So 5.6 should be safe. I will probably run bisect, maybe it is block layer changes (there are some between 248 - 250...)
Cc'ing Jeff Moyer since his 5.7 block changes from bug #638988 were included in -250.el5. Could they be related to the messages highlighted in comment#6? Full log added in comment#12.
Yeah, the messages in comment #6 look exactly like bug 638988, so I suspect the changes in the block layer fix this problem.
(In reply to comment #23) > Yeah, the messages in comment #6 look exactly like bug 638988, so I suspect the > changes in the block layer fix this problem. That is odd because those messages were from a kernel that had those block changes (2.6.18-256.el5). Full /var/log/messages attached in comment#12
Sorry, I managed to miss that. I can try to reproduce this on one of my boxes and collect a crashdump for analysis. I'd also like to know whether we can reproduce this problem on other vendors' storage adapters.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Interesting, this is what I got after bisect... 15f06bd61ba8ef12ca68f27c9c7d89dcc680b3b5 is first bad commit commit 15f06bd61ba8ef12ca68f27c9c7d89dcc680b3b5 Author: Jeff Moyer <jmoyer> Date: Fri Feb 4 23:19:21 2011 -0500 [block] reduce stack footprint of blk_recount_segments()
(In reply to comment #28) > Interesting, this is what I got after bisect... > > 15f06bd61ba8ef12ca68f27c9c7d89dcc680b3b5 is first bad commit > commit 15f06bd61ba8ef12ca68f27c9c7d89dcc680b3b5 > Author: Jeff Moyer <jmoyer> > Date: Fri Feb 4 23:19:21 2011 -0500 > > [block] reduce stack footprint of blk_recount_segments() Does the test pass if that one commit reverted (being patch 7/7 I'd imagine it reverts cleanly)?
This commit is missing in tree: commit 59247eaea50cc68cc6ce3d3fd3855f3301b65c96 Author: Jens Axboe Date: Fri Mar 6 08:55:24 2009 +0100 block: fix missing bio back/front segment size setting in blk_recount_segments() Commit 1e42807918d17e8c93bf14fbb74be84b141334c1 introduced a bug where we don't get front/back segment sizes in the bio in blk_recount_segments(). Fix this by tracking the back bio as well as the front bio in __blk_recalc_rq_segments(), this also cleans up the interface by getting rid of the segment size pointer passing.
Created attachment 497498 [details] Backported patch This patch seems to fix the issue. (Not much tested though.)
Patch(es) available in kernel-2.6.18-261.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
Fix verified in 2.6.18-261.el5. Test case no longer fails. 2.6.18-261.el5 lvm2-2.02.84-3.el5 BUILT: Wed Apr 27 03:42:24 CDT 2011 lvm2-cluster-2.02.84-3.el5 BUILT: Wed Apr 27 03:42:43 CDT 2011 device-mapper-1.02.63-2.el5 BUILT: Fri Mar 4 10:23:17 CST 2011 device-mapper-event-1.02.63-2.el5 BUILT: Fri Mar 4 10:23:17 CST 2011
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1065.html