Bug 165717
Summary: | ext on top of mirror attempts to access beyond end of device: dm-5: rw=0, want=16304032720, limit=20971520 | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Corey Marthaler <cmarthal> | ||||
Component: | kernel | Assignee: | Jonathan Earl Brassow <jbrassow> | ||||
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 4.0 | CC: | agk | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | RHSA-2005-514 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2005-10-05 13:50:32 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 156322 | ||||||
Attachments: |
|
Description
Corey Marthaler
2005-08-11 14:59:58 UTC
[root@link-08 ~]# dmsetup table GFS-mirror_mlog: 0 8192 linear 8:65 384 GFS-mirror: 0 20971520 mirror disk 2 253:2 1024 2 253:3 0 253:4 0 GFS-mirror_mimage_1: 0 20971520 linear 8:17 384 GFS-mirror_mimage_0: 0 20971520 linear 8:1 384 There are no snapshots on this system or any md raid. The I/O is multiple processes writing to the fs at once. but you were doing pvmoves as well? If so, is the bug reproducable on a mirror device w/o doing pvmoves too? I was attempting pvmoves, however no data ever got written out as far as I could tell, all i ever saw was: No data to move for GFS Skipping mirror LV mirror Skipping mirror log LV mirror_mlog Skipping mirror image LV mirror_mimage_0 Skipping mirror image LV mirror_mimage_1 ok, then don't use pvmove for now... one less thing to worry about when reproducing this bug And if you get it again, unmount the filesystem and run fsck on it, capturing the output. Can you try a test similar to the one described on bug 160275 (lots of files ; find & md5sun) and see if you can reproduce it any more quickly with a different test like that? After 24 hours of not seeing these issues (and resulting in a full fs) I killed all I/O, unmounted, deleted the lv and vg and then proceed to rebuild everything back up. I recreated the vg with only 3 devs this time (instead of 5) and made the lv/mirror/fs 75G instead of 10. This is on link-08 (x86-64) with one cpu, running 2.6.9-15.EL kernel compiled with patches: patch -p1 < /pub/164630.patch patch -p1 < /pub/165242.patch (as posted on rhkernel) The I/O load again consisted of many fsstress/genesis/accordion processes along with pvmoves of those three devices. root@link-08 redhat]# dmsetup table link08-mirror_mimage_1: 0 157286400 linear 8:17 384 link08-mirror_mimage_0: 0 157286400 linear 8:1 384 GFS-mirror_mlog: 0 8192 linear 8:65 384 link08-mirror_mlog: 0 8192 linear 8:33 384 GFS-mirror_mimage_1: 0 20971520 linear 8:17 384 GFS-mirror_mimage_0: 0 20971520 linear 8:1 384 VolGroup00-LogVol01: 0 4063232 linear 3:6 110756224 VolGroup00-LogVol00: 0 110755840 linear 3:6 384 link08-mirror: 0 157286400 mirror disk 2 253:5 1024 2 253:6 0 253:7 0 [root@link-08 ~]# lvs --all LV VG Attr LSize Origin Snap% Move Log Copy% LogVol00 VolGroup00 -wi-ao 52.81G LogVol01 VolGroup00 -wi-ao 1.94G mirror link08 mwi-ao 75.00G mirror_mlog 6.05 [mirror_mimage_0] link08 iwi-ao 75.00G [mirror_mimage_1] link08 iwi-ao 75.00G [mirror_mlog] link08 lwi-ao 4.00M [root@link-08 ~]# pvscan PV /dev/hda6 VG VolGroup00 lvm2 [54.81 GB / 64.00 MB free] PV /dev/sda1 VG link08 lvm2 [135.66 GB / 60.66 GB free] PV /dev/sdb1 VG link08 lvm2 [135.66 GB / 60.66 GB free] PV /dev/sdc1 VG link08 lvm2 [135.66 GB / 135.65 GB free] PV /dev/sdd1 lvm2 [135.66 GB] PV /dev/sde1 lvm2 [135.66 GB] Aug 12 04:07:21 link-08 kernel: EXT2-fs error (device dm-8): ext2_new_block: Allocating block in system zone - block = 4030465 Aug 12 04:07:21 link-08 kernel: EXT2-fs error (device dm-8): ext2_new_block: Allocating block in system zone - block = 4030470 Aug 12 04:07:21 link-08 kernel: EXT2-fs error (device dm-8): ext2_free_blocks: Freeing blocks in system zones - Bloc k = 4030471, count = 1 Aug 12 04:07:21 link-08 kernel: EXT2-fs error (device dm-8): ext2_new_block: Allocating block in system zone - block = 4030471 Aug 12 04:07:21 link-08 kernel: EXT2-fs error (device dm-8): ext2_new_block: Allocating block in system zone - block = 4030473 Aug 12 04:07:21 link-08 kernel: EXT2-fs error (device dm-8): ext2_free_blocks: Freeing blocks in system zones - Bloc k = 4030474, count = 1 Aug 12 04:07:21 link-08 kernel: EXT2-fs error (device dm-8): ext2_new_block: Allocating block in system zone - block = 4030478 Aug 12 04:07:21 link-08 kernel: EXT2-fs error (device dm-8): ext2_new_block: Allocating block in system zone - block = 4030481 Aug 12 04:07:21 link-08 kernel: EXT2-fs error (device dm-8): ext2_new_block: Allocating block in system zone - block = 4030486 Aug 12 04:07:21 link-08 kernel: EXT2-fs error (device dm-8): ext2_free_blocks: Freeing blocks in system zones - Bloc k = 4030487, count = 1 [...] One new piece of information here: "Copy% 6.05" i.e. a mirror sync was happening Perhaps there's a race in this code while a sync is in progress? Reproduced this again on link-08 about 5 minutes after remaking the fs/mounting it/and starting I/O. Again there was a mirror sync in progress so it appears that that may be needed to cause this. [root@link-08 ~]# dmsetup status link08-mirror_mimage_1: 0 157286400 linear link08-mirror_mimage_0: 0 157286400 linear link08-mirror_mlog: 0 8192 linear VolGroup00-LogVol01: 0 4063232 linear VolGroup00-LogVol00: 0 110755840 linear link08-mirror: 0 157286400 mirror 2 253:3 253:4 96045/153600 1 AA 3 disk 253:2 A Aug 12 07:08:27 link-08 kernel: attempt to access beyond end of device Aug 12 07:08:27 link-08 kernel: dm-5: rw=0, want=13613869888, limit=157286400 Aug 12 07:08:27 link-08 kernel: EXT2-fs error (device dm-5): ext2_free_blocks: Freeing blocks not in datazone - block = 1701733735, count = 1 Aug 12 07:08:27 link-08 kernel: EXT2-fs error (device dm-5): ext2_free_blocks: Freeing blocks not in datazone - block = 712206707, count = 1 Aug 12 07:08:27 link-08 kernel: EXT2-fs error (device dm-5): ext2_free_blocks: Freeing blocks not in datazone - block = 808532547, count = 1 Aug 12 07:08:27 link-08 kernel: EXT2-fs error (device dm-5): ext2_free_blocks: Freeing blocks not in datazone - block = 976630069, count = 1 [...] OK, so it's starting to be reproducible. Good. After you get the failure like that. Stop everything running, unmount the filesystem, reboot, then run an fsck on the filesystem and see whether it reports any errors or not. Also can you go back to a kernel *without* the U2 mirroring patch but *with* all the other bug fixes (rh_inc/rh_dec etc.) and see if you can reproduce it? hit this on x86 now as well (link-12) UP, running 2.6.9-15.EL kernel compiled with patches: patch -p1 < /pub/164630.patch patch -p1 < /pub/165242.patch (as posted on rhkernel) saw the bug within seconds of remaking and starting I/O to the fs. [root@link-12 ~]# dmsetup status gfs-link11: 0 57344 mirror 2 253:5 253:6 56/56 1 AA 3 disk 253:4 A gfs-link11_mlog: 0 8192 linear gfs-link11_mimage_1: 0 57344 linear gfs-link11_mimage_0: 0 57344 linear coreyvg-mirror_mimage_1: 0 42680320 linear coreyvg-mirror_mimage_1: 42680320 42680320 linear coreyvg-mirror_mimage_1: 85360640 42680320 linear coreyvg-mirror_mimage_1: 128040960 29245440 linear coreyvg-mirror: 0 157286400 mirror 2 253:1 253:2 3397/153600 1 AA 3 disk 253:0 A coreyvg-mirror_mimage_0: 0 42680320 linear coreyvg-mirror_mimage_0: 42680320 42680320 linear coreyvg-mirror_mimage_0: 85360640 42680320 linear coreyvg-mirror_mimage_0: 128040960 29245440 linear coreyvg-mirror_mlog: 0 8192 linear Aug 12 10:13:39 link-12 kernel: EXT2-fs error (device dm-3): ext2_new_block: Allocating block in system zone - block = 14778370 Aug 12 10:13:39 link-12 kernel: EXT2-fs error (device dm-3): ext2_new_block: Allocating block in system zone - block = 14778371 Aug 12 10:13:39 link-12 kernel: EXT2-fs error (device dm-3): ext2_new_block: Allocating block in system zone - block = 14778372 Aug 12 10:13:39 link-12 kernel: EXT2-fs error (device dm-3): ext2_new_block: Allocating block in system zone - block = 14778375 Aug 12 10:13:39 link-12 kernel: EXT2-fs error (device dm-3): ext2_new_block: Allocating block in system zone - block = 14778378 Aug 12 10:13:39 link-12 kernel: EXT2-fs error (device dm-3): ext2_new_block: Allocating block in system zone - block = 14778382 Aug 12 10:13:39 link-12 kernel: EXT2-fs error (device dm-3): ext2_new_block: Allocating block in system zone - block = 14778385 Aug 12 10:13:39 link-12 kernel: EXT2-fs error (device dm-3): ext2_new_block: Allocating block in system zone - block = 14778390 [...] I'll reboot now and see what happens with fsck... Created attachment 117767 [details]
Fix for 165717 - turn off read balancing
I've disabled the feature that contains the bug. However, I'm still not sure
_why_ the bug occurred.
Note that read balancing has never been in mirroring before, so this is not a
regression.
Well, I understand now... do_reads, the only place that calls rh_in_sync(), was checking to see if what was returned was RH_CLEAN. But rh_in_sync() expects to be called as a predicate - returning 1 if in sync. RH_CLEAN is 0. We were getting exactly the opposite result of what we were expecting in do_reads()!!!!!!!!!!!! This bug was introduced by me (as can be seen in the mirror patch). It was part of a completely different scheme for handling cluster mirrors that agk and I rejected. This offending remnant of that code made it into the final patch. Moving back to RHEL4U2Proposed as the problem is a regression from pvmove capabilities in RHEL4 U1 code. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-514.html |