Bug 512387
Summary: | max_phys_segments violation with dm-linear + md raid1 + cciss | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Charlie Brady <charlieb-fedora-bugzilla> | ||||||
Component: | kernel | Assignee: | Milan Broz <mbroz> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 5.4 | CC: | agk, charlieb-fedora-bugzilla, coughlan, dledford, dmair, dzickus, jzapleta, mbroz, mgahagan, mnovacek, paul.clements, peterm, pvrabec, syeghiay, tao, thenzl | ||||||
Target Milestone: | rc | Keywords: | Regression | ||||||
Target Release: | --- | ||||||||
Hardware: | i686 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | 471639 | Environment: | |||||||
Last Closed: | 2009-09-02 08:29:48 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Charlie Brady
2009-07-17 16:24:31 UTC
2.6.30.1 kernel appears to be OK. I am not able to reproduce it (on the same hw where I run test mentioned above) IIRC there can be another path which can possibly violate max_phys_segments, I think this could be different bug (just triggers the same BUG_ON()). Please can you provide more system info? lvmdump, if possible. Why are you using PAE kernel - is there >4G memory? If so, please can you try boot with non-PAE kernel and run test again? > Please can you provide more system info? lvmdump, if possible. Sure, but next Thursday will the earliest possibility. There are two lvs, one is swap, and the other is the root partition. > Why are you using PAE kernel We have a variety of systems, some of which have >4G, and we wish to use the same kernel on all. Is PAE deprecated with <4G? IIRC, this system does have 8G. > - is there >4G memory? The crash was produced with mem=2G (to keep the dump size manageable). I'll try non-PAE kernel as well. > If so, please can you try boot with non-PAE kernel and run test again?
non-PAE kernel also crashes.
[root@localhost ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 cciss/c0d1p1[1] cciss/c0d0p1[0]
104320 blocks [2/2] [UU]
md2 : active raid1 cciss/c0d1p2[1] cciss/c0d0p2[0]
292824704 blocks [2/2] [UU]
[======>..............] resync = 33.3% (97756800/292824704) finish=35.2min speed=92325K/sec
unused devices: <none>
[root@localhost ~]# lvdisplay
--- Logical volume ---
LV Name /dev/main/root
VG Name main
LV UUID hMXbly-QE7d-cQKW-jK0X-11ww-HDK3-bkGora
LV Write Access read/write
LV Status available
# open 1
LV Size 273.84 GB
Current LE 8763
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 256
Block device 253:0
--- Logical volume ---
LV Name /dev/main/swap
VG Name main
LV UUID tl0q2j-SUsL-5XaS-5kR9-xrKA-qNWt-zcfXqr
LV Write Access read/write
LV Status available
# open 1
LV Size 5.41 GB
Current LE 173
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 256
Block device 253:1
[root@localhost ~]# fdisk -l /dev/sda
[root@localhost ~]# fdisk -l /dev/cciss/c0d0
Disk /dev/cciss/c0d0: 299.9 GB, 299966445568 bytes
255 heads, 63 sectors/track, 36468 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/cciss/c0d0p1 * 1 13 104391 fd Linux raid autodetect
/dev/cciss/c0d0p2 14 36468 292824787+ fd Linux raid autodetect
[root@localhost ~]# fdisk -l /dev/cciss/c0d1
Disk /dev/cciss/c0d1: 299.9 GB, 299966445568 bytes
255 heads, 63 sectors/track, 36468 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/cciss/c0d1p1 * 1 13 104391 fd Linux raid autodetect
/dev/cciss/c0d1p2 14 36468 292824787+ fd Linux raid autodetect
[root@localhost ~]# mount
/dev/mapper/main-root on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/md1 on /boot type ext3 (rw)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
[root@localhost ~]#
Created attachment 354764 [details]
lvdump output.
I've confirmed (not surprisingly) that there is no kernel crash with 2.6.30.2. I tried to boot 2.6.19.7 (built with oldconfig + defaults), but without success - the vg main wasn't found. Ditto with 2.6.18.8 and 2.6.25.20. Milan, are you waiting for any more information from me? Is there anything further I can do to help? Thanks for info, I'll need to analyse that first. The crash log is slightly different (mainly bio_add_page() appears here). BTW, Milan, Martin Peterson is aware of this issue. I saw him last week in Montreal and showed him the stacktrace. Sorry, Martin Petersen. I'm looking at the source of 2.6.18-128.1.6. I see that nr_phys_segments is only ever set within block/ll_rw_blk.c. reqs there should have a ceiling of q->max_phys_segments, so we are interested in tracking what sets/changes q->max_phys_segments. bash-3.2$ grep -r max_phys_segments . ./Documentation/block/biodoc.txt: blk_queue_max_phys_segments(q, max_segments) ./Documentation/block/biodoc.txt:blk_queue_max_phys_segments() : Sets an upper limit on the maximum number ./block/ll_rw_blk.c: blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS); ./block/ll_rw_blk.c: * blk_queue_max_phys_segments - set max phys segments for a request for this queue ./block/ll_rw_blk.c:void blk_queue_max_phys_segments(request_queue_t *q, unsigned short max_segments) ./block/ll_rw_blk.c: q->max_phys_segments = max_segments; ./block/ll_rw_blk.c:EXPORT_SYMBOL(blk_queue_max_phys_segments); ./block/ll_rw_blk.c: t->max_phys_segments = min(t->max_phys_segments,b->max_phys_segments); ./block/ll_rw_blk.c: if (req->nr_phys_segments + nr_phys_segs > q->max_phys_segments) { ./block/ll_rw_blk.c: || req->nr_phys_segments + nr_phys_segs > q->max_phys_segments) { ./block/ll_rw_blk.c: if (total_phys_segments > q->max_phys_segments) ./block/ll_rw_blk.c: blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS); ./drivers/block/DAC960.c: blk_queue_max_phys_segments(RequestQueue, Controller->DriverScatterGatherLimit); ./drivers/block/cciss.c: blk_queue_max_phys_segments(disk->queue, MAXSGENTRIES); ./drivers/block/cciss.c: blk_queue_max_phys_segments(q, MAXSGENTRIES); ./drivers/block/cpqarray.c: blk_queue_max_phys_segments(q, SG_MAX); ./drivers/block/paride/pf.c: blk_queue_max_phys_segments(pf_queue, cluster); ./drivers/block/pktcdvd.c: * max_phys_segments value. ./drivers/block/pktcdvd.c: if ((pd->settings.size << 9) / CD_FRAMESIZE <= q->max_phys_segments) { ./drivers/block/pktcdvd.c: } else if ((pd->settings.size << 9) / PAGE_SIZE <= q->max_phys_segments) { ./drivers/block/pktcdvd.c: printk("pktcdvd: cdrom max_phys_segments too small\n"); ./drivers/block/sx8.c: blk_queue_max_phys_segments(q, CARM_MAX_REQ_SG); ./drivers/block/ub.c: blk_queue_max_phys_segments(q, UB_MAX_REQ_SG); ./drivers/block/viodasd.c: blk_queue_max_phys_segments(q, VIOMAXBLOCKDMA); ./drivers/cdrom/viocd.c: blk_queue_max_phys_segments(q, 1); ./drivers/ide/ide-probe.c: blk_queue_max_phys_segments(q, max_sg_entries); ./drivers/md/dm-table.c: lhs->max_phys_segments = ./drivers/md/dm-table.c: min_not_zero(lhs->max_phys_segments, rhs->max_phys_segments); ./drivers/md/dm-table.c: rs->max_phys_segments = ./drivers/md/dm-table.c: min_not_zero(rs->max_phys_segments, ./drivers/md/dm-table.c: q->max_phys_segments); ./drivers/md/dm-table.c: if (!rs->max_phys_segments) ./drivers/md/dm-table.c: rs->max_phys_segments = MAX_PHYS_SEGMENTS; ./drivers/md/dm-table.c: q->max_phys_segments = t->limits.max_phys_segments; ./drivers/message/i2o/i2o_block.c: blk_queue_max_phys_segments(queue, I2O_MAX_PHYS_SEGMENTS); ./drivers/message/i2o/i2o_block.c: osm_debug("max sectors = %d\n", queue->max_phys_segments); ./drivers/mmc/mmc_queue.c: blk_queue_max_phys_segments(mq->queue, host->max_phys_segs); ./drivers/s390/block/dasd.c: blk_queue_max_phys_segments(device->request_queue, -1L); ./drivers/s390/char/tape_block.c: blk_queue_max_phys_segments(blkdat->request_queue, -1L); ./drivers/scsi/sg.c: q->max_phys_segments); ./drivers/scsi/sg.c: sdp->sg_tablesize = min(q->max_hw_segments, q->max_phys_segments); ./drivers/scsi/st.c: SDp->request_queue->max_phys_segments); ./drivers/scsi/scsi_lib.c: blk_queue_max_phys_segments(q, SCSI_MAX_PHYS_SEGMENTS); ./drivers/xen/blkfront/vbd.c: blk_queue_max_phys_segments(rq, BLKIF_MAX_SEGMENTS_PER_REQUEST); ./fs/bio.c: if (nr_pages > q->max_phys_segments) ./fs/bio.c: nr_pages = q->max_phys_segments; ./fs/bio.c: while (bio->bi_phys_segments >= q->max_phys_segments ./fs/ocfs2/cluster/heartbeat.c: if (max_pages > q->max_phys_segments) ./fs/ocfs2/cluster/heartbeat.c: max_pages = q->max_phys_segments; ./include/linux/blkdev.h: unsigned short max_phys_segments; ./include/linux/blkdev.h:extern void blk_queue_max_phys_segments(request_queue_t *, unsigned short); ./include/linux/device-mapper.h: unsigned short max_phys_segments; ./include/linux/i2o.h:/* defines for max_sectors and max_phys_segments */ ./include/linux/mmc/host.h: unsigned short max_phys_segs; /* see blk_queue_max_phys_segments */ bash-3.2$ Of these, drivers/block/cciss.c appears trivial, and drivers/md/dm-table.c appears most interesting. fyi: I was able to reproduce that with some effort on one of the system (with -128 and -155 kernel). Seems like another path where code can violate max_phys_segments for bio. What's the problem: cciss drivers fails on BUG_ON(creq->nr_phys_segments > MAXSGENTRIES); Because there is stack of devices cciss->md_raid1->dm_linear some path miscalculates and violates segment count in bio. The number of segments depends on queue restrictions, according to blk_recount_segments() it depends on these queue parameters: q->bounce_pfn q->max_segment_size q->seg_boundary_mask The last two are correctly propagated through stack now (thanks to previous patch) but bunce_pfn can be different: - all dm devices sets BLK_BOUNCE_ANY here - md do not set explicitly this value, so block layer sets BLK_BOUNCE_HIGH - cciss uses explicit setting using dma_mask So in reality, every stack device can have different bounce_pfn! The real problem is when virtual block device (MD or DM) clones bio request using __bio_clone() function - this recalculates segments count for _new_ clone usin _old_ block device queue and sets BIO_SEG_VALID. DM here correctly resets this flag, so after remmaping (i.e. queue change) segments are recalculated. MD do not do this. Now, segment count can by recalculated using more restrictive bounce_pfn (so all pages in high memory are counted as possible new segment because of possible bouncing) and because BIO_SEG_VALID is set, this segment count is _not_ recalculated for underlying device queue later. Upstream way, how to fix it is just do not recalculate segments in bio_clone http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=5d84070ee0a433620c57e85dac7f82faaec5fbb3 Exactly the same patch fixes it in RHEL5 for me. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Charlie, please can you verify that this experimental kernel build fixes the issue for you (attached only i686 build)? see http://people.redhat.com/mbroz/pkg/kernel/ (In reply to comment #16) > Charlie, > please can you verify that this experimental kernel build fixes the issue for > you (attached only i686 build)? The dd command no longer crashes the kernel. It'll take longer for me to test whether DB2 install will do so, but signs do look good :-) Thanks. As expected, DB2 install completes successfully with the updated kernel. > see http://people.redhat.com/mbroz/pkg/kernel/
If you still have it, could you please add the kernel-PAE-devel rpm? Thanks.
Is there any chance that this will be included in a RHEL5.3 update kernel?
Just want to report that I've seen this problem on the 2.6.18-128 kernel using lvm -> raid1 -> SCSI/nbd There is no panic, but the error is: Incorrect number of segments after building list counted 94, received 64 req nr_sec 992, cur_nr_sec 8 coming from SCSI. The patch that Milan Broz gave me fixed the issue. I hope this gets into the next Red Hat kernel. I'll attach the patch, which went in upstream. Created attachment 355604 [details]
patch file
(In reply to comment #22) > > see http://people.redhat.com/mbroz/pkg/kernel/ > > If you still have it, could you please add the kernel-PAE-devel rpm? Thanks. It's there. (but I'll remove there rpms soon) > Is there any chance that this will be included in a RHEL5.3 update kernel? Please use standard Red Hat support channels for that question. in kernel-2.6.18-162.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified. Reproduced on 2.6.18-128.el5PAE (rhel 5.3) NOT reproduced on 2.6.18-161.el5PAE kernel. reproduced on 2.6.18-160.el5 kernel as well. Reproducer, bonnie++ and tiobench over lvm+md{raid0,raid1,raid5} on AACRAID and MEGARAID_SAS went fine for kernel 2.6.18-161.el5. Checked also for MPTSAS driver with / on lvm2 on dm-raid0. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html |