+++ This bug was initially created as a clone of Bug #471639 +++ > After long fiddling with creating bio requests I found extremely > simple way how to reproduce it: > - create raid1 array over CCISS device > - use this array for VG > - create LV there > and simply several time repeat this until it oopses:-) > > dd iflag=direct if=/dev/<vg>/<lv> of=/dev/null bs=128000 count=10 The same procedure still creates a kernel BUG crash with 5.3 and 5.4beta kernels. ------------[ cut here ]------------ kernel BUG at drivers/block/cciss.c:2862! invalid opcode: 0000 [#1] SMP last sysfs file: /devices/pci0000:00/0000:00:08.0/0000:02:00.1/irq Modules linked in: autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 xfrm_nalgo crypto_api dm_multipath scsi_dh video hwmon backlight sbs i2c_ec i2c_core button batte ry asus_acpi ac parport_pc lp parport sr_mod cdrom sg pcspkr hpilo bnx2 serio_raw dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata cciss sd_mod scsi_mod raid1 ext3 jbd uhci_hcd ohci_hcd ehci_hcd CPU: 10 EIP: 0060:[<f88c4f31>] Not tainted VLI EFLAGS: 00010012 (2.6.18-155.el5PAE #1) EIP is at do_cciss_request+0x46/0x3a3 [cciss] eax: f7c2a9e0 ebx: f7bbe7e4 ecx: 00000000 edx: 00000000 esi: f7c1da00 edi: 0000000c ebp: 00000001 esp: f65e1abc ds: 007b es: 007b ss: 0068 Process dd (pid: 4374, ti=f65e1000 task=f7579550 task.ti=f65e1000) Stack: f7913800 f7bbe7e4 f7bd0000 00001116 f7c2a9e0 c04e6425 f7bbe7e4 f69a877c c04e4b3b f7c2a9e0 f69edb38 00000000 f69a877c c04e5974 f7913800 f7c2a9e0 f7c2a9e0 f7bbe7e4 f7c2a9e0 c04e5a7a 00000000 006629b0 f7913800 f69edb38 Call Trace: [<c04e6425>] cfq_set_request+0x0/0x31f [<c04e4b3b>] cfq_resort_rr_list+0x23/0x8b [<c04e5974>] cfq_add_crq_rb+0xba/0xc3 [<c04e5a7a>] cfq_insert_request+0x42/0x498 [<c04db3a0>] elv_insert+0xc7/0x160 [<c04df21d>] __make_request+0x2fb/0x344 [<c04dd1ca>] generic_make_request+0x255/0x265 [<c0478a60>] __bio_clone+0x6f/0x8a [<f884800e>] make_request+0x174/0x543 [raid1] [<c04dd1ca>] generic_make_request+0x255/0x265 [<c0478a60>] __bio_clone+0x6f/0x8a [<f88e7442>] __map_bio+0x44/0x103 [dm_mod] [<f88e80d8>] __split_bio+0x428/0x438 [dm_mod] [<c0461a9e>] __handle_mm_fault+0x79c/0xcf8 [<f88e87a2>] dm_request+0xe2/0xe8 [dm_mod] [<c04dd1ca>] generic_make_request+0x255/0x265 [<c042c529>] lock_timer_base+0x15/0x2f [<c042c9e4>] del_timer+0x41/0x47 [<c04ddf45>] __generic_unplug_device+0x1d/0x1f [<c04deeba>] generic_unplug_device+0x1f/0x2c [<f88472b4>] unplug_slaves+0x4f/0x83 [raid1] [<f8847300>] raid1_unplug+0xe/0x1a [raid1] [<f88e98fd>] dm_table_unplug_all+0x2d/0x60 [dm_mod] [<c0478e37>] bio_add_page+0x25/0x2e [<f88e7c7b>] dm_unplug_all+0x17/0x21 [dm_mod] [<c04df515>] blk_backing_dev_unplug+0x2f/0x32 [<c04945a8>] __blockdev_direct_IO+0x9a9/0xba9 [<c047adf5>] blkdev_direct_IO+0x30/0x35 [<c047ad10>] blkdev_get_blocks+0x0/0xb5 [<c0455d9a>] generic_file_direct_IO+0xd0/0x118 [<c0455ffc>] __generic_file_aio_read+0xd2/0x198 [<c0617192>] lock_kernel+0x16/0x25 [<c04570ae>] generic_file_read+0x0/0xab [<c0457145>] generic_file_read+0x97/0xab [<c0434cef>] autoremove_wake_function+0x0/0x2d [<c04738a5>] vfs_read+0x9f/0x141 [<c0473cf3>] sys_read+0x3c/0x63 [<c0404f17>] syscall_call+0x7/0xb ======================= Code: 08 8b 82 f8 00 00 00 84 c0 0f 88 65 03 00 00 8b 44 24 04 e8 87 62 c1 c7 85 c0 89 44 24 10 0f 84 50 03 00 00 66 83 78 54 1f 76 08 <0f> 0b 2e 0b 20 93 8c f8 8b 44 24 08 ba 01 00 00 00 e8 ff d3 ff EIP: [<f88c4f31>] do_cciss_request+0x46/0x3a3 [cciss] SS:ESP 0068:f65e1abc --- Additional comment from charlieb-fedora-bugzilla.org.au on 2009-07-17 11:32:15 EDT --- (In reply to comment #23) > The same procedure still creates a kernel oops with 5.3 and 5.4beta kernels. "still" or "again" - I didn't go back to 2.6.18-92.1.13.el5 to check. FWIW I first saw this problem when installing DB2.
2.6.30.1 kernel appears to be OK.
I am not able to reproduce it (on the same hw where I run test mentioned above) IIRC there can be another path which can possibly violate max_phys_segments, I think this could be different bug (just triggers the same BUG_ON()). Please can you provide more system info? lvmdump, if possible. Why are you using PAE kernel - is there >4G memory? If so, please can you try boot with non-PAE kernel and run test again?
> Please can you provide more system info? lvmdump, if possible. Sure, but next Thursday will the earliest possibility. There are two lvs, one is swap, and the other is the root partition. > Why are you using PAE kernel We have a variety of systems, some of which have >4G, and we wish to use the same kernel on all. Is PAE deprecated with <4G? IIRC, this system does have 8G. > - is there >4G memory? The crash was produced with mem=2G (to keep the dump size manageable). I'll try non-PAE kernel as well.
> If so, please can you try boot with non-PAE kernel and run test again? non-PAE kernel also crashes. [root@localhost ~]# cat /proc/mdstat Personalities : [raid1] md1 : active raid1 cciss/c0d1p1[1] cciss/c0d0p1[0] 104320 blocks [2/2] [UU] md2 : active raid1 cciss/c0d1p2[1] cciss/c0d0p2[0] 292824704 blocks [2/2] [UU] [======>..............] resync = 33.3% (97756800/292824704) finish=35.2min speed=92325K/sec unused devices: <none> [root@localhost ~]# lvdisplay --- Logical volume --- LV Name /dev/main/root VG Name main LV UUID hMXbly-QE7d-cQKW-jK0X-11ww-HDK3-bkGora LV Write Access read/write LV Status available # open 1 LV Size 273.84 GB Current LE 8763 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:0 --- Logical volume --- LV Name /dev/main/swap VG Name main LV UUID tl0q2j-SUsL-5XaS-5kR9-xrKA-qNWt-zcfXqr LV Write Access read/write LV Status available # open 1 LV Size 5.41 GB Current LE 173 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:1 [root@localhost ~]# fdisk -l /dev/sda [root@localhost ~]# fdisk -l /dev/cciss/c0d0 Disk /dev/cciss/c0d0: 299.9 GB, 299966445568 bytes 255 heads, 63 sectors/track, 36468 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/cciss/c0d0p1 * 1 13 104391 fd Linux raid autodetect /dev/cciss/c0d0p2 14 36468 292824787+ fd Linux raid autodetect [root@localhost ~]# fdisk -l /dev/cciss/c0d1 Disk /dev/cciss/c0d1: 299.9 GB, 299966445568 bytes 255 heads, 63 sectors/track, 36468 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/cciss/c0d1p1 * 1 13 104391 fd Linux raid autodetect /dev/cciss/c0d1p2 14 36468 292824787+ fd Linux raid autodetect [root@localhost ~]# mount /dev/mapper/main-root on / type ext3 (rw) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) /dev/md1 on /boot type ext3 (rw) tmpfs on /dev/shm type tmpfs (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) [root@localhost ~]#
Created attachment 354764 [details] lvdump output.
I've confirmed (not surprisingly) that there is no kernel crash with 2.6.30.2. I tried to boot 2.6.19.7 (built with oldconfig + defaults), but without success - the vg main wasn't found. Ditto with 2.6.18.8 and 2.6.25.20.
Milan, are you waiting for any more information from me? Is there anything further I can do to help?
Thanks for info, I'll need to analyse that first. The crash log is slightly different (mainly bio_add_page() appears here).
BTW, Milan, Martin Peterson is aware of this issue. I saw him last week in Montreal and showed him the stacktrace.
Sorry, Martin Petersen.
I'm looking at the source of 2.6.18-128.1.6. I see that nr_phys_segments is only ever set within block/ll_rw_blk.c. reqs there should have a ceiling of q->max_phys_segments, so we are interested in tracking what sets/changes q->max_phys_segments. bash-3.2$ grep -r max_phys_segments . ./Documentation/block/biodoc.txt: blk_queue_max_phys_segments(q, max_segments) ./Documentation/block/biodoc.txt:blk_queue_max_phys_segments() : Sets an upper limit on the maximum number ./block/ll_rw_blk.c: blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS); ./block/ll_rw_blk.c: * blk_queue_max_phys_segments - set max phys segments for a request for this queue ./block/ll_rw_blk.c:void blk_queue_max_phys_segments(request_queue_t *q, unsigned short max_segments) ./block/ll_rw_blk.c: q->max_phys_segments = max_segments; ./block/ll_rw_blk.c:EXPORT_SYMBOL(blk_queue_max_phys_segments); ./block/ll_rw_blk.c: t->max_phys_segments = min(t->max_phys_segments,b->max_phys_segments); ./block/ll_rw_blk.c: if (req->nr_phys_segments + nr_phys_segs > q->max_phys_segments) { ./block/ll_rw_blk.c: || req->nr_phys_segments + nr_phys_segs > q->max_phys_segments) { ./block/ll_rw_blk.c: if (total_phys_segments > q->max_phys_segments) ./block/ll_rw_blk.c: blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS); ./drivers/block/DAC960.c: blk_queue_max_phys_segments(RequestQueue, Controller->DriverScatterGatherLimit); ./drivers/block/cciss.c: blk_queue_max_phys_segments(disk->queue, MAXSGENTRIES); ./drivers/block/cciss.c: blk_queue_max_phys_segments(q, MAXSGENTRIES); ./drivers/block/cpqarray.c: blk_queue_max_phys_segments(q, SG_MAX); ./drivers/block/paride/pf.c: blk_queue_max_phys_segments(pf_queue, cluster); ./drivers/block/pktcdvd.c: * max_phys_segments value. ./drivers/block/pktcdvd.c: if ((pd->settings.size << 9) / CD_FRAMESIZE <= q->max_phys_segments) { ./drivers/block/pktcdvd.c: } else if ((pd->settings.size << 9) / PAGE_SIZE <= q->max_phys_segments) { ./drivers/block/pktcdvd.c: printk("pktcdvd: cdrom max_phys_segments too small\n"); ./drivers/block/sx8.c: blk_queue_max_phys_segments(q, CARM_MAX_REQ_SG); ./drivers/block/ub.c: blk_queue_max_phys_segments(q, UB_MAX_REQ_SG); ./drivers/block/viodasd.c: blk_queue_max_phys_segments(q, VIOMAXBLOCKDMA); ./drivers/cdrom/viocd.c: blk_queue_max_phys_segments(q, 1); ./drivers/ide/ide-probe.c: blk_queue_max_phys_segments(q, max_sg_entries); ./drivers/md/dm-table.c: lhs->max_phys_segments = ./drivers/md/dm-table.c: min_not_zero(lhs->max_phys_segments, rhs->max_phys_segments); ./drivers/md/dm-table.c: rs->max_phys_segments = ./drivers/md/dm-table.c: min_not_zero(rs->max_phys_segments, ./drivers/md/dm-table.c: q->max_phys_segments); ./drivers/md/dm-table.c: if (!rs->max_phys_segments) ./drivers/md/dm-table.c: rs->max_phys_segments = MAX_PHYS_SEGMENTS; ./drivers/md/dm-table.c: q->max_phys_segments = t->limits.max_phys_segments; ./drivers/message/i2o/i2o_block.c: blk_queue_max_phys_segments(queue, I2O_MAX_PHYS_SEGMENTS); ./drivers/message/i2o/i2o_block.c: osm_debug("max sectors = %d\n", queue->max_phys_segments); ./drivers/mmc/mmc_queue.c: blk_queue_max_phys_segments(mq->queue, host->max_phys_segs); ./drivers/s390/block/dasd.c: blk_queue_max_phys_segments(device->request_queue, -1L); ./drivers/s390/char/tape_block.c: blk_queue_max_phys_segments(blkdat->request_queue, -1L); ./drivers/scsi/sg.c: q->max_phys_segments); ./drivers/scsi/sg.c: sdp->sg_tablesize = min(q->max_hw_segments, q->max_phys_segments); ./drivers/scsi/st.c: SDp->request_queue->max_phys_segments); ./drivers/scsi/scsi_lib.c: blk_queue_max_phys_segments(q, SCSI_MAX_PHYS_SEGMENTS); ./drivers/xen/blkfront/vbd.c: blk_queue_max_phys_segments(rq, BLKIF_MAX_SEGMENTS_PER_REQUEST); ./fs/bio.c: if (nr_pages > q->max_phys_segments) ./fs/bio.c: nr_pages = q->max_phys_segments; ./fs/bio.c: while (bio->bi_phys_segments >= q->max_phys_segments ./fs/ocfs2/cluster/heartbeat.c: if (max_pages > q->max_phys_segments) ./fs/ocfs2/cluster/heartbeat.c: max_pages = q->max_phys_segments; ./include/linux/blkdev.h: unsigned short max_phys_segments; ./include/linux/blkdev.h:extern void blk_queue_max_phys_segments(request_queue_t *, unsigned short); ./include/linux/device-mapper.h: unsigned short max_phys_segments; ./include/linux/i2o.h:/* defines for max_sectors and max_phys_segments */ ./include/linux/mmc/host.h: unsigned short max_phys_segs; /* see blk_queue_max_phys_segments */ bash-3.2$ Of these, drivers/block/cciss.c appears trivial, and drivers/md/dm-table.c appears most interesting.
fyi: I was able to reproduce that with some effort on one of the system (with -128 and -155 kernel). Seems like another path where code can violate max_phys_segments for bio.
What's the problem: cciss drivers fails on BUG_ON(creq->nr_phys_segments > MAXSGENTRIES); Because there is stack of devices cciss->md_raid1->dm_linear some path miscalculates and violates segment count in bio. The number of segments depends on queue restrictions, according to blk_recount_segments() it depends on these queue parameters: q->bounce_pfn q->max_segment_size q->seg_boundary_mask The last two are correctly propagated through stack now (thanks to previous patch) but bunce_pfn can be different: - all dm devices sets BLK_BOUNCE_ANY here - md do not set explicitly this value, so block layer sets BLK_BOUNCE_HIGH - cciss uses explicit setting using dma_mask So in reality, every stack device can have different bounce_pfn! The real problem is when virtual block device (MD or DM) clones bio request using __bio_clone() function - this recalculates segments count for _new_ clone usin _old_ block device queue and sets BIO_SEG_VALID. DM here correctly resets this flag, so after remmaping (i.e. queue change) segments are recalculated. MD do not do this. Now, segment count can by recalculated using more restrictive bounce_pfn (so all pages in high memory are counted as possible new segment because of possible bouncing) and because BIO_SEG_VALID is set, this segment count is _not_ recalculated for underlying device queue later. Upstream way, how to fix it is just do not recalculate segments in bio_clone http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=5d84070ee0a433620c57e85dac7f82faaec5fbb3 Exactly the same patch fixes it in RHEL5 for me.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Charlie, please can you verify that this experimental kernel build fixes the issue for you (attached only i686 build)? see http://people.redhat.com/mbroz/pkg/kernel/
(In reply to comment #16) > Charlie, > please can you verify that this experimental kernel build fixes the issue for > you (attached only i686 build)? The dd command no longer crashes the kernel. It'll take longer for me to test whether DB2 install will do so, but signs do look good :-) Thanks.
As expected, DB2 install completes successfully with the updated kernel.
> see http://people.redhat.com/mbroz/pkg/kernel/ If you still have it, could you please add the kernel-PAE-devel rpm? Thanks. Is there any chance that this will be included in a RHEL5.3 update kernel?
Just want to report that I've seen this problem on the 2.6.18-128 kernel using lvm -> raid1 -> SCSI/nbd There is no panic, but the error is: Incorrect number of segments after building list counted 94, received 64 req nr_sec 992, cur_nr_sec 8 coming from SCSI. The patch that Milan Broz gave me fixed the issue. I hope this gets into the next Red Hat kernel. I'll attach the patch, which went in upstream.
Created attachment 355604 [details] patch file
(In reply to comment #22) > > see http://people.redhat.com/mbroz/pkg/kernel/ > > If you still have it, could you please add the kernel-PAE-devel rpm? Thanks. It's there. (but I'll remove there rpms soon) > Is there any chance that this will be included in a RHEL5.3 update kernel? Please use standard Red Hat support channels for that question.
in kernel-2.6.18-162.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
Reproduced on 2.6.18-128.el5PAE (rhel 5.3) NOT reproduced on 2.6.18-161.el5PAE kernel.
reproduced on 2.6.18-160.el5 kernel as well.
Reproducer, bonnie++ and tiobench over lvm+md{raid0,raid1,raid5} on AACRAID and MEGARAID_SAS went fine for kernel 2.6.18-161.el5. Checked also for MPTSAS driver with / on lvm2 on dm-raid0.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html