Description of problem: We're seeing a max_phys_segments violation for a request submitted to CCISS. The system is configured with MD RAID1 volumes over the CCISS drives, with a pair of volume groups configured on these RAID volumes. I.e.: +------+ | dm | +--+---- | +--+---+ | md | +--+---+ | +--+---+ | cciss| +------+ This triggers the BUG_ON() in do_cciss_request: Kernel BUG at drivers/block/cciss.c:2571 invalid opcode: 0000 [1] SMP last sysfs file: /devices/pci0000:00/0000:00:1c.1/0000:04:00.0/0000:05:00.0/irq CPU 0 Modules linked in: mptctl ipmi_si(U) ipmi_devintf(U) ipmi_msghandler(U) hp_ilo(U) sunrpc bonding ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 xfrm_nalgo c rypto_api dm_multipath video sbs backlight i2c_ec i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp parport st sg a ta_piix libata shpchp i5000_edac pcspkr e1000e ide_cd bnx2 edac_mc serio_raw cdrom floppy dm_snapshot dm_zero dm_mirror dm_mod mptsp i scsi_transport_spi mptsas mptscsih scsi_transport_sas mptbase cciss sd_mod scsi_mod raid1 ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 15791, comm: tar Tainted: G 2.6.18-92.1.13.el5 #1 RIP: 0010:[<ffffffff880c7a4d>] [<ffffffff880c7a4d>] :cciss:do_cciss_request+0x4b/0x406 RSP: 0018:ffff8100a01771a8 EFLAGS: 00010012 RAX: ffff8100a2f1d178 RBX: 0000000000000003 RCX: ffff8100a2f1d178 RDX: 000000000000000e RSI: ffff8100a2f1d178 RDI: 0000000000000000 RBP: ffff8100a2f1d178 R08: ffff8100a2f1d178 R09: 5e00000000000000 R10: ffff8100beab0e50 R11: ffff8100bb9269b0 R12: ffff8100a2f1d178 R13: 00000000000b20c9 R14: ffff8100bf370000 R15: 0000000000001000 FS: 00002b49a941cc10(0000) GS:ffffffff803a0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000fff3521c CR3: 00000000a0385000 CR4: 00000000000006e0 Process tar (pid: 15791, threadinfo ffff8100a0176000, task ffff8100bb8d7860) Stack: ffff8100be400270 ffffffff880c7df6 ffff8100bf6cec98 0000100000000002 ffff8100bf6cec98 00000000beab0e50 ffff81000a8f6780 0000100000000000 0000000000000000 0000000000000000 ffff81000a8f7eb0 0000100000000000 Call Trace: [<ffffffff880c7df6>] :cciss:do_cciss_request+0x3f4/0x406 [<ffffffff800da621>] __bio_clone+0x71/0x8a [<ffffffff88076083>] :raid1:make_request+0x196/0x606 [<ffffffff8001c15b>] cfq_set_request+0x246/0x389 [<ffffffff8008a74c>] enqueue_task+0x41/0x56 [<ffffffff80023036>] mempool_alloc+0x31/0xe7 [<ffffffff80023036>] mempool_alloc+0x31/0xe7 [<ffffffff8001c15b>] cfq_set_request+0x246/0x389 [<ffffffff8003d6c6>] lock_timer_base+0x1b/0x3c [<ffffffff8003d6c6>] lock_timer_base+0x1b/0x3c [<ffffffff80031fff>] del_timer+0x4e/0x57 [<ffffffff80137edd>] elv_insert+0x12a/0x1f7 [<ffffffff8000bd0a>] __make_request+0x3c9/0x418 [<ffffffff8001bc13>] generic_make_request+0x1e8/0x1ff [<ffffffff800da621>] __bio_clone+0x71/0x8a [<ffffffff88076083>] :raid1:make_request+0x196/0x606 [<ffffffff8804f9de>] :ext3:ext3_get_blocks_handle+0x9b1/0x9f1 [<ffffffff8814e465>] :dm_mod:__map_bio+0xbb/0x119 [<ffffffff8001bc13>] generic_make_request+0x1e8/0x1ff [<ffffffff8814e465>] :dm_mod:__map_bio+0xbb/0x119 [<ffffffff8814ef3c>] :dm_mod:__split_bio+0x188/0x3be [<ffffffff8814f938>] :dm_mod:dm_request+0x101/0x110 [<ffffffff8001bc13>] generic_make_request+0x1e8/0x1ff [<ffffffff8804fc80>] :ext3:ext3_get_block+0x0/0xe3 [<ffffffff80033080>] submit_bio+0xcd/0xd4 [<ffffffff800e96e4>] mpage_bio_submit+0x22/0x26 [<ffffffff8804fc80>] :ext3:ext3_get_block+0x0/0xe3 [<ffffffff80038bca>] mpage_readpages+0xb2/0x132 [<ffffffff8804fc80>] :ext3:ext3_get_block+0x0/0xe3 [<ffffffff8000f095>] __alloc_pages+0x65/0x2ce [<ffffffff800129dc>] __do_page_cache_readahead+0x101/0x1d9 [<ffffffff8006d940>] do_gettimeofday+0x50/0x92 [<ffffffff80028866>] sync_page+0x0/0x42 [<ffffffff8005a5d0>] getnstimeofday+0x10/0x28 [<ffffffff8009ff32>] ktime_get_ts+0x1a/0x4e [<ffffffff800ba35b>] delayacct_end+0x5d/0x86 [<ffffffff8003205b>] blockable_page_cache_readahead+0x53/0xb2 [<ffffffff8002edd6>] make_ahead_window+0x82/0x9e [<ffffffff80013b19>] page_cache_readahead+0x17f/0x1af [<ffffffff8000be7f>] do_generic_mapping_read+0x126/0x3f8 [<ffffffff8000cd7b>] file_read_actor+0x0/0x154 [<ffffffff8000c29d>] __generic_file_aio_read+0x14c/0x190 [<ffffffff80016809>] generic_file_aio_read+0x34/0x39 [<ffffffff8000caa4>] do_sync_read+0xc7/0x104 [<ffffffff8000df9a>] current_fs_time+0x3b/0x40 [<ffffffff8009df87>] autoremove_wake_function+0x0/0x2e [<ffffffff8000b337>] vfs_read+0xcb/0x171 [<ffffffff8001173f>] sys_read+0x45/0x6e [<ffffffff8005d28d>] tracesys+0xd5/0xe0 This is: static void do_cciss_request(request_queue_t *q) { ctlr_info_t *h = q->queuedata; CommandList_struct *c; sector_t start_blk; int seg; struct request *creq; u64bit temp64; struct scatterlist tmp_sg[MAXSGENTRIES]; drive_info_struct *drv; int i, dir; [snip] BUG_ON(creq->nr_phys_segments > MAXSGENTRIES); <<<< if ((c = cmd_alloc(h, 1)) == NULL) goto full; MAXSGENTRIES is 31: drivers/block/cciss_cmd.h:#define MAXSGENTRIES 31 This request was submitted with 32 entries: crash> struct request ffff8100a2f1d178 struct request { queuelist = { next = 0xffff8100bf6cec98, prev = 0xffff8100bf6cec98 }, donelist = { next = 0xffff8100a2f1d188, prev = 0xffff8100a2f1d188 }, flags = 2372, sector = 729289, nr_sectors = 250, current_nr_sectors = 8, hard_sector = 729289, hard_nr_sectors = 250, hard_cur_sectors = 8, bio = 0xffff8100bf8125c0, biotail = 0xffff8100bf8125c0, elevator_private = 0xffff8100bb9269b0, completion_data = 0x0, rq_status = 1, errors = 0, rq_disk = 0xffff8100beb41400, start_time = 4295138633, *nr_phys_segments = 32,* nr_hw_segments = 32, Version-Release number of selected component (if applicable): 2.6.18-92.1.13.el5 How reproducible: Unknown - this is seen reliably at one site but attempts to reproduce in the lab have so far failed. Steps to Reproduce: 1. Configure LVM over MD RAID1 over CCISS 2. Submit I/O Actual results: Above BUG Expected results: No BUG, device limits correctly enforced across the whole stack. Additional info: We've not so far been able to reproduce this. We have seen similar problems in the past in device-mapper, e.g.: http://bugzilla.kernel.org/show_bug.cgi?id=7763 I'm not sure exactly where the limit is not being enforced here. In the above kernel.org bugzilla, it was a problem in the dm-crypt write method not properly enforcing limits when cloning and modifying bios while performing encryption.
Created attachment 323633 [details] log from BUG in comment #0 Log from above BUG.
Simple analysis: The OOPs happens during read request initiated from page readahead code. The page vectors are there correctly allocated through bio_add_page() call and the whole device stack do not modify these vectors, every part uses always bio_clone() where is cloned only new bio header, not the page request data itself. (One exception is page bouncing, but it surely doesn't apply here because according to sos report there is no high memory on the affected system.) So the problem is in the beginning - the request is allocated with different number of physical segments that final request queue see. Simulating similar bio request on the device stack I found that request queue parameters are *not* correctly propagated through the device stack: From similar configuration on x86_64 system (LVM over MD over SCSI): request_queue max_segment_size seg_boundary_mask SCSI 65536 0xffffffff MD RAID1 0 0 LVM 65536 -1 (64bit) Unfortunatelly bio_add_page (resp. bio_phys_segments) calculates number of physical segments according to these paramaters. During the genereic_make_request() is segment count recalculated and *can* *icrease* bio->bi_phys_segments count, whis correlates with request queue n_phys_segments later. BUGS: - generic block layer code doesn't initialize some values if stacked driver use request queue simplified constructor blk_alloc_queue() and set own request function through blk_queue_make_request() - parameters above remains wrong even if MD calls blk_queue_stack_limits() later (DM uses its own function for that which set the limits "correctly" unfortunatelly using not initialized queue below...) Reproducer: After long fiddling with creating bio requests I found extremely simple way how to reproduce it: - create raid1 array over CCISS device - use this array for VG - create LV there and simply several time repeat this until it oopses:-) dd iflag=direct if=/dev/<vg>/<lv> of=/dev/null bs=128000 count=10 This command generates bio with 250 sectors, allocated in 32 pages (last page uses only 1024 bytes). (I have also reproducer in-kernel code to do this more straightforward.) For LVM layer, it allocates bio with 31 segments (still OK for CCISS), unfortunatelly on lower layer it is recalculated to 32 segments and this violates CCISS requirements and triggers BUG_ON() problem. Unfortunately the problem is in upstream too. I'll attach patch to fix, but it need to be discussed with block layer maintainer first (The setting of boundary mask is full of magic constants anyway...)
Created attachment 325139 [details] Proposed patch (NOT FINAL yet!) This is backport for RHEL5 patch to test, original patch it need to be sent upstream first.
in kernel-2.6.18-126.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html
> After long fiddling with creating bio requests I found extremely > simple way how to reproduce it: > - create raid1 array over CCISS device > - use this array for VG > - create LV there > and simply several time repeat this until it oopses:-) > > dd iflag=direct if=/dev/<vg>/<lv> of=/dev/null bs=128000 count=10 The same procedure still creates a kernel oops with 5.3 and 5.4beta kernels. ------------[ cut here ]------------ kernel BUG at drivers/block/cciss.c:2862! invalid opcode: 0000 [#1] SMP last sysfs file: /devices/pci0000:00/0000:00:08.0/0000:02:00.1/irq Modules linked in: autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 xfrm_nalgo crypto_api dm_multipath scsi_dh video hwmon backlight sbs i2c_ec i2c_core button batte ry asus_acpi ac parport_pc lp parport sr_mod cdrom sg pcspkr hpilo bnx2 serio_raw dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata cciss sd_mod scsi_mod raid1 ext3 jbd uhci_hcd ohci_hcd ehci_hcd CPU: 10 EIP: 0060:[<f88c4f31>] Not tainted VLI EFLAGS: 00010012 (2.6.18-155.el5PAE #1) EIP is at do_cciss_request+0x46/0x3a3 [cciss] eax: f7c2a9e0 ebx: f7bbe7e4 ecx: 00000000 edx: 00000000 esi: f7c1da00 edi: 0000000c ebp: 00000001 esp: f65e1abc ds: 007b es: 007b ss: 0068 Process dd (pid: 4374, ti=f65e1000 task=f7579550 task.ti=f65e1000) Stack: f7913800 f7bbe7e4 f7bd0000 00001116 f7c2a9e0 c04e6425 f7bbe7e4 f69a877c c04e4b3b f7c2a9e0 f69edb38 00000000 f69a877c c04e5974 f7913800 f7c2a9e0 f7c2a9e0 f7bbe7e4 f7c2a9e0 c04e5a7a 00000000 006629b0 f7913800 f69edb38 Call Trace: [<c04e6425>] cfq_set_request+0x0/0x31f [<c04e4b3b>] cfq_resort_rr_list+0x23/0x8b [<c04e5974>] cfq_add_crq_rb+0xba/0xc3 [<c04e5a7a>] cfq_insert_request+0x42/0x498 [<c04db3a0>] elv_insert+0xc7/0x160 [<c04df21d>] __make_request+0x2fb/0x344 [<c04dd1ca>] generic_make_request+0x255/0x265 [<c0478a60>] __bio_clone+0x6f/0x8a [<f884800e>] make_request+0x174/0x543 [raid1] [<c04dd1ca>] generic_make_request+0x255/0x265 [<c0478a60>] __bio_clone+0x6f/0x8a [<f88e7442>] __map_bio+0x44/0x103 [dm_mod] [<f88e80d8>] __split_bio+0x428/0x438 [dm_mod] [<c0461a9e>] __handle_mm_fault+0x79c/0xcf8 [<f88e87a2>] dm_request+0xe2/0xe8 [dm_mod] [<c04dd1ca>] generic_make_request+0x255/0x265 [<c042c529>] lock_timer_base+0x15/0x2f [<c042c9e4>] del_timer+0x41/0x47 [<c04ddf45>] __generic_unplug_device+0x1d/0x1f [<c04deeba>] generic_unplug_device+0x1f/0x2c [<f88472b4>] unplug_slaves+0x4f/0x83 [raid1] [<f8847300>] raid1_unplug+0xe/0x1a [raid1] [<f88e98fd>] dm_table_unplug_all+0x2d/0x60 [dm_mod] [<c0478e37>] bio_add_page+0x25/0x2e [<f88e7c7b>] dm_unplug_all+0x17/0x21 [dm_mod] [<c04df515>] blk_backing_dev_unplug+0x2f/0x32 [<c04945a8>] __blockdev_direct_IO+0x9a9/0xba9 [<c047adf5>] blkdev_direct_IO+0x30/0x35 [<c047ad10>] blkdev_get_blocks+0x0/0xb5 [<c0455d9a>] generic_file_direct_IO+0xd0/0x118 [<c0455ffc>] __generic_file_aio_read+0xd2/0x198 [<c0617192>] lock_kernel+0x16/0x25 [<c04570ae>] generic_file_read+0x0/0xab [<c0457145>] generic_file_read+0x97/0xab [<c0434cef>] autoremove_wake_function+0x0/0x2d [<c04738a5>] vfs_read+0x9f/0x141 [<c0473cf3>] sys_read+0x3c/0x63 [<c0404f17>] syscall_call+0x7/0xb ======================= Code: 08 8b 82 f8 00 00 00 84 c0 0f 88 65 03 00 00 8b 44 24 04 e8 87 62 c1 c7 85 c0 89 44 24 10 0f 84 50 03 00 00 66 83 78 54 1f 76 08 <0f> 0b 2e 0b 20 93 8c f8 8b 44 24 08 ba 01 00 00 00 e8 ff d3 ff EIP: [<f88c4f31>] do_cciss_request+0x46/0x3a3 [cciss] SS:ESP 0068:f65e1abc
(In reply to comment #23) > The same procedure still creates a kernel oops with 5.3 and 5.4beta kernels. "still" or "again" - I didn't go back to 2.6.18-92.1.13.el5 to check. FWIW I first saw this problem when installing DB2.