Bug 471639

Summary: max_phys_segments violation with dm-linear + md raid1 + cciss
Product: Red Hat Enterprise Linux 5 Reporter: Bryn M. Reeves <bmr>
Component: kernelAssignee: Milan Broz <mbroz>
Status: CLOSED ERRATA QA Contact: Martin Jenner <mjenner>
Severity: high Docs Contact:
Priority: urgent    
Version: 5.2CC: agk, charlieb-fedora-bugzilla, dledford, dmair, dzickus, mbroz, mgahagan, peterm, pvrabec, syeghiay, tao, thenzl
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 512387 (view as bug list) Environment:
Last Closed: 2009-01-20 19:39:21 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
log from BUG in comment #0
none
Proposed patch (NOT FINAL yet!) none

Description Bryn M. Reeves 2008-11-14 19:27:16 UTC
Description of problem:
We're seeing a max_phys_segments violation for a request submitted to CCISS. The system is configured with MD RAID1 volumes over the CCISS drives, with a pair of volume groups configured on these RAID volumes.

I.e.:

        +------+
        |  dm  |
        +--+----
           |
        +--+---+
        |  md  |
        +--+---+
           |
        +--+---+
        | cciss|
        +------+

This triggers the BUG_ON() in do_cciss_request:

Kernel BUG at drivers/block/cciss.c:2571
invalid opcode: 0000 [1] SMP
last sysfs file: /devices/pci0000:00/0000:00:1c.1/0000:04:00.0/0000:05:00.0/irq
CPU 0
Modules linked in: mptctl ipmi_si(U) ipmi_devintf(U) ipmi_msghandler(U) hp_ilo(U) sunrpc bonding ip_conntrack_netbios_ns ipt_REJECT
xt_state ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 xfrm_nalgo c
rypto_api dm_multipath video sbs backlight i2c_ec i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp parport st sg a
ta_piix libata shpchp i5000_edac pcspkr e1000e ide_cd bnx2 edac_mc serio_raw cdrom floppy dm_snapshot dm_zero dm_mirror dm_mod mptsp
i scsi_transport_spi mptsas mptscsih scsi_transport_sas mptbase cciss sd_mod scsi_mod raid1 ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 15791, comm: tar Tainted: G      2.6.18-92.1.13.el5 #1
RIP: 0010:[<ffffffff880c7a4d>]  [<ffffffff880c7a4d>] :cciss:do_cciss_request+0x4b/0x406
RSP: 0018:ffff8100a01771a8  EFLAGS: 00010012
RAX: ffff8100a2f1d178 RBX: 0000000000000003 RCX: ffff8100a2f1d178
RDX: 000000000000000e RSI: ffff8100a2f1d178 RDI: 0000000000000000
RBP: ffff8100a2f1d178 R08: ffff8100a2f1d178 R09: 5e00000000000000
R10: ffff8100beab0e50 R11: ffff8100bb9269b0 R12: ffff8100a2f1d178
R13: 00000000000b20c9 R14: ffff8100bf370000 R15: 0000000000001000
FS:  00002b49a941cc10(0000) GS:ffffffff803a0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000fff3521c CR3: 00000000a0385000 CR4: 00000000000006e0
Process tar (pid: 15791, threadinfo ffff8100a0176000, task ffff8100bb8d7860)
Stack:  ffff8100be400270 ffffffff880c7df6 ffff8100bf6cec98 0000100000000002
ffff8100bf6cec98 00000000beab0e50 ffff81000a8f6780 0000100000000000
0000000000000000 0000000000000000 ffff81000a8f7eb0 0000100000000000
Call Trace:
[<ffffffff880c7df6>] :cciss:do_cciss_request+0x3f4/0x406
[<ffffffff800da621>] __bio_clone+0x71/0x8a
[<ffffffff88076083>] :raid1:make_request+0x196/0x606
[<ffffffff8001c15b>] cfq_set_request+0x246/0x389
[<ffffffff8008a74c>] enqueue_task+0x41/0x56
[<ffffffff80023036>] mempool_alloc+0x31/0xe7
[<ffffffff80023036>] mempool_alloc+0x31/0xe7
[<ffffffff8001c15b>] cfq_set_request+0x246/0x389
[<ffffffff8003d6c6>] lock_timer_base+0x1b/0x3c
[<ffffffff8003d6c6>] lock_timer_base+0x1b/0x3c
[<ffffffff80031fff>] del_timer+0x4e/0x57
[<ffffffff80137edd>] elv_insert+0x12a/0x1f7
[<ffffffff8000bd0a>] __make_request+0x3c9/0x418
[<ffffffff8001bc13>] generic_make_request+0x1e8/0x1ff
[<ffffffff800da621>] __bio_clone+0x71/0x8a
[<ffffffff88076083>] :raid1:make_request+0x196/0x606
[<ffffffff8804f9de>] :ext3:ext3_get_blocks_handle+0x9b1/0x9f1
[<ffffffff8814e465>] :dm_mod:__map_bio+0xbb/0x119
[<ffffffff8001bc13>] generic_make_request+0x1e8/0x1ff
[<ffffffff8814e465>] :dm_mod:__map_bio+0xbb/0x119
[<ffffffff8814ef3c>] :dm_mod:__split_bio+0x188/0x3be
[<ffffffff8814f938>] :dm_mod:dm_request+0x101/0x110
[<ffffffff8001bc13>] generic_make_request+0x1e8/0x1ff
[<ffffffff8804fc80>] :ext3:ext3_get_block+0x0/0xe3
[<ffffffff80033080>] submit_bio+0xcd/0xd4
[<ffffffff800e96e4>] mpage_bio_submit+0x22/0x26
[<ffffffff8804fc80>] :ext3:ext3_get_block+0x0/0xe3
[<ffffffff80038bca>] mpage_readpages+0xb2/0x132
[<ffffffff8804fc80>] :ext3:ext3_get_block+0x0/0xe3
[<ffffffff8000f095>] __alloc_pages+0x65/0x2ce
[<ffffffff800129dc>] __do_page_cache_readahead+0x101/0x1d9
[<ffffffff8006d940>] do_gettimeofday+0x50/0x92
[<ffffffff80028866>] sync_page+0x0/0x42
[<ffffffff8005a5d0>] getnstimeofday+0x10/0x28
[<ffffffff8009ff32>] ktime_get_ts+0x1a/0x4e
[<ffffffff800ba35b>] delayacct_end+0x5d/0x86
[<ffffffff8003205b>] blockable_page_cache_readahead+0x53/0xb2
[<ffffffff8002edd6>] make_ahead_window+0x82/0x9e
[<ffffffff80013b19>] page_cache_readahead+0x17f/0x1af
[<ffffffff8000be7f>] do_generic_mapping_read+0x126/0x3f8
[<ffffffff8000cd7b>] file_read_actor+0x0/0x154
[<ffffffff8000c29d>] __generic_file_aio_read+0x14c/0x190
[<ffffffff80016809>] generic_file_aio_read+0x34/0x39
[<ffffffff8000caa4>] do_sync_read+0xc7/0x104
[<ffffffff8000df9a>] current_fs_time+0x3b/0x40
[<ffffffff8009df87>] autoremove_wake_function+0x0/0x2e
[<ffffffff8000b337>] vfs_read+0xcb/0x171
[<ffffffff8001173f>] sys_read+0x45/0x6e
[<ffffffff8005d28d>] tracesys+0xd5/0xe0

This is:

static void do_cciss_request(request_queue_t *q)
{
        ctlr_info_t *h = q->queuedata;
        CommandList_struct *c;
        sector_t start_blk;
        int seg;
        struct request *creq;
        u64bit temp64;
        struct scatterlist tmp_sg[MAXSGENTRIES];
        drive_info_struct *drv;
        int i, dir;
[snip]
        BUG_ON(creq->nr_phys_segments > MAXSGENTRIES); <<<<

        if ((c = cmd_alloc(h, 1)) == NULL)
                goto full;

MAXSGENTRIES is 31:

drivers/block/cciss_cmd.h:#define MAXSGENTRIES            31

This request was submitted with 32 entries:

    crash> struct  request ffff8100a2f1d178
    struct request {
     queuelist = {
       next = 0xffff8100bf6cec98,
       prev = 0xffff8100bf6cec98
     },
     donelist = {
       next = 0xffff8100a2f1d188,
       prev = 0xffff8100a2f1d188
     },
     flags = 2372,
     sector = 729289,
     nr_sectors = 250,
     current_nr_sectors = 8,
     hard_sector = 729289,
     hard_nr_sectors = 250,
     hard_cur_sectors = 8,
     bio = 0xffff8100bf8125c0,
     biotail = 0xffff8100bf8125c0,
     elevator_private = 0xffff8100bb9269b0,
     completion_data = 0x0,
     rq_status = 1,
     errors = 0,
     rq_disk = 0xffff8100beb41400,
     start_time = 4295138633,
    *nr_phys_segments = 32,*
     nr_hw_segments = 32,    

Version-Release number of selected component (if applicable):
2.6.18-92.1.13.el5

How reproducible:
Unknown - this is seen reliably at one site but attempts to reproduce in the lab have so far failed.

Steps to Reproduce:
1. Configure LVM over MD RAID1 over CCISS
2. Submit I/O

  
Actual results:
Above BUG

Expected results:
No BUG, device limits correctly enforced across the whole stack.

Additional info:
We've not so far been able to reproduce this. We have seen similar problems in the past in device-mapper, e.g.:

  http://bugzilla.kernel.org/show_bug.cgi?id=7763

I'm not sure exactly where the limit is not being enforced here. In the above kernel.org bugzilla, it was a problem in the dm-crypt write method not properly enforcing limits when cloning and modifying bios while performing encryption.

Comment 1 Bryn M. Reeves 2008-11-14 19:30:28 UTC
Created attachment 323633 [details]
log from BUG in comment #0

Log from above BUG.

Comment 7 Milan Broz 2008-11-30 19:32:45 UTC
Simple analysis:

The OOPs happens during read request initiated from page readahead code.

The page vectors are there correctly allocated through bio_add_page() call
and the whole device stack do not modify these vectors, every part
uses always bio_clone() where is cloned only new bio header, not the page
request data itself.

(One exception is page bouncing, but it surely doesn't apply here because
according to sos report there is no high memory on the affected system.)

So the problem is in the beginning - the request is allocated with different
number of physical segments that final request queue see.

Simulating similar bio request on the device stack I found that request
queue parameters are *not* correctly propagated through the device stack:

From similar configuration on x86_64 system (LVM over MD over SCSI):

request_queue   max_segment_size  seg_boundary_mask
SCSI                65536             0xffffffff
MD RAID1                0                      0
LVM                 65536                 -1 (64bit)

Unfortunatelly bio_add_page (resp. bio_phys_segments) calculates number
of physical segments according to these paramaters.

During the genereic_make_request() is segment count recalculated and
*can* *icrease* bio->bi_phys_segments count, whis correlates with
request queue n_phys_segments later.

BUGS:
	- generic block layer code doesn't initialize some values
	if stacked driver use request queue simplified constructor
	blk_alloc_queue() and set own request function through
	blk_queue_make_request()

	- parameters above remains wrong even if MD calls
	blk_queue_stack_limits() later
	(DM uses its own function for that which set the limits "correctly"
	unfortunatelly using not initialized queue below...)

Reproducer:

After long fiddling with creating bio requests I found extremely
simple way how to reproduce it:
	- create raid1 array over CCISS device
	- use this array for VG
	- create LV there
and simply several time repeat this until it oopses:-)

	dd iflag=direct if=/dev/<vg>/<lv> of=/dev/null bs=128000 count=10

This command generates bio with 250 sectors, allocated in 32 pages
(last page uses only 1024 bytes).

(I have also reproducer in-kernel code to do this more straightforward.)

For LVM layer, it allocates bio with 31 segments (still OK for CCISS),
unfortunatelly on lower layer it is recalculated to 32 segments and
this violates CCISS requirements and triggers BUG_ON() problem.

Unfortunately the problem is in upstream too. I'll attach patch to fix,
but it need to be discussed with block layer maintainer first
(The setting of boundary mask is full of magic constants anyway...)

Comment 10 Milan Broz 2008-11-30 19:45:49 UTC
Created attachment 325139 [details]
Proposed patch (NOT FINAL yet!)

This is backport for RHEL5 patch to test, original patch it need to be sent upstream first.

Comment 17 Don Zickus 2008-12-09 21:05:00 UTC
in kernel-2.6.18-126.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 21 errata-xmlrpc 2009-01-20 19:39:21 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html

Comment 23 Charlie Brady 2009-07-17 15:29:11 UTC
> After long fiddling with creating bio requests I found extremely
> simple way how to reproduce it:
> - create raid1 array over CCISS device
> - use this array for VG
> - create LV there
> and simply several time repeat this until it oopses:-)
>
> dd iflag=direct if=/dev/<vg>/<lv> of=/dev/null bs=128000 count=10

The same procedure still creates a kernel oops with 5.3 and 5.4beta kernels.



------------[ cut here ]------------
kernel BUG at drivers/block/cciss.c:2862!
invalid opcode: 0000 [#1]
SMP
last sysfs file: /devices/pci0000:00/0000:00:08.0/0000:02:00.1/irq
Modules linked in: autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc 
ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink 
iptable_filter ip_tables                     ip6t_REJECT xt_tcpudp 
ip6table_filter ip6_tables x_tables ipv6 xfrm_nalgo crypto_api 
dm_multipath scsi_dh video hwmon backlight sbs i2c_ec i2c_core button 
batte                    ry asus_acpi ac parport_pc lp parport sr_mod 
cdrom sg pcspkr hpilo bnx2 serio_raw dm_raid45 dm_message dm_region_hash 
dm_mem_cache dm_snapshot dm_zero dm_mirror                     dm_log 
dm_mod ata_piix libata cciss sd_mod scsi_mod raid1 ext3 jbd uhci_hcd 
ohci_hcd ehci_hcd
CPU:    10
EIP:    0060:[<f88c4f31>]    Not tainted VLI
EFLAGS: 00010012   (2.6.18-155.el5PAE #1)
EIP is at do_cciss_request+0x46/0x3a3 [cciss]
eax: f7c2a9e0   ebx: f7bbe7e4   ecx: 00000000   edx: 00000000
esi: f7c1da00   edi: 0000000c   ebp: 00000001   esp: f65e1abc
ds: 007b   es: 007b   ss: 0068
Process dd (pid: 4374, ti=f65e1000 task=f7579550 task.ti=f65e1000)
Stack: f7913800 f7bbe7e4 f7bd0000 00001116 f7c2a9e0 c04e6425 f7bbe7e4 
f69a877c
       c04e4b3b f7c2a9e0 f69edb38 00000000 f69a877c c04e5974 f7913800 
f7c2a9e0
       f7c2a9e0 f7bbe7e4 f7c2a9e0 c04e5a7a 00000000 006629b0 f7913800 
f69edb38
Call Trace:
 [<c04e6425>] cfq_set_request+0x0/0x31f
 [<c04e4b3b>] cfq_resort_rr_list+0x23/0x8b
 [<c04e5974>] cfq_add_crq_rb+0xba/0xc3
 [<c04e5a7a>] cfq_insert_request+0x42/0x498
 [<c04db3a0>] elv_insert+0xc7/0x160
 [<c04df21d>] __make_request+0x2fb/0x344
 [<c04dd1ca>] generic_make_request+0x255/0x265
 [<c0478a60>] __bio_clone+0x6f/0x8a
 [<f884800e>] make_request+0x174/0x543 [raid1]
 [<c04dd1ca>] generic_make_request+0x255/0x265
 [<c0478a60>] __bio_clone+0x6f/0x8a
 [<f88e7442>] __map_bio+0x44/0x103 [dm_mod]
 [<f88e80d8>] __split_bio+0x428/0x438 [dm_mod]
 [<c0461a9e>] __handle_mm_fault+0x79c/0xcf8
 [<f88e87a2>] dm_request+0xe2/0xe8 [dm_mod]
 [<c04dd1ca>] generic_make_request+0x255/0x265
 [<c042c529>] lock_timer_base+0x15/0x2f
 [<c042c9e4>] del_timer+0x41/0x47
 [<c04ddf45>] __generic_unplug_device+0x1d/0x1f
 [<c04deeba>] generic_unplug_device+0x1f/0x2c
 [<f88472b4>] unplug_slaves+0x4f/0x83 [raid1]
 [<f8847300>] raid1_unplug+0xe/0x1a [raid1]
 [<f88e98fd>] dm_table_unplug_all+0x2d/0x60 [dm_mod]
 [<c0478e37>] bio_add_page+0x25/0x2e
 [<f88e7c7b>] dm_unplug_all+0x17/0x21 [dm_mod]
 [<c04df515>] blk_backing_dev_unplug+0x2f/0x32
 [<c04945a8>] __blockdev_direct_IO+0x9a9/0xba9
 [<c047adf5>] blkdev_direct_IO+0x30/0x35
 [<c047ad10>] blkdev_get_blocks+0x0/0xb5
 [<c0455d9a>] generic_file_direct_IO+0xd0/0x118
 [<c0455ffc>] __generic_file_aio_read+0xd2/0x198
 [<c0617192>] lock_kernel+0x16/0x25
 [<c04570ae>] generic_file_read+0x0/0xab
 [<c0457145>] generic_file_read+0x97/0xab
 [<c0434cef>] autoremove_wake_function+0x0/0x2d
 [<c04738a5>] vfs_read+0x9f/0x141
 [<c0473cf3>] sys_read+0x3c/0x63
 [<c0404f17>] syscall_call+0x7/0xb
 =======================
Code: 08 8b 82 f8 00 00 00 84 c0 0f 88 65 03 00 00 8b 44 24 04 e8 87 62 c1 
c7 85 c0 89 44 24 10 0f 84 50 03 00 00 66 83 78 54 1f 76 08 <0f> 0b 2e 0b 
20 93 8c f8                     8b 44 24 08 ba 01 00 00 00 e8 ff d3 ff
EIP: [<f88c4f31>] do_cciss_request+0x46/0x3a3 [cciss] SS:ESP 0068:f65e1abc

Comment 24 Charlie Brady 2009-07-17 15:32:15 UTC
(In reply to comment #23)

> The same procedure still creates a kernel oops with 5.3 and 5.4beta kernels.

"still" or "again" - I didn't go back to 2.6.18-92.1.13.el5 to check.

FWIW I first saw this problem when installing DB2.