512387 – max_phys_segments violation with dm-linear + md raid1 + cciss

Bug 512387 - max_phys_segments violation with dm-linear + md raid1 + cciss

Summary: max_phys_segments violation with dm-linear + md raid1 + cciss

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.4
Hardware:	i686
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Milan Broz
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-07-17 16:24 UTC by Charlie Brady
Modified:	2013-03-01 04:07 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:	471639
Environment:
Last Closed:	2009-09-02 08:29:48 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
lvdump output. (17.11 KB, application/x-gzip) 2009-07-22 18:24 UTC, Charlie Brady	no flags	Details
patch file (1.54 KB, patch) 2009-07-29 20:08 UTC, Paul Clements	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2009:1243	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.4 kernel security and bug fix update	2009-09-01 08:53:34 UTC

Description Charlie Brady 2009-07-17 16:24:31 UTC

+++ This bug was initially created as a clone of Bug #471639 +++

> After long fiddling with creating bio requests I found extremely
> simple way how to reproduce it:
> - create raid1 array over CCISS device
> - use this array for VG
> - create LV there
> and simply several time repeat this until it oopses:-)
>
> dd iflag=direct if=/dev/<vg>/<lv> of=/dev/null bs=128000 count=10

The same procedure still creates a kernel BUG crash with 5.3 and 5.4beta kernels.



------------[ cut here ]------------
kernel BUG at drivers/block/cciss.c:2862!
invalid opcode: 0000 [#1]
SMP
last sysfs file: /devices/pci0000:00/0000:00:08.0/0000:02:00.1/irq
Modules linked in: autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc 
ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink 
iptable_filter ip_tables                     ip6t_REJECT xt_tcpudp 
ip6table_filter ip6_tables x_tables ipv6 xfrm_nalgo crypto_api 
dm_multipath scsi_dh video hwmon backlight sbs i2c_ec i2c_core button 
batte                    ry asus_acpi ac parport_pc lp parport sr_mod 
cdrom sg pcspkr hpilo bnx2 serio_raw dm_raid45 dm_message dm_region_hash 
dm_mem_cache dm_snapshot dm_zero dm_mirror                     dm_log 
dm_mod ata_piix libata cciss sd_mod scsi_mod raid1 ext3 jbd uhci_hcd 
ohci_hcd ehci_hcd
CPU:    10
EIP:    0060:[<f88c4f31>]    Not tainted VLI
EFLAGS: 00010012   (2.6.18-155.el5PAE #1)
EIP is at do_cciss_request+0x46/0x3a3 [cciss]
eax: f7c2a9e0   ebx: f7bbe7e4   ecx: 00000000   edx: 00000000
esi: f7c1da00   edi: 0000000c   ebp: 00000001   esp: f65e1abc
ds: 007b   es: 007b   ss: 0068
Process dd (pid: 4374, ti=f65e1000 task=f7579550 task.ti=f65e1000)
Stack: f7913800 f7bbe7e4 f7bd0000 00001116 f7c2a9e0 c04e6425 f7bbe7e4 
f69a877c
       c04e4b3b f7c2a9e0 f69edb38 00000000 f69a877c c04e5974 f7913800 
f7c2a9e0
       f7c2a9e0 f7bbe7e4 f7c2a9e0 c04e5a7a 00000000 006629b0 f7913800 
f69edb38
Call Trace:
 [<c04e6425>] cfq_set_request+0x0/0x31f
 [<c04e4b3b>] cfq_resort_rr_list+0x23/0x8b
 [<c04e5974>] cfq_add_crq_rb+0xba/0xc3
 [<c04e5a7a>] cfq_insert_request+0x42/0x498
 [<c04db3a0>] elv_insert+0xc7/0x160
 [<c04df21d>] __make_request+0x2fb/0x344
 [<c04dd1ca>] generic_make_request+0x255/0x265
 [<c0478a60>] __bio_clone+0x6f/0x8a
 [<f884800e>] make_request+0x174/0x543 [raid1]
 [<c04dd1ca>] generic_make_request+0x255/0x265
 [<c0478a60>] __bio_clone+0x6f/0x8a
 [<f88e7442>] __map_bio+0x44/0x103 [dm_mod]
 [<f88e80d8>] __split_bio+0x428/0x438 [dm_mod]
 [<c0461a9e>] __handle_mm_fault+0x79c/0xcf8
 [<f88e87a2>] dm_request+0xe2/0xe8 [dm_mod]
 [<c04dd1ca>] generic_make_request+0x255/0x265
 [<c042c529>] lock_timer_base+0x15/0x2f
 [<c042c9e4>] del_timer+0x41/0x47
 [<c04ddf45>] __generic_unplug_device+0x1d/0x1f
 [<c04deeba>] generic_unplug_device+0x1f/0x2c
 [<f88472b4>] unplug_slaves+0x4f/0x83 [raid1]
 [<f8847300>] raid1_unplug+0xe/0x1a [raid1]
 [<f88e98fd>] dm_table_unplug_all+0x2d/0x60 [dm_mod]
 [<c0478e37>] bio_add_page+0x25/0x2e
 [<f88e7c7b>] dm_unplug_all+0x17/0x21 [dm_mod]
 [<c04df515>] blk_backing_dev_unplug+0x2f/0x32
 [<c04945a8>] __blockdev_direct_IO+0x9a9/0xba9
 [<c047adf5>] blkdev_direct_IO+0x30/0x35
 [<c047ad10>] blkdev_get_blocks+0x0/0xb5
 [<c0455d9a>] generic_file_direct_IO+0xd0/0x118
 [<c0455ffc>] __generic_file_aio_read+0xd2/0x198
 [<c0617192>] lock_kernel+0x16/0x25
 [<c04570ae>] generic_file_read+0x0/0xab
 [<c0457145>] generic_file_read+0x97/0xab
 [<c0434cef>] autoremove_wake_function+0x0/0x2d
 [<c04738a5>] vfs_read+0x9f/0x141
 [<c0473cf3>] sys_read+0x3c/0x63
 [<c0404f17>] syscall_call+0x7/0xb
 =======================
Code: 08 8b 82 f8 00 00 00 84 c0 0f 88 65 03 00 00 8b 44 24 04 e8 87 62 c1 
c7 85 c0 89 44 24 10 0f 84 50 03 00 00 66 83 78 54 1f 76 08 <0f> 0b 2e 0b 
20 93 8c f8                     8b 44 24 08 ba 01 00 00 00 e8 ff d3 ff
EIP: [<f88c4f31>] do_cciss_request+0x46/0x3a3 [cciss] SS:ESP 0068:f65e1abc

--- Additional comment from charlieb-fedora-bugzilla.org.au on 2009-07-17 11:32:15 EDT ---

(In reply to comment #23)

> The same procedure still creates a kernel oops with 5.3 and 5.4beta kernels.

"still" or "again" - I didn't go back to 2.6.18-92.1.13.el5 to check.

FWIW I first saw this problem when installing DB2.

Comment 1 Charlie Brady 2009-07-17 16:44:50 UTC

2.6.30.1 kernel appears to be OK.

Comment 2 Milan Broz 2009-07-17 22:41:50 UTC

I am not able to reproduce it (on the same hw where I run test mentioned above)

IIRC there can be another path which can possibly violate max_phys_segments, I think this could be different bug (just triggers the same BUG_ON()).

Please can you provide more system info? lvmdump, if possible.

Why are you using PAE kernel - is there >4G memory? If so, please can you try boot with non-PAE kernel and run test again?

Comment 3 Charlie Brady 2009-07-18 15:00:18 UTC

> Please can you provide more system info? lvmdump, if possible.

Sure, but next Thursday will the earliest possibility. There are two lvs, one is swap, and the other is the root partition.

> Why are you using PAE kernel

We have a variety of systems, some of which have >4G, and we wish to use the same kernel on all. Is PAE deprecated with <4G? IIRC, this system does have 8G.

> - is there >4G memory?

The crash was produced with mem=2G (to keep the dump size manageable).

I'll try non-PAE kernel as well.

Comment 4 Charlie Brady 2009-07-22 18:23:14 UTC

> If so, please can you try boot with non-PAE kernel and run test again?

non-PAE kernel also crashes.

[root@localhost ~]# cat /proc/mdstat 
Personalities : [raid1] 
md1 : active raid1 cciss/c0d1p1[1] cciss/c0d0p1[0]
      104320 blocks [2/2] [UU]
      
md2 : active raid1 cciss/c0d1p2[1] cciss/c0d0p2[0]
      292824704 blocks [2/2] [UU]
      [======>..............]  resync = 33.3% (97756800/292824704) finish=35.2min speed=92325K/sec
      
unused devices: <none>
[root@localhost ~]# lvdisplay 
  --- Logical volume ---
  LV Name                /dev/main/root
  VG Name                main
  LV UUID                hMXbly-QE7d-cQKW-jK0X-11ww-HDK3-bkGora
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                273.84 GB
  Current LE             8763
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:0
   
  --- Logical volume ---
  LV Name                /dev/main/swap
  VG Name                main
  LV UUID                tl0q2j-SUsL-5XaS-5kR9-xrKA-qNWt-zcfXqr
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                5.41 GB
  Current LE             173
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:1
   
[root@localhost ~]# fdisk -l /dev/sda
[root@localhost ~]# fdisk -l /dev/cciss/c0d0

Disk /dev/cciss/c0d0: 299.9 GB, 299966445568 bytes
255 heads, 63 sectors/track, 36468 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

           Device Boot      Start         End      Blocks   Id  System
/dev/cciss/c0d0p1   *           1          13      104391   fd  Linux raid autodetect
/dev/cciss/c0d0p2              14       36468   292824787+  fd  Linux raid autodetect
[root@localhost ~]# fdisk -l /dev/cciss/c0d1

Disk /dev/cciss/c0d1: 299.9 GB, 299966445568 bytes
255 heads, 63 sectors/track, 36468 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

           Device Boot      Start         End      Blocks   Id  System
/dev/cciss/c0d1p1   *           1          13      104391   fd  Linux raid autodetect
/dev/cciss/c0d1p2              14       36468   292824787+  fd  Linux raid autodetect
[root@localhost ~]# mount
/dev/mapper/main-root on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/md1 on /boot type ext3 (rw)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
[root@localhost ~]#

Comment 5 Charlie Brady 2009-07-22 18:24:23 UTC

Created attachment 354764 [details]
lvdump output.

Comment 6 Charlie Brady 2009-07-22 22:05:57 UTC

I've confirmed (not surprisingly) that there is no kernel crash with 2.6.30.2.

I tried to boot 2.6.19.7 (built with oldconfig + defaults), but without success - the vg main wasn't found. Ditto with 2.6.18.8 and 2.6.25.20.

Comment 7 Charlie Brady 2009-07-24 14:45:18 UTC

Milan, are you waiting for any more information from me? Is there anything further I can do to help?

Comment 8 Milan Broz 2009-07-24 15:27:54 UTC

Thanks for info, I'll need to analyse that first.

The crash log is slightly different (mainly bio_add_page() appears here).

Comment 9 Charlie Brady 2009-07-24 16:31:54 UTC

BTW, Milan, Martin Peterson is aware of this issue. I saw him last week in Montreal and showed him the stacktrace.

Comment 10 Charlie Brady 2009-07-24 16:32:28 UTC

Sorry, Martin Petersen.

Comment 11 Charlie Brady 2009-07-24 18:22:37 UTC

I'm looking at the source of 2.6.18-128.1.6. I see that nr_phys_segments is only ever set within block/ll_rw_blk.c. reqs there should have a ceiling of q->max_phys_segments, so we are interested in tracking what sets/changes q->max_phys_segments.

bash-3.2$ grep -r max_phys_segments .
./Documentation/block/biodoc.txt:	blk_queue_max_phys_segments(q, max_segments)
./Documentation/block/biodoc.txt:blk_queue_max_phys_segments() : Sets an upper limit on the maximum number
./block/ll_rw_blk.c:	blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS);
./block/ll_rw_blk.c: * blk_queue_max_phys_segments - set max phys segments for a request for this queue
./block/ll_rw_blk.c:void blk_queue_max_phys_segments(request_queue_t *q, unsigned short max_segments)
./block/ll_rw_blk.c:	q->max_phys_segments = max_segments;
./block/ll_rw_blk.c:EXPORT_SYMBOL(blk_queue_max_phys_segments);
./block/ll_rw_blk.c:	t->max_phys_segments = min(t->max_phys_segments,b->max_phys_segments);
./block/ll_rw_blk.c:	if (req->nr_phys_segments + nr_phys_segs > q->max_phys_segments) {
./block/ll_rw_blk.c:	    || req->nr_phys_segments + nr_phys_segs > q->max_phys_segments) {
./block/ll_rw_blk.c:	if (total_phys_segments > q->max_phys_segments)
./block/ll_rw_blk.c:	blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS);
./drivers/block/DAC960.c:	blk_queue_max_phys_segments(RequestQueue, Controller->DriverScatterGatherLimit);
./drivers/block/cciss.c:		blk_queue_max_phys_segments(disk->queue, MAXSGENTRIES);
./drivers/block/cciss.c:		blk_queue_max_phys_segments(q, MAXSGENTRIES);
./drivers/block/cpqarray.c:	blk_queue_max_phys_segments(q, SG_MAX);
./drivers/block/paride/pf.c:	blk_queue_max_phys_segments(pf_queue, cluster);
./drivers/block/pktcdvd.c: * max_phys_segments value.
./drivers/block/pktcdvd.c:	if ((pd->settings.size << 9) / CD_FRAMESIZE <= q->max_phys_segments) {
./drivers/block/pktcdvd.c:	} else if ((pd->settings.size << 9) / PAGE_SIZE <= q->max_phys_segments) {
./drivers/block/pktcdvd.c:		printk("pktcdvd: cdrom max_phys_segments too small\n");
./drivers/block/sx8.c:		blk_queue_max_phys_segments(q, CARM_MAX_REQ_SG);
./drivers/block/ub.c:	blk_queue_max_phys_segments(q, UB_MAX_REQ_SG);
./drivers/block/viodasd.c:	blk_queue_max_phys_segments(q, VIOMAXBLOCKDMA);
./drivers/cdrom/viocd.c:	blk_queue_max_phys_segments(q, 1);
./drivers/ide/ide-probe.c:	blk_queue_max_phys_segments(q, max_sg_entries);
./drivers/md/dm-table.c:	lhs->max_phys_segments =
./drivers/md/dm-table.c:		min_not_zero(lhs->max_phys_segments, rhs->max_phys_segments);
./drivers/md/dm-table.c:		rs->max_phys_segments =
./drivers/md/dm-table.c:			min_not_zero(rs->max_phys_segments,
./drivers/md/dm-table.c:				     q->max_phys_segments);
./drivers/md/dm-table.c:	if (!rs->max_phys_segments)
./drivers/md/dm-table.c:		rs->max_phys_segments = MAX_PHYS_SEGMENTS;
./drivers/md/dm-table.c:	q->max_phys_segments = t->limits.max_phys_segments;
./drivers/message/i2o/i2o_block.c:	blk_queue_max_phys_segments(queue, I2O_MAX_PHYS_SEGMENTS);
./drivers/message/i2o/i2o_block.c:	osm_debug("max sectors = %d\n", queue->max_phys_segments);
./drivers/mmc/mmc_queue.c:	blk_queue_max_phys_segments(mq->queue, host->max_phys_segs);
./drivers/s390/block/dasd.c:	blk_queue_max_phys_segments(device->request_queue, -1L);
./drivers/s390/char/tape_block.c:	blk_queue_max_phys_segments(blkdat->request_queue, -1L);
./drivers/scsi/sg.c:					q->max_phys_segments);
./drivers/scsi/sg.c:	sdp->sg_tablesize = min(q->max_hw_segments, q->max_phys_segments);
./drivers/scsi/st.c:		SDp->request_queue->max_phys_segments);
./drivers/scsi/scsi_lib.c:	blk_queue_max_phys_segments(q, SCSI_MAX_PHYS_SEGMENTS);
./drivers/xen/blkfront/vbd.c:	blk_queue_max_phys_segments(rq, BLKIF_MAX_SEGMENTS_PER_REQUEST);
./fs/bio.c:	if (nr_pages > q->max_phys_segments)
./fs/bio.c:		nr_pages = q->max_phys_segments;
./fs/bio.c:	while (bio->bi_phys_segments >= q->max_phys_segments
./fs/ocfs2/cluster/heartbeat.c:	if (max_pages > q->max_phys_segments)
./fs/ocfs2/cluster/heartbeat.c:		max_pages = q->max_phys_segments;
./include/linux/blkdev.h:	unsigned short		max_phys_segments;
./include/linux/blkdev.h:extern void blk_queue_max_phys_segments(request_queue_t *, unsigned short);
./include/linux/device-mapper.h:	unsigned short		max_phys_segments;
./include/linux/i2o.h:/* defines for max_sectors and max_phys_segments */
./include/linux/mmc/host.h:	unsigned short		max_phys_segs;	/* see blk_queue_max_phys_segments */
bash-3.2$
 
Of these, drivers/block/cciss.c appears trivial, and drivers/md/dm-table.c appears most interesting.

Comment 12 Milan Broz 2009-07-24 21:26:17 UTC

fyi: I was able to reproduce that with some effort on one of the system (with -128 and -155 kernel).

Seems like another path where code can violate max_phys_segments for bio.

Comment 13 Milan Broz 2009-07-27 08:28:13 UTC

What's the problem:

cciss drivers fails on
  BUG_ON(creq->nr_phys_segments > MAXSGENTRIES);

Because there is stack of devices cciss->md_raid1->dm_linear
some path miscalculates and violates segment count in bio.

The number of segments depends on queue restrictions, according to blk_recount_segments() it depends on these queue parameters:

  q->bounce_pfn
  q->max_segment_size
  q->seg_boundary_mask

The last two are correctly propagated through stack now (thanks to previous patch) but bunce_pfn can be different:

- all dm devices sets BLK_BOUNCE_ANY here
- md do not set explicitly this value, so block layer sets BLK_BOUNCE_HIGH
- cciss uses explicit setting using dma_mask

So in reality, every stack device can have different bounce_pfn!

The real problem is when virtual block device (MD or DM) clones bio request using __bio_clone() function - this recalculates segments count for _new_ clone usin _old_ block device queue and sets BIO_SEG_VALID.

DM here correctly resets this flag, so after remmaping (i.e. queue change) segments are recalculated.
MD do not do this.

Now, segment count can by recalculated using more restrictive bounce_pfn (so all
pages in high memory are counted as possible new segment because of possible bouncing) and because BIO_SEG_VALID is set, this segment count is _not_ recalculated for underlying device queue later.

Upstream way, how to fix it is just do not recalculate segments in bio_clone
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=5d84070ee0a433620c57e85dac7f82faaec5fbb3

Exactly the same patch fixes it in RHEL5 for me.

Comment 15 RHEL Program Management 2009-07-27 08:54:06 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 16 Milan Broz 2009-07-27 08:59:54 UTC

Charlie, 
please can you verify that this experimental kernel build fixes the issue for you (attached only i686 build)?

see http://people.redhat.com/mbroz/pkg/kernel/

Comment 18 Charlie Brady 2009-07-27 15:30:16 UTC

(In reply to comment #16)
> Charlie, 
> please can you verify that this experimental kernel build fixes the issue for
> you (attached only i686 build)?

The dd command no longer crashes the kernel. It'll take longer for me to test whether DB2 install will do so, but signs do look good :-)  Thanks.

Comment 19 Charlie Brady 2009-07-28 13:09:32 UTC

As expected, DB2 install completes successfully with the updated kernel.

Comment 22 Charlie Brady 2009-07-29 15:00:48 UTC

> see http://people.redhat.com/mbroz/pkg/kernel/ 

If you still have it, could you please add the kernel-PAE-devel rpm? Thanks.

Is there any chance that this will be included in a RHEL5.3 update kernel?

Comment 23 Paul Clements 2009-07-29 20:07:41 UTC

Just want to report that I've seen this problem on the 2.6.18-128 kernel using

lvm -> raid1 -> SCSI/nbd

There is no panic, but the error is:

Incorrect number of segments after building list
counted 94, received 64
req nr_sec 992, cur_nr_sec 8 

coming from SCSI. The patch that Milan Broz gave me fixed the issue. I hope this gets into the next Red Hat kernel. I'll attach the patch, which went in upstream.

Comment 24 Paul Clements 2009-07-29 20:08:46 UTC

Created attachment 355604 [details]
patch file

Comment 25 Milan Broz 2009-07-30 10:26:12 UTC

(In reply to comment #22)
> > see http://people.redhat.com/mbroz/pkg/kernel/ 
> 
> If you still have it, could you please add the kernel-PAE-devel rpm? Thanks.

It's there. (but I'll remove there rpms soon)

> Is there any chance that this will be included in a RHEL5.3 update kernel?  

Please use standard Red Hat support channels for that question.

Comment 28 Don Zickus 2009-08-05 14:09:10 UTC

in kernel-2.6.18-162.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 30 michal novacek 2009-08-05 15:38:53 UTC

Reproduced on 2.6.18-128.el5PAE (rhel 5.3)

NOT reproduced on 2.6.18-161.el5PAE kernel.

Comment 34 michal novacek 2009-08-07 09:54:11 UTC

reproduced on 2.6.18-160.el5 kernel as well.

Comment 35 michal novacek 2009-08-10 17:35:17 UTC

Reproducer, bonnie++ and tiobench over lvm+md{raid0,raid1,raid5}
on AACRAID and MEGARAID_SAS went fine for kernel 2.6.18-161.el5.

Checked also for MPTSAS driver with / on lvm2 on dm-raid0.

Comment 37 errata-xmlrpc 2009-09-02 08:29:48 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html

Note You need to log in before you can comment on or make changes to this bug.