Description of problem: Adding a new lun resulted in file corruption on already mounted and active lun's on an Oracle production server. Version-Release number of selected component (if applicable): OS : Red Hat Enterprise Linux Server release 5.5 (Tikanga) Kernel : 2.6.18-194.11.4.el5 Hardware : HP ProLiant BL460c G1 HBA : QLogic Corp. ISP2432-based 4Gb Fibre Channel to PCI Express HBA Filesystem : ext3 How reproducible: Not tested (yet) Steps to Reproduce: ]# multipath -ll - rescan hba's to detect new presented lun ]# echo "- - -" > /sys/class/scsi_host/host0/scan ]# echo "- - -" > /sys/class/scsi_host/host1/scan ]# multipath -ll - adding alias for new lun ]# vi /etc/multipath.conf ]# multipath -r - bio log events start occuring: 2010-09-30 08:39:14 0 4 10.x.x.x bio too big device dm-5 (176 > 64) kernel: 2010-09-30 08:39:14 0 4 10.x.x.x bio too big device dm-5 (72 > 64) kernel: 2010-09-30 08:39:14 3 3 10.x.x.x dm-6: devmap already registered multipathd: 2010-09-30 08:39:14 3 5 10.x.x.x dm-6: add map (uevent) multipathd: 2010-09-30 08:39:14 3 3 10.x.x.x dm-5: devmap already registered multipathd: 2010-09-30 08:39:14 3 5 10.x.x.x dm-5: add map (uevent) multipathd: 2010-09-30 08:39:14 3 3 10.x.x.x dm-4: devmap already registered multipathd: 2010-09-30 08:39:14 3 5 10.x.x.x dm-4: add map (uevent) multipathd: 2010-09-30 08:39:14 0 4 10.x.x.x bio too big device dm-0 (1024 > 64) kernel: 2010-09-30 08:39:14 0 4 10.x.x.x bio too big device dm-0 (1024 > 64) kernel: Actual results: File corruption on already active and mounted lun's Expected results: Solid operation and no file corruption. This procedure has been followed many times on other servers without problems. Is this procedure oke, or are we very unlucky to run into a bug which seldom occurs? Additional info: Server is a boot from SAN setup with several big data lun's presented. bio events where reported for each lun including the OS lun (dm-0) but file corruption seemed to have occurred only on dm-5, one of the Oracle data lun's. Oracle uses 3 big ext3 filesystems without ASM. Info on the dm-5 device: /dev/dm-5 hosts LVM2 volumegroup vg_data004, size 1.95T (ext3) LVM2 pv - vg - lv are all on this lun Multipath info on dm-5 : data004 (3600508b40010889b000090000c650000) dm-5 HP,HSV210 [size=2.0T][features=1 queue_if_no_path][hwhandler=0][rw] \_ round-robin 0 [prio=100][active] \_ 0:0:3:13 sdl 8:176 [active][ready] \_ 1:0:3:13 sdx 65:112 [active][ready] \_ round-robin 0 [prio=20][enabled] \_ 0:0:2:13 sdf 8:80 [active][ready] \_ 1:0:2:13 sdr 65:16 [active][ready]
How can above bug be related to m17n-db?
Moving this to kernel people in case they can help here.
The same server yesterday (Februari 2nd), now running Red Hat Enterprise Linux Server release 5.6 Kernel 2.6.18-238.el5 x86_64 Via Qlogic SanSurfer software (Remote Agent for Linux Version 1.00.1280-19) a HP technical consultant modified the following parameter: SanSurfer : "HBA -> Port -> Parameters -> Advanced HBA Parameters -> Execution Throttle" changed from 16 to 32 "-> Save" This server has one HBA with 2 ports. After modifying the first port, the following showed up in /var/log/messages: Feb 2 15:34:52 scomp1037 kernel: qla2xxx 0000:10:00.0: Performing ISP error recovery - ha= ffff81082d5bc4f8. Feb 2 15:34:53 scomp1037 kernel: rport-0:0-0: blocked FC remote port time out: saving binding Feb 2 15:34:53 scomp1037 kernel: rport-0:0-1: blocked FC remote port time out: saving binding Feb 2 15:34:53 scomp1037 kernel: rport-0:0-2: blocked FC remote port time out: saving binding Feb 2 15:34:53 scomp1037 kernel: rport-0:0-3: blocked FC remote port time out: saving binding Feb 2 15:34:53 scomp1037 kernel: sd 0:0:3:1: SCSI error: return code = 0x00010000 Feb 2 15:34:53 scomp1037 kernel: end_request: I/O error, dev sdh, sector 11776429 Feb 2 15:34:53 scomp1037 kernel: device-mapper: multipath: Failing path 8:112. Feb 2 15:34:53 scomp1037 kernel: sd 0:0:3:1: SCSI error: return code = 0x00010000 Feb 2 15:34:53 scomp1037 kernel: end_request: I/O error, dev sdh, sector 11825493 Feb 2 15:34:53 scomp1037 kernel: sd 0:0:3:1: SCSI error: return code = 0x00010000 Feb 2 15:34:53 scomp1037 kernel: end_request: I/O error, dev sdh, sector 11923829 Feb 2 15:34:53 scomp1037 kernel: sd 0:0:3:1: SCSI error: return code = 0x00010000 Feb 2 15:34:53 scomp1037 multipathd: dm-2: add map (uevent) Feb 2 15:34:53 scomp1037 kernel: end_request: I/O error, dev sdh, sector 21410149 Feb 2 15:34:53 scomp1037 kernel: sd 0:0:3:1: SCSI error: return code = 0x00010000 Feb 2 15:34:53 scomp1037 kernel: end_request: I/O error, dev sdh, sector 31895325 Feb 2 15:34:53 scomp1037 kernel: sd 0:0:3:10: SCSI error: return code = 0x00010000 Feb 2 15:34:53 scomp1037 kernel: end_request: I/O error, dev sdj, sector 1002495848 Feb 2 15:34:53 scomp1037 kernel: device-mapper: multipath: Failing path 8:144. Feb 2 15:34:53 scomp1037 kernel: sd 0:0:3:10: SCSI error: return code = 0x00010000 Feb 2 15:34:53 scomp1037 kernel: end_request: I/O error, dev sdj, sector 6730900 Feb 2 15:34:53 scomp1037 kernel: sd 0:0:3:11: SCSI error: return code = 0x00010000 Feb 2 15:34:53 scomp1037 kernel: end_request: I/O error, dev sdk, sector 1431896872 Feb 2 15:34:53 scomp1037 kernel: device-mapper: multipath: Failing path 8:160. Feb 2 15:34:53 scomp1037 kernel: sd 0:0:3:11: SCSI error: return code = 0x00010000 Feb 2 15:34:53 scomp1037 kernel: end_request: I/O error, dev sdk, sector 2042291448 Feb 2 15:34:53 scomp1037 kernel: sd 0:0:3:11: SCSI error: return code = 0x00010000 Feb 2 15:34:53 scomp1037 kernel: end_request: I/O error, dev sdk, sector 2042291704 Feb 2 15:34:53 scomp1037 kernel: sd 0:0:3:11: SCSI error: return code = 0x00010000 Feb 2 15:34:53 scomp1037 kernel: end_request: I/O error, dev sdk, sector 2042292184 Feb 2 15:34:53 scomp1037 kernel: sd 0:0:3:11: SCSI error: return code = 0x00010000 Feb 2 15:34:53 scomp1037 kernel: end_request: I/O error, dev sdk, sector 2042292504 Feb 2 15:34:53 scomp1037 kernel: sd 0:0:3:11: SCSI error: return code = 0x00010000 Feb 2 15:34:53 scomp1037 kernel: end_request: I/O error, dev sdk, sector 2042292792 Feb 2 15:34:53 scomp1037 kernel: sd 0:0:3:11: SCSI error: return code = 0x00010000 Feb 2 15:34:53 scomp1037 kernel: end_request: I/O error, dev sdk, sector 2042294040 Feb 2 15:34:53 scomp1037 kernel: sd 0:0:3:11: SCSI error: return code = 0x00010000 Feb 2 15:34:53 scomp1037 kernel: end_request: I/O error, dev sdk, sector 2599415176 Feb 2 15:34:53 scomp1037 kernel: sd 0:0:3:11: SCSI error: return code = 0x00010000 Feb 2 15:34:53 scomp1037 kernel: end_request: I/O error, dev sdk, sector 2599425888 Feb 2 15:34:53 scomp1037 kernel: sd 0:0:3:11: SCSI error: return code = 0x00010000 Feb 2 15:34:53 scomp1037 kernel: end_request: I/O error, dev sdk, sector 2599576024 Feb 2 15:34:53 scomp1037 kernel: sd 0:0:3:11: SCSI error: return code = 0x00010000 Feb 2 15:34:53 scomp1037 kernel: end_request: I/O error, dev sdk, sector 3061938608 Feb 2 15:34:53 scomp1037 kernel: sd 0:0:3:12: SCSI error: return code = 0x00010000 Feb 2 15:34:53 scomp1037 kernel: end_request: I/O error, dev sdl, sector 1403427096 Feb 2 15:34:53 scomp1037 kernel: device-mapper: multipath: Failing path 8:176. Feb 2 15:34:53 scomp1037 kernel: qla2xxx 0000:10:00.0: LOOP UP detected (4 Gbps). Feb 2 15:34:58 scomp1037 multipathd: dm-2: devmap already registered Feb 2 15:34:58 scomp1037 multipathd: dm-4: add map (uevent) Feb 2 15:34:58 scomp1037 multipathd: dm-4: devmap already registered Feb 2 15:34:58 scomp1037 multipathd: 8:112: mark as failed Feb 2 15:34:58 scomp1037 multipathd: system: remaining active paths: 3 Feb 2 15:34:58 scomp1037 multipathd: dm-5: add map (uevent) Feb 2 15:34:58 scomp1037 multipathd: dm-5: devmap already registered Feb 2 15:34:58 scomp1037 multipathd: 8:144: mark as failed Feb 2 15:34:58 scomp1037 multipathd: datanew1: remaining active paths: 3 Feb 2 15:34:58 scomp1037 multipathd: 8:160: mark as failed Feb 2 15:34:58 scomp1037 multipathd: data002: remaining active paths: 3 Feb 2 15:34:58 scomp1037 multipathd: 8:176: mark as failed Feb 2 15:34:58 scomp1037 multipathd: data003: remaining active paths: 3 Feb 2 15:34:58 scomp1037 multipathd: dm-6: add map (uevent) Feb 2 15:34:58 scomp1037 multipathd: dm-6: devmap already registered Feb 2 15:34:58 scomp1037 multipathd: sdh: tur checker reports path is up Feb 2 15:34:58 scomp1037 multipathd: 8:112: reinstated Feb 2 15:34:58 scomp1037 multipathd: system: remaining active paths: 4 Feb 2 15:34:58 scomp1037 multipathd: /sbin/mpath_prio_alua exitted with 5 Feb 2 15:34:58 scomp1037 multipathd: error calling out /sbin/mpath_prio_alua /dev/sda Feb 2 15:34:58 scomp1037 multipathd: system: load table [0 52428800 multipath 0 0 3 2 round-robin 0 1 1 8:0 100 round-robin 0 2 1 8:112 100 65:80 100 round-robin 0 Feb 2 15:34:58 scomp1037 multipathd: sdj: tur checker reports path is up Feb 2 15:34:58 scomp1037 multipathd: 8:144: reinstated Feb 2 15:34:58 scomp1037 multipathd: datanew1: remaining active paths: 4 Feb 2 15:34:58 scomp1037 multipathd: /sbin/mpath_prio_alua exitted with 5 Feb 2 15:34:58 scomp1037 multipathd: error calling out /sbin/mpath_prio_alua /dev/sdc Feb 2 15:34:58 scomp1037 multipathd: datanew1: load table [0 1048576000 multipath 0 0 3 2 round-robin 0 1 1 8:32 100 round-robin 0 2 1 8:144 100 65:112 100 round-ro Feb 2 15:34:58 scomp1037 multipathd: sdk: tur checker reports path is up Feb 2 15:34:58 scomp1037 multipathd: 8:160: reinstated Feb 2 15:34:58 scomp1037 multipathd: data002: remaining active paths: 4 Feb 2 15:34:58 scomp1037 multipathd: /sbin/mpath_prio_alua exitted with 5 Feb 2 15:34:58 scomp1037 multipathd: error calling out /sbin/mpath_prio_alua /dev/sdd Feb 2 15:34:58 scomp1037 multipathd: data002: load table [0 4194304000 multipath 0 0 3 2 round-robin 0 1 1 8:48 100 round-robin 0 2 1 8:160 100 65:128 100 round-rob Feb 2 15:34:58 scomp1037 multipathd: sdl: tur checker reports path is up Feb 2 15:34:58 scomp1037 multipathd: 8:176: reinstated Feb 2 15:34:58 scomp1037 multipathd: data003: remaining active paths: 4 Feb 2 15:34:58 scomp1037 multipathd: /sbin/mpath_prio_alua exitted with 5 Feb 2 15:34:58 scomp1037 multipathd: error calling out /sbin/mpath_prio_alua /dev/sde Feb 2 15:34:58 scomp1037 multipathd: data003: load table [0 4194304000 multipath 0 0 3 2 round-robin 0 1 1 8:64 100 round-robin 0 2 1 8:176 100 65:144 100 round-rob Feb 2 15:34:58 scomp1037 multipathd: dm-2: add map (uevent) Feb 2 15:34:58 scomp1037 multipathd: dm-2: devmap already registered Feb 2 15:34:58 scomp1037 multipathd: dm-2: add map (uevent) Feb 2 15:34:58 scomp1037 multipathd: dm-2: devmap already registered Feb 2 15:34:58 scomp1037 multipathd: dm-4: add map (uevent) Feb 2 15:34:58 scomp1037 multipathd: dm-4: devmap already registered Feb 2 15:34:58 scomp1037 multipathd: dm-4: add map (uevent) Feb 2 15:34:58 scomp1037 multipathd: dm-4: devmap already registered Feb 2 15:34:58 scomp1037 multipathd: dm-5: add map (uevent) Feb 2 15:34:58 scomp1037 multipathd: dm-5: devmap already registered Feb 2 15:34:58 scomp1037 multipathd: dm-5: add map (uevent) Feb 2 15:34:58 scomp1037 multipathd: dm-5: devmap already registered Feb 2 15:34:58 scomp1037 multipathd: dm-6: add map (uevent) Feb 2 15:34:58 scomp1037 multipathd: dm-6: devmap already registered Feb 2 15:34:58 scomp1037 multipathd: dm-6: add map (uevent) Feb 2 15:34:58 scomp1037 multipathd: dm-6: devmap already registered - This line is from a Oracle log 2011-02-02 15:34:59 IO errors.... instance terminated Feb 2 15:34:59 scomp1037 kernel: bio too big device dm-4 (518 > 64) Feb 2 15:34:59 scomp1037 kernel: bio too big device dm-6 (256 > 64) Feb 2 15:34:59 scomp1037 kernel: bio too big device dm-6 (256 > 64) Feb 2 15:34:59 scomp1037 kernel: bio too big device dm-4 (88 > 64) Feb 2 15:34:59 scomp1037 kernel: bio too big device dm-4 (256 > 64) Feb 2 15:34:59 scomp1037 kernel: bio too big device dm-4 (192 > 64) Feb 2 15:34:59 scomp1037 kernel: bio too big device dm-6 (224 > 64) Feb 2 15:34:59 scomp1037 kernel: bio too big device dm-6 (96 > 64) Feb 2 15:34:59 scomp1037 kernel: bio too big device dm-6 (256 > 64) Feb 2 15:34:59 scomp1037 kernel: bio too big device dm-6 (256 > 64) Feb 2 15:34:59 scomp1037 kernel: bio too big device dm-4 (88 > 64) Feb 2 15:34:59 scomp1037 kernel: bio too big device dm-4 (208 > 64) Feb 2 15:34:59 scomp1037 kernel: bio too big device dm-4 (128 > 64) Feb 2 15:34:59 scomp1037 kernel: bio too big device dm-4 (88 > 64) Feb 2 15:34:59 scomp1037 kernel: bio too big device dm-4 (96 > 64) A server reboot was necessary.
"2011-02-03 07:39:52 EST" is reproduce-able , w've cloned the system disk, presented it on an other server (same type of blade ) , fully updated the system via RHN : kernel :2.6.18-238.1.1.el5 ]# rpm -qfi /etc/init.d/multipathd Name : device-mapper-multipath Relocations: (not relocatable) Version : 0.4.7 Vendor: Red Hat, Inc. Release : 42.el5 Build Date: Tue 07 Dec 2010 07:44:21 PM CET ]# modinfo qla2xxx filename: /lib/modules/2.6.18-238.1.1.el5/kernel/drivers/scsi/qla2xxx/qla2xxx.ko version: 8.03.01.05.05.06-k license: GPL description: QLogic Fibre Channel HBA Driver Same procedure to modify "Execution Throttle" and BIO messages appeared. Next step is to install HP qlogic modules and see if this solves this problem.
Maybe it's related to Oracle itself. As soon as Oracle performs an IO request (not sure yet if it's only write), the bio messages start to appear. Creating, deleting files on the console seems not to trigger any errors on the same partition/lun. What is also interesting, is that Oracle seems to ignore setting the max_sectors_kb Linux kernel parameter. We use a script to adjust this parameter in order to be able to reduce IO load on the HP EVA's SAN systems. A HP san/IO expert noticed that even when we shrink the parameter, the OS honors this but Oracle still bypasses the OS sending it's own blocksize ?! ]# cat modparam.sh #!/bin/bash # default 512 maxsize=32 case "$1" in "-b" | "-B" ) echo "Setting max_sectors_kb to $maxsize ..." for i in `ls -a -d /sys/block/sd*` ; do echo "$i/queue/max_sectors_kb" ; echo $maxsize > $i/queue/max_sectors_kb ; done ;; * ) for i in `ls -a -d /sys/block/sd*` ; do echo "$i/queue/scheduler"; cat $i/queue/scheduler ; done for i in `ls -a -d /sys/block/sd*` ; do echo "$i/queue/max_sectors_kb"; cat $i/queue/max_sectors_kb ; done ;; esac
Today again a bio error: Jun 1 17:00:26 scomp1037 multipathd: system: load table [0 52428800 multipath 0 0 2 1 round-robin 0 2 1 8:112 100 65:80 100 round-robin 0 2 1 8:0 100 8:224 100] Jun 1 17:00:26 scomp1037 multipathd: dm-2: add map (uevent) Jun 1 17:00:26 scomp1037 multipathd: dm-2: devmap already registered Jun 1 17:01:16 scomp1037 multipathd: data004: load table [0 4194304000 multipath 0 0 2 1 round-robin 0 2 1 65:160 100 8:192 100 round-robin 0 2 1 8:80 100 65:48 100 Jun 1 17:01:16 scomp1037 multipathd: samba: load table [0 1048576000 multipath 0 0 2 1 round-robin 0 2 1 8:128 100 65:96 100 round-robin 0 2 1 8:16 100 8:240 100] Jun 1 17:01:16 scomp1037 multipathd: data002: load table [0 4194304000 multipath 0 0 2 1 round-robin 0 2 1 8:160 100 65:128 100 round-robin 0 2 1 8:48 100 65:16 100 Jun 1 17:01:16 scomp1037 multipathd: dm-0: add map (uevent) Jun 1 17:01:16 scomp1037 multipathd: dm-0: devmap already registered Jun 1 17:01:16 scomp1037 multipathd: dm-3: add map (uevent) Jun 1 17:01:16 scomp1037 multipathd: dm-3: devmap already registered Jun 1 17:01:16 scomp1037 multipathd: dm-5: add map (uevent) Jun 1 17:01:16 scomp1037 multipathd: dm-5: devmap already registered Jun 1 17:05:35 scomp1037 kernel: bio too big device dm-5 (288 > 256) ]# uname -a Linux scomp1037.wurnet.nl 2.6.18-238.9.1.el5 #1 SMP Fri Mar 18 12:42:39 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux Could priority be raised to high please?
Another case of this seen on production server with ext3 file system used by Oracle. 2.6.18-238.5.1.el5 #1 SMP Mon Feb 21 05:52:39 EST 2011 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 5.6 (Tikanga) HP ProLiant DL380 G6 QLogic Fibre Channel HBA Driver: 8.03.01.05.05.06-k device-mapper-multipath-0.4.7-42.el5_6.1.x86_64 Ext3 file system built on lvm volume group. Jul 21 08:01:13 host kernel: bio too big device dm-9 (957 > 256) Jul 21 08:01:14 host kernel: bio too big device dm-10 (1017 > 256) << MANY LOGGED >>
I am seeing this error on our rhel5u6 . Sep 1 17:40:13 host kernel: bio too big device dm-8 (416 > 256) Sep 1 17:40:13 host kernel: bio too big device dm-8 (512 > 256) Sep 1 17:40:13 host kernel: bio too big device dm-8 (512 > 256) Sep 1 17:40:13 host kernel: bio too big device dm-8 (416 > 256) Sep 1 17:40:13 host kernel: bio too big device dm-8 (512 > 256) Aug 31 21:43:31 host multipathd: /sbin/mpath_prio_alua exitted with 1 Aug 31 21:43:31 host multipathd: error calling out /sbin/mpath_prio_alua /dev/sdu Aug 31 21:43:31 host multipathd: dm-12: devmap already registered Aug 31 20:11:33 host multipathd: /sbin/mpath_prio_alua exitted with 1 Aug 31 20:11:33 host multipathd: error calling out /sbin/mpath_prio_alua /dev/sdq Aug 31 20:11:33 host multipathd: dm-8: devmap already registered Kernel 2.6.18-238.9.1.el5 ext3 file system with LVM striping . Mainly seeing this error up on Read. We are also setting the max_sectors_kb 128 as per EVA best practice . Storage eva6400 with full 8 path and a total of 10 LUN . Could you please advice on the resolution of this error?
Well maybe time someone from HP contacts RedHat direct about this issue please? I've filed this bug almost a year ago without any response from RedHat so far.
If the underlying device limits are being made more restrictive due to something you do externally, linux has no mechanism to propagate those changes up the device stack automatically and you must update them yourself. Best not to try making such changes while I/O is flowing through the stack. If you can't manage to set the lower limits manually (through /sys perhaps) in advance of the external change, then for the dm layers of the stack (including multipath), reload the tables and suspend them, make the external change, then resume them.
(In device-mapper, a 'resume' that follows a 'reload' will reset the limits based on the current state of the devices directly referenced in the table.)
(In reply to comment #11) > If the underlying device limits are being made more restrictive due to > something you do externally, linux has no mechanism to propagate those changes > up the device stack automatically and you must update them yourself. Best not > to try making such changes while I/O is flowing through the stack. If you > can't manage to set the lower limits manually (through /sys perhaps) in advance > of the external change, then for the dm layers of the stack (including > multipath), reload the tables and suspend them, make the external change, then > resume them. Thanks for you time! We do not make any changes externally like on the EVA or such. We use /sys/block/sd* max_sectors_kb to change this parameter (see my previous post 2011-03-15 12:08:15 EDT) , but probably this is to low level or high level with regards to multipath to be propagated correct?
If you change it at one level of the stack that change does NOT propagate automatically - a dm multipath layer above will only notice the change if you follow it with a dm reload+resume at that layer. You'll need to set it at *every* level of the stack the same way and to avoid problems with I/O in flight you should make the changes top-down not bottom up.
BTW I intend to close this bugzilla soon as WONT_FIX. In the long term (years rather than months), I hope we can find a way to make changes propagate automatically, but that's a matter for upstream work and it is not easy (or it would already have been done). In the mean time, there are manual (or scriptable) workarounds as I indicated above provided you know in advance that the change is about to occur.
A question we could investigate: whether we could have an option to recalculate the limits on (some?) resumes that don't follow a reload.