+++ This bug was initially created as a clone of Bug #585430 +++ Description of problem: We do not support lun remapping, but if somehow it were to happen, we could silently corrupt data. This is a request to add a printk to indicate that we got REPORTED_LUNS_DATA_CHANGED but we cannot handle it and it is not supported. This way at least our support team can identify the problem without having to look through logs for 4 months :) Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
in kernel-2.6.18-223.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
Mike, Can you specify the what does remap LUN mean? I tried it in this way, but no issue found in RHEL 5.5 GA: 1. Map LUN name V0048 to host as LUN 11. [root@storageqe-05 ~]# multipath -l mpath1 mpath1 (20090ef1270000030) dm-19 IQSTOR,iQ2880 [size=3.9G][features=0][hwhandler=0][rw] \_ round-robin 0 [prio=0][active] \_ 1:0:0:11 sdat 66:208 [active][undef] \_ round-robin 0 [prio=0][enabled] \_ 1:0:1:11 sdbp 68:48 [active][undef] \_ round-robin 0 [prio=0][enabled] \_ 0:0:0:11 sdb 8:16 [active][undef] \_ round-robin 0 [prio=0][enabled] \_ 0:0:1:11 sdx 65:112 [active][undef] 2. LUN name V0049 to host as LUN 22: [root@storageqe-05 ~]# multipath -l mpath10 mpath10 (20090ef1270000031) dm-20 IQSTOR,iQ2880 [size=3.9G][features=0][hwhandler=0][rw] \_ round-robin 0 [prio=0][enabled] \_ 1:0:0:22 sdau 66:224 [active][undef] \_ round-robin 0 [prio=0][enabled] \_ 1:0:1:22 sdbq 68:64 [active][undef] \_ round-robin 0 [prio=0][enabled] \_ 0:0:0:22 sdc 8:32 [active][undef] \_ round-robin 0 [prio=0][enabled] \_ 0:0:1:22 sdy 65:128 [active][undef] 3. Create a file in mpath1 mand mpath10 as identifier. 4. Remap LUN V0049 to host as LUN 11, V0048 to host as 21. 5. Both these mpath1 and mpath10 report path down. 6. Reboot this host. It come online with correct mpath1 and mpath10 with LUN ID changed. [root@storageqe-05 ~]# multipath -l mpath1 mpath1 (20090ef1270000030) dm-20 IQSTOR,iQ2880 [size=3.9G][features=0][hwhandler=0][rw] \_ round-robin 0 [prio=0][active] \_ 1:0:0:21 sdau 66:224 [active][undef] \_ round-robin 0 [prio=0][enabled] \_ 1:0:1:21 sdbq 68:64 [active][undef] \_ round-robin 0 [prio=0][enabled] \_ 0:0:0:21 sdc 8:32 [active][undef] \_ round-robin 0 [prio=0][enabled] \_ 0:0:1:21 sdy 65:128 [active][undef] [root@storageqe-05 ~]# multipath -l mpath10 mpath10 (20090ef1270000031) dm-19 IQSTOR,iQ2880 [size=3.9G][features=0][hwhandler=0][rw] \_ round-robin 0 [prio=0][active] \_ 1:0:0:11 sdat 66:208 [active][undef] \_ round-robin 0 [prio=0][enabled] \_ 1:0:1:11 sdbp 68:48 [active][undef] \_ round-robin 0 [prio=0][enabled] \_ 0:0:0:11 sdb 8:16 [active][undef] \_ round-robin 0 [prio=0][enabled] \_ 0:0:1:11 sdx 65:112 [active][undef] Please let me know if you need more information.
(In reply to comment #5) > 5. Both these mpath1 and mpath10 report path down. > Do the paths come back up? How do you know they are down? Is that what is seen in multipath's output? What do you see in /var/log/messages? Are there errors or do you the rport getting deleted? Normally you would not see the path go down. > 6. Reboot this host. It come online with correct mpath1 and mpath10 with LUN ID You do not want to reboot the box. Rebooting would be a workaround for the problem. The problem would be that if the paths come back up or did not go down, then did IO to mpath1, the IO would get sent to LUN 11 which is now mapped to V0049's storage. So if you later, rebooted and and looked at V0049 you would see IO that should have got sent to V0048. You do not need multipath for this btw. Just access the scsi disk directly.
Oh yeah, when you do the remap, then send IO to the disk, you should see the error messages added in this patch http://patchwork.usersys.redhat.com/patch/28250/ in /var/log/messages.
Mike, multipath block the access to /dev/sdN and will check WWID. So with multipath enabled customers will not have this LUN re-map issue. I try to reproduce on Kernel 2.6.18-233.el5 with multipath disabled, but it doesn't looks good. Before remap =============================== scsi_device:0:0:0:22 /dev/sdc 20090ef1270000031 V0049 scsi_device:0:0:0:11 /dev/sdb 20090ef1270000030 V0048 mount /dev/sdc1 /tmp/0049 ls -l /tmp/0049 -> got file "this is V0049" mount /dev/sdb1 /tmp/0048 ls -l /tmp/0048 -> got file "this is V0048" After re-map V0049 to LUN11, =============================== sdc gone. sdb goes to V0049: mount /dev/sdb1 /mnt ls -l /mnt -> got file "this is V0049" ============================================================== No log for REPORT_LUNS_DATA_CHANGED found in /var/log/message Is there anything I missed?
(In reply to comment #8) > After re-map V0049 to LUN11, > =============================== > sdc gone. > sdb goes to V0049: If you leave sdb mounted and try to access it after the remap do you get IO errors? > mount /dev/sdb1 /mnt > ls -l /mnt -> got file "this is V0049" > > ============================================================== > No log for REPORT_LUNS_DATA_CHANGED found in /var/log/message > > Is there anything I missed? The target might be forcing us to delete the devices instead of sending the sense error. That is fine since IO would not be routed incorrectly. Could you attach the /var/log/messages?
Created attachment 465123 [details] dmesg and /var/log/message when re-map Before remap: sdb -> V0049 LUN 11 sdc -> V0048 LUN 21 (no LUN mapping) After remap: sdb -> V0048 LUN 11 sdc -> NULL as no LUN is 21 V0049 is LUN 22 (no LUN mapping) I keep both sdb and sdc mounted during the re-map, and ls always show to correct one (filesystem cache). I unmount these two and mount again, only sdb can mount and found it's V0048. dmesg and /var/log/message was attached. Please let me know if I could help.
(In reply to comment #10) > Created attachment 465123 [details] > dmesg and /var/log/message when re-map > > Before remap: > > sdb -> V0049 LUN 11 > sdc -> V0048 LUN 21 (no LUN mapping) > > > After remap: > > sdb -> V0048 LUN 11 > sdc -> NULL as no LUN is 21 > > V0049 is LUN 22 (no LUN mapping) > > I keep both sdb and sdc mounted during the re-map, and ls always show to > correct one (filesystem cache). If at this point you write some file to the FS mounted over /dev/sdb1, what happens? Do you get IO errors? If not and you then unmount the FS and remount /dev/sdb1 do you see the new file? If so then that is the corruption we are looking for. The user thought they were writing to V0049 (the original volume that was mapped to LUN11) but it got written to V0048 instead.
(In reply to comment #11) > > If at this point you write some file to the FS mounted over /dev/sdb1, what > happens? Do you get IO errors? If not and you then unmount the FS and remount > /dev/sdb1 do you see the new file? If so then that is the corruption we are > looking for. Or of when you try to remount the FS you get errors from EXT3 about the journal or some super blocks or some files being messed up that is another case of corruption we could expect. Or if the write fails, we would expect that too.
The filesystem goes to Read-only mode after re-map LUN. I will try again and let you know.
Mike Christie, Sorry for system change. I reproduced the problem by these steps twice with same output. These are the steps I am using: V0018 -> LUN 03 sdd V0019 -> LUN 04 sde #before re-map #disable multiapth by command: multipath -F mkfs.ext3 /dev/sdd1 mkfs.ext3 /dev/sde1 mount /dev/sdd1 /tmp/V0018 mount /dev/sde1 /tmp/V0019 dd if=/dev/urandom count=100 bs=1MB of=/tmp/V0018/V0018 dd if=/dev/urandom count=100 bs=1MB of=/tmp/V0019/V0019 ============================= 7ecadf25c6e21a2ae62da97bf5062f8e /tmp/V0018/V0018 fd08907ce7ab97eed23f94dda5520485 /tmp/V0019/V0019 ============================= #reboot host #host online then disable multipath multipath -F mount /dev/sdd1 /tmp/V0018 mount /dev/sde1 /tmp/V0019 md5sum /tmp/V00*/V* ============================= 7ecadf25c6e21a2ae62da97bf5062f8e /tmp/V0018/V0018 fd08907ce7ab97eed23f94dda5520485 /tmp/V0019/V0019 ============================= #No I/O during re-map #remap V0018 to LUN 05 #remap V0019 to LUN 03 #remap V0018 to LUN 04 # Now: V0018 -> LUN 04, V0019 -> LUN 03 #Sleep 10 minutes. dd if=/dev/urandom count=10 bs=1MB of=/tmp/V0018/suppose_to_V0018 dd if=/dev/urandom count=10 bs=1MB of=/tmp/V0019/suppose_to_V0019 #No error found in /var/log/message umount /tmp/V00* #umount got no error in /var/log/message mount /dev/sdd1 /tmp/V0018 mount /dev/sde1 /tmp/V0019 ls -l /tmp/V00* ============================= /tmp/V0018: total 107568 drwx------ 2 root root 16384 Dec 15 04:25 lost+found -rw-r--r-- 1 root root 10000000 Dec 16 02:44 suppose_to_V0018 -rw-r--r-- 1 root root 100000000 Dec 15 04:26 V0018 /tmp/V0019: total 107568 drwx------ 2 root root 16384 Dec 15 04:25 lost+found -rw-r--r-- 1 root root 10000000 Dec 16 02:45 suppose_to_V0019 -rw-r--r-- 1 root root 100000000 Dec 15 04:26 V0019 ============================= md5sum /tmp/V00*/V* ============================= md5sum: /tmp/V0018/V0018: Input/output error md5sum: /tmp/V0019/V0019: Input/output error ============================= #got these log in /var/log/message Dec 16 02:47:28 storageqe-06 kernel: attempt to access beyond end of device Dec 16 02:47:28 storageqe-06 kernel: sde1: rw=0, want=26300803416, limit=41801067 /var/log/messages was uploaded. I am confused, where could we got this REPORTED_LUNS_DATA_CHANGED error? Does that from block layer, scsi layer, or filesystem layer?
Created attachment 469085 [details] /var/log/message for re-map LUN
(In reply to comment #14) > I am confused, where could we got this REPORTED_LUNS_DATA_CHANGED error? > Does that from block layer, scsi layer, or filesystem layer? SCSI Layer. After the remap you would expect that the first IO sent would get failed and you would see that error message. Not all targets support this. It looks like the one you are using does not. If you have a Netapp target you would see it (that is what I did the work against).
I just retested the current kernel with a netapp target and got: Dec 17 10:38:59 noisymax kernel: sd 12:0:0:2: Warning! Received an indication that the LUN assignments on this target have changed. The Linux SCSI layer does not automatically remap LUN assignments.
With NetApp Target, I have reproduced this issue and the new kernel provide REPORTED_LUNS_DATA_CHANGED error. Mike, One more thing need to bother you again: I swap two LUNs (LUN 0 and LUN 10 ), but only got 1 error message for LUN 0 in messages: Dec 19 21:51:37 storageqe-08 kernel: sd 7:0:1:0: Warning! Received an indication that the LUN assignments on this target have changed. The Linux SCSI layer does not automatically remap LUN assignments. Both filesystems on these LUNs went read-only.
(In reply to comment #20) > With NetApp Target, I have reproduced this issue and the new kernel provide > REPORTED_LUNS_DATA_CHANGED error. > > Mike, > One more thing need to bother you again: > I swap two LUNs (LUN 0 and LUN 10 ), but only got 1 error message for LUN 0 in > messages: > Did you do IO to both LUNs or just one? What triggers the error message is the first IO sent to the device after the remap.
dd if=/dev/urandom count=10 bs=1MB of=/tmp/boot/suppose_to_boot dd if=/dev/urandom count=10 bs=1MB of=/tmp/test/suppose_to_test The first one finished and no dd error, but message report error from filesystem level. The second one got Input and output error. After that, both mount points went read-only. If need, I can build up the system again and provide you the detailed log. Let me know.
What were the commands you used on the netapp target? I will reproduce here.
Oh wait. It seems to be expected to just see the "Warning Received an indication that the LUN assignments...." message once. And I guess depending on the timing and commands you run on the target you might see IO errors from the device like "Sense Key : Illegal Request" or you might see the FS figure out that something is wrong if doing FS IO. So it looks like you got what was expected.
As Comment #24 mentioned: The REPORTED_LUNS_DATA_CHANGED error message will only report once. Hence change this bug into verified status.
Previously /vol/flex/storageqe_08_boot -> LUN0 /vol/flex/storageqe_08_test -> LUN10 The command I am using on netapp is: lun unmap /vol/flex/storageqe_08_boot storageqe_08_boot lun unmap /vol/flex/storageqe_08_test storageqe_08_boot lun map /vol/flex/storageqe_08_test storageqe_08_boot 0 lun map /vol/flex/storageqe_08_boot storageqe_08_boot 10 ext3 filesystem was mounted when re-mapping. These commands are used to generate I/O after re-map: dd if=/dev/urandom count=10 bs=1MB of=/tmp/boot/suppose_to_boot dd if=/dev/urandom count=10 bs=1MB of=/tmp/test/suppose_to_test As the error message indicate "sd 7:0:1:0" the re-map LUN 0, Can we got error message for each re-maped LUN?
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html