585431 – Add log message for unhandled sense error REPORTED_LUNS_DATA_CHANGED

Bug 585431 - Add log message for unhandled sense error REPORTED_LUNS_DATA_CHANGED

Summary: Add log message for unhandled sense error REPORTED_LUNS_DATA_CHANGED

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.7
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Mike Christie
QA Contact:	Gris Ge
Docs Contact:
URL:
Whiteboard:
Depends On:	585430
Blocks:
TreeView+	depends on / blocked

Reported:	2010-04-24 01:02 UTC by Mike Christie
Modified:	2011-01-13 21:29 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:	585430
Environment:
Last Closed:	2011-01-13 21:29:10 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
dmesg and /var/log/message when re-map (12.21 KB, text/plain) 2010-12-07 02:07 UTC, Gris Ge	no flags	Details
/var/log/message for re-map LUN (1.13 MB, text/plain) 2010-12-16 08:07 UTC, Gris Ge	no flags	Details
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2011:0017	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.6 kernel security and bug fix update	2011-01-13 10:37:42 UTC

Description Mike Christie 2010-04-24 01:02:50 UTC

+++ This bug was initially created as a clone of Bug #585430 +++

Description of problem:

We do not support lun remapping, but if somehow it were to happen, we could silently corrupt data.

This is a request to add a printk to indicate that we got REPORTED_LUNS_DATA_CHANGED but we cannot handle it and it is not supported. This way at least our support team can identify the problem without having to look through logs for 4 months :)


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 RHEL Program Management 2010-05-20 12:49:13 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 3 Jarod Wilson 2010-09-21 21:00:17 UTC

in kernel-2.6.18-223.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 5 Gris Ge 2010-12-02 03:27:36 UTC

Mike,
Can you specify the what does remap LUN mean?

I tried it in this way, but no issue found in RHEL 5.5 GA:
1. Map LUN name V0048 to host as LUN 11.
    [root@storageqe-05 ~]# multipath -l mpath1
    mpath1 (20090ef1270000030) dm-19 IQSTOR,iQ2880
    [size=3.9G][features=0][hwhandler=0][rw]
    \_ round-robin 0 [prio=0][active]
     \_ 1:0:0:11 sdat 66:208 [active][undef]
    \_ round-robin 0 [prio=0][enabled]
     \_ 1:0:1:11 sdbp 68:48  [active][undef]
    \_ round-robin 0 [prio=0][enabled]
     \_ 0:0:0:11 sdb  8:16   [active][undef]
    \_ round-robin 0 [prio=0][enabled]
     \_ 0:0:1:11 sdx  65:112 [active][undef]
2. LUN name V0049 to host as LUN 22:

    [root@storageqe-05 ~]# multipath -l mpath10
    mpath10 (20090ef1270000031) dm-20 IQSTOR,iQ2880
    [size=3.9G][features=0][hwhandler=0][rw]
    \_ round-robin 0 [prio=0][enabled]
     \_ 1:0:0:22 sdau 66:224 [active][undef]
    \_ round-robin 0 [prio=0][enabled]
     \_ 1:0:1:22 sdbq 68:64  [active][undef]
    \_ round-robin 0 [prio=0][enabled]
     \_ 0:0:0:22 sdc  8:32   [active][undef]
    \_ round-robin 0 [prio=0][enabled]
     \_ 0:0:1:22 sdy  65:128 [active][undef]
3. Create a file in mpath1 mand mpath10 as identifier.

4. Remap LUN V0049 to host as LUN 11, V0048 to host as 21.

5. Both these mpath1 and mpath10 report path down.

6. Reboot this host. It come online with correct mpath1 and mpath10 with LUN ID changed.
    [root@storageqe-05 ~]# multipath -l mpath1
    mpath1 (20090ef1270000030) dm-20 IQSTOR,iQ2880
    [size=3.9G][features=0][hwhandler=0][rw]
    \_ round-robin 0 [prio=0][active]
     \_ 1:0:0:21 sdau 66:224 [active][undef]
    \_ round-robin 0 [prio=0][enabled]
     \_ 1:0:1:21 sdbq 68:64  [active][undef]
    \_ round-robin 0 [prio=0][enabled]
     \_ 0:0:0:21 sdc  8:32   [active][undef]
    \_ round-robin 0 [prio=0][enabled]
     \_ 0:0:1:21 sdy  65:128 [active][undef]
    [root@storageqe-05 ~]# multipath -l mpath10
    mpath10 (20090ef1270000031) dm-19 IQSTOR,iQ2880
    [size=3.9G][features=0][hwhandler=0][rw]
    \_ round-robin 0 [prio=0][active]
     \_ 1:0:0:11 sdat 66:208 [active][undef]
    \_ round-robin 0 [prio=0][enabled]
     \_ 1:0:1:11 sdbp 68:48  [active][undef]
    \_ round-robin 0 [prio=0][enabled]
     \_ 0:0:0:11 sdb  8:16   [active][undef]
    \_ round-robin 0 [prio=0][enabled]
     \_ 0:0:1:11 sdx  65:112 [active][undef]

Please let me know if you need more information.

Comment 6 Mike Christie 2010-12-03 02:21:19 UTC

(In reply to comment #5)
> 5. Both these mpath1 and mpath10 report path down.
> 

Do the paths come back up? How do you know they are down? Is that what is seen in multipath's output? What do you see in /var/log/messages? Are there errors or do you the rport getting deleted? Normally you would not see the path go down.


> 6. Reboot this host. It come online with correct mpath1 and mpath10 with LUN ID

You do not want to reboot the box. Rebooting would be a workaround for the problem.

The problem would be that if the paths come back up or did not go down, then did IO to mpath1, the IO would get sent to LUN 11 which is now mapped to V0049's storage. So if you later, rebooted and and looked at V0049 you would see IO that should have got sent to V0048.


You do not need multipath for this btw. Just access the scsi disk directly.

Comment 7 Mike Christie 2010-12-03 02:22:02 UTC

Oh yeah, when you do the remap, then send IO to the disk, you should see the error messages added in this patch
http://patchwork.usersys.redhat.com/patch/28250/
in /var/log/messages.

Comment 8 Gris Ge 2010-12-03 07:43:55 UTC

Mike,

multipath block the access to /dev/sdN and will check WWID. So with multipath enabled customers will not have this LUN re-map issue.

I try to reproduce on Kernel 2.6.18-233.el5 with multipath disabled, but it doesn't looks good.

Before remap
===============================
scsi_device:0:0:0:22	/dev/sdc	20090ef1270000031	V0049
scsi_device:0:0:0:11	/dev/sdb	20090ef1270000030	V0048

mount /dev/sdc1 /tmp/0049
ls -l /tmp/0049 -> got file "this is V0049"
mount /dev/sdb1 /tmp/0048
ls -l /tmp/0048 -> got file "this is V0048"

After re-map V0049 to LUN11,
===============================
sdc gone.
sdb goes to V0049:
mount /dev/sdb1 /mnt
ls -l /mnt -> got file "this is V0049"

==============================================================
No log for REPORT_LUNS_DATA_CHANGED found in /var/log/message

Is there anything I missed?

Comment 9 Mike Christie 2010-12-03 21:02:17 UTC

(In reply to comment #8)
> After re-map V0049 to LUN11,
> ===============================
> sdc gone.
> sdb goes to V0049:

If you leave sdb mounted and try to access it after the remap do you get IO errors?

> mount /dev/sdb1 /mnt
> ls -l /mnt -> got file "this is V0049"
> 
> ==============================================================
> No log for REPORT_LUNS_DATA_CHANGED found in /var/log/message
> 
> Is there anything I missed?

The target might be forcing us to delete the devices instead of sending the sense error. That is fine since IO would not be routed incorrectly.

Could you attach the /var/log/messages?

Comment 10 Gris Ge 2010-12-07 02:07:48 UTC

Created attachment 465123 [details]
dmesg and /var/log/message when re-map

Before remap:

sdb -> V0049 LUN 11
sdc -> V0048 LUN 21 (no LUN mapping)


After remap:

sdb -> V0048 LUN 11
sdc -> NULL as no LUN is 21

V0049 is LUN 22 (no LUN mapping)

I keep both sdb and sdc mounted during the re-map, and ls always show to correct one (filesystem cache). I unmount these two and mount again, only sdb can mount and found it's V0048.

dmesg and /var/log/message was attached.

Please let me know if I could help.

Comment 11 Mike Christie 2010-12-07 02:32:54 UTC

(In reply to comment #10)
> Created attachment 465123 [details]
> dmesg and /var/log/message when re-map
> 
> Before remap:
> 
> sdb -> V0049 LUN 11
> sdc -> V0048 LUN 21 (no LUN mapping)
> 
> 
> After remap:
> 
> sdb -> V0048 LUN 11
> sdc -> NULL as no LUN is 21
> 
> V0049 is LUN 22 (no LUN mapping)
> 
> I keep both sdb and sdc mounted during the re-map, and ls always show to
> correct one (filesystem cache).

If at this point you write some file to the FS mounted over /dev/sdb1, what happens? Do you get IO errors? If not and you then unmount the FS and remount /dev/sdb1 do you see the new file? If so then that is the corruption we are looking for. The user thought they were writing to V0049 (the original volume that was mapped to LUN11) but it got written to V0048 instead.

Comment 12 Mike Christie 2010-12-07 02:35:33 UTC

(In reply to comment #11)
> 
> If at this point you write some file to the FS mounted over /dev/sdb1, what
> happens? Do you get IO errors? If not and you then unmount the FS and remount
> /dev/sdb1 do you see the new file? If so then that is the corruption we are
> looking for.

Or of when you try to remount the FS you get errors from EXT3 about the journal or some super blocks or some files being messed up that is another case of corruption we could expect.

Or if the write fails, we would expect that too.

Comment 13 Gris Ge 2010-12-15 03:56:28 UTC

The filesystem goes to Read-only mode after re-map LUN.
I will try again and let you know.

Comment 14 Gris Ge 2010-12-16 08:05:46 UTC

Mike Christie,

Sorry for system change.

I reproduced the problem by these steps twice with same output.

These are the steps I am using:
V0018 -> LUN 03 sdd
V0019 -> LUN 04 sde

#before re-map
#disable multiapth by command: 
multipath -F
mkfs.ext3 /dev/sdd1
mkfs.ext3 /dev/sde1
mount /dev/sdd1 /tmp/V0018
mount /dev/sde1 /tmp/V0019
dd if=/dev/urandom count=100 bs=1MB of=/tmp/V0018/V0018
dd if=/dev/urandom count=100 bs=1MB of=/tmp/V0019/V0019
=============================
7ecadf25c6e21a2ae62da97bf5062f8e  /tmp/V0018/V0018
fd08907ce7ab97eed23f94dda5520485  /tmp/V0019/V0019
=============================

#reboot host
#host online then disable multipath
multipath -F
mount /dev/sdd1 /tmp/V0018
mount /dev/sde1 /tmp/V0019
md5sum /tmp/V00*/V*
=============================
7ecadf25c6e21a2ae62da97bf5062f8e  /tmp/V0018/V0018
fd08907ce7ab97eed23f94dda5520485  /tmp/V0019/V0019
=============================

#No I/O during re-map
#remap V0018 to LUN 05
#remap V0019 to LUN 03
#remap V0018 to LUN 04
# Now: V0018 -> LUN 04, V0019 -> LUN 03
#Sleep 10 minutes.

dd if=/dev/urandom count=10 bs=1MB of=/tmp/V0018/suppose_to_V0018
dd if=/dev/urandom count=10 bs=1MB of=/tmp/V0019/suppose_to_V0019

#No error found in /var/log/message

umount /tmp/V00*
#umount got no error in /var/log/message
mount /dev/sdd1 /tmp/V0018
mount /dev/sde1 /tmp/V0019
ls -l /tmp/V00*
=============================
/tmp/V0018:
total 107568
drwx------ 2 root root     16384 Dec 15 04:25 lost+found
-rw-r--r-- 1 root root  10000000 Dec 16 02:44 suppose_to_V0018
-rw-r--r-- 1 root root 100000000 Dec 15 04:26 V0018

/tmp/V0019:
total 107568
drwx------ 2 root root     16384 Dec 15 04:25 lost+found
-rw-r--r-- 1 root root  10000000 Dec 16 02:45 suppose_to_V0019
-rw-r--r-- 1 root root 100000000 Dec 15 04:26 V0019
=============================

md5sum /tmp/V00*/V*
=============================
md5sum: /tmp/V0018/V0018: Input/output error
md5sum: /tmp/V0019/V0019: Input/output error
=============================
#got these log in /var/log/message

Dec 16 02:47:28 storageqe-06 kernel: attempt to access beyond end of device
Dec 16 02:47:28 storageqe-06 kernel: sde1: rw=0, want=26300803416, limit=41801067

/var/log/messages was uploaded.

I am confused, where could we got this REPORTED_LUNS_DATA_CHANGED error?
Does that from block layer, scsi layer, or filesystem layer?

Comment 15 Gris Ge 2010-12-16 08:07:27 UTC

Created attachment 469085 [details]
/var/log/message for re-map LUN

Comment 16 Mike Christie 2010-12-16 20:32:02 UTC

(In reply to comment #14)
> I am confused, where could we got this REPORTED_LUNS_DATA_CHANGED error?
> Does that from block layer, scsi layer, or filesystem layer?

SCSI Layer.  After the remap you would expect that the first IO sent would get failed and you would see that error message.

Not all targets support this. It looks like the one you are using does not. If you have a Netapp target you would see it (that is what I did the work against).

Comment 17 Mike Christie 2010-12-16 23:43:57 UTC

I just retested the current kernel with a netapp target and got:


Dec 17 10:38:59 noisymax kernel: sd 12:0:0:2: Warning! Received an indication that the LUN assignments on this target have changed. The Linux SCSI layer does not automatically remap LUN assignments.

Comment 20 Gris Ge 2010-12-20 02:59:53 UTC

With NetApp Target, I have reproduced this issue and the new kernel provide  REPORTED_LUNS_DATA_CHANGED error.

Mike,
One more thing need to bother you again:
I swap two LUNs (LUN 0 and LUN 10 ), but only got 1 error message for LUN 0 in messages:

Dec 19 21:51:37 storageqe-08 kernel: sd 7:0:1:0: Warning! Received an indication that the LUN assignments on this target have changed. The Linux SCSI layer does not automatically remap LUN assignments.

Both filesystems on these LUNs went read-only.

Comment 21 Mike Christie 2010-12-21 01:35:45 UTC

(In reply to comment #20)
> With NetApp Target, I have reproduced this issue and the new kernel provide 
> REPORTED_LUNS_DATA_CHANGED error.
> 
> Mike,
> One more thing need to bother you again:
> I swap two LUNs (LUN 0 and LUN 10 ), but only got 1 error message for LUN 0 in
> messages:
> 

Did you do IO to both LUNs or just one? What triggers the error message is the first IO sent to the device after the remap.

Comment 22 Gris Ge 2010-12-21 02:48:48 UTC

dd if=/dev/urandom count=10 bs=1MB of=/tmp/boot/suppose_to_boot
dd if=/dev/urandom count=10 bs=1MB of=/tmp/test/suppose_to_test

The first one finished and no dd error, but message report error from filesystem level.
The second one got Input and output error.

After that, both mount points went read-only.

If need, I can build up the system again and provide you the detailed log.
Let me know.

Comment 23 Mike Christie 2010-12-22 00:48:44 UTC

What were the commands you  used on the netapp target? I will reproduce here.

Comment 24 Mike Christie 2010-12-22 01:01:32 UTC

Oh wait. It seems to be expected to just see the "Warning Received an indication that the LUN assignments...." message once.

And I guess depending on the timing and commands you run on the target you might see IO errors from the device like "Sense Key : Illegal Request" or you might see the FS figure out that something is wrong if doing FS IO.

So it looks like you got what was expected.

Comment 25 Gris Ge 2010-12-22 02:34:20 UTC

As Comment #24 mentioned: The REPORTED_LUNS_DATA_CHANGED error message will only report once. Hence change this bug into verified status.

Comment 26 Gris Ge 2010-12-22 02:43:46 UTC

Previously
/vol/flex/storageqe_08_boot -> LUN0
/vol/flex/storageqe_08_test -> LUN10

The command I am using on netapp is:

lun unmap /vol/flex/storageqe_08_boot storageqe_08_boot
lun unmap /vol/flex/storageqe_08_test storageqe_08_boot

lun map /vol/flex/storageqe_08_test storageqe_08_boot 0
lun map /vol/flex/storageqe_08_boot storageqe_08_boot 10

ext3 filesystem was mounted when re-mapping.

These commands are used to generate I/O after re-map:
dd if=/dev/urandom count=10 bs=1MB of=/tmp/boot/suppose_to_boot
dd if=/dev/urandom count=10 bs=1MB of=/tmp/test/suppose_to_test

As the error message indicate "sd 7:0:1:0" the re-map LUN 0,
Can we got error message for each re-maped LUN?

Comment 28 errata-xmlrpc 2011-01-13 21:29:10 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html

Note You need to log in before you can comment on or make changes to this bug.