| Summary: | Target device mapped to multiple luns | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Dinesh Surpur <dinesh.surpur> | ||||||||||
| Component: | device-mapper-multipath | Assignee: | Ben Marzinski <bmarzins> | ||||||||||
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Storage QE <storage-qe> | ||||||||||
| Severity: | urgent | Docs Contact: | |||||||||||
| Priority: | unspecified | ||||||||||||
| Version: | 5.6 | CC: | agk, bdonahue, bmarzins, bmr, dinesh.surpur, dwysocha, hector.arteaga, heinzm, mchristi, prajnoha, prockai, zkabelac | ||||||||||
| Target Milestone: | rc | Flags: | bmarzins:
needinfo?
(dinesh.surpur) |
||||||||||
| Target Release: | --- | ||||||||||||
| Hardware: | x86_64 | ||||||||||||
| OS: | Linux | ||||||||||||
| Whiteboard: | |||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | Environment: | ||||||||||||
| Last Closed: | 2014-04-15 19:50:01 UTC | Type: | --- | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Attachments: |
|
||||||||||||
|
Description
Dinesh Surpur
2012-02-02 02:39:47 UTC
Created attachment 558956 [details]
message log
Created attachment 558958 [details]
actions to taken when the bug was hit
Forgot to list the Ethernet adapter information used: On the host there is a 10G Qlogic QLE8142. driver version: 8.03.01.05.05.06-k I'm currently looking into this. However, I see messages like these in you output Warning! Received an indication that the LUN assignments on this target have changed. The Linux SCSI layer does not automatically remap LUN assignments. It's not immediately obvious that these are the cause of your problem, but they certainly could be. If, for instance, scsi device sdc starts pointing at a new LUN, without first being removed from the system, you will get these messages, and unless you completely remove your multipath devices, and remake them, this will cause corruption. multipath, and a great deal of other things in linux, are not designed to handle a scsi device switching which LUN it is pointing to underneath them. Sometimes the 3par array will send UA Luns have changed if nodes are rebooted, however there would have been no difference in the I_T_L_Q nexus association. Meaning all lun detail would have stayed the same, within a report lun resonse frame. If you look at the multipath mapping everying is the same, except for sdc gets mapped to Lun 10 and Lun 11. I think this happened when I was performing a iSCSI logout/login test. Also, I have been trying to recreate this condition, but have not been able to as of yet. Mike, should it be safe to log out of the iscsi devices while they are in use by multipath, or will that always run the risk of something like this happening? (In reply to comment #5) > Sometimes the 3par array will send UA Luns have changed if nodes are rebooted, > however there would have been no difference in the I_T_L_Q nexus association. > Meaning all lun detail would have stayed the same, within a report lun resonse > frame. > What about between the logout and relogin. Did you remap? > If you look at the multipath mapping everying is the same, except for sdc gets > mapped to Lun 10 and Lun 11. I think this happened when I was performing a > iSCSI logout/login test. > Maybe there are 2 issues. LUN remmapping: Hey, so just to make sure I got it, you did not remap the LUNs right? They should have all come back as LUN 10? If you did not remap the LUN to 11 then I have no idea how that happened. It should not have happened. I have never seen something like that before. sdc hanging around: Doing a logout while something above it is using it is a problem. It has the same issues as when dev_loss_tmo fires for FC and would remove devices. For rhel 5, we switched the default for dev_loss_tmo so that it did not remove devices. When devices are removed but something above it is using the device it is common to see the issue of a device hanging around like this. Basically some apps were not handling the kobject delete/remove events or there was a race where the add and delete came so quick, something would mess up (not sure if userspace app or kobject code or what). In the docs we advise users to bring everything above the scsi (iscsi, fc sas, etc) devices down before tearing down the low level devices. There are not checks for this though. It is one of those things we allow the admin to do. After you do iscsiadm -m node --logout do iscsiadm -m session -P 3 Send the output. Then after you relogin run that iscsiadm -m session -P 3 command again and send the output. Oh yeah for the sdc hanging around issue, I think it might be a userspace and/or kobject event race issue. If there is nothing with a refcount/open on the device, then when we log back in we should get the same sdX values as before, because the scsi layer reuses them if possible (if they are completely free meaning all refcounts/opens are released). In the second multipath output we got all different names except sdc so maybe something was not processing the adds/deletes quick enough (scsi layer was adding devices while multipath or something was still trying to handle the deletion/remove event and multipath had not done a close/release yet) and then also dropped the sdc event. Created attachment 915419 [details]
Comment
(This comment was longer than 65,535 characters and has been moved to an attachment by Red Hat Bugzilla).
Sorry, I should have been more clear. Could you send the iscsiadm output and the multipath output? Created attachment 565707 [details]
iscsi_logout_in_multipath.txt
Attachment "iscsi_logout_in_multipath.txt contains the output of the following commands: multipath -ll iscsiadm -m session iscsiadm -m node --logout multipath -ll -v3 iscsiadm -m node --login multipath -ll -v3 multipath -ll iscsiadm -m session iscsiadm -m session -P 3 It's been sometime since we have had any update on this issue. We need to find out what is going on with this. Because there was a data corruption involved here we feel this is a very critical issue. Please let us know if there is any more information we may provide to you. I think the log output in comment #12 got corrupted. It only has some iscsiadm -m session -P 3 output. If the LUN hasn't been remapped, then it looks like somehow sdc must be getting added to a multipath device's paths list when it shouldn't be. Unfortunately, looking at the messages, there are places where multipathd's log buffer filled up before all the message could get sent to syslog, and so messages are missing, making it impossible to tell exactly what's happening. I shouldn't be too hard to put in a check so that before multipathd is reloads the device table it double checks that it doesn't have a path that shouldn't be there (it's wwid doesn't match). This should catch the issue that I can see here, although without knowing how that device got associated with the wrong device, I can't be certain that there aren't other ways that this issue could crop up. Have you been able to reproduce this at all? If you can reproduce this, I can get you a test package that has the log buffer size increased, and a check for the paths before the table is reloaded, to verify that fixes your issue. There's also a couple of log messages I'd like to add to see if I can't find where this goes wrong in the first place. But if you can't reproduce this, tracking it down is going to be a lot trickier. This is difficult to reproduce, though possible. Please give me the test build with the increased log buffer size and I'll try to reproduce the issue. Thanks, Tim test packages are available at: http://people.redhat.com/bmarzins/device-mapper-multipath/rpms/x86_64/ and http://people.redhat.com/bmarzins/device-mapper-multipath/rpms/i386/ They are the -0.4.7-48.el5.bz786649 packages. Hi Ben, I know it's been sometime, but I finally had a chance to get back to this. Though difficult I'm able to recreate this issue, however I was not able to get to your test packages. Can you please make your test packages available again. Thanks, Tim Actually, they still are available. The location just changed to: http://people.redhat.com/bmarzins/device-mapper-multipath/rpms/RHEL5/x86_64/ and http://people.redhat.com/bmarzins/device-mapper-multipath/rpms/RHEL5/i386/ Hi Ben, I know it's been a while since I've updated this bug. But, I have loaded these packages on number of host that seen this issue, and performed a number of iscsiadm login/logout tests and I have not seeen this issue again. This issue is very rare and it would take me a while to actually see it. Typically I would see it after a once after a few weeks of testing. So, with the testing that has been done so far, I would say this issue is fixed. Can we expect to see this fix in a released version of device-mapper? Thanks for your help, Tim The problem is I didn't fix any issue in those packages. I just added warning messages, so that you would hopefully see either a message like: WARNING, path's mpp doesn't match or WARNING, path wwid doesn't match if the problem reproduced. It's possible that adding the checks before these print statements changed some timing, but the checks are really small, so it isn't too likely. I think it's more likely that the problem simply hasn't happened again because it's not easy to reproduce. Oh, I did not realizes that the private build was just a debug build. This issue is very difficult to reproduce. In my configuration there are 28 ESX 4.1 guest host that have your private build, and there are another 28 guest that do not. In my testing I focused mostly on the 28 hosts that have the private build, performing logins/logouts continuosly for mutiple days. Also performed a numer of test where target ports were removed, switch ports taken offline/online and hosts reoots. To be honest with I just did not see this issue again since I loaded up your private build. I guess we need ot continue testing this until we hit it. I'll send the logs to you when we see this again. Thanks, Tim Oh, I did not realizes that the private build was just a debug build. This issue is very difficult to reproduce. In my configuration there are 28 ESX 4.1 guest host that have your private build, and there are another 28 guest that do not. In my testing I focused mostly on the 28 hosts that have the private build, performing logins/logouts continuosly for mutiple days. Also performed a numer of test where target ports were removed, switch ports taken offline/online and hosts reoots. To be honest with I just did not see this issue again since I loaded up your private build. I guess we need ot continue testing this until we hit it. I'll send the logs to you when we see this again. Thanks, Tim This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux release for currently deployed products. This request is not yet committed for inclusion in a release. Have you hit this issue again? |