Bug 884427
| Summary: | Unable to change node startup mode to manual due to ISCSI database error | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Gadi Ickowicz <gickowic> | ||||||
| Component: | iscsi-initiator-utils | Assignee: | Chris Leech <cleech> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Bruno Goncalves <bgoncalv> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | medium | ||||||||
| Version: | 6.3 | CC: | abaron, agrover, bgoncalv, cleech, coughlan, cpelland, ddumas, eedri, gickowic, hateya, kzhang, nlevinki | ||||||
| Target Milestone: | rc | Keywords: | TestBlocker | ||||||
| Target Release: | --- | ||||||||
| Hardware: | x86_64 | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | iscsi-initiator-utils-6.2.0.873-10.el6 | Doc Type: | Bug Fix | ||||||
| Doc Text: |
Cause: Database errors could occur if multiple node records, with and without an set target portal group tag, were created for the same iSCSI target portal. VDSM was doing this by creating static records after discovery.
Consequence: Depending on the filesystem dependent return order of the readdir syscall, sometimes a error would occur causing an update operation to fail.
Fix: Prevent multiple node records in different formats from existing simultaneously, detected at record creation time.
Result: Duplicate node entries should no longer exist in the iSCSI database, and updates to records should not result in a database error.
|
Story Points: | --- | ||||||
| Clone Of: | |||||||||
| : | 1066115 (view as bug list) | Environment: | |||||||
| Last Closed: | 2013-11-21 23:08:52 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 1066115 | ||||||||
| Attachments: |
|
||||||||
Can I get any more information on reproducing this? Is this reproducible? How frequently has it been seen? Has this been reproduced with a simple script around iscsiadm as in the steps listed, or has it only been seen with vdsm? (are the steps listed above really steps to reproduce?) I've been running the listed reproduction steps in a loop for hours, with no hint of anything failing. Looking at iscsiadm, the error indicates that there was a stat(3) failure on the node record file (/var/lib/iscsi/nodes/iqn.2008-05.com.xtremio:90e2ba1cbfc4/10.35.160.7,3260) but unfortunately does not tell us why stat(3) failed. If the node record file didn't exist at all, iscsiadm shouldn't be this deep into the database iterators - it should have failed with a "no records found" error message. The attachment with the contents of the node record portion of the database does not include a matching record to what is listed in the reproduction steps. Was it pulled off of the failing system, immediately after attempting to setup iqn.2008-05.com.xtremio:90e2ba1cbfc4? It almost looks as if some process deleted the node record in the middle of the update operation, as that sort of race condition is the only thing I can see right now that would cause the described failure case. Haim, wasn't this recently reproduced? can you provide more details as requested above? (In reply to comment #2) > Can I get any more information on reproducing this? > > Is this reproducible? How frequently has it been seen? > > Has this been reproduced with a simple script around iscsiadm as in the > steps listed, or has it only been seen with vdsm? (are the steps listed > above really steps to reproduce?) > > I've been running the listed reproduction steps in a loop for hours, with no > hint of anything failing. > > Looking at iscsiadm, the error indicates that there was a stat(3) failure on > the node record file > (/var/lib/iscsi/nodes/iqn.2008-05.com.xtremio:90e2ba1cbfc4/10.35.160.7,3260) > but unfortunately does not tell us why stat(3) failed. > > If the node record file didn't exist at all, iscsiadm shouldn't be this deep > into the database iterators - it should have failed with a "no records > found" error message. > > The attachment with the contents of the node record portion of the database > does not include a matching record to what is listed in the reproduction > steps. Was it pulled off of the failing system, immediately after > attempting to setup iqn.2008-05.com.xtremio:90e2ba1cbfc4? > > It almost looks as if some process deleted the node record in the middle of > the update operation, as that sort of race condition is the only thing I can > see right now that would cause the described failure case. This was reproduced only through vdsm, with a specific automated test. However, once this state is reached manually running the commands yields the same errors. The node is not there because it is not added due to the error A careful look at an strace of the failing update command has been very helpful. I believe that at the start of the update command there are two different records for the same target portal in the filesystem. An "old style" portal configuration file that looks like [1] [1] /var/lib/iscsi/nodes/iqn.1992-08.com.netapp:sn.151753773/10.16.41.222,3260 and a new target portal directory with an interface-bound configuration file that looks like [2] [2] /var/lib/iscsi/nodes/iqn.1992-08.com.netapp:sn.151753773/10.16.41.222,3260,1/default If the update command reads in [2] first (that might depend on the return order from readdir), then there's a problem. When it tries to write back the updated record, there's a check for an old style entry [1], and then unlinks it to instead write it back as a new style in [2]. After the updated record is written to [2], the iterator loop looks for [1] because it was previously returned from readdir. That's where it is failing (but to further confuse the issue, the update has actually happened). Now I need to figure out why both of these configuration files seem to exist at the same time. Turns out what I was missing was there in Bruno's comment, ST discovery is run in addition to a manually node creation command (discovery is creating node records, so the manual node creation isn't really needed). Looks like discovery is creating node records in a different format than op=new, then depending on the return order of readdir (unspecified, depends on filesystem) you might hit the bug I described in comment #12. Now that I know what's going on, I should be able to fix this without much trouble. Created attachment 791019 [details]
sorting readdir replacement via LD_PRELOAD
This is how I forced reproduction of the issue for testing. I used the attached spd_readdir code with LD_PRELOAD to change the behavior of readdir(3) to always sort directory contents into the order that triggered this problem.
Build spd_readdir.o with
gcc -shared -ldl -fPIC spd_readdir.c -o spd_readdir.o
Then run this test script (with appropriate target name and portal address set), which duplicates the vdsm steps involved.
---
#!/bin/sh
export LD_PRELOAD=${PWD}/spd_readdir.o
TARGET=iqn.1992-01.com.example:target.0
PORTAL=192.168.122.1:3260
delete_all () {
p=$1
echo "== Clearing DB of all node records"
iscsiadm -m node -o delete
echo "== Clearing DB of all discovery records"
iscsiadm -m discoverydb -o show > /dev/null
if [ $? -eq 0 ]; then
iscsiadm -m discoverydb -t st -p $p -o delete
else
echo "scsiadm: No discovery records found"
fi
}
discovery () {
portal=$1
echo "== Performing SendTargets Discovery against ${portal}"
iscsiadm -m discoverydb -t st -p $portal -o new
iscsiadm -m discoverydb -t st -p $portal --discover
}
new_static () {
target=$1
portal=$2
echo "== New static node ${target} ${portal}"
iscsiadm -m node -o new -T $target -p $portal
}
show () {
n=$1
echo "== Showing $n for all nodes"
iscsiadm -m node -o show | grep $n
}
update () {
target=$1
portal=$2
n=$3
v=$4
echo "== Setting ${portal} $n -> $v"
iscsiadm -m node -o update -T $target -p $portal -n $n -v $v
}
delete_all $PORTAL
discovery $PORTAL
new_static $TARGET $PORTAL
update $TARGET $PORTAL node.startup manual
It seems that now, after updating the target I can't delete the target portal. # iscsiadm -m discovery -I default -p na3170b.lab.bos.redhat.com -t st10.16.41.222:3260,1 iqn.1992-08.com.netapp:sn.151753773 10.16.43.127:3260,1 iqn.1992-08.com.netapp:sn.151753773 # iscsiadm -m node -T iqn.1992-08.com.netapp:sn.151753773 -p 10.16.41.222 -I default -o new && iscsiadm -m node -T iqn.1992-08.com.netapp:sn.151753773 -n node.startup -v manual --op=update -p 10.16.41.222 -I default iscsiadm: using tpgt 1 from existing record New iSCSI node [tcp:[hw=,ip=,net_if=,iscsi_if=default] 10.16.41.222,3260,-1 iqn.1992-08.com.netapp:sn.151753773] added # iscsiadm -m node --targetname "iqn.1992-08.com.netapp:sn.151753773" -o delete # iscsiadm -m discoverydb --portal "na3170b.lab.bos.redhat.com" -t sendtargets -o deleteiscsiadm: Could not stat /var/lib/iscsi/nodes/iqn.1992-08.com.netapp:sn.151753773/10.16.41.222,3260,1/default to delete node: No such file or directory iscsiadm: Could not delete node /var/lib/iscsi/nodes/iqn.1992-08.com.netapp:sn.151753773/10.16.41.222,3260/default ------- # iscsiadm -m discoverydb --portal "na3170b.lab.bos.redhat.com" -t sendtargets # BEGIN RECORD 6.2.0-873.5.el6 discovery.startup = manual discovery.type = sendtargets discovery.sendtargets.address = na3170b.lab.bos.redhat.com discovery.sendtargets.port = 3260 discovery.sendtargets.auth.authmethod = None discovery.sendtargets.auth.username = <empty> discovery.sendtargets.auth.password = <empty> discovery.sendtargets.auth.username_in = <empty> discovery.sendtargets.auth.password_in = <empty> discovery.sendtargets.timeo.login_timeout = 15 discovery.sendtargets.use_discoveryd = No discovery.sendtargets.discoveryd_poll_inval = 30 discovery.sendtargets.reopen_max = 5 discovery.sendtargets.timeo.auth_timeout = 45 discovery.sendtargets.timeo.active_timeout = 30 discovery.sendtargets.iscsi.MaxRecvDataSegmentLength = 32768 # END RECORD ls /var/lib/iscsi/send_targets/* iqn.1992-08.com.netapp:sn.151753773,10.16.41.222,3260,1,default Fix has been updated to remove old discoverydb links when overwriting with a static record. Reproduced on iscsi-initiator-utils-6.2.0.873-9.el6.x86_64: #iscsiadm -m node -T iqn.1992-08.com.netapp:sn.151753773 -p 10.16.41.222 -I default -o new && iscsiadm -m node -T iqn.1992-08.com.netapp:sn.151753773 -n node.startup -v manual --op=update -p 10.16.41.222 -I default iscsiadm: using tpgt 1 from existing record New iSCSI node [tcp:[hw=,ip=,net_if=,iscsi_if=default] 10.16.41.222,3260,-1 iqn.1992-08.com.netapp:sn.151753773] added #iscsiadm -m node -I default --targetname iqn.1992-08.com.netapp:sn.151753773 -o delete #iscsiadm -m discoverydb -t sendtargets --portal na3170b.lab.bos.redhat.com -o delete iscsiadm: Could not stat /var/lib/iscsi/nodes/iqn.1992-08.com.netapp:sn.151753773/10.16.41.222,3260,1/default to delete node: No such file or directory iscsiadm: Could not delete node /var/lib/iscsi/nodes/iqn.1992-08.com.netapp:sn.151753773/10.16.41.222,3260/default -------- Verified fix on iscsi-initiator-utils-6.2.0.873-10.el6.x86_64 #iscsiadm -m node -T iqn.1992-08.com.netapp:sn.151753773 -p 10.16.41.222 -I default -o new && iscsiadm -m node -T iqn.1992-08.com.netapp:sn.151753773 -n node.startup -v manual --op=update -p 10.16.41.222 -I default iscsiadm: using tpgt 1 from existing record New iSCSI node [tcp:[hw=,ip=,net_if=,iscsi_if=default] 10.16.41.222,3260,-1 iqn.1992-08.com.netapp:sn.151753773] added #iscsiadm -m node -I default --targetname iqn.1992-08.com.netapp:sn.151753773 -o delete #iscsiadm -m discoverydb -t sendtargets --portal na3170b.lab.bos.redhat.com -o delete Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1700.html |
Created attachment 658584 [details] iscsi nodes Description of problem: After creating a new node and logging in, attempting to update that nodes' startup mode to to manual fails with the following error: iscsiadm: iface iter could not stat /var/lib/iscsi/nodes/iqn.2008-05.com.xtremio:90e2ba1cbfc4/10.35.160.7,3260.\niscsiadm: Could not execute operation on all records: encountered iSCSI database failure Version-Release number of selected component (if applicable): iscsi-initiator-utils-6.2.0.872-41.el6.x86_64 How reproducible: ? Steps to Reproduce: vdsm runs the following commands during connectStorageServer 1. /usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.2008-05.com.xtremio:90e2ba1cbfc4 -I default -p 10.35.160.7:3260 --op=new 2. /usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.2008-05.com.xtremio:90e2ba1cbfc4 -I default -p 10.35.160.7:3260 -l 3. /usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.2008-05.com.xtremio:90e2ba1cbfc4 -I default -p 10.35.160.7:3260 -n node.startup -v manual --op=update Actual results: step 3 fails sometimes ** tar with the contents of /var/lib/iscsi/nodes is attached ** Additional info: