Bug 884427 - Unable to change node startup mode to manual due to ISCSI database error
Unable to change node startup mode to manual due to ISCSI database error
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: iscsi-initiator-utils (Show other bugs)
6.3
x86_64 Linux
medium Severity high
: rc
: ---
Assigned To: Chris Leech
Bruno Goncalves
: TestBlocker
Depends On:
Blocks: 1066115
  Show dependency treegraph
 
Reported: 2012-12-06 03:44 EST by Gadi Ickowicz
Modified: 2014-08-21 21:41 EDT (History)
12 users (show)

See Also:
Fixed In Version: iscsi-initiator-utils-6.2.0.873-10.el6
Doc Type: Bug Fix
Doc Text:
Cause: Database errors could occur if multiple node records, with and without an set target portal group tag, were created for the same iSCSI target portal. VDSM was doing this by creating static records after discovery. Consequence: Depending on the filesystem dependent return order of the readdir syscall, sometimes a error would occur causing an update operation to fail. Fix: Prevent multiple node records in different formats from existing simultaneously, detected at record creation time. Result: Duplicate node entries should no longer exist in the iSCSI database, and updates to records should not result in a database error.
Story Points: ---
Clone Of:
: 1066115 (view as bug list)
Environment:
Last Closed: 2013-11-21 18:08:52 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
iscsi nodes (20.00 KB, application/x-tar)
2012-12-06 03:44 EST, Gadi Ickowicz
no flags Details
sorting readdir replacement via LD_PRELOAD (5.72 KB, text/plain)
2013-08-27 10:44 EDT, Chris Leech
no flags Details

  None (edit)
Description Gadi Ickowicz 2012-12-06 03:44:10 EST
Created attachment 658584 [details]
iscsi nodes

Description of problem:
After creating a new node and logging in, attempting to update that nodes' startup mode to to manual fails with the following error: 

iscsiadm: iface iter could not stat /var/lib/iscsi/nodes/iqn.2008-05.com.xtremio:90e2ba1cbfc4/10.35.160.7,3260.\niscsiadm: Could not execute operation on all records: encountered iSCSI database failure

Version-Release number of selected component (if applicable):
iscsi-initiator-utils-6.2.0.872-41.el6.x86_64

How reproducible:
?

Steps to Reproduce:
vdsm runs the following commands during connectStorageServer
1. /usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.2008-05.com.xtremio:90e2ba1cbfc4 -I default -p 10.35.160.7:3260 --op=new
2. /usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.2008-05.com.xtremio:90e2ba1cbfc4 -I default -p 10.35.160.7:3260 -l
3. /usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.2008-05.com.xtremio:90e2ba1cbfc4 -I default -p 10.35.160.7:3260 -n node.startup -v manual --op=update
  
Actual results:
step 3 fails sometimes

** tar with the contents of /var/lib/iscsi/nodes is attached **


Additional info:
Comment 2 Chris Leech 2012-12-17 14:45:49 EST
Can I get any more information on reproducing this?

Is this reproducible?  How frequently has it been seen?

Has this been reproduced with a simple script around iscsiadm as in the steps listed, or has it only been seen with vdsm?  (are the steps listed above really steps to reproduce?)

I've been running the listed reproduction steps in a loop for hours, with no hint of anything failing.

Looking at iscsiadm, the error indicates that there was a stat(3) failure on the node record file (/var/lib/iscsi/nodes/iqn.2008-05.com.xtremio:90e2ba1cbfc4/10.35.160.7,3260) but unfortunately does not tell us why stat(3) failed.

If the node record file didn't exist at all, iscsiadm shouldn't be this deep into the database iterators - it should have failed with a "no records found" error message.

The attachment with the contents of the node record portion of the database does not include a matching record to what is listed in the reproduction steps.  Was it pulled off of the failing system, immediately after attempting to setup iqn.2008-05.com.xtremio:90e2ba1cbfc4?

It almost looks as if some process deleted the node record in the middle of the update operation, as that sort of race condition is the only thing I can see right now that would cause the described failure case.
Comment 3 Ayal Baron 2012-12-23 03:04:03 EST
Haim, wasn't this recently reproduced? can you provide more details as requested above?
Comment 4 Gadi Ickowicz 2012-12-23 05:37:03 EST
(In reply to comment #2)
> Can I get any more information on reproducing this?
> 
> Is this reproducible?  How frequently has it been seen?
> 
> Has this been reproduced with a simple script around iscsiadm as in the
> steps listed, or has it only been seen with vdsm?  (are the steps listed
> above really steps to reproduce?)
> 
> I've been running the listed reproduction steps in a loop for hours, with no
> hint of anything failing.
> 
> Looking at iscsiadm, the error indicates that there was a stat(3) failure on
> the node record file
> (/var/lib/iscsi/nodes/iqn.2008-05.com.xtremio:90e2ba1cbfc4/10.35.160.7,3260)
> but unfortunately does not tell us why stat(3) failed.
> 
> If the node record file didn't exist at all, iscsiadm shouldn't be this deep
> into the database iterators - it should have failed with a "no records
> found" error message.
> 
> The attachment with the contents of the node record portion of the database
> does not include a matching record to what is listed in the reproduction
> steps.  Was it pulled off of the failing system, immediately after
> attempting to setup iqn.2008-05.com.xtremio:90e2ba1cbfc4?
> 
> It almost looks as if some process deleted the node record in the middle of
> the update operation, as that sort of race condition is the only thing I can
> see right now that would cause the described failure case.

This was reproduced only through vdsm, with a specific automated test. However, once this state is reached manually running the commands yields the same errors. The node is not there because it is not added due to the error
Comment 12 Chris Leech 2013-01-28 14:08:40 EST
A careful look at an strace of the failing update command has been very
helpful.

I believe that at the start of the update command there are two different
records for the same target portal in the filesystem.  An "old style" portal
configuration file that looks like [1] 

  [1] /var/lib/iscsi/nodes/iqn.1992-08.com.netapp:sn.151753773/10.16.41.222,3260

and a new target portal directory with an interface-bound configuration file
that looks like [2]

  [2] /var/lib/iscsi/nodes/iqn.1992-08.com.netapp:sn.151753773/10.16.41.222,3260,1/default

If the update command reads in [2] first (that might depend on the return order
from readdir), then there's a problem.  When it tries to write back the updated
record, there's a check for an old style entry [1], and then unlinks it to
instead write it back as a new style in [2].  After the updated record is
written to [2], the iterator loop looks for [1] because it was previously
returned from readdir.  That's where it is failing (but to further confuse the
issue, the update has actually happened).

Now I need to figure out why both of these configuration files seem to exist at
the same time.
Comment 15 Chris Leech 2013-07-26 18:57:32 EDT
Turns out what I was missing was there in Bruno's comment, ST discovery is run in addition to a manually node creation command (discovery is creating node records, so the manual node creation isn't really needed).

Looks like discovery is creating node records in a different format than op=new, then depending on the return order of readdir (unspecified, depends on filesystem) you might hit the bug I described in comment #12.

Now that I know what's going on, I should be able to fix this without much trouble.
Comment 20 Chris Leech 2013-08-27 10:44:53 EDT
Created attachment 791019 [details]
sorting readdir replacement via LD_PRELOAD

This is how I forced reproduction of the issue for testing.  I used the attached spd_readdir code with LD_PRELOAD to change the behavior of readdir(3) to always sort directory contents into the order that triggered this problem.

Build spd_readdir.o with

  gcc -shared -ldl -fPIC spd_readdir.c -o spd_readdir.o

Then run this test script (with appropriate target name and portal address set), which duplicates the vdsm steps involved.

---

#!/bin/sh

export LD_PRELOAD=${PWD}/spd_readdir.o

TARGET=iqn.1992-01.com.example:target.0 
PORTAL=192.168.122.1:3260

delete_all () {
	p=$1
	echo "== Clearing DB of all node records"
	iscsiadm -m node -o delete

	echo "== Clearing DB of all discovery records"
	iscsiadm -m discoverydb -o show > /dev/null
	if [ $? -eq 0 ]; then
		iscsiadm -m discoverydb -t st -p $p -o delete
	else
		echo "scsiadm: No discovery records found"
	fi
}

discovery () {
	portal=$1
	echo "== Performing SendTargets Discovery against ${portal}"
	iscsiadm -m discoverydb -t st -p $portal -o new
	iscsiadm -m discoverydb -t st -p $portal --discover
}

new_static () {
	target=$1
	portal=$2
	echo "== New static node ${target} ${portal}"
	iscsiadm -m node -o new -T $target -p $portal
}

show () {
	n=$1
	echo "== Showing $n for all nodes"
	iscsiadm -m node -o show | grep $n
}

update () {
	target=$1
	portal=$2
	n=$3
	v=$4
	echo "== Setting ${portal} $n -> $v"
	iscsiadm -m node -o update -T $target -p $portal -n $n -v $v
}

delete_all $PORTAL
discovery $PORTAL
new_static $TARGET $PORTAL
update $TARGET $PORTAL node.startup manual
Comment 21 Bruno Goncalves 2013-09-02 09:14:41 EDT
It seems that now, after updating the target I can't delete the target portal.


# iscsiadm -m discovery -I default -p na3170b.lab.bos.redhat.com -t st10.16.41.222:3260,1 iqn.1992-08.com.netapp:sn.151753773
10.16.43.127:3260,1 iqn.1992-08.com.netapp:sn.151753773


# iscsiadm -m node -T iqn.1992-08.com.netapp:sn.151753773  -p 10.16.41.222 -I default -o new && iscsiadm -m node -T iqn.1992-08.com.netapp:sn.151753773  -n node.startup -v manual --op=update -p 10.16.41.222 -I default
iscsiadm: using tpgt 1 from existing record
New iSCSI node [tcp:[hw=,ip=,net_if=,iscsi_if=default] 10.16.41.222,3260,-1 iqn.1992-08.com.netapp:sn.151753773] added

# iscsiadm -m node --targetname "iqn.1992-08.com.netapp:sn.151753773" -o delete

# iscsiadm -m discoverydb  --portal "na3170b.lab.bos.redhat.com" -t sendtargets -o deleteiscsiadm: Could not stat /var/lib/iscsi/nodes/iqn.1992-08.com.netapp:sn.151753773/10.16.41.222,3260,1/default to delete node: No such file or directory

iscsiadm: Could not delete node /var/lib/iscsi/nodes/iqn.1992-08.com.netapp:sn.151753773/10.16.41.222,3260/default


-------
# iscsiadm -m discoverydb  --portal "na3170b.lab.bos.redhat.com" -t sendtargets 
# BEGIN RECORD 6.2.0-873.5.el6
discovery.startup = manual
discovery.type = sendtargets
discovery.sendtargets.address = na3170b.lab.bos.redhat.com
discovery.sendtargets.port = 3260
discovery.sendtargets.auth.authmethod = None
discovery.sendtargets.auth.username = <empty>
discovery.sendtargets.auth.password = <empty>
discovery.sendtargets.auth.username_in = <empty>
discovery.sendtargets.auth.password_in = <empty>
discovery.sendtargets.timeo.login_timeout = 15
discovery.sendtargets.use_discoveryd = No
discovery.sendtargets.discoveryd_poll_inval = 30
discovery.sendtargets.reopen_max = 5
discovery.sendtargets.timeo.auth_timeout = 45
discovery.sendtargets.timeo.active_timeout = 30
discovery.sendtargets.iscsi.MaxRecvDataSegmentLength = 32768
# END RECORD

ls /var/lib/iscsi/send_targets/*
iqn.1992-08.com.netapp:sn.151753773,10.16.41.222,3260,1,default
Comment 24 Chris Leech 2013-10-08 11:45:56 EDT
Fix has been updated to remove old discoverydb links when overwriting with a static record.
Comment 26 Bruno Goncalves 2013-10-09 11:07:24 EDT
Reproduced on iscsi-initiator-utils-6.2.0.873-9.el6.x86_64:

#iscsiadm -m node -T iqn.1992-08.com.netapp:sn.151753773  -p 10.16.41.222 -I default -o new && iscsiadm -m node -T iqn.1992-08.com.netapp:sn.151753773 -n node.startup -v manual --op=update -p 10.16.41.222 -I default

iscsiadm: using tpgt 1 from existing record
New iSCSI node [tcp:[hw=,ip=,net_if=,iscsi_if=default] 10.16.41.222,3260,-1 iqn.1992-08.com.netapp:sn.151753773] added

#iscsiadm -m node -I default --targetname iqn.1992-08.com.netapp:sn.151753773 -o delete

#iscsiadm -m discoverydb -t sendtargets --portal na3170b.lab.bos.redhat.com -o delete
iscsiadm: Could not stat /var/lib/iscsi/nodes/iqn.1992-08.com.netapp:sn.151753773/10.16.41.222,3260,1/default to delete node: No such file or directory

iscsiadm: Could not delete node /var/lib/iscsi/nodes/iqn.1992-08.com.netapp:sn.151753773/10.16.41.222,3260/default

--------
Verified fix on iscsi-initiator-utils-6.2.0.873-10.el6.x86_64

#iscsiadm -m node -T iqn.1992-08.com.netapp:sn.151753773  -p 10.16.41.222 -I default -o new && iscsiadm -m node -T iqn.1992-08.com.netapp:sn.151753773 -n node.startup -v manual --op=update -p 10.16.41.222 -I default

iscsiadm: using tpgt 1 from existing record
New iSCSI node [tcp:[hw=,ip=,net_if=,iscsi_if=default] 10.16.41.222,3260,-1 iqn.1992-08.com.netapp:sn.151753773] added

#iscsiadm -m node -I default --targetname iqn.1992-08.com.netapp:sn.151753773 -o delete

#iscsiadm -m discoverydb -t sendtargets --portal na3170b.lab.bos.redhat.com -o delete
Comment 27 errata-xmlrpc 2013-11-21 18:08:52 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1700.html

Note You need to log in before you can comment on or make changes to this bug.