884427 – Unable to change node startup mode to manual due to ISCSI database error

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 884427 - Unable to change node startup mode to manual due to ISCSI database error

Summary: Unable to change node startup mode to manual due to ISCSI database error

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	iscsi-initiator-utils
Sub Component:
Version:	6.3
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Chris Leech
QA Contact:	Bruno Goncalves
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1066115
TreeView+	depends on / blocked

Reported:	2012-12-06 08:44 UTC by Gadi Ickowicz
Modified:	2014-08-22 01:41 UTC (History)
CC List:	12 users (show)
Fixed In Version:	iscsi-initiator-utils-6.2.0.873-10.el6
Doc Type:	Bug Fix
Doc Text:	Cause: Database errors could occur if multiple node records, with and without an set target portal group tag, were created for the same iSCSI target portal. VDSM was doing this by creating static records after discovery. Consequence: Depending on the filesystem dependent return order of the readdir syscall, sometimes a error would occur causing an update operation to fail. Fix: Prevent multiple node records in different formats from existing simultaneously, detected at record creation time. Result: Duplicate node entries should no longer exist in the iSCSI database, and updates to records should not result in a database error.
Clone Of:
Clones:	1066115 (view as bug list)
Environment:
Last Closed:	2013-11-21 23:08:52 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
iscsi nodes (20.00 KB, application/x-tar) 2012-12-06 08:44 UTC, Gadi Ickowicz	no flags	Details
sorting readdir replacement via LD_PRELOAD (5.72 KB, text/plain) 2013-08-27 14:44 UTC, Chris Leech	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2013:1700	0	normal	SHIPPED_LIVE	iscsi-initiator-utils bug fix and enhancement update	2013-11-20 21:52:07 UTC

Description Gadi Ickowicz 2012-12-06 08:44:10 UTC

Created attachment 658584 [details]
iscsi nodes

Description of problem:
After creating a new node and logging in, attempting to update that nodes' startup mode to to manual fails with the following error: 

iscsiadm: iface iter could not stat /var/lib/iscsi/nodes/iqn.2008-05.com.xtremio:90e2ba1cbfc4/10.35.160.7,3260.\niscsiadm: Could not execute operation on all records: encountered iSCSI database failure

Version-Release number of selected component (if applicable):
iscsi-initiator-utils-6.2.0.872-41.el6.x86_64

How reproducible:
?

Steps to Reproduce:
vdsm runs the following commands during connectStorageServer
1. /usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.2008-05.com.xtremio:90e2ba1cbfc4 -I default -p 10.35.160.7:3260 --op=new
2. /usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.2008-05.com.xtremio:90e2ba1cbfc4 -I default -p 10.35.160.7:3260 -l
3. /usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.2008-05.com.xtremio:90e2ba1cbfc4 -I default -p 10.35.160.7:3260 -n node.startup -v manual --op=update
  
Actual results:
step 3 fails sometimes

** tar with the contents of /var/lib/iscsi/nodes is attached **


Additional info:

Comment 2 Chris Leech 2012-12-17 19:45:49 UTC

Can I get any more information on reproducing this?

Is this reproducible?  How frequently has it been seen?

Has this been reproduced with a simple script around iscsiadm as in the steps listed, or has it only been seen with vdsm?  (are the steps listed above really steps to reproduce?)

I've been running the listed reproduction steps in a loop for hours, with no hint of anything failing.

Looking at iscsiadm, the error indicates that there was a stat(3) failure on the node record file (/var/lib/iscsi/nodes/iqn.2008-05.com.xtremio:90e2ba1cbfc4/10.35.160.7,3260) but unfortunately does not tell us why stat(3) failed.

If the node record file didn't exist at all, iscsiadm shouldn't be this deep into the database iterators - it should have failed with a "no records found" error message.

The attachment with the contents of the node record portion of the database does not include a matching record to what is listed in the reproduction steps.  Was it pulled off of the failing system, immediately after attempting to setup iqn.2008-05.com.xtremio:90e2ba1cbfc4?

It almost looks as if some process deleted the node record in the middle of the update operation, as that sort of race condition is the only thing I can see right now that would cause the described failure case.

Comment 3 Ayal Baron 2012-12-23 08:04:03 UTC

Haim, wasn't this recently reproduced? can you provide more details as requested above?

Comment 4 Gadi Ickowicz 2012-12-23 10:37:03 UTC

(In reply to comment #2)
> Can I get any more information on reproducing this?
> 
> Is this reproducible?  How frequently has it been seen?
> 
> Has this been reproduced with a simple script around iscsiadm as in the
> steps listed, or has it only been seen with vdsm?  (are the steps listed
> above really steps to reproduce?)
> 
> I've been running the listed reproduction steps in a loop for hours, with no
> hint of anything failing.
> 
> Looking at iscsiadm, the error indicates that there was a stat(3) failure on
> the node record file
> (/var/lib/iscsi/nodes/iqn.2008-05.com.xtremio:90e2ba1cbfc4/10.35.160.7,3260)
> but unfortunately does not tell us why stat(3) failed.
> 
> If the node record file didn't exist at all, iscsiadm shouldn't be this deep
> into the database iterators - it should have failed with a "no records
> found" error message.
> 
> The attachment with the contents of the node record portion of the database
> does not include a matching record to what is listed in the reproduction
> steps.  Was it pulled off of the failing system, immediately after
> attempting to setup iqn.2008-05.com.xtremio:90e2ba1cbfc4?
> 
> It almost looks as if some process deleted the node record in the middle of
> the update operation, as that sort of race condition is the only thing I can
> see right now that would cause the described failure case.

This was reproduced only through vdsm, with a specific automated test. However, once this state is reached manually running the commands yields the same errors. The node is not there because it is not added due to the error

Comment 12 Chris Leech 2013-01-28 19:08:40 UTC

A careful look at an strace of the failing update command has been very
helpful.

I believe that at the start of the update command there are two different
records for the same target portal in the filesystem.  An "old style" portal
configuration file that looks like [1] 

  [1] /var/lib/iscsi/nodes/iqn.1992-08.com.netapp:sn.151753773/10.16.41.222,3260

and a new target portal directory with an interface-bound configuration file
that looks like [2]

  [2] /var/lib/iscsi/nodes/iqn.1992-08.com.netapp:sn.151753773/10.16.41.222,3260,1/default

If the update command reads in [2] first (that might depend on the return order
from readdir), then there's a problem.  When it tries to write back the updated
record, there's a check for an old style entry [1], and then unlinks it to
instead write it back as a new style in [2].  After the updated record is
written to [2], the iterator loop looks for [1] because it was previously
returned from readdir.  That's where it is failing (but to further confuse the
issue, the update has actually happened).

Now I need to figure out why both of these configuration files seem to exist at
the same time.

Comment 15 Chris Leech 2013-07-26 22:57:32 UTC

Turns out what I was missing was there in Bruno's comment, ST discovery is run in addition to a manually node creation command (discovery is creating node records, so the manual node creation isn't really needed).

Looks like discovery is creating node records in a different format than op=new, then depending on the return order of readdir (unspecified, depends on filesystem) you might hit the bug I described in comment #12.

Now that I know what's going on, I should be able to fix this without much trouble.

Comment 20 Chris Leech 2013-08-27 14:44:53 UTC

Created attachment 791019 [details]
sorting readdir replacement via LD_PRELOAD

This is how I forced reproduction of the issue for testing.  I used the attached spd_readdir code with LD_PRELOAD to change the behavior of readdir(3) to always sort directory contents into the order that triggered this problem.

Build spd_readdir.o with

  gcc -shared -ldl -fPIC spd_readdir.c -o spd_readdir.o

Then run this test script (with appropriate target name and portal address set), which duplicates the vdsm steps involved.

---

#!/bin/sh

export LD_PRELOAD=${PWD}/spd_readdir.o

TARGET=iqn.1992-01.com.example:target.0 
PORTAL=192.168.122.1:3260

delete_all () {
	p=$1
	echo "== Clearing DB of all node records"
	iscsiadm -m node -o delete

	echo "== Clearing DB of all discovery records"
	iscsiadm -m discoverydb -o show > /dev/null
	if [ $? -eq 0 ]; then
		iscsiadm -m discoverydb -t st -p $p -o delete
	else
		echo "scsiadm: No discovery records found"
	fi
}

discovery () {
	portal=$1
	echo "== Performing SendTargets Discovery against ${portal}"
	iscsiadm -m discoverydb -t st -p $portal -o new
	iscsiadm -m discoverydb -t st -p $portal --discover
}

new_static () {
	target=$1
	portal=$2
	echo "== New static node ${target} ${portal}"
	iscsiadm -m node -o new -T $target -p $portal
}

show () {
	n=$1
	echo "== Showing $n for all nodes"
	iscsiadm -m node -o show | grep $n
}

update () {
	target=$1
	portal=$2
	n=$3
	v=$4
	echo "== Setting ${portal} $n -> $v"
	iscsiadm -m node -o update -T $target -p $portal -n $n -v $v
}

delete_all $PORTAL
discovery $PORTAL
new_static $TARGET $PORTAL
update $TARGET $PORTAL node.startup manual

Comment 21 Bruno Goncalves 2013-09-02 13:14:41 UTC

It seems that now, after updating the target I can't delete the target portal.


# iscsiadm -m discovery -I default -p na3170b.lab.bos.redhat.com -t st10.16.41.222:3260,1 iqn.1992-08.com.netapp:sn.151753773
10.16.43.127:3260,1 iqn.1992-08.com.netapp:sn.151753773


# iscsiadm -m node -T iqn.1992-08.com.netapp:sn.151753773  -p 10.16.41.222 -I default -o new && iscsiadm -m node -T iqn.1992-08.com.netapp:sn.151753773  -n node.startup -v manual --op=update -p 10.16.41.222 -I default
iscsiadm: using tpgt 1 from existing record
New iSCSI node [tcp:[hw=,ip=,net_if=,iscsi_if=default] 10.16.41.222,3260,-1 iqn.1992-08.com.netapp:sn.151753773] added

# iscsiadm -m node --targetname "iqn.1992-08.com.netapp:sn.151753773" -o delete

# iscsiadm -m discoverydb  --portal "na3170b.lab.bos.redhat.com" -t sendtargets -o deleteiscsiadm: Could not stat /var/lib/iscsi/nodes/iqn.1992-08.com.netapp:sn.151753773/10.16.41.222,3260,1/default to delete node: No such file or directory

iscsiadm: Could not delete node /var/lib/iscsi/nodes/iqn.1992-08.com.netapp:sn.151753773/10.16.41.222,3260/default


-------
# iscsiadm -m discoverydb  --portal "na3170b.lab.bos.redhat.com" -t sendtargets 
# BEGIN RECORD 6.2.0-873.5.el6
discovery.startup = manual
discovery.type = sendtargets
discovery.sendtargets.address = na3170b.lab.bos.redhat.com
discovery.sendtargets.port = 3260
discovery.sendtargets.auth.authmethod = None
discovery.sendtargets.auth.username = <empty>
discovery.sendtargets.auth.password = <empty>
discovery.sendtargets.auth.username_in = <empty>
discovery.sendtargets.auth.password_in = <empty>
discovery.sendtargets.timeo.login_timeout = 15
discovery.sendtargets.use_discoveryd = No
discovery.sendtargets.discoveryd_poll_inval = 30
discovery.sendtargets.reopen_max = 5
discovery.sendtargets.timeo.auth_timeout = 45
discovery.sendtargets.timeo.active_timeout = 30
discovery.sendtargets.iscsi.MaxRecvDataSegmentLength = 32768
# END RECORD

ls /var/lib/iscsi/send_targets/*
iqn.1992-08.com.netapp:sn.151753773,10.16.41.222,3260,1,default

Comment 24 Chris Leech 2013-10-08 15:45:56 UTC

Fix has been updated to remove old discoverydb links when overwriting with a static record.

Comment 26 Bruno Goncalves 2013-10-09 15:07:24 UTC

Reproduced on iscsi-initiator-utils-6.2.0.873-9.el6.x86_64:

#iscsiadm -m node -T iqn.1992-08.com.netapp:sn.151753773  -p 10.16.41.222 -I default -o new && iscsiadm -m node -T iqn.1992-08.com.netapp:sn.151753773 -n node.startup -v manual --op=update -p 10.16.41.222 -I default

iscsiadm: using tpgt 1 from existing record
New iSCSI node [tcp:[hw=,ip=,net_if=,iscsi_if=default] 10.16.41.222,3260,-1 iqn.1992-08.com.netapp:sn.151753773] added

#iscsiadm -m node -I default --targetname iqn.1992-08.com.netapp:sn.151753773 -o delete

#iscsiadm -m discoverydb -t sendtargets --portal na3170b.lab.bos.redhat.com -o delete
iscsiadm: Could not stat /var/lib/iscsi/nodes/iqn.1992-08.com.netapp:sn.151753773/10.16.41.222,3260,1/default to delete node: No such file or directory

iscsiadm: Could not delete node /var/lib/iscsi/nodes/iqn.1992-08.com.netapp:sn.151753773/10.16.41.222,3260/default

--------
Verified fix on iscsi-initiator-utils-6.2.0.873-10.el6.x86_64

#iscsiadm -m node -T iqn.1992-08.com.netapp:sn.151753773  -p 10.16.41.222 -I default -o new && iscsiadm -m node -T iqn.1992-08.com.netapp:sn.151753773 -n node.startup -v manual --op=update -p 10.16.41.222 -I default

iscsiadm: using tpgt 1 from existing record
New iSCSI node [tcp:[hw=,ip=,net_if=,iscsi_if=default] 10.16.41.222,3260,-1 iqn.1992-08.com.netapp:sn.151753773] added

#iscsiadm -m node -I default --targetname iqn.1992-08.com.netapp:sn.151753773 -o delete

#iscsiadm -m discoverydb -t sendtargets --portal na3170b.lab.bos.redhat.com -o delete

Comment 27 errata-xmlrpc 2013-11-21 23:08:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1700.html

Note You need to log in before you can comment on or make changes to this bug.