1751761 – Nova introduced a bottleneck by unnecessarily serializing attach/detach operations

Bug 1751761 - Nova introduced a bottleneck by unnecessarily serializing attach/detach operations

Summary: Nova introduced a bottleneck by unnecessarily serializing attach/detach opera...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-nova
Sub Component:
Version:	17.0 (Wallaby)
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	Upstream M1
Target Release:	17.0
Assignee:	Lee Yarwood
QA Contact:	OSP DFG:Compute
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1941951
TreeView+	depends on / blocked

Reported:	2019-09-12 14:18 UTC by Gorka Eguileor
Modified:	2023-03-21 19:21 UTC (History)
CC List:	14 users (show)
Fixed In Version:	openstack-nova-22.1.0-0.20210309122122.31889ce.el8ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1941951 (view as bug list)
Environment:
Last Closed:	2022-09-21 12:07:58 UTC
Target Upstream Version:	Wallaby
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1800515	None	None	None	2019-09-12 14:18:15 UTC
OpenStack gerrit	614190	'None'	MERGED	Use os-brick locking for volume attach and detach	2021-02-08 15:28:43 UTC
Red Hat Issue Tracker	OSP-457	None	None	None	2022-04-13 19:58:04 UTC
Red Hat Product Errata	RHEA-2022:6543	None	None	None	2022-09-21 12:09:15 UTC

Description Gorka Eguileor 2019-09-12 14:18:15 UTC

Cinder introduced "shared_targets" and "service_uuid" fields in volumes to allow volume consumers to protect themselves from unintended leftover devices when handling iSCSI connections with shared targets.

The way to protect from the automatic scans that happen on detach/map race conditions is by locking and only allowing one attach or one detach operation for each server to happen at a given time.

When using an up to date Open iSCSI initiator we don't need to use locks, as it has the possibility to disable automatic LUN scans (which are the real cause of the leftover devices), and OS-Brick already supports this feature.

Currently Nova is blindly locking whenever "shared_targets" is set to True, even when the iSCSI initiator and OS-Brick are already preventing such races, which introduces unnecessary serialization on the connection of volumes.

Current code in Nova will also serialize all connections for non iSCSI connections, like RBD, which don't report "shared_targets", because it doesn't mean anything to them.

Comment 3 Matthew Booth 2019-10-02 13:29:46 UTC

I've slept since I discussed this last. Remind me what shared targets means in the context of iscsi? IIRC shared targets means you attach 1 thing and you get all the things. E.g. when you mount an NFS export you get all the volumes on that export even if you only need 1. Is shared targets the same with iscsi?

Comment 4 Gorka Eguileor 2019-10-02 13:53:45 UTC

Yes, it is a similar concept. Some iSCSI backends have a 1 to 1 relationship between the iSCSI target-portal and volume/LUN (in this case the iSCSI initiator needs to login for each one), whereas others share the same target-portal for all volumes/LUNs (we login once and we have all the LUNs that get mapped in there).
There was a race condition for shared targets between the mapping/unmapping at the backend and the attach/detach in the host caused by iSCSI AEN/AER and the Open iSCSI initiator behavior that resulted in leftover devices on the host.  To fix it I added a feature to the Open iSCSI initiator and support for it in OS-Brick (backporting it downstream all the way back to OSP8), so at RH we no longer had this issues, and upstream anyone using a modern iSCSI initiator would not have it either.
About 6 months later Nova added a big lock around the mapping/unmapping + attaching/detaching operations that serialized them for all the backends that didn't report that they didn't have shared targets (the driver set it to True or didn't set it) regardless of whether the type of transport protocol used with the backend (shared targets mean nothing to backends like Ceph or FC backends), it also didn't care if the initiator running on the host had the new feature that didn't require the lock.
So the lock is really only necessary if you are doing iSCSI and the initiator doesn't have the manual scans feature, and that is what the os-brick context manager does.

Comment 5 Matthew Booth 2019-10-02 14:49:07 UTC

Oh, nice! Locking in os-brick had previously been hard NAKed. This was always where it needed to live as the locking requirements are specific to not just the specific backend, but also the bugs present in that backend. Attempting to doing it in Nova, or even worse c-vol, made this a minefield. I suggest you might want to add some intent to the interface, e.g.:

brick_utils.guard_attach(volume), and
brick_utils.guard_detach(volume)

In the first instance these can both be aliases for guard_connection, but IIRC there was at least 1 driver which could handle concurrency in one operation but not the other.

Anyway, assuming the locking is correct in os-brick I'm all in favour of this in Nova.

Comment 17 errata-xmlrpc 2022-09-21 12:07:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:6543

Note You need to log in before you can comment on or make changes to this bug.