Bug 1624093 - [tcmu-runner] tcmu_rbd_lock_break fails with "Could not break lock from XYZ (Err -16)"
Summary: [tcmu-runner] tcmu_rbd_lock_break fails with "Could not break lock from XYZ (...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: iSCSI
Version: 3.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: z1
: 3.2
Assignee: Jason Dillaman
QA Contact: Manohar Murthy
URL:
Whiteboard:
Depends On: 1623737
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-08-30 22:53 UTC by Jason Dillaman
Modified: 2019-03-07 15:51 UTC (History)
5 users (show)

Fixed In Version: RHEL: ceph-12.2.8-70.el7cp Ubuntu: ceph_12.2.8-55redhat1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-03-07 15:50:55 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 34534 0 None None None 2018-08-30 22:53:56 UTC
Red Hat Product Errata RHBA-2019:0475 0 None None None 2019-03-07 15:51:06 UTC

Description Jason Dillaman 2018-08-30 22:53:24 UTC
Description of problem:
After a failover event where the active path fails from gateway A to B, the B gateway will blacklist gateway A to prevent data corruption. If no other IO is sent to gateway A (i.e. because its network is unreachable by the initiators), it may not detect that it has been blacklisted and therefore won't attempt to recover by re-opening the RBD image associated with the LUN.

If the blacklist is manually removed or is allowed to expire after the default 24 hours and then the IO path between the initiators and gateway A is restored, gateway A will attempt to acquire the exclusive lock to the LUN by breaking the lock of gateway B. However, the internal state of librbd is inconsistent and will result in the failure to break the lock of the peer gateway while tcmu-runner gets stuck in a loop attempting to re-break the lock:

2018-08-30 17:49:15.396290 7f7c9affd700 10 librbd::ManagedLock: 0x7f7c940588d0 break_lock
2018-08-30 17:49:15.396295 7f7c9affd700 20 librbd::ManagedLock: 0x7f7c940588d0 is_lock_owner=1
2018-08-30 17:49:15.396296 7f7c9affd700 -1 librbd: failed to break lock: (16) Device or resource busy

The only recovery for the LUN on gateway A is to restart tcmu-runner. 

Version-Release number of selected component (if applicable):
3.0, 3.1

How reproducible:
100%

Steps to Reproduce:
1. Force multipath layer to failover from non primary path in a way that does not fail the gw for the primary path. For example, do a ifdown on the network link to the primary gw. Do not do something like reboot the gw.
2. Wait for over 24 hours for blacklist to expire or manually remove the blacklist entry for the primary gateway.
3. Fix problem that caused failover. So in this example just fix the network link.
4. Check that multipath device has failed back
5. Perform basic IO test

Actual results:
The IO will be sent to the failed-back path and will hang with the logs repeating lines similar to above. 

Expected results:
The IO completes on the failed-back path successfully.

Additional info:

Comment 6 errata-xmlrpc 2019-03-07 15:50:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0475


Note You need to log in before you can comment on or make changes to this bug.