Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1941054

Summary: [OSP13] Can't boot vm since live migration - InvalidBDM: Unable to update attachment.(Invalid volume: duplicate connectors detected on volume)
Product: Red Hat OpenStack Reporter: ggrimaux
Component: openstack-novaAssignee: OSP DFG:Compute <osp-dfg-compute>
Status: CLOSED DUPLICATE QA Contact: OSP DFG:Compute <osp-dfg-compute>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 13.0 (Queens)CC: abishop, amoralej, dasmith, eglynn, jhakimra, kchamart, lyarwood, mabrams, sbauza, sgordon, vromanso
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-03-29 09:34:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description ggrimaux 2021-03-19 20:25:56 UTC
Description of problem:
OSP13 z10

Client did a live migration and after that the instance couldn't be started.
Original issue: NoFibreChannelVolumeDeviceFound: Unable to find a Fibre Channel volume device.

Instance: 84e56525-f6fb-4891-82cb-8206bdf30724 
Volumes: 4 volumes attached to this instance.

We found out that the first volume (vda), had attached_host empty.
I made a backup of the database. 
Changed the attached_host value directly in the database.
Still on hard reboot it would fail with the same error message.

While looking at the cinder.volume_attachment for that volume I noticed that both connector and connection_info lines were empty.
So I thought of shelving and unshelving the instance.
Shelving worked but now we can't unshelve it.

Now the issue is this in nova-compute.log on compute node when we try to unshelve it:
2021-03-19 16:17:08.937 8 ERROR oslo_messaging.rpc.server InvalidBDM: Unable to update attachment.(Invalid volume: duplicate connectors detected on volume 5ab79eb6-0d13-47a3-9104-3ad55d5f1650). (HTTP
500) (Request-ID: req-6d5dfc41-8dc5-428f-a200-b87309877068)

I looked inside the database and for nova.block_device_mapper I saw one entry for this volume (which is vdb).

Client told me he think that the device is used by another instance (multiattach was at true) but can't say it is or isn't.

I found the code were we get this error.
https://github.com/openstack/cinder/blob/master/cinder/volume/api.py#L2251

We need your help to fix this situation.

We have /var/log from all 3 controller nodes (sosreport is creating issues on controller nodes)
We have sosreport from compute node.

We have 2 mysqldump, first before I touched anything and then one with the latest situation.

I am available on IRC and do remote with the client monday to work on this.

Thank you.

Version-Release number of selected component (if applicable):
OSP13 z10.

How reproducible:
N/A

Steps to Reproduce:
1. Try to unshelve instance
2.
3.

Actual results:
Can't boot/unshelve instance

Expected results:
Instance boot.

Additional info:
We have logs/sos and mysqldump.

Comment 1 Alan Bishop 2021-03-20 04:16:47 UTC
This looks an awful lot like bug #1851260 and the associated LP bug https://bugs.launchpad.net/nova/+bug/1780973.

The fix for bug #1851260 is in openstack-nova-17.0.12-1.el7ost. From the sosreport for *this* bz, I located the nova-compute container version, downloaded it, and verified it already contains the version that fixes that other bug. So I'm not sure what's going on.

Regardless, I think DFG:Compute should start the investigation, and move it back if it's actually a cinder or os-brick issue.

Comment 3 ggrimaux 2021-03-22 14:03:01 UTC
Issue is now resolved.
Instance is up and active.

To fix this you need to delete the other active cinder.volume_attachment for each volumes.

You only keep the reserved one that should match the attachment_id from nova.block_device_mapping.

Comment 4 Lee Yarwood 2021-03-29 09:34:45 UTC
As discussed I'm closing this as a duplicate of bug #1874432 that improved our LM rollback handling of volume attachments, note this covers failures during the actual LM as in this case, not just pre-live-migration as in the case of bug #1874432.

*** This bug has been marked as a duplicate of bug 1874432 ***