Description of problem:
OSP13 z10
Client did a live migration and after that the instance couldn't be started.
Original issue: NoFibreChannelVolumeDeviceFound: Unable to find a Fibre Channel volume device.
Instance: 84e56525-f6fb-4891-82cb-8206bdf30724
Volumes: 4 volumes attached to this instance.
We found out that the first volume (vda), had attached_host empty.
I made a backup of the database.
Changed the attached_host value directly in the database.
Still on hard reboot it would fail with the same error message.
While looking at the cinder.volume_attachment for that volume I noticed that both connector and connection_info lines were empty.
So I thought of shelving and unshelving the instance.
Shelving worked but now we can't unshelve it.
Now the issue is this in nova-compute.log on compute node when we try to unshelve it:
2021-03-19 16:17:08.937 8 ERROR oslo_messaging.rpc.server InvalidBDM: Unable to update attachment.(Invalid volume: duplicate connectors detected on volume 5ab79eb6-0d13-47a3-9104-3ad55d5f1650). (HTTP
500) (Request-ID: req-6d5dfc41-8dc5-428f-a200-b87309877068)
I looked inside the database and for nova.block_device_mapper I saw one entry for this volume (which is vdb).
Client told me he think that the device is used by another instance (multiattach was at true) but can't say it is or isn't.
I found the code were we get this error.
https://github.com/openstack/cinder/blob/master/cinder/volume/api.py#L2251
We need your help to fix this situation.
We have /var/log from all 3 controller nodes (sosreport is creating issues on controller nodes)
We have sosreport from compute node.
We have 2 mysqldump, first before I touched anything and then one with the latest situation.
I am available on IRC and do remote with the client monday to work on this.
Thank you.
Version-Release number of selected component (if applicable):
OSP13 z10.
How reproducible:
N/A
Steps to Reproduce:
1. Try to unshelve instance
2.
3.
Actual results:
Can't boot/unshelve instance
Expected results:
Instance boot.
Additional info:
We have logs/sos and mysqldump.
This looks an awful lot like bug #1851260 and the associated LP bug https://bugs.launchpad.net/nova/+bug/1780973.
The fix for bug #1851260 is in openstack-nova-17.0.12-1.el7ost. From the sosreport for *this* bz, I located the nova-compute container version, downloaded it, and verified it already contains the version that fixes that other bug. So I'm not sure what's going on.
Regardless, I think DFG:Compute should start the investigation, and move it back if it's actually a cinder or os-brick issue.
Issue is now resolved.
Instance is up and active.
To fix this you need to delete the other active cinder.volume_attachment for each volumes.
You only keep the reserved one that should match the attachment_id from nova.block_device_mapping.
As discussed I'm closing this as a duplicate of bug #1874432 that improved our LM rollback handling of volume attachments, note this covers failures during the actual LM as in this case, not just pre-live-migration as in the case of bug #1874432.
*** This bug has been marked as a duplicate of bug 1874432 ***