Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2209629

Summary:	[OSP16.2] Deployment is failing on one compute with "The error was: 'dict object' has no attribute 'devices'"
Product:	Red Hat OpenStack	Reporter:	Ricardo Ramos Thomas <riramos>
Component:	tripleo-ansible	Assignee:	Alan Bishop <abishop>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Joe H. Rahme <jhakimra>
Severity:	high	Docs Contact:
Priority:	medium
Version:	16.2 (Train)	CC:	abishop, bshephar, eharney, geguileo, gfidente, jjoyce, jschluet, jveiraca, ralfieri, slinaber, tvignaud
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	Unspecified
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2024-09-30 14:22:16 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ricardo Ramos Thomas 2023-05-24 09:51:30 UTC

Description of problem:

Stack update fails when ansible is trying to get facts, more specifically ansible_devices from compute10. This only fails on this node even though it is exactly like the rest of ~20 compute nodes.

What is the business impact? Please also provide timeframe information.
This is not critical but it's blocking the validation process to hand over to customer the cloud and close the project.


Cu Confirm that is hitting the same error of the following KCS, but can't reboot the node like the work around suggest.


https://access.redhat.com/solutions/6996321



Version-Release number of selected component (if applicable):

Red Hat OpenStack Platform release 16.2.3 (Train)

How reproducible:


Steps to Reproduce:
1.ran stack update.
2.
3.

Actual results:

Error

Expected results:

No Error from ansible update

Additional info:

- Templates, scripts and SOS reports are available on the case.

Comment 3 Brendan Shephard 2023-05-25 21:59:53 UTC

This seems like it's caused by an underlying storage problem. Is there some hung network storage attached to the node? The Ansible problem seems to be a symptom of an underlying issue. The reboot would just remove the broken mount. If you want to fix it without rebooting, you would need to identify which storage device is having issues and try to fix that. Note that the KCS also says the same thing "the issue can be worked around by either finding the dead iscsi path and deleting it OR rebooting the compute."

Comment 5 Gorka Eguileor 2023-06-06 14:34:12 UTC

Since we are using multipathed connections we can find the iSCSI/FC paths that are down using the multipath daemon, because it is monitoring the devices:

  sudo multipathd list paths | grep -v ready

We may also have paths that are not being part of a multipath device, we can see which ones are with:

  sudo multipathd list paths | grep orphan

And then we can issue an SCSI command to confirm that they are responsive, for example for the sdXYZ device it would be:

  sudo /lib/udev/scsi_id --page 0x83 --whitelisted /dev/sdXYZ

By now we should have a list of devices that are not responding and we can try to fix the underlying network issues.
If we cannot fix the underlying network issues and we want to remove the devices we should determine if an instance is using them.
We list the instances on the host with:

  sudo virsh list

Then we see the devices each instance is using with:

  sudo virsh domblklist <instance-name>

If a failed device is being used by an instance then we should probably delete the nova instance to remove the device.
If a failed device is not being used directly by an instance (for example because the instance is using the multipathed device) we can remove the device itself.

Be very, very careful with removing devices, as removal doesn't care about the holders, so even if a Nova instance is using the device it will be removed.
How to remove the devices depends on the storage transport protocol in use, how many devices are currently connected, etc.
I believe this is an FC storage array, so for the sdXYZ failed device we would call:

  echo 1 | sudo tee /sys/block/sdXYZ/device/delete

Comment 9 Alan Bishop 2024-09-30 14:22:16 UTC

The customer case is closed, so I'm closing this BZ.