Bug 2209629 - [OSP16.2] Deployment is failing on one compute with "The error was: 'dict object' has no attribute 'devices'"
Summary: [OSP16.2] Deployment is failing on one compute with "The error was: 'dict o...
Keywords:
Status: NEW
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: tripleo-ansible
Version: 16.2 (Train)
Hardware: Unspecified
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Alan Bishop
QA Contact: Joe H. Rahme
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-05-24 09:51 UTC by Ricardo Ramos Thomas
Modified: 2023-07-14 09:59 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-25344 0 None None None 2023-05-24 09:52:10 UTC

Description Ricardo Ramos Thomas 2023-05-24 09:51:30 UTC
Description of problem:

Stack update fails when ansible is trying to get facts, more specifically ansible_devices from compute10. This only fails on this node even though it is exactly like the rest of ~20 compute nodes.

What is the business impact? Please also provide timeframe information.
This is not critical but it's blocking the validation process to hand over to customer the cloud and close the project.


Cu Confirm that is hitting the same error of the following KCS, but can't reboot the node like the work around suggest.


https://access.redhat.com/solutions/6996321



Version-Release number of selected component (if applicable):

Red Hat OpenStack Platform release 16.2.3 (Train)

How reproducible:


Steps to Reproduce:
1.ran stack update.
2.
3.

Actual results:

Error

Expected results:

No Error from ansible update

Additional info:

- Templates, scripts and SOS reports are available on the case.

Comment 3 Brendan Shephard 2023-05-25 21:59:53 UTC
This seems like it's caused by an underlying storage problem. Is there some hung network storage attached to the node? The Ansible problem seems to be a symptom of an underlying issue. The reboot would just remove the broken mount. If you want to fix it without rebooting, you would need to identify which storage device is having issues and try to fix that. Note that the KCS also says the same thing "the issue can be worked around by either finding the dead iscsi path and deleting it OR rebooting the compute."

Comment 5 Gorka Eguileor 2023-06-06 14:34:12 UTC
Since we are using multipathed connections we can find the iSCSI/FC paths that are down using the multipath daemon, because it is monitoring the devices:

  sudo multipathd list paths | grep -v ready

We may also have paths that are not being part of a multipath device, we can see which ones are with:

  sudo multipathd list paths | grep orphan

And then we can issue an SCSI command to confirm that they are responsive, for example for the sdXYZ device it would be:

  sudo /lib/udev/scsi_id --page 0x83 --whitelisted /dev/sdXYZ

By now we should have a list of devices that are not responding and we can try to fix the underlying network issues.
If we cannot fix the underlying network issues and we want to remove the devices we should determine if an instance is using them.
We list the instances on the host with:

  sudo virsh list

Then we see the devices each instance is using with:

  sudo virsh domblklist <instance-name>

If a failed device is being used by an instance then we should probably delete the nova instance to remove the device.
If a failed device is not being used directly by an instance (for example because the instance is using the multipathed device) we can remove the device itself.

Be very, very careful with removing devices, as removal doesn't care about the holders, so even if a Nova instance is using the device it will be removed.
How to remove the devices depends on the storage transport protocol in use, how many devices are currently connected, etc.
I believe this is an FC storage array, so for the sdXYZ failed device we would call:

  echo 1 | sudo tee /sys/block/sdXYZ/device/delete


Note You need to log in before you can comment on or make changes to this bug.