Description of problem: Stack update fails when ansible is trying to get facts, more specifically ansible_devices from compute10. This only fails on this node even though it is exactly like the rest of ~20 compute nodes. What is the business impact? Please also provide timeframe information. This is not critical but it's blocking the validation process to hand over to customer the cloud and close the project. Cu Confirm that is hitting the same error of the following KCS, but can't reboot the node like the work around suggest. https://access.redhat.com/solutions/6996321 Version-Release number of selected component (if applicable): Red Hat OpenStack Platform release 16.2.3 (Train) How reproducible: Steps to Reproduce: 1.ran stack update. 2. 3. Actual results: Error Expected results: No Error from ansible update Additional info: - Templates, scripts and SOS reports are available on the case.
This seems like it's caused by an underlying storage problem. Is there some hung network storage attached to the node? The Ansible problem seems to be a symptom of an underlying issue. The reboot would just remove the broken mount. If you want to fix it without rebooting, you would need to identify which storage device is having issues and try to fix that. Note that the KCS also says the same thing "the issue can be worked around by either finding the dead iscsi path and deleting it OR rebooting the compute."
Since we are using multipathed connections we can find the iSCSI/FC paths that are down using the multipath daemon, because it is monitoring the devices: sudo multipathd list paths | grep -v ready We may also have paths that are not being part of a multipath device, we can see which ones are with: sudo multipathd list paths | grep orphan And then we can issue an SCSI command to confirm that they are responsive, for example for the sdXYZ device it would be: sudo /lib/udev/scsi_id --page 0x83 --whitelisted /dev/sdXYZ By now we should have a list of devices that are not responding and we can try to fix the underlying network issues. If we cannot fix the underlying network issues and we want to remove the devices we should determine if an instance is using them. We list the instances on the host with: sudo virsh list Then we see the devices each instance is using with: sudo virsh domblklist <instance-name> If a failed device is being used by an instance then we should probably delete the nova instance to remove the device. If a failed device is not being used directly by an instance (for example because the instance is using the multipathed device) we can remove the device itself. Be very, very careful with removing devices, as removal doesn't care about the holders, so even if a Nova instance is using the device it will be removed. How to remove the devices depends on the storage transport protocol in use, how many devices are currently connected, etc. I believe this is an FC storage array, so for the sdXYZ failed device we would call: echo 1 | sudo tee /sys/block/sdXYZ/device/delete