Bug 2166300

Summary: Ansible setup is easily broken when iscsi devices dies in weird ways
Product: Red Hat OpenStack Reporter: David Hill <dhill>
Component: ansibleAssignee: OSP Team <rhos-maint>
Status: NEW --- QA Contact: Nobody <nobody>
Severity: high Docs Contact:
Priority: unspecified    
Version: 16.2 (Train)CC: cgussobo, eharney, jjoyce, jschluet, slinaber, tvignaud
Target Milestone: ---Flags: dhill: needinfo-
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description David Hill 2023-02-01 13:14:51 UTC
Description of problem:
Ansible setup is easily broken when iscsi devices dies in weird ways and I suspect we hit this bug here [1].   I'm pretty sure rebooting the compute node affected by the issue would've solved this issue but after minutes of troubleshooting (not to say hours), we managed to find a dead iscsi device with broken symlinks in /sys/ ... nothing in the sosreport talks very much perhaps beside this:

logs/journalctl_--no-pager:Oct 15 15:54:55 comp systemd-udevd[103774]: inotify_add_watch(7, /dev/sdde2, 10) failed: No such file or directory
logs/journalctl_--no-pager:Oct 15 15:54:55 comp multipathd[1040]: sdde [70:192]: path added to devmap 360002ac0000000000000058d0002447c
logs/journalctl_--no-pager:Oct 15 15:54:55 comp systemd-udevd[103772]: inotify_add_watch(7, /dev/sdde1, 10) failed: No such file or directory



[1] https://github.com/ansible/ansible/issues/77037

Version-Release number of selected component (if applicable):


How reproducible:
This time 


Steps to Reproduce:
1. Don't know, destroy a iscsi session the wrong way perhaps
2.
3.

Actual results:
overcloud deployment fails on a compute with:
(undercloud) [stack@director scripts]$ openstack overcloud failures
|-> Failures for host: overcloud-compute-10
|--> Task: set allowed_devices
|---> _ansible_no_log: false
|---> msg: "The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'devices'\n\nThe error appears to be in '/usr/share/ansible/roles/tripleo_lvmfilter/tasks/main.yml': line 39, column 7, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n      lvm2_physical_devices_facts:\n    - name: set allowed_devices\n      ^ here\n"


Expected results:
No failures

Additional info:

Comment 3 Conrado Gusso Bozza 2023-03-02 16:08:57 UTC
The failure occurs when executing a overcloud deploy to apply some network changes.

The network changes are needed to solve another issue with interfaces name cause by a FFU from OSP13 to OSP16.