Description of problem: The "vdsm-tool unconfigure" will be executed from the playbook "ovirt-host-remove.yml" when we remove the host from the manager. This takes care of removing the entries of the host from the OVN databases. However, if the host is down/inaccessible when we remove the host, the playbook will fail to execute leaving the stale chassis entry in the southbound database. Now if we reinstall the OS and add it back to the same environment, it will create duplicate entries in the southbound database. === ~]# ovn-sbctl show Chassis "ed1f1757-d14e-4e29-b661-989ad5f2f88c" hostname: "dhcp0-5.ansirhv.redhat.com" Encap geneve ip: "192.168.0.8" options: {csum="true"} Port_Binding "3fa19ad2-51c3-4a02-85f3-09842d9faeba" Chassis "efd7ba6f-3c47-4e6d-abeb-3781fc21a668" ======> Duplicate entry which was not removed. hostname: "dhcp0-2.ansirhv.redhat.com" Encap geneve ip: "192.168.0.9" options: {csum="true"} Chassis "7e56082d-f847-4dda-b14b-b4f1f7ce3c65" hostname: "dhcp0-2.ansirhv.redhat.com" Encap geneve ip: "192.168.0.9" options: {csum="true"} Port_Binding "888e67e8-7d58-4afb-8b0f-cd233a3cdcc0" === There will be duplicate Geneve tunnels with remote IP as 192.168.0.9 on every other host in this cluster. === dhcp0-5 ~]# ovs-vsctl show 26366dc3-8297-4346-9e16-16a943eafa0a Bridge br-int fail_mode: secure Port "vnet0" Interface "vnet0" Port br-int Interface br-int type: internal Port "ovn-efd7ba-0" Interface "ovn-efd7ba-0" type: geneve options: {csum="true", key=flow, remote_ip="192.168.0.9"} Port "ovn-7e5608-0" Interface "ovn-7e5608-0" type: geneve options: {csum="true", key=flow, remote_ip="192.168.0.9"} error: "could not add network device ovn-7e5608-0 to ofproto (File exists)" ovs_version: "2.11.0" ==== Because of this, the table 32 will not be having a rule for forwarding the packets to the Geneve tunnel of the host which is having duplicate entries (192.168.0.9). ==== [root@dhcp0-5 ~]#ovs-ofctl dump-flows br-int|grep table=32 cookie=0x0, duration=16589.727s, table=32, n_packets=0, n_bytes=0, idle_age=16589, priority=150,reg10=0x10/0x10 actions=resubmit(,33) cookie=0x0, duration=16589.727s, table=32, n_packets=0, n_bytes=0, idle_age=16589, priority=150,reg10=0x2/0x2 actions=resubmit(,33) cookie=0x0, duration=16589.727s, table=32, n_packets=17, n_bytes=1044, idle_age=2451, priority=0 actions=resubmit(,33) ==== So the VMs on the host with duplicate entries (192.168.0.9) will not be able to communicate with any other host in the cluster as the openflow rule to forward the packets to the 192.168.0.9 host doesn't exist in any of the hosts in cluster. Version-Release number of selected component (if applicable): ovirt-provider-ovn-1.2.20-1.el7ev.noarch How reproducible: 100% Steps to Reproduce: 1. Shutdown the host before removing the host from the manager so that the playbook will fail. 2. Reinstall this host and add it back to the same RHV-M. 3. Check command output of ovs-vsctl show and ovn-sbctl show to see suplicate entries. 4. Run a VM in this host and try to check connectivity with VMs running in other hosts in the cluster. It should fail. Actual results: Duplicate chassis entries in the southbound database if the host is down while removing the host from Manager breaking the network connectivity between the VMs. Expected results: There can be cases like host OS is corrupted or not bootable or not recoverable where the user has to forcefully remove the host from the Manager. In these cases, the user will reinstall the host and will add it back to the manager. This will results in duplicate entries in the database which will break the connectivity between the VMs. I think we should remove the chassis entries from the OVN database even if the host is not accessible while removing it from the portal. Additional info:
Idea: generate warning containing the steps for manual removal if the host is not reachable.
Dominik, Martin this looks like ovirt-engine ansible based host removal code. Can you please have a look? Not related at all with ovirt-host-deploy right?
This bug is about propagating a problem in packaging/playbooks/roles/ovirt-provider-ovn-driver/tasks/unconfigure.yml back to the user.
Michael, do you think it is rewired to provide a script like https://gerrit.ovirt.org/#/c/106175/ to remove the host by hostname from ovn sbdb, or do you think that a hint like "Removing the host from OVN db might have failed. Please use 'ovn-sbctl chassis-del' to remove the chassis from OVN db manually" ?
(In reply to Dominik Holler from comment #4) > Michael, do you think it is rewired to provide a script like > https://gerrit.ovirt.org/#/c/106175/ to remove the host by hostname from ovn > sbdb, or do you think that a hint like "Removing the host from OVN db might > have failed. Please use 'ovn-sbctl chassis-del' to remove the chassis from > OVN db manually" ? Hint like "Removing the host from OVN db might have failed. Please use 'ovn-sbctl chassis-del' to remove the chassis from OVN db manually" sound very good to me)
I believe this should be ON_QA
(In reply to Michael Burman from comment #6) > I believe this should be ON_QA Yes, it was merged before tagging last week's build.
Verified on - rhvm-4.4.0-0.31.master.el8ev.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: RHV Manager (ovirt-engine) 4.4 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:3247