Bug 1758289

Summary: [Warn] Duplicate chassis entries in southbound database if the host is down while removing the host from Manager
Product: Red Hat Enterprise Virtualization Manager Reporter: nijin ashok <nashok>
Component: ovirt-engineAssignee: eraviv
Status: CLOSED ERRATA QA Contact: Michael Burman <mburman>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.3.4CC: dholler, dougsland, eraviv, lsurette, mburman, mperina, pelauter, rdlugyhe, srevivo
Target Milestone: ovirt-4.4.1   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: ovirt-engine-4.4.0_beta2 Doc Type: Enhancement
Doc Text:
When you remove a host from the RHV Manager, it can create duplicate entries for a host-unreachable event in the RHV Manager database. Later, if you add the host back to the RHV Manager, these entries can cause networking issues. With this enhancement, if this type of event happens, the RHV Manager prints a message to the events tab and log. The message notifies users of the issue and explains how to avoid networking issues if they add the host back to RHV Manager.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-08-04 13:20:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Network RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description nijin ashok 2019-10-03 18:48:05 UTC
Description of problem:

The "vdsm-tool unconfigure" will be executed from the playbook "ovirt-host-remove.yml" when we remove the host from the manager. This takes care of removing the entries of the host from the OVN databases. However, if the host is down/inaccessible when we remove the host, the playbook will fail to execute leaving the stale chassis entry in the southbound database.

Now if we reinstall the OS and add it back to the same environment, it will create duplicate entries in the southbound database.

===
~]# ovn-sbctl show
Chassis "ed1f1757-d14e-4e29-b661-989ad5f2f88c"
    hostname: "dhcp0-5.ansirhv.redhat.com"
    Encap geneve
        ip: "192.168.0.8"
        options: {csum="true"}
    Port_Binding "3fa19ad2-51c3-4a02-85f3-09842d9faeba"    
Chassis "efd7ba6f-3c47-4e6d-abeb-3781fc21a668"               ======> Duplicate entry which was not removed.
    hostname: "dhcp0-2.ansirhv.redhat.com"
    Encap geneve
        ip: "192.168.0.9"
        options: {csum="true"}
Chassis "7e56082d-f847-4dda-b14b-b4f1f7ce3c65"
    hostname: "dhcp0-2.ansirhv.redhat.com"
    Encap geneve
        ip: "192.168.0.9"
        options: {csum="true"}
    Port_Binding "888e67e8-7d58-4afb-8b0f-cd233a3cdcc0"
===

There will be duplicate Geneve tunnels with remote IP as 192.168.0.9 on every other host in this cluster.

===
dhcp0-5 ~]# ovs-vsctl show
26366dc3-8297-4346-9e16-16a943eafa0a
    Bridge br-int
        fail_mode: secure
        Port "vnet0"
            Interface "vnet0"
        Port br-int
            Interface br-int
                type: internal
        Port "ovn-efd7ba-0"
            Interface "ovn-efd7ba-0"
                type: geneve
                options: {csum="true", key=flow, remote_ip="192.168.0.9"}
        Port "ovn-7e5608-0"
            Interface "ovn-7e5608-0"
                type: geneve
                options: {csum="true", key=flow, remote_ip="192.168.0.9"}
                error: "could not add network device ovn-7e5608-0 to ofproto (File exists)"
    ovs_version: "2.11.0"
====

Because of this, the table 32 will not be having a rule for forwarding the packets to the Geneve tunnel of the host which is having duplicate entries (192.168.0.9).

====
[root@dhcp0-5 ~]#ovs-ofctl dump-flows br-int|grep table=32
 cookie=0x0, duration=16589.727s, table=32, n_packets=0, n_bytes=0, idle_age=16589, priority=150,reg10=0x10/0x10 actions=resubmit(,33)
 cookie=0x0, duration=16589.727s, table=32, n_packets=0, n_bytes=0, idle_age=16589, priority=150,reg10=0x2/0x2 actions=resubmit(,33)
 cookie=0x0, duration=16589.727s, table=32, n_packets=17, n_bytes=1044, idle_age=2451, priority=0 actions=resubmit(,33)
====

So the VMs on the host with duplicate entries (192.168.0.9) will not be able to communicate with any other host in the cluster as the openflow rule to forward the packets to the 192.168.0.9 host doesn't exist in any of the hosts in cluster.


Version-Release number of selected component (if applicable):

ovirt-provider-ovn-1.2.20-1.el7ev.noarch


How reproducible:

100%

Steps to Reproduce:

1. Shutdown the host before removing the host from the manager so that the playbook will fail.

2. Reinstall this host and add it back to the same RHV-M.

3. Check command output of ovs-vsctl show and ovn-sbctl show to see suplicate entries.

4. Run a VM in this host and try to check connectivity with VMs running in other hosts in the cluster. It should fail.


Actual results:

Duplicate chassis entries in the southbound database if the host is down while removing the host from Manager breaking the network connectivity between the VMs.

Expected results:

There can be cases like host OS is corrupted or not bootable or not recoverable where the user has to forcefully remove the host from the Manager. In these cases, the user will reinstall the host and will add it back to the manager. This will results in duplicate entries in the database which will break the connectivity between the VMs. I think we should remove the chassis entries from the OVN database even if the host is not accessible while removing it from the portal. 

Additional info:

Comment 1 Dominik Holler 2019-10-15 13:07:46 UTC
Idea: generate warning containing the steps for manual removal if the host is not reachable.

Comment 2 Sandro Bonazzola 2019-12-09 14:24:47 UTC
Dominik, Martin this looks like ovirt-engine ansible based host removal code. Can you please have a look?
Not related at all with ovirt-host-deploy right?

Comment 3 Dominik Holler 2019-12-09 14:28:40 UTC
This bug is about propagating a problem in
packaging/playbooks/roles/ovirt-provider-ovn-driver/tasks/unconfigure.yml
back to the user.

Comment 4 Dominik Holler 2020-01-08 10:09:00 UTC
Michael, do you think it is rewired to provide a script like https://gerrit.ovirt.org/#/c/106175/ to remove the host by hostname from ovn sbdb, or do you think that a hint like "Removing the host from OVN db might have failed. Please use 'ovn-sbctl chassis-del' to remove the chassis from OVN db manually" ?

Comment 5 Michael Burman 2020-01-08 11:41:10 UTC
(In reply to Dominik Holler from comment #4)
> Michael, do you think it is rewired to provide a script like
> https://gerrit.ovirt.org/#/c/106175/ to remove the host by hostname from ovn
> sbdb, or do you think that a hint like "Removing the host from OVN db might
> have failed. Please use 'ovn-sbctl chassis-del' to remove the chassis from
> OVN db manually" ?

Hint like "Removing the host from OVN db might have failed. Please use 'ovn-sbctl chassis-del' to remove the chassis from OVN db manually" sound very good to me)

Comment 6 Michael Burman 2020-04-05 13:57:49 UTC
I believe this should be ON_QA

Comment 7 Dominik Holler 2020-04-06 06:37:32 UTC
(In reply to Michael Burman from comment #6)
> I believe this should be ON_QA

Yes, it was merged before tagging last week's build.

Comment 8 Michael Burman 2020-04-06 10:16:27 UTC
Verified on - rhvm-4.4.0-0.31.master.el8ev.noarch

Comment 13 errata-xmlrpc 2020-08-04 13:20:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: RHV Manager (ovirt-engine) 4.4 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:3247