Bug 1846338 - Host monitoring does not report bond mode 1 active slave after engine is alive some time
Summary: Host monitoring does not report bond mode 1 active slave after engine is aliv...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm-jsonrpc-java
Classification: oVirt
Component: Core
Version: 1.5.5
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ovirt-4.4.4
: 1.5.7
Assignee: Artur Socha
QA Contact: Michael Burman
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-11 12:18 UTC by msheena
Modified: 2020-12-21 12:36 UTC (History)
6 users (show)

Fixed In Version: vdsm-jsonrpc-java-1.5.7, ovirt-engine-4.4.4.2
Clone Of:
Environment:
Last Closed: 2020-12-21 12:36:26 UTC
oVirt Team: Infra
Embargoed:
pm-rhel: ovirt-4.4+
aoconnor: blocker-


Attachments (Terms of Use)
engine.log with debug enabled (845.63 KB, application/x-xz)
2020-07-30 15:34 UTC, Dominik Holler
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 111923 0 None MERGED Fix for negative subscription count 2021-01-26 08:32:40 UTC
oVirt gerrit 111951 0 master MERGED Fix for negative subscription count 2021-01-26 08:33:22 UTC
oVirt gerrit 111955 0 master MERGED Bump vdsm-jsonrpc-java version to 1.5.7 2021-01-26 08:32:39 UTC

Description msheena 2020-06-11 12:18:36 UTC
Description of problem
======================
Given I have a RHV environment with a host,
When I create a bond in mode 1 (active-backup) with 2 slaves,
Then the bond is created successfully, but the active slave is not reported.

Version-Release number of selected component (if applicable)
============================================================
ovirt-engine-4.4.1.2-0.10.el8ev.noarch

How reproducible
================
Reproduces in 'long' running environments.
I have yet to determine when it happens that the host monitoring stops functioning in this particular aspect.
It was observed that restarting ovirt-engine.service gets the host monitoring to work, but not indefinitely - meaning in some stage this issue will reproduce again.

Steps to Reproduce
==================
1. Create a bond in mode 1 with 2 slaves.

Actual results
==============
The active slave will not be reported regardless of how much time passes since the bond creation. (manually triggering refresh capabilities will cause RHV to report the active slave).

Expected results
================
As described in [1].
Momentarily after creating the bond RHV reports the active slave.

[1] - https://bugzilla.redhat.com/show_bug.cgi?id=1801794

Comment 3 RHEL Program Management 2020-06-15 08:43:44 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 6 Sandro Bonazzola 2020-07-02 13:20:55 UTC
Missed final 4.4.1 build, can you re-target to 4.4.2 or should it block 4.4.1 GA?

Comment 10 Dominik Holler 2020-07-30 15:34:49 UTC
Created attachment 1702968 [details]
engine.log with debug enabled

The attached engine.log contains many "|net|host_conn|no_id" , which does not trigger refresh caps as expected.

The situation is created in context of https://gerrit.ovirt.org/#/c/110474/10 by

def test_bond_active_slave(system, default_data_center, default_cluster,
                           host_0_up):
    with netlib.new_network('test-bond', default_data_center) as test_bond:
        with clusterlib.network_assignment(default_cluster, test_bond):
            attachment = netattachlib.NetworkAttachmentData(test_bond,
                                                            BOND_NAME)
            bond_data = netattachlib.BondingData(
                BOND_NAME, slave_names=[SLAVE1, SLAVE2], options=BOND_OPTIONS
            )
            with hostlib.setup_networks(host_0_up, [attachment],
                                        bonding_data=[bond_data]):
                bond = hostlib.Bond(host_0_up)
                bond.import_by_name(BOND_NAME)
                bond.wait_for_up_status()
                sleep(20)
                initial_active_slave = bond.get_active_slave()
                inactive_slave = bond.get_inactive_slaves()[0]
                sleep(20)
                sshlib.exec_command(
                    host_0_up.address, host_0_up.root_password,
                    CMD + ' ' + BOND_NAME + ' ' + inactive_slave.name
                )
                slaves = [initial_active_slave.name, inactive_slave.name]
                # sleep until engine refreshes capabilities
                for i in range(100):
                    new_active_slave = bond.get_active_slave()
                    print('{}: {} -> {}'.format(
                        i, initial_active_slave.id, new_active_slave.id
                    ))
                    sshlib.exec_command(
                        host_0_up.address, host_0_up.root_password,
                        CMD + ' ' + BOND_NAME + ' ' + slaves[i % 2]
                    )
                    sleep(10)
                assert new_active_slave.id != initial_active_slave.id


https://gerrit.ovirt.org/#/c/110474/10 seems to work on machines with much computing power, but failing on machines with less computing power. Looks like the event reaches org.ovirt.vdsm.jsonrpc.client, but unlcear if it was delivered to HostConnectionRefresher .

Comment 11 Dominik Holler 2020-07-31 10:21:03 UTC
 https://gerrit.ovirt.org/#/c/110564/2  shows that the scenario is just fine, if only a single host is added.

Comment 12 Michael Burman 2020-08-06 08:40:21 UTC
This bug also affecting dhcpv4/6 events to the engine.
After engine is alive for some time without restart, the dhcvpv4/6 events not arriving to the engine on change or update.
Restart engine fix it immediately.

Comment 13 Artur Socha 2020-08-06 12:52:42 UTC
Micheal, please make a thread dump next time you'll suspect this particular issue. The root cause is most likely on engine's side. One other thing that comes to my mind would be stats from GC.  They would allow me to check that there are not to many full gc cycles that would freeze engine's threads therefore causing timeouts. Perhaps such events might not be handled properly and the connection is being closed as a result. 
Well, that's just my another theory upon information that the issue manifests itself in long running environments, but who know, perhaps, will get us closer to the root cause.

Comment 15 Artur Socha 2020-10-23 07:56:21 UTC
An update:
I have one environment besides the QE where the issue re-appears after ~1 day after restart. Remote debug is enabled which I suspect helps to recreate the problem.
So far these are my findings:
- on the host (vdsm) the notification is fired and correctly written to the socket (betterAsyncore.py)
- notification is received and processed by the engine's jsonrpc java client. debug logs
2020-10-23 09:50:52,603+02 DEBUG [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) [] Message received: {"jsonrpc": "2.0", "method": "|net|host_conn|no_id", "params": {"notify_time": 11884568459}}
2020-10-23 09:50:59,877+02 DEBUG [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) [] Message received: {"jsonrpc": "2.0", "method": "|net|host_conn|no_id", "params": {"notify_time": 11884575732}

- this notification message does not reach HostConnectionRefresher or HostConnectionRefresher is somehow blocked. That is exactly the reason why 'refresh capabilities' call is not made. I am going to dig into it now.

Comment 16 Artur Socha 2020-10-27 13:13:58 UTC
Core issue was found in vdsm-jsonrpc-java and fixed [1] there. Once new jsonrpc client is release I will bump it in ovirt-engine.

Comment 17 Artur Socha 2020-10-27 13:25:44 UTC
The issue was found in vdsm-jsonrpc-java code. It is addressed by this patch [1] and once jsonrpc java is release it will be bumped in ovirt-engine

[1]https://gerrit.ovirt.org/111923/

Comment 18 Michael Burman 2020-11-15 09:33:40 UTC
Hi Arthur,

rhvm-4.4.4-0.1.el8ev.noarch shipped with vdsm-jsonrpc-java-1.5.5-1.el8ev.noarch
There is no higher version available of vdsm-jsonrpc-java

Comment 19 Michael Burman 2020-11-15 10:31:32 UTC
Because vdsm-jsonrpc-java 1.5.7 is not aviablbe in rhv 4.4.4, this can't be ON_QA.
The relevant test keep failing.
We added engine restart to overcome this bug, the bug moc=ved to ON_QA and the restart step was skipped and the test failed.
This should be back on MODIFIED until newer version of vdsm-jsonrpc-java is shipped to QE.

Comment 21 Michael Burman 2020-12-01 07:28:28 UTC
Verified on - rhvm-4.4.4.2-0.1.el8ev.noarch and vdsm-jsonrpc-java-1.5.7-1.el8ev.noarch

We haven't saw this issue since the fix. All automated tests passing now without the need to restart engine service.

Comment 22 Sandro Bonazzola 2020-12-21 12:36:26 UTC
This bugzilla is included in oVirt 4.4.4 release, published on December 21st 2020.

Since the problem described in this bug report should be resolved in oVirt 4.4.4 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.