Description of problem ====================== Given I have a RHV environment with a host, When I create a bond in mode 1 (active-backup) with 2 slaves, Then the bond is created successfully, but the active slave is not reported. Version-Release number of selected component (if applicable) ============================================================ ovirt-engine-4.4.1.2-0.10.el8ev.noarch How reproducible ================ Reproduces in 'long' running environments. I have yet to determine when it happens that the host monitoring stops functioning in this particular aspect. It was observed that restarting ovirt-engine.service gets the host monitoring to work, but not indefinitely - meaning in some stage this issue will reproduce again. Steps to Reproduce ================== 1. Create a bond in mode 1 with 2 slaves. Actual results ============== The active slave will not be reported regardless of how much time passes since the bond creation. (manually triggering refresh capabilities will cause RHV to report the active slave). Expected results ================ As described in [1]. Momentarily after creating the bond RHV reports the active slave. [1] - https://bugzilla.redhat.com/show_bug.cgi?id=1801794
This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
Missed final 4.4.1 build, can you re-target to 4.4.2 or should it block 4.4.1 GA?
Created attachment 1702968 [details] engine.log with debug enabled The attached engine.log contains many "|net|host_conn|no_id" , which does not trigger refresh caps as expected. The situation is created in context of https://gerrit.ovirt.org/#/c/110474/10 by def test_bond_active_slave(system, default_data_center, default_cluster, host_0_up): with netlib.new_network('test-bond', default_data_center) as test_bond: with clusterlib.network_assignment(default_cluster, test_bond): attachment = netattachlib.NetworkAttachmentData(test_bond, BOND_NAME) bond_data = netattachlib.BondingData( BOND_NAME, slave_names=[SLAVE1, SLAVE2], options=BOND_OPTIONS ) with hostlib.setup_networks(host_0_up, [attachment], bonding_data=[bond_data]): bond = hostlib.Bond(host_0_up) bond.import_by_name(BOND_NAME) bond.wait_for_up_status() sleep(20) initial_active_slave = bond.get_active_slave() inactive_slave = bond.get_inactive_slaves()[0] sleep(20) sshlib.exec_command( host_0_up.address, host_0_up.root_password, CMD + ' ' + BOND_NAME + ' ' + inactive_slave.name ) slaves = [initial_active_slave.name, inactive_slave.name] # sleep until engine refreshes capabilities for i in range(100): new_active_slave = bond.get_active_slave() print('{}: {} -> {}'.format( i, initial_active_slave.id, new_active_slave.id )) sshlib.exec_command( host_0_up.address, host_0_up.root_password, CMD + ' ' + BOND_NAME + ' ' + slaves[i % 2] ) sleep(10) assert new_active_slave.id != initial_active_slave.id https://gerrit.ovirt.org/#/c/110474/10 seems to work on machines with much computing power, but failing on machines with less computing power. Looks like the event reaches org.ovirt.vdsm.jsonrpc.client, but unlcear if it was delivered to HostConnectionRefresher .
https://gerrit.ovirt.org/#/c/110564/2 shows that the scenario is just fine, if only a single host is added.
This bug also affecting dhcpv4/6 events to the engine. After engine is alive for some time without restart, the dhcvpv4/6 events not arriving to the engine on change or update. Restart engine fix it immediately.
Micheal, please make a thread dump next time you'll suspect this particular issue. The root cause is most likely on engine's side. One other thing that comes to my mind would be stats from GC. They would allow me to check that there are not to many full gc cycles that would freeze engine's threads therefore causing timeouts. Perhaps such events might not be handled properly and the connection is being closed as a result. Well, that's just my another theory upon information that the issue manifests itself in long running environments, but who know, perhaps, will get us closer to the root cause.
An update: I have one environment besides the QE where the issue re-appears after ~1 day after restart. Remote debug is enabled which I suspect helps to recreate the problem. So far these are my findings: - on the host (vdsm) the notification is fired and correctly written to the socket (betterAsyncore.py) - notification is received and processed by the engine's jsonrpc java client. debug logs 2020-10-23 09:50:52,603+02 DEBUG [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) [] Message received: {"jsonrpc": "2.0", "method": "|net|host_conn|no_id", "params": {"notify_time": 11884568459}} 2020-10-23 09:50:59,877+02 DEBUG [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) [] Message received: {"jsonrpc": "2.0", "method": "|net|host_conn|no_id", "params": {"notify_time": 11884575732} - this notification message does not reach HostConnectionRefresher or HostConnectionRefresher is somehow blocked. That is exactly the reason why 'refresh capabilities' call is not made. I am going to dig into it now.
Core issue was found in vdsm-jsonrpc-java and fixed [1] there. Once new jsonrpc client is release I will bump it in ovirt-engine.
The issue was found in vdsm-jsonrpc-java code. It is addressed by this patch [1] and once jsonrpc java is release it will be bumped in ovirt-engine [1]https://gerrit.ovirt.org/111923/
Hi Arthur, rhvm-4.4.4-0.1.el8ev.noarch shipped with vdsm-jsonrpc-java-1.5.5-1.el8ev.noarch There is no higher version available of vdsm-jsonrpc-java
Because vdsm-jsonrpc-java 1.5.7 is not aviablbe in rhv 4.4.4, this can't be ON_QA. The relevant test keep failing. We added engine restart to overcome this bug, the bug moc=ved to ON_QA and the restart step was skipped and the test failed. This should be back on MODIFIED until newer version of vdsm-jsonrpc-java is shipped to QE.
Verified on - rhvm-4.4.4.2-0.1.el8ev.noarch and vdsm-jsonrpc-java-1.5.7-1.el8ev.noarch We haven't saw this issue since the fix. All automated tests passing now without the need to restart engine service.
This bugzilla is included in oVirt 4.4.4 release, published on December 21st 2020. Since the problem described in this bug report should be resolved in oVirt 4.4.4 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.