1534658 – [downstream clone - 4.1.9] Host installation made all other hosts timeout - add logging to make heartbeat issues more visible

Bug 1534658 - [downstream clone - 4.1.9] Host installation made all other hosts timeout - add logging to make heartbeat issues more visible

Summary: [downstream clone - 4.1.9] Host installation made all other hosts timeout - a...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm-jsonrpc-java
Sub Component:
Version:	4.0.7
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	ovirt-4.1.9
Target Release:	---
Assignee:	Piotr Kliczewski
QA Contact:	Petr Matyáš
Docs Contact:
URL:
Whiteboard:
Depends On:	1502391
Blocks:
TreeView+	depends on / blocked

Reported:	2018-01-15 17:00 UTC by rhev-integ
Modified:	2021-06-10 14:12 UTC (History)
CC List:	11 users (show)
Fixed In Version:	vdsm-jsonrpc-java-1.3.16-1.el7ev
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1502391
Environment:
Last Closed:	2018-01-24 14:42:48 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:0131	normal	SHIPPED_LIVE	vdsm-jsonrpc-java bug fix and enhancement update for RHV 4.1.9	2018-01-24 19:38:16 UTC
oVirt gerrit	84984	master	MERGED	heartbeat: provide more information when exceeded	2018-01-15 17:05:07 UTC
oVirt gerrit	85495	ovirt-4.1	MERGED	heartbeat: provide more information when exceeded	2018-01-15 17:05:07 UTC

Description rhev-integ 2018-01-15 17:00:36 UTC

+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1502391 +++
======================================================================

Description of problem:

A host was installed and the engine failed to connect to it. 3 minutes later all other 25 hosts, across multiple clusters and DCs, are Not-Responding. Restarting the engine corrected the problem for all hosts.

1. Host installation finishes:
2017-10-11 22:13:35,695 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy) [7e14403e] Correlation ID: 7e14403e, Call Stack: null, Custom Event ID: -1, Message: Installing Host clove. Stage: Termination.

2. Engine tries to connect to it:
2017-10-11 22:13:35,749 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to /172.16.38.206

3. PollVDS Timeout

2017-10-11 22:13:38,749 WARN  [org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable] (org.ovirt.thread.pool-6-thread-8) [] Retry failed
2017-10-11 22:13:38,749 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.pool-6-thread-18) [7e14403e] Command 'PollVDSCommand(HostName = clove, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='85fcbdde-a545-429f-aca2-96d242f0872d'})' execution failed: VDSGenericException: VDSNetworkException: Timeout during xml-rpc call
2017-10-11 22:13:38,749 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.pool-6-thread-18) [7e14403e] Timeout waiting for VDSM response: null

4. Last PollVDS attempt and Installation fails

2017-10-11 22:15:37,827 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.pool-6-thread-18) [7e14403e] Command 'PollVDSCommand(HostName = clove, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='85fcbdde-a545-429f-aca2-96d242f0872d'})' execution failed: VDSGenericException: VDSNetworkException: Timeout during xml-rpc call
2017-10-11 22:15:37,827 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (org.ovirt.thread.pool-6-thread-18) [7e14403e] Timeout waiting for VDSM response: null
2017-10-11 22:15:37,828 ERROR [org.ovirt.engine.core.bll.hostdeploy.InstallVdsInternalCommand] (org.ovirt.thread.pool-6-thread-18) [7e14403e] Host installation failed for host '85fcbdde-a545-429f-aca2-96d242f0872d', 'clove': Network error during communication with the host

5. Starts fencing installed host

2017-10-11 22:15:01,783 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-99) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Executing power management status on Host clove using Proxy Host elphias and Fence Agent ipmilan:172.16.208.27.

5. All other hosts lose connection.. one by one. 25 Hosts.

2017-10-11 22:16:36,057 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler55) [43b80e86] Command 'GetAllVmStatsVDSCommand(HostName = jailin, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='70584133-f92d-4453-86b2-a8c8a449b3f7', vds='Host[jailin,70584133-f92d-4453-86b2-a8c8a449b3f7]'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues
2017-10-11 22:16:36,057 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler93) [] Command 'GetAllVmStatsVDSCommand(HostName = haister, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='c551463c-8201-4b24-acce-b0c36c031a09', vds='Host[haister,c551463c-8201-4b24-acce-b0c36c031a09]'})' execution failed: VDSGenericException: VDSNetworkException: Vds timeout occured
2017-10-11 22:16:36,560 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler81) [7fce1c8b] Command 'GetStatsVDSCommand(HostName = arnault, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='15f76737-f372-439b-8504-dc3797c0951c', vds='Host[arnault,15f76737-f372-439b-8504-dc3797c0951c]'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues
2017-10-11 22:16:37,058 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler64) [43b80e86] Command 'GetAllVmStatsVDSCommand(HostName = ballmer, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='ade4e274-ac8f-4451-a22f-5352eaa81d98', vds='Host[ballmer,ade4e274-ac8f-4451-a22f-5352eaa81d98]'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues
2017-10-11 22:16:37,058 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler83) [7fce1c8b] Command 'GetAllVmStatsVDSCommand(HostName = bezas, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='a30db754-ef47-4b73-bbca-19b3bfee3b3e', vds='Host[bezas,a30db754-ef47-4b73-bbca-19b3bfee3b3e]'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues

Some of these messages say vds timeout, which is set to 180s. It would point to 22:13:36, which is around the first PollVDS timeout and initial installation host connection.

6. Does not recover by itself.

7. Restart ovirt-engine service and all hosts are now connected fine.

Version-Release number of selected component (if applicable):
ovirt-engine-4.0.7.4-0.1.el7ev.noarch
vdsm-jsonrpc-java-1.2.10-1.el7ev.noarch

How reproducible:
Unknown

(Originally by Germano Veit Michel)

Comment 3 rhev-integ 2018-01-15 17:00:59 UTC

There were some stuck ProcessOvfUpdateForStorageDomain commands, but these have been there for quite some time and don't look related.

(Originally by Germano Veit Michel)

Comment 4 rhev-integ 2018-01-15 17:01:12 UTC

Piotr, could you please take a look?

(Originally by Martin Perina)

Comment 5 rhev-integ 2018-01-15 17:01:23 UTC

It looks like the issues that we had before and it was already fixed. To make sure that it is really the issue please provide being installed vdsm logs and server.log.

(Originally by Piotr Kliczewski)

Comment 6 rhev-integ 2018-01-15 17:01:33 UTC

(In reply to Piotr Kliczewski from comment #4)
> It looks like the issues that we had before and it was already fixed. To
> make sure that it is really the issue please provide being installed vdsm
> logs and server.log.

Hi Piotr,

I checked server.log previously and it was clean, that's why I did not upload it. Attaching now. VDSM logs will follow soon (keeping needinfo on me)

Could you please point to the bug/patch that you think already fixed it?

(Originally by Germano Veit Michel)

Comment 9 rhev-integ 2018-01-15 17:02:06 UTC

From vdsm logs I see very strange vdsm behavior:
1. Vdsm was started at 2017-10-12 22:13:35,669+0200
2. The engine connected at 2017-10-11 22:13:36,037+0200
3. The engine got disconnected at 2017-10-12 12:38:33,063+0200
4. Vdsm was restarted at 2017-10-12 12:49:45,843+0200 and became fully functional

Please note that between 1 and 3 vdsm never received CONNECT frame and there is no sign of mom (localhost) connecting to vdsm.

From the engine perspective I see that engine attempted to call ping method for some time but failed which resulted in fencing flow at 2017-10-11 22:15:01,784. The host was switched to maintenance at 2017-10-11 22:16:25,393. There was balancing info:

2017-10-11 22:16:27,144 WARN  [org.ovirt.engine.core.bll.scheduling.policyunits.EvenGuestDistributionBalancePolicyUnit] (DefaultQuartzScheduler93) [] There is no host with less than 33 running guests
2017-10-11 22:16:27,144 WARN  [org.ovirt.engine.core.bll.scheduling.policyunits.CpuAndMemoryBalancingPolicyUnit] (DefaultQuartzScheduler93) [] All candidate hosts have been filtered, can't balance the cluster 'Tele2Cloud3' based on the CPU usage, will try memory based approach

I see that all other hosts started to fail one by one at 2017-10-11 22:16:36,057 and ended at 2017-10-11 22:18:27,617. The reason for all of the failures was no response within higher level timeout.

Please provide vdsm log for one of the vdsms which started to fail after failed installation.

(Originally by Piotr Kliczewski)

Comment 11 rhev-integ 2018-01-15 17:02:26 UTC

There is nothing special in the log. I can see that the last request from the engine arrived at 2017-10-11 22:13:28,083+0200 and later the connection was closed by the engine at 2017-10-11 22:16:38,160+0200. I haven't seen similar behavior before.

Here is what we know:
1. Host was installed and the handshake was successful but no CONNECT frame was sent
2. After the host install issue other hosts slowly started to timeout according to the timeout defined in the db

Checking the engine log a bit further I see enormous amount of ovf stream uploads and when the engine was about to connect to vdsm the log entries stopped to appear and here are the last entires:

2017-10-11 22:13:38,339 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQuartzScheduler79) [43b80e86] Command 'ProcessOvfUpdateForStorageDomain' (id: '7c2f78e9-c18b-4610-9f57-35f34a38dd27') waiting on child command id: '95057e8d-4906-4d8c-99c8-f7c25db0fcc0' type:'UploadStream' to complete
2017-10-11 22:14:16,403 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQuartzScheduler89) [7fce1c8b] Command 'ProcessOvfUpdateForStorageDomain' (id: '144d5adf-5777-41b6-8a33-73c59d783a47') waiting on child command id: '84800851-e7a3-4638-98e3-9016c493c0f4' type:'UploadStream' to complete 

Do we know anything about this massive upload operations? They seems to occupy many threads.

(Originally by Piotr Kliczewski)

Comment 12 rhev-integ 2018-01-15 17:02:33 UTC

(In reply to Piotr Kliczewski from comment #10)
> There is nothing special in the log. I can see that the last request from
> the engine arrived at 2017-10-11 22:13:28,083+0200 and later the connection
> was closed by the engine at 2017-10-11 22:16:38,160+0200. I haven't seen
> similar behavior before.

Yes. I too only saw the engine not talking to any hosts, no clues anywhere else.

> 
> Here is what we know:
> 1. Host was installed and the handshake was successful but no CONNECT frame
> was sent
> 2. After the host install issue other hosts slowly started to timeout
> according to the timeout defined in the db
> 
> Checking the engine log a bit further I see enormous amount of ovf stream
> uploads and when the engine was about to connect to vdsm the log entries
> stopped to appear and here are the last entires:
> 
> 2017-10-11 22:13:38,339 INFO 
> [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback]
> (DefaultQuartzScheduler79) [43b80e86] Command
> 'ProcessOvfUpdateForStorageDomain' (id:
> '7c2f78e9-c18b-4610-9f57-35f34a38dd27') waiting on child command id:
> '95057e8d-4906-4d8c-99c8-f7c25db0fcc0' type:'UploadStream' to complete
> 2017-10-11 22:14:16,403 INFO 
> [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback]
> (DefaultQuartzScheduler89) [7fce1c8b] Command
> 'ProcessOvfUpdateForStorageDomain' (id:
> '144d5adf-5777-41b6-8a33-73c59d783a47') waiting on child command id:
> '84800851-e7a3-4638-98e3-9016c493c0f4' type:'UploadStream' to complete 
> 
> Do we know anything about this massive upload operations? They seems to
> occupy many threads.

We don't. The engine was waiting for all these commands to complete for a very long time. Our oldest logs still show these commands waiting, so the info on why this is still there was rotated in the logs. It's been like that for at least 1 month. After engine restart the OVF updates were cleaned up.

I understand we both see an issue, but there are no clues where it's coming from. As an engine restart fixed it, it probably indicates something wrong in the engine itself right? Do you have any ideas on what could trigger this to try to reproduce it? If not, I'm afraid we don't have enough data to tackle this bug.

(Originally by Germano Veit Michel)

Comment 13 rhev-integ 2018-01-15 17:02:42 UTC

(In reply to Germano Veit Michel from comment #11)
> 
> We don't. The engine was waiting for all these commands to complete for a
> very long time. Our oldest logs still show these commands waiting, so the
> info on why this is still there was rotated in the logs. It's been like that
> for at least 1 month. After engine restart the OVF updates were cleaned up.
> 
> I understand we both see an issue, but there are no clues where it's coming
> from. As an engine restart fixed it, it probably indicates something wrong
> in the engine itself right? Do you have any ideas on what could trigger this
> to try to reproduce it? If not, I'm afraid we don't have enough data to
> tackle this bug.

I agree, there is engine side issue here but I am not able to say what is causing it. I haven't seen anything similar before. Now the only thing we can do is try to reproduce it. Do you have any suggestions how to do it?

(Originally by Piotr Kliczewski)

Comment 14 rhev-integ 2018-01-15 17:02:50 UTC

(In reply to Piotr Kliczewski from comment #12)
> I agree, there is engine side issue here but I am not able to say what is
> causing it. I haven't seen anything similar before. Now the only thing we
> can do is try to reproduce it. Do you have any suggestions how to do it?

No idea. Except for those ovf streams (which we don't know how they got in that state) I don't see anything exceptional. Do you think those could be related? How would it block Host<->Engine communication? I assume it's not easy to reproduce, otherwise would be happening all the time.

(Originally by Germano Veit Michel)

Comment 15 rhev-integ 2018-01-15 17:02:55 UTC

I am not sure whether ovf streams are affecting the communication. For some reason the responses never arrived and for installed host we saw that CONNECT frame was not sent. Ovf stream upload uses http client and different physical connection. I do not see any clues in the logs what could be going on. Maybe the cause is the same for both.

I am not able to come up with steps to reproduce so I suggest it to close with insufficient data and reopen when we see it one more time. What do you think?

(Originally by Piotr Kliczewski)

Comment 16 rhev-integ 2018-01-15 17:03:10 UTC

Any log prints we can add?

(Originally by Oved Ourfali)

Comment 17 rhev-integ 2018-01-15 17:03:17 UTC

Based on the logs provided I am not sure what happened. For sure logs in debug would be more helpful but I am not able to specify any place which would be helpful.

(Originally by Piotr Kliczewski)

Comment 18 rhev-integ 2018-01-15 17:03:27 UTC

Moran, let's close this bug for now with INSUFFICIENT_DATA and let's reopen when we get reproducing steps or debug logs

(Originally by Martin Perina)

Comment 19 rhev-integ 2018-01-15 17:03:34 UTC

(In reply to Piotr Kliczewski from comment #14)
> I am not able to come up with steps to reproduce so I suggest it to close
> with insufficient data and reopen when we see it one more time. What do you
> think?

Agreed. Would be nice to have some better logging of heartbeat issues though, as these happen quite frequently (most of the times due to network issues though) and we are usually in the dark about them. Not sure what could be done.

(Originally by Germano Veit Michel)

Comment 20 rhev-integ 2018-01-15 17:03:44 UTC

I am happy to provide more logging around heartbeat issues. Please suggest information that would be helpful and I will make sure they are in.

(Originally by Piotr Kliczewski)

Comment 21 rhev-integ 2018-01-15 17:03:53 UTC

(In reply to Piotr Kliczewski from comment #19)
> I am happy to provide more logging around heartbeat issues. Please suggest
> information that would be helpful and I will make sure they are in.

Hi Piotr,

I don't know the heartbeat mechanism well enough to comment on this. But if we could have some logging that could help us determining if host (or vdsm) stopped the heartbeats maybe it could help. Like some warning: heartbeat not sent within secs, heartbeat not received in the last secs.. Heartbeat thread blocked..

Not sure if this makes sense and I know some of these may be impossible to code, as they are not just logging.

(Originally by Germano Veit Michel)

Comment 22 rhev-integ 2018-01-15 17:04:02 UTC

OK, I will add such log to the code.

(Originally by Piotr Kliczewski)

Comment 27 rhev-integ 2018-01-15 17:04:42 UTC

Hi Piotr, 

The patch is on 4.1 tree but there is no Z-Stream clone of this BZ. Is this really going out with 4.1.9?

Because we have a re-occurence and would be nice to confirm it. It's a good chance to catch the problem.

Also, is it safe to run 4.1.8 with vdsm-jsonrpc-java-1.3.16-1.el7ev?

Thank you!

(Originally by Germano Veit Michel)

Comment 28 rhev-integ 2018-01-15 17:04:49 UTC

(In reply to Germano Veit Michel from comment #26)
> The patch is on 4.1 tree but there is no Z-Stream clone of this BZ. Is this
> really going out with 4.1.9?

As I understand this is the z-stream bug for 4.1 so no need to clone. I think that Martin can clarify if I am wrong.

> 
> Also, is it safe to run 4.1.8 with vdsm-jsonrpc-java-1.3.16-1.el7ev?
> 

Yes, There are no api changes so it is safe.

(Originally by Piotr Kliczewski)

Comment 30 Petr Matyáš 2018-01-17 12:19:00 UTC

Verified on vdsm-jsonrpc-java-1.3.16-1.el7ev.noarch

Comment 33 errata-xmlrpc 2018-01-24 14:42:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0131

Comment 34 Franta Kust 2019-05-16 12:55:16 UTC

BZ<2>Jira re-sync

Note You need to log in before you can comment on or make changes to this bug.