1404082 – Host is stuck in Non-Responsive status

Bug 1404082 - Host is stuck in Non-Responsive status

Summary: Host is stuck in Non-Responsive status

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.1.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Artur Socha
QA Contact:	Petr Matyáš
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-12-13 02:49 UTC by Germano Veit Michel
Modified:	2024-03-25 14:58 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-11-13 08:28:35 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
ovirt engine log (1.94 MB, text/plain) 2019-06-10 17:42 UTC, Michael Watters	no flags	Details
View All

Description Germano Veit Michel 2016-12-13 02:49:43 UTC

Description of problem:

If I start the Manager before starting one of the Hosts, and the Host is not already in Maintenance mode, it gets stuck in Non Responding state forever until I restart the engine. Can't even switch to Maintenance mode.

Version-Release number of selected component (if applicable):
ovirt-engine-4.1.0-0.0.master.20161201071307.gita5ff876.el7.centos.noarch
vdsm-4.18.999-1020.git1ff41b1.el7.centos.x86_64
CentOS 7.2

How reproducible:
100%

Steps to Reproduce:
1. Shutdown Manager, leaving Host X NOT in Maintenance Mode
2. Shutdown one Host X
3. Start Manager, wait for it to come up and set Host X as Non Responsive
4. Start Host X

Actual results:
Host is Non Resposive and cannot switch to Maintenance Mode

Expected results:
Host Connects

Additional info:
1. Engine Comes up and Host is powered off (Host goes Non-Responsive: OK, expected)
2. Power up the Host
3. Host continues in Non-Responsive, network connection is fine and vdsm port is open. But engine continues to timeout.

2016-12-13 02:32:35,244Z ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler4) [] Command 'GetAllVmStatsVDSCommand(HostName = h2, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='14b8697c-4955-4d4c-ace0-cb6f7699945a'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues

4. Try to switch Host to Maintenance, still fails

2016-12-13 02:33:12,681Z INFO  [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (default task-20) [c06a8970-601e-48af-aea2-67fc957f984d] Running command: MaintenanceNumberOfVdssCommand internal: false. Entities affected :  ID: 14b8697c-4955-4d4c-ace0-cb6f7699945a Type: VDSAction group MANIPULATE_HOST with role type ADMIN

2016-12-13 02:33:12,944Z INFO  [org.ovirt.engine.core.bll.MaintenanceVdsCommand] (default task-20) [c06a8970-601e-48af-aea2-67fc957f984d] Running command: MaintenanceVdsCommand internal: true. Entities affected :  ID: 14b8697c-4955-4d4c-ace0-cb6f7699945a Type: VDS

2016-12-13 02:33:13,043Z INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SetHaMaintenanceModeVDSCommand] (default task-20) [c06a8970-601e-48af-aea2-67fc957f984d] START, SetHaMaintenanceModeVDSCommand(HostName = h2, SetHaMaintenanceModeVDSCommandParameters:{runAsync='true', hostId='14b8697c-4955-4d4c-ace0-cb6f7699945a'}), log id: 4c4a223a

2016-12-13 02:36:10,616Z ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SetHaMaintenanceModeVDSCommand] (default task-20) [c06a8970-601e-48af-aea2-67fc957f984d] Command 'SetHaMaintenanceModeVDSCommand(HostName = h2, SetHaMaintenanceModeVDSCommandParameters:{runAsync='true', hostId='14b8697c-4955-4d4c-ace0-cb6f7699945a'})' execution failed: VDSGenericException: VDSNetworkException: Vds timeout occured

-> Host is running fine, vdsm is listening on 54321
-> Host and manager can ping each other

5. Just restart the ovirt-engine and Host simply goes up on first connection retry, nothing else changed.

Comment 1 Germano Veit Michel 2016-12-13 04:09:53 UTC

Ok,

So here is what I did

1. shut everything down
2. truncate vdsm.log and engine.log
3. followed reproduction steps
   * We need to give a few minutes between starting the engine and the host, otherwise it works.

4. Engine started at 2016-12-13 13:47:56
5. Engine fails to connect to h2 at 2016-12-13 13:48:35 (it's powered off)
6. Powered host on at 

[root@h2 ~]# date
Tue Dec 13 13:56:43 AEST 2016

[root@h2 ~]# lsof | grep vdsm | grep LISTEN
vdsm      1546                 vdsm   23u     IPv6              28998       0t0        TCP *:54321 (LISTEN)

vdsm already listening...

7. At this point engine.log looks very quiet. Is it even trying to connect to h2?

8. At 14:05, engine still failing to connect to Host H2. That's almost 10 minutes AFTER the host (and vdsm) was up.

2016-12-13 14:05:56,273+10 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler9) [] Command 'GetAllVmStatsVDSCommand(HostName = h2, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='14b8697c-4955-4d4c-ace0-cb6f7699945a'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues
2016-12-13 14:05:56,274+10 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher] (DefaultQuartzScheduler9) [] Failed to fetch vms info for host 'h2' - skipping VMs monitoring.

9. There is no communication issue, port is open, network is fine.

10. I just restart ovirt-engine

[root@m ~]# date
Tue Dec 13 14:07:12 AEST 2016
[root@m ~]# systemctl restart ovirt-engine

2016-12-13 14:08:26,500+10 INFO  [org.ovirt.engine.core.bll.Backend] (ServerService Thread Pool -- 63) [] Running ovirt-engine 4.1.0-0.0.master.20161201071307.gita5ff876.el7.centos

And the Host is up on first connect!

2016-12-13 14:08:41,124+10 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler8) [604eab5c] Correlation ID: 5f42e05f, Call Stack: null, Custom Event ID: -1, Message: Status of host h2 was set to Up.

Comment 4 Piotr Kliczewski 2016-12-13 08:44:02 UTC

Looking at the logs I can see:

The engine was not able to connect to h2 host since it started. I can see plenty (every 3 seconds) of attempt to host monitoring:

2016-12-13 13:48:50,773+10 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to h2.rhv41.rhev/192.168.41.20
2016-12-13 13:48:53,847+10 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler5) [a7d6ca6] Command 'GetCapabilitiesVDSCommand(HostName = h2, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='14b8697c-4955-4d4c-ace0-cb6f7699945a', vds='Host[h2,14b8697c-4955-4d4c-ace0-cb6f7699945a]'})' execution failed: org.ovirt.vdsm.jsonrpc.client.ClientConnectionException: Connection failed
2016-12-13 13:48:53,919+10 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler5) [a7d6ca6] Failure to refresh host 'h2' runtime info: org.ovirt.vdsm.jsonrpc.client.ClientConnectionException: Connection failed

until:

2016-12-13 13:51:33,834+10 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to h2.rhv41.rhev/192.168.41.20
2016-12-13 13:51:35,432+10 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (DefaultQuartzScheduler7) [54a13ade] FINISH, ConnectStoragePoolVDSCommand, log id: 48225d39
2016-12-13 13:51:35,433+10 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler7) [54a13ade] IrsBroker::Failed::GetStoragePoolInfoVDS: IRSGenericException: IRSErrorException: IRSNonOperationalException: Could not connect host to Data Center(Storage issue)
2016-12-13 13:51:35,435+10 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler6) [] Command 'GetAllVmStatsVDSCommand(HostName = h2, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='14b8697c-4955-4d4c-ace0-cb6f7699945a'})' execution failed: VDSGenericException: VDSNetworkException: Vds timeout occured
2016-12-13 13:51:35,435+10 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher] (DefaultQuartzScheduler6) [] Failed to fetch vms info for host 'h2' - skipping VMs monitoring.

After this there is no attempt to connect to the host again. Vdsm log confirms this. It was started at:

2016-12-13 13:56:07,705 INFO

and only connection attempt was 

2016-12-13 14:08:32,049 INFO  (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::ffff:192.168.41.30:43218 (protocoldetector:72)
2016-12-13 14:08:32,273 INFO  (Reactor thread) [ProtocolDetector.Detector] Detected protocol stomp from ::ffff:192.168.41.30:43218 (protocoldetector:127)

Based on the logs we know that the engine did not attempt to connect after 2016-12-13 13:51:35,435+10.

@Michal - why we attempted to monitor so often and why we stopped after "Failed to fetch vms info for host 'h2' - skipping VMs monitoring."

@Germano - I do not see any attempt to manually activate the host.

Comment 5 Germano Veit Michel 2016-12-13 09:22:01 UTC

(In reply to Piotr Kliczewski from comment #4)
> @Germano - I do not see any attempt to manually activate the host.

Hi Piotr,

Indeed. This time I was more worried in giving you clean logs that I forgot to try to switch to maintenance and activate. But from what I remember it will just timeout and won't switch to maintenance (i.e.: look at HaMaintanance timing out in comment #0.

I can prepare new logs tomorrow if you think it's necessary. But the problem already seems quite clear from your previous comment. Let me know if you want another round of logs.

Cheers,

Comment 6 Piotr Kliczewski 2016-12-13 10:15:32 UTC

Germano,

Please provide the engine log with manual activation. It would be great to have it with debug log level set.

Thanks,

Comment 7 Piotr Kliczewski 2016-12-13 10:47:15 UTC

And it would be great to get wireshark log as well.

Comment 8 Germano Veit Michel 2016-12-14 02:27:54 UTC

Here we go again.

-> Engine debug enabled
-> Logs truncated
-> tcpdump on manager since engine start

All shut down.

2016-12-14 11:42:18 -> Engine goes Up

2016-12-14 11:42:28 -> Fails to talk to H2 for the first time (still off)

  ... GetCapabilitiesVDSCommand every few seconds

2016-12-14 11:48:27 -> Last GetCapabilities, why (?)

  ... GetAllVmStatsVDSCommand continues running every few minutes

  ... No signs of GetCapabilities|HostMonitoring

  !!! Here is the last time the engine tries to resolve H2 MAC-ARP 
       so apparently it's the last time it tried to connect
  ... wireshark t=370
  
2016-12-14 12:00:00 -> H2 is powered on
  ... wireshark t=1125

2016-12-14 12:01:24 -> H2 vdsm is starting

  ... wait

2016-12-14 12:06:28 -> GetAllVmStatsVDSCommand still failing
                       we are not even connected...

  ... Host still Non Responding
  ... no tries to resolve ARP or TCP SYN.
       
2016-12-14 12:08:13 -> Try to switch H2 to maintenance

2016-12-14 12:10:04 -> Fails to switch to maintenance (timeout)

2016-12-14 12:14:22 -> Restart ovirt-engine

  ... FINALLY tries to resolve H2 MAC and sends TCP SYN, which connects
  ... wireshark t=1926

2016-12-14 12:14:44 -> H2 is Up

It looks like indeed the engine gave up at 11:48:27, did not try to connect to the Host anymore.

Funny thing is, apparently I can't reproduce this if the Hosted-Engine is in global maintenance mode. But might have been bad luck.

Comment 12 Piotr Kliczewski 2016-12-14 09:35:33 UTC

Based on the info provided I see that we have 2 issues:
- host monitoring stopped connection attempts (hopefully Michal can share some thoughts why)
- moving host to maintenance timeouts due to following call never reaching vdsm:

2016-12-14 12:08:33,234+10 DEBUG [org.ovirt.vdsm.jsonrpc.client.reactors.stomp.StompCommonClient] (default task-10) [cc040a56-155c-4f8a-8d8b-55379f3a5a5a] Message sent: SEND
destination:jms.topic.vdsm_requests
content-length:139
reply-to:jms.topic.vdsm_responses

<JsonRpcRequest id: "24569c64-67a0-451d-a9dd-4380bcb60b59", method: Host.setHaMaintenanceMode, params: {mode=LOCAL, enabled=true}> 

This call timeout and fail SetHaMaintenanceModeVDSCommand.

Comment 13 Michal Skrivanek 2016-12-14 13:27:09 UTC

I didn't have a chance to analyze the logs much, but if the host monitoring stops being scheduled I would suspect a similar problem as used to exist in vm monitoring - when exception occurs and is caught properly it propagates all the way up in the monitoring thread and it does not schedule a new run in the next 15s. Once it doesn't do that once, no one else does it and the monitoring dies.
Ravi?

Comment 14 Ravi Nori 2016-12-14 17:27:28 UTC

This is what I see in the debug logs 

there is a last call to GetCapabilities at 2016-12-14 11:48:27 

After that every 15 seconds HostMonitoring refreshVdsStats is being invoked so HostMonitoring has not died. 

From the code I see that we do not call GetCapabilities from HostMonitoring if the host is in status UP, PreparingForMaintenance, Error or NonOperational. So we do not have an issue in HostMonitoring.

Comment 15 Oved Ourfali 2016-12-14 17:54:04 UTC

Michal, perhaps someone in your team can take a deeper look at the logs?

Comment 16 Oved Ourfali 2016-12-14 18:36:13 UTC

Martin, as this seems related to the HA maintenance mode it can be great if you take a look as well.

Comment 17 Michal Skrivanek 2016-12-16 15:04:25 UTC

the debug log from Dec 14th is truncated too much, it doesn't contain the time when host h2 stopped connecting. If this is reproducible please capture that time as well.

If not, I can guess that it can be caused by uncaught exception in host monitoring.
Looking at the original log from 13th, I can see a problem at 2016-12-13 14:07:19,214+10 that looks like some quartz/infra problem scheduling the host monitoring

Comment 18 Martin Sivák 2016-12-16 15:49:09 UTC

Yeah, I do not see anything related to HA maintenance in the vdsm log either.

Comment 19 Germano Veit Michel 2016-12-16 21:44:24 UTC

(In reply to Michal Skrivanek from comment #17)
> the debug log from Dec 14th is truncated too much, it doesn't contain the
> time when host h2 stopped connecting. If this is reproducible please capture
> that time as well.

There is no such thing as "stopped connecting". As I stated in the bug description this is from a cold start on both host and manager, this is 100% reproducible with no immediate previous problems.

1. All connected
2. Shutdown manager
3. Shutdown both hosts
4. Start host 1
5. Start manager (connected fine to host 1)
6. Wait...
7. Start host 2 (BUG!)

There is no stopped connecting in the sequence above. The problem is failing to connect for the first time.

I was thinking this might be reproducible in order ways to. I'll try on Monday if I can make it hit this without restarting anything and just cutting network between host and manager for a few minutes, without any shutdowns. I actually hope it doesn't reproduce, because it would be a much worse problem.

Comment 20 Germano Veit Michel 2016-12-16 22:00:02 UTC

(In reply to Martin Sivák from comment #18)
> Yeah, I do not see anything related to HA maintenance in the vdsm log either.

Hmm... Are you talking about the SetHaMaintenanceModeVDSCommand that should have arrived from the engine? Yes, it never arrived because the host was not connected in the first place (look at tcpdump and comment #8).

Comment 21 Piotr Kliczewski 2016-12-19 08:51:18 UTC

@Martin The host was down when SetHaMaintenanceModeVDSCommand attempted to talk to vdsm.

Comment 22 Oved Ourfali 2016-12-21 09:31:41 UTC

Meital - can you try and reproduce it in your lab?

Comment 23 meital avital 2017-01-09 08:27:38 UTC

Israel - please try to reproduce it

Comment 24 Israel Pinto 2017-01-09 09:22:38 UTC

Steps to Reproduce:
1. Shutdown Manager, leaving Host X NOT in Maintenance Mode
2. Shutdown one Host X
3. Start Manager, wait for it to come up and set Host X as Non Responsive
4. Start Host X

Actual results:
Host become up after start it.

engine:
oVirt Engine Version: 4.2.0-0.0.master.20170104114928.git5490b36.el7.centos
host: 
OS Version:RHEL - 7.3 - 7.el7
OS Description:Red Hat Enterprise Linux Server 7.3 (Maipo)
Kernel Version:3.10.0 - 514.el7.bug1404060_20.x86_64
KVM Version:2.6.0 - 28.el7_3.2.bz1406126dgilbertg
LIBVIRT Version:libvirt-2.0.0-10.el7_3.2
VDSM Version: vdsm-4.20.0-154.git21b2663.el7.centos

Comment 25 Piotr Kliczewski 2017-01-09 10:18:43 UTC

Israel,

Is far as I understand the test that Germano did we need to wait ~5mins in NonResponsive. From the logs I saw that monitoring stopped calling the hosts.
Can you please try with this delay?

Comment 26 Israel Pinto 2017-01-09 10:53:48 UTC

(In reply to Piotr Kliczewski from comment #25)
> Israel,
> 
> Is far as I understand the test that Germano did we need to wait ~5mins in
> NonResponsive. From the logs I saw that monitoring stopped calling the hosts.
> Can you please try with this delay?

How mach time to say in NonResponsive ?

Comment 27 Piotr Kliczewski 2017-01-09 11:29:24 UTC

Let's try with 5 - 10 mins.

Comment 28 Israel Pinto 2017-01-09 11:53:40 UTC

(In reply to Piotr Kliczewski from comment #27)
> Let's try with 5 - 10 mins.

Hi Piotr, 
Recheck it with delay of 6 mins, (host in NonResponsive about 6 min)
Then started host, host become active after it was up.

Comment 29 Piotr Kliczewski 2017-01-09 12:23:47 UTC

Germano,

Based on comment #28 can you please check latest 4.1 or 4.2? According to Israel the problem is not there.

Comment 30 Germano Veit Michel 2017-01-11 05:41:23 UTC

(In reply to Piotr Kliczewski from comment #29)
> Germano,
> 
> Based on comment #28 can you please check latest 4.1 or 4.2? According to
> Israel the problem is not there.

Sure. 

Doing it with master... Might not finish today though. Will report back soon.

Comment 31 Germano Veit Michel 2017-01-12 03:25:02 UTC

Good and bad news. Mixed results.

Versions:
ovirt-engine-4.2.0-0.0.master.20170110224030.gitcbcc146.el7.centos.noarch
vdsm-4.20.0-176.gitd845c5f.el7.centos.x86_64

It seems that the bug I initially described is fixed. The engine does not seem to give up anymore and the kind of behavior we had on comment #8 is gone. I see the engine trying to connect to the host all the time. But it still doesn't activate.

On engine side I get lots of these:
2017-01-12 13:14:49,344+10 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler5) [] Command 'GetAllVmStatsVDSCommand(HostName = Host1, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='7d1a4ecc-8591-4e43-914c-225bbf521164'})' execution failed: VDSGenericException: VDSNetworkException: Unrecognized message received 

On vdsm side, lot's of these:
2017-01-12 13:14:29,373 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::ffff:10.64.24.183', 34896, 0, 0) at 0x24d1a28>: unexpected eof (betterAsyncore:119)

This seems completely unrelated to the initial BZ. If you want me to open a different BZ for it then that's ok. Not sure if worth it since it's a snapshot of the master branch. Anyway, the initial problem seems to be indeed fixed.

I'll try to run the same test with latest 4.1 later today.

Comment 32 Germano Veit Michel 2017-01-12 07:07:49 UTC

Seems all good on:
ovirt-engine-4.1.0-0.4.master.20170111194713.gitb7b4769.el7.centos.noarch.

Any idea on what might have fixed it?

Comment 33 Piotr Kliczewski 2017-01-12 08:04:34 UTC

Germano,

We have a regression on master. It is going to be fixed today.

Comment 34 Oved Ourfali 2017-01-16 12:22:34 UTC

I don't think something specific fixed it.
The regression was only on master.
Closing as worksforme but please re-open if that happens again.

Comment 35 Nicolas Ecarnot 2019-03-19 08:28:21 UTC

Hello,

On 4.3.1.1-1.el7, I seem to hit the same issue : one host gets stuck in non-responsive status, whatever I try :
- restart vdsmd, mom, sanlock
- reboot host

Indeed, the host is fully pingable from the engine (not HE).

The only way out is to restart the engine.

Should I open another bug or could this one be reused?

Comment 36 Piotr Kliczewski 2019-03-19 08:44:47 UTC

Nicolas, Would be great to open new BZ and attach all the logs.

Comment 37 Michael Watters 2019-06-10 17:42:20 UTC

Just ran into this issue today after attempting to put three of our hosts into maintenance mode.  Engine log is attached.

Comment 38 Michael Watters 2019-06-10 17:42:54 UTC

Created attachment 1579135 [details]
ovirt engine log

Comment 47 Martin Perina 2020-11-13 08:28:35 UTC

Moving this bug to closed/worksforme, investigation of the new issue will continue in BZ1897114

Note You need to log in before you can comment on or make changes to this bug.