Description of problem: If I start the Manager before starting one of the Hosts, and the Host is not already in Maintenance mode, it gets stuck in Non Responding state forever until I restart the engine. Can't even switch to Maintenance mode. Version-Release number of selected component (if applicable): ovirt-engine-4.1.0-0.0.master.20161201071307.gita5ff876.el7.centos.noarch vdsm-4.18.999-1020.git1ff41b1.el7.centos.x86_64 CentOS 7.2 How reproducible: 100% Steps to Reproduce: 1. Shutdown Manager, leaving Host X NOT in Maintenance Mode 2. Shutdown one Host X 3. Start Manager, wait for it to come up and set Host X as Non Responsive 4. Start Host X Actual results: Host is Non Resposive and cannot switch to Maintenance Mode Expected results: Host Connects Additional info: 1. Engine Comes up and Host is powered off (Host goes Non-Responsive: OK, expected) 2. Power up the Host 3. Host continues in Non-Responsive, network connection is fine and vdsm port is open. But engine continues to timeout. 2016-12-13 02:32:35,244Z ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler4) [] Command 'GetAllVmStatsVDSCommand(HostName = h2, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='14b8697c-4955-4d4c-ace0-cb6f7699945a'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues 4. Try to switch Host to Maintenance, still fails 2016-12-13 02:33:12,681Z INFO [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (default task-20) [c06a8970-601e-48af-aea2-67fc957f984d] Running command: MaintenanceNumberOfVdssCommand internal: false. Entities affected : ID: 14b8697c-4955-4d4c-ace0-cb6f7699945a Type: VDSAction group MANIPULATE_HOST with role type ADMIN 2016-12-13 02:33:12,944Z INFO [org.ovirt.engine.core.bll.MaintenanceVdsCommand] (default task-20) [c06a8970-601e-48af-aea2-67fc957f984d] Running command: MaintenanceVdsCommand internal: true. Entities affected : ID: 14b8697c-4955-4d4c-ace0-cb6f7699945a Type: VDS 2016-12-13 02:33:13,043Z INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SetHaMaintenanceModeVDSCommand] (default task-20) [c06a8970-601e-48af-aea2-67fc957f984d] START, SetHaMaintenanceModeVDSCommand(HostName = h2, SetHaMaintenanceModeVDSCommandParameters:{runAsync='true', hostId='14b8697c-4955-4d4c-ace0-cb6f7699945a'}), log id: 4c4a223a 2016-12-13 02:36:10,616Z ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SetHaMaintenanceModeVDSCommand] (default task-20) [c06a8970-601e-48af-aea2-67fc957f984d] Command 'SetHaMaintenanceModeVDSCommand(HostName = h2, SetHaMaintenanceModeVDSCommandParameters:{runAsync='true', hostId='14b8697c-4955-4d4c-ace0-cb6f7699945a'})' execution failed: VDSGenericException: VDSNetworkException: Vds timeout occured -> Host is running fine, vdsm is listening on 54321 -> Host and manager can ping each other 5. Just restart the ovirt-engine and Host simply goes up on first connection retry, nothing else changed.
Ok, So here is what I did 1. shut everything down 2. truncate vdsm.log and engine.log 3. followed reproduction steps * We need to give a few minutes between starting the engine and the host, otherwise it works. 4. Engine started at 2016-12-13 13:47:56 5. Engine fails to connect to h2 at 2016-12-13 13:48:35 (it's powered off) 6. Powered host on at [root@h2 ~]# date Tue Dec 13 13:56:43 AEST 2016 [root@h2 ~]# lsof | grep vdsm | grep LISTEN vdsm 1546 vdsm 23u IPv6 28998 0t0 TCP *:54321 (LISTEN) vdsm already listening... 7. At this point engine.log looks very quiet. Is it even trying to connect to h2? 8. At 14:05, engine still failing to connect to Host H2. That's almost 10 minutes AFTER the host (and vdsm) was up. 2016-12-13 14:05:56,273+10 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler9) [] Command 'GetAllVmStatsVDSCommand(HostName = h2, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='14b8697c-4955-4d4c-ace0-cb6f7699945a'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues 2016-12-13 14:05:56,274+10 INFO [org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher] (DefaultQuartzScheduler9) [] Failed to fetch vms info for host 'h2' - skipping VMs monitoring. 9. There is no communication issue, port is open, network is fine. 10. I just restart ovirt-engine [root@m ~]# date Tue Dec 13 14:07:12 AEST 2016 [root@m ~]# systemctl restart ovirt-engine 2016-12-13 14:08:26,500+10 INFO [org.ovirt.engine.core.bll.Backend] (ServerService Thread Pool -- 63) [] Running ovirt-engine 4.1.0-0.0.master.20161201071307.gita5ff876.el7.centos And the Host is up on first connect! 2016-12-13 14:08:41,124+10 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler8) [604eab5c] Correlation ID: 5f42e05f, Call Stack: null, Custom Event ID: -1, Message: Status of host h2 was set to Up.
Looking at the logs I can see: The engine was not able to connect to h2 host since it started. I can see plenty (every 3 seconds) of attempt to host monitoring: 2016-12-13 13:48:50,773+10 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to h2.rhv41.rhev/192.168.41.20 2016-12-13 13:48:53,847+10 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler5) [a7d6ca6] Command 'GetCapabilitiesVDSCommand(HostName = h2, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='14b8697c-4955-4d4c-ace0-cb6f7699945a', vds='Host[h2,14b8697c-4955-4d4c-ace0-cb6f7699945a]'})' execution failed: org.ovirt.vdsm.jsonrpc.client.ClientConnectionException: Connection failed 2016-12-13 13:48:53,919+10 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler5) [a7d6ca6] Failure to refresh host 'h2' runtime info: org.ovirt.vdsm.jsonrpc.client.ClientConnectionException: Connection failed until: 2016-12-13 13:51:33,834+10 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to h2.rhv41.rhev/192.168.41.20 2016-12-13 13:51:35,432+10 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (DefaultQuartzScheduler7) [54a13ade] FINISH, ConnectStoragePoolVDSCommand, log id: 48225d39 2016-12-13 13:51:35,433+10 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler7) [54a13ade] IrsBroker::Failed::GetStoragePoolInfoVDS: IRSGenericException: IRSErrorException: IRSNonOperationalException: Could not connect host to Data Center(Storage issue) 2016-12-13 13:51:35,435+10 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler6) [] Command 'GetAllVmStatsVDSCommand(HostName = h2, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='14b8697c-4955-4d4c-ace0-cb6f7699945a'})' execution failed: VDSGenericException: VDSNetworkException: Vds timeout occured 2016-12-13 13:51:35,435+10 INFO [org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher] (DefaultQuartzScheduler6) [] Failed to fetch vms info for host 'h2' - skipping VMs monitoring. After this there is no attempt to connect to the host again. Vdsm log confirms this. It was started at: 2016-12-13 13:56:07,705 INFO and only connection attempt was 2016-12-13 14:08:32,049 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::ffff:192.168.41.30:43218 (protocoldetector:72) 2016-12-13 14:08:32,273 INFO (Reactor thread) [ProtocolDetector.Detector] Detected protocol stomp from ::ffff:192.168.41.30:43218 (protocoldetector:127) Based on the logs we know that the engine did not attempt to connect after 2016-12-13 13:51:35,435+10. @Michal - why we attempted to monitor so often and why we stopped after "Failed to fetch vms info for host 'h2' - skipping VMs monitoring." @Germano - I do not see any attempt to manually activate the host.
(In reply to Piotr Kliczewski from comment #4) > @Germano - I do not see any attempt to manually activate the host. Hi Piotr, Indeed. This time I was more worried in giving you clean logs that I forgot to try to switch to maintenance and activate. But from what I remember it will just timeout and won't switch to maintenance (i.e.: look at HaMaintanance timing out in comment #0. I can prepare new logs tomorrow if you think it's necessary. But the problem already seems quite clear from your previous comment. Let me know if you want another round of logs. Cheers,
Germano, Please provide the engine log with manual activation. It would be great to have it with debug log level set. Thanks,
And it would be great to get wireshark log as well.
Here we go again. -> Engine debug enabled -> Logs truncated -> tcpdump on manager since engine start All shut down. 2016-12-14 11:42:18 -> Engine goes Up 2016-12-14 11:42:28 -> Fails to talk to H2 for the first time (still off) ... GetCapabilitiesVDSCommand every few seconds 2016-12-14 11:48:27 -> Last GetCapabilities, why (?) ... GetAllVmStatsVDSCommand continues running every few minutes ... No signs of GetCapabilities|HostMonitoring !!! Here is the last time the engine tries to resolve H2 MAC-ARP so apparently it's the last time it tried to connect ... wireshark t=370 2016-12-14 12:00:00 -> H2 is powered on ... wireshark t=1125 2016-12-14 12:01:24 -> H2 vdsm is starting ... wait 2016-12-14 12:06:28 -> GetAllVmStatsVDSCommand still failing we are not even connected... ... Host still Non Responding ... no tries to resolve ARP or TCP SYN. 2016-12-14 12:08:13 -> Try to switch H2 to maintenance 2016-12-14 12:10:04 -> Fails to switch to maintenance (timeout) 2016-12-14 12:14:22 -> Restart ovirt-engine ... FINALLY tries to resolve H2 MAC and sends TCP SYN, which connects ... wireshark t=1926 2016-12-14 12:14:44 -> H2 is Up It looks like indeed the engine gave up at 11:48:27, did not try to connect to the Host anymore. Funny thing is, apparently I can't reproduce this if the Hosted-Engine is in global maintenance mode. But might have been bad luck.
Based on the info provided I see that we have 2 issues: - host monitoring stopped connection attempts (hopefully Michal can share some thoughts why) - moving host to maintenance timeouts due to following call never reaching vdsm: 2016-12-14 12:08:33,234+10 DEBUG [org.ovirt.vdsm.jsonrpc.client.reactors.stomp.StompCommonClient] (default task-10) [cc040a56-155c-4f8a-8d8b-55379f3a5a5a] Message sent: SEND destination:jms.topic.vdsm_requests content-length:139 reply-to:jms.topic.vdsm_responses <JsonRpcRequest id: "24569c64-67a0-451d-a9dd-4380bcb60b59", method: Host.setHaMaintenanceMode, params: {mode=LOCAL, enabled=true}> This call timeout and fail SetHaMaintenanceModeVDSCommand.
I didn't have a chance to analyze the logs much, but if the host monitoring stops being scheduled I would suspect a similar problem as used to exist in vm monitoring - when exception occurs and is caught properly it propagates all the way up in the monitoring thread and it does not schedule a new run in the next 15s. Once it doesn't do that once, no one else does it and the monitoring dies. Ravi?
This is what I see in the debug logs there is a last call to GetCapabilities at 2016-12-14 11:48:27 After that every 15 seconds HostMonitoring refreshVdsStats is being invoked so HostMonitoring has not died. From the code I see that we do not call GetCapabilities from HostMonitoring if the host is in status UP, PreparingForMaintenance, Error or NonOperational. So we do not have an issue in HostMonitoring.
Michal, perhaps someone in your team can take a deeper look at the logs?
Martin, as this seems related to the HA maintenance mode it can be great if you take a look as well.
the debug log from Dec 14th is truncated too much, it doesn't contain the time when host h2 stopped connecting. If this is reproducible please capture that time as well. If not, I can guess that it can be caused by uncaught exception in host monitoring. Looking at the original log from 13th, I can see a problem at 2016-12-13 14:07:19,214+10 that looks like some quartz/infra problem scheduling the host monitoring
Yeah, I do not see anything related to HA maintenance in the vdsm log either.
(In reply to Michal Skrivanek from comment #17) > the debug log from Dec 14th is truncated too much, it doesn't contain the > time when host h2 stopped connecting. If this is reproducible please capture > that time as well. There is no such thing as "stopped connecting". As I stated in the bug description this is from a cold start on both host and manager, this is 100% reproducible with no immediate previous problems. 1. All connected 2. Shutdown manager 3. Shutdown both hosts 4. Start host 1 5. Start manager (connected fine to host 1) 6. Wait... 7. Start host 2 (BUG!) There is no stopped connecting in the sequence above. The problem is failing to connect for the first time. I was thinking this might be reproducible in order ways to. I'll try on Monday if I can make it hit this without restarting anything and just cutting network between host and manager for a few minutes, without any shutdowns. I actually hope it doesn't reproduce, because it would be a much worse problem.
(In reply to Martin Sivák from comment #18) > Yeah, I do not see anything related to HA maintenance in the vdsm log either. Hmm... Are you talking about the SetHaMaintenanceModeVDSCommand that should have arrived from the engine? Yes, it never arrived because the host was not connected in the first place (look at tcpdump and comment #8).
@Martin The host was down when SetHaMaintenanceModeVDSCommand attempted to talk to vdsm.
Meital - can you try and reproduce it in your lab?
Israel - please try to reproduce it
Steps to Reproduce: 1. Shutdown Manager, leaving Host X NOT in Maintenance Mode 2. Shutdown one Host X 3. Start Manager, wait for it to come up and set Host X as Non Responsive 4. Start Host X Actual results: Host become up after start it. engine: oVirt Engine Version: 4.2.0-0.0.master.20170104114928.git5490b36.el7.centos host: OS Version:RHEL - 7.3 - 7.el7 OS Description:Red Hat Enterprise Linux Server 7.3 (Maipo) Kernel Version:3.10.0 - 514.el7.bug1404060_20.x86_64 KVM Version:2.6.0 - 28.el7_3.2.bz1406126dgilbertg LIBVIRT Version:libvirt-2.0.0-10.el7_3.2 VDSM Version: vdsm-4.20.0-154.git21b2663.el7.centos
Israel, Is far as I understand the test that Germano did we need to wait ~5mins in NonResponsive. From the logs I saw that monitoring stopped calling the hosts. Can you please try with this delay?
(In reply to Piotr Kliczewski from comment #25) > Israel, > > Is far as I understand the test that Germano did we need to wait ~5mins in > NonResponsive. From the logs I saw that monitoring stopped calling the hosts. > Can you please try with this delay? How mach time to say in NonResponsive ?
Let's try with 5 - 10 mins.
(In reply to Piotr Kliczewski from comment #27) > Let's try with 5 - 10 mins. Hi Piotr, Recheck it with delay of 6 mins, (host in NonResponsive about 6 min) Then started host, host become active after it was up.
Germano, Based on comment #28 can you please check latest 4.1 or 4.2? According to Israel the problem is not there.
(In reply to Piotr Kliczewski from comment #29) > Germano, > > Based on comment #28 can you please check latest 4.1 or 4.2? According to > Israel the problem is not there. Sure. Doing it with master... Might not finish today though. Will report back soon.
Good and bad news. Mixed results. Versions: ovirt-engine-4.2.0-0.0.master.20170110224030.gitcbcc146.el7.centos.noarch vdsm-4.20.0-176.gitd845c5f.el7.centos.x86_64 It seems that the bug I initially described is fixed. The engine does not seem to give up anymore and the kind of behavior we had on comment #8 is gone. I see the engine trying to connect to the host all the time. But it still doesn't activate. On engine side I get lots of these: 2017-01-12 13:14:49,344+10 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler5) [] Command 'GetAllVmStatsVDSCommand(HostName = Host1, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='7d1a4ecc-8591-4e43-914c-225bbf521164'})' execution failed: VDSGenericException: VDSNetworkException: Unrecognized message received On vdsm side, lot's of these: 2017-01-12 13:14:29,373 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::ffff:10.64.24.183', 34896, 0, 0) at 0x24d1a28>: unexpected eof (betterAsyncore:119) This seems completely unrelated to the initial BZ. If you want me to open a different BZ for it then that's ok. Not sure if worth it since it's a snapshot of the master branch. Anyway, the initial problem seems to be indeed fixed. I'll try to run the same test with latest 4.1 later today.
Seems all good on: ovirt-engine-4.1.0-0.4.master.20170111194713.gitb7b4769.el7.centos.noarch. Any idea on what might have fixed it?
Germano, We have a regression on master. It is going to be fixed today.
I don't think something specific fixed it. The regression was only on master. Closing as worksforme but please re-open if that happens again.
Hello, On 4.3.1.1-1.el7, I seem to hit the same issue : one host gets stuck in non-responsive status, whatever I try : - restart vdsmd, mom, sanlock - reboot host Indeed, the host is fully pingable from the engine (not HE). The only way out is to restart the engine. Should I open another bug or could this one be reused?
Nicolas, Would be great to open new BZ and attach all the logs.
Just ran into this issue today after attempting to put three of our hosts into maintenance mode. Engine log is attached.
Created attachment 1579135 [details] ovirt engine log
Moving this bug to closed/worksforme, investigation of the new issue will continue in BZ1897114