| Summary: | Engine does not start restore action in case if host does not responding | ||
|---|---|---|---|
| Product: | [oVirt] ovirt-engine | Reporter: | Artyom <alukiano> |
| Component: | Backend.Core | Assignee: | Martin Perina <mperina> |
| Status: | CLOSED WORKSFORME | QA Contact: | Petr Matyáš <pmatyas> |
| Severity: | urgent | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.1.0 | CC: | alukiano, bugs, mgoldboi, mperina, oourfali, pstehlik |
| Target Milestone: | ovirt-4.1.0-beta | Keywords: | Regression |
| Target Release: | --- | Flags: | oourfali:
ovirt-4.1?
gklein: blocker? gklein: planning_ack? gklein: devel_ack? gklein: testing_ack? |
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-01-05 15:51:23 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Attachments: | |||
Can you please provide the complete log? Also, can you specify what are the cluster fencing policy options you are using? Created attachment 1224755 [details]
engine log and fence policy screenshot
We issue power management stop, and then status until the host is indeed down. In your case the return status is ON, which means the host wasn't powered off. Are you sure the power management setting are correct and for the same host? I just stopped the host network. Do we have some different from 4.0 behavior for such issue? I expect that: 1) Host moved to the connecting state 2) Host moved to non-responding state 3) The engine runs soft fencing 4) The engine runs power management restart command on the host So the restart you mention in part 4 consist of (there are more steps, but these are the power management ones). 1. Power off. 2. Status until the host is indeed powered off 3. Power on. 4. Status until the host is indeed powered on (I think we actually do status here, but Martin will know best). Seems like in the log you have several hosts. From what I see: host_mixed_1: works good? I only see status power management happening twice. host_mixed_2: I see we issues stop, status and start, and the host was powered on. host_mixed_3: I see in the end of the log that we issued a STOP command. Is that the host you had issues with? Can you please make it easier and try the following: Use two hosts Disconnect one from network Wait for a few minutes (as the fencing flow does take time) Check the status let's say after 10 minutes? Report back here, with the log of the test. That way we'll see less noise and we might be able to find your issue. Created attachment 1224785 [details]
engine log(you can start to look from Sun Nov 27 11:36:25 IST 2016)
1) Environment has two hosts(host_mixed_2, host_mixed_3)
2) Stop the network on host_mixed_3
3) Wait 15 minutes
Host host_mixed_3 has non-responding state and I can see via power management remote viewer that no restart happened on the host.
Thanks.
From the last lines I see:
2016-11-27 11:41:23,537+02 INFO [org.ovirt.engine.core.bll.pm.StopVdsCommand] (org.ovirt.thread.pool-6-thread-32) [5eabacc9] Power-Management: STOP of host 'host_mixed_3' initiated.
2016-11-27 11:41:23,595+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-32) [5eabacc9] Correlation ID: 5eabacc9, Call Stack: null, Custom Event ID: -1, Message: Power management stop of Host host_mixed_3 initiated.
2016-11-27 11:41:23,629+02 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (org.ovirt.thread.pool-6-thread-32) [5eabacc9] START, SetVdsStatusVDSCommand(HostName = host_mixed_3, SetVdsStatusVDSCommandParameters:{runAsync='true', hostId='fd66bfae-e845-408c-94e7-dbb4c354a3e5', status='Reboot', nonOperationalReason='NONE', stopSpmFailureLogged='false', maintenanceReason='null'}), log id: b98d4e1
2016-11-27 11:43:18,327+02 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler3) [] Check for available updates is skipped for host 'host_mixed_3' due to unsupported host status 'NonResponsive'
2016-11-27 11:43:33,994+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler10) [3e4f6531] Command 'GetAllVmStatsVDSCommand(HostName = host_mixed_3, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='fd66bfae-e845-408c-94e7-dbb4c354a3e5'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues
2016-11-27 11:43:33,997+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher] (DefaultQuartzScheduler10) [3e4f6531] Failed to fetch vms info for host 'host_mixed_3' - skipping VMs monitoring.
2016-11-27 11:46:49,079+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler8) [6f96573e] Command 'GetAllVmStatsVDSCommand(HostName = host_mixed_3, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='fd66bfae-e845-408c-94e7-dbb4c354a3e5'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues
2016-11-27 11:46:49,079+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher] (DefaultQuartzScheduler8) [6f96573e] Failed to fetch vms info for host 'host_mixed_3' - skipping VMs monitoring.
2016-11-27 11:50:04,154+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler10) [3e4f6531] Command 'GetAllVmStatsVDSCommand(HostName = host_mixed_3, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='fd66bfae-e845-408c-94e7-dbb4c354a3e5'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues
2016-11-27 11:50:04,155+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher] (DefaultQuartzScheduler10) [3e4f6531] Failed to fetch vms info for host 'host_mixed_3' - skipping VMs monitoring
So the power management stop was indeed initiated, but perhaps never returned?
Can you attach the host log of host_mixed_2 to check what's wrong with the power management operation?
Created attachment 1225166 [details] engine and vdsm logs for both hosts(you can start looking from Mon Nov 28 09:17:05 IST 2016) The same scenario as in the comment 6 1) Environment has two hosts(host_mixed_2, host_mixed_3) 2) Stop the network on host_mixed_3 3) Wait 15 minutes Host host_mixed_3 has non-responding state and I can see via power management remote viewer that no restart happened on the host. There's something really strange on all those logs, is it possible to get complete full logs (boot/server/engine) or access to the system where the issue happened? For example: 1. On 2016-11-27 11:41:23,536+02 we executed StopVdsCommand 2. On 2016-11-27 11:41:23,629+02 we executed SetVdsStatus to Reboot 3. And after that there is nothing about really executing PM stop nor other notes about the same correlation id - wasn't the engine killed? The same weird behaviour is for fencing around 2016-11-28 09:17, expected logs about command executed one by one from NonRespondingTreatment are just missing in the log. No, the engine was UP. I can provide you the environment, I believe it will be the best solution, you can ping me when you want to use it and I will provide the environment. I tried to reproduce the bug on the downstream build rhevm-4.1.0-0.3.beta2.el7.noarch and looks like all backed to normal, so you can close the bug. |
Created attachment 1223689 [details] engine.log Description of problem: Engine does not try to restart host with power management in case when host does not respond Version-Release number of selected component (if applicable): ovirt-engine-4.1.0-0.0.master.20161123111311.git6f8ac51.el7.centos.noarch How reproducible: Always Steps to Reproduce: 1. Configure power management on the host 2. Stop network on the host 3. Actual results: Engine does not try to restart the host via power management Expected results: After soft fencing fails, engine must try to restart host via power management Additional info: I also do not see that the engine tries to restore connection after restoring of the network(vdsm up and works fine)