Bug 1398204

Summary: Engine does not start restore action in case if host does not responding
Product: [oVirt] ovirt-engine Reporter: Artyom <alukiano>
Component: Backend.CoreAssignee: Martin Perina <mperina>
Status: CLOSED WORKSFORME QA Contact: Petr Matyáš <pmatyas>
Severity: urgent Docs Contact:
Priority: high    
Version: 4.1.0CC: alukiano, bugs, mgoldboi, mperina, oourfali, pstehlik
Target Milestone: ovirt-4.1.0-betaKeywords: Regression
Target Release: ---Flags: oourfali: ovirt-4.1?
gklein: blocker?
gklein: planning_ack?
gklein: devel_ack?
gklein: testing_ack?
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-01-05 15:51:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
engine.log
none
engine log and fence policy screenshot
none
engine log(you can start to look from Sun Nov 27 11:36:25 IST 2016)
none
engine and vdsm logs for both hosts(you can start looking from Mon Nov 28 09:17:05 IST 2016) none

Description Artyom 2016-11-24 09:50:31 UTC
Created attachment 1223689 [details]
engine.log

Description of problem:
Engine does not try to restart host with power management in case when host does not respond

Version-Release number of selected component (if applicable):
ovirt-engine-4.1.0-0.0.master.20161123111311.git6f8ac51.el7.centos.noarch

How reproducible:
Always

Steps to Reproduce:
1. Configure power management on the host
2. Stop network on the host
3.

Actual results:
Engine does not try to restart the host via power management

Expected results:
After soft fencing fails, engine must try to restart host via power management

Additional info:
I also do not see that the engine tries to restore connection after restoring of the network(vdsm up and works fine)

Comment 1 Oved Ourfali 2016-11-26 07:53:02 UTC
Can you please provide the complete log? 
Also, can you specify what are the cluster fencing policy options you are using?

Comment 2 Artyom 2016-11-27 08:07:15 UTC
Created attachment 1224755 [details]
engine log and fence policy screenshot

Comment 3 Oved Ourfali 2016-11-27 08:26:14 UTC
We issue power management stop, and then status until the host is indeed down.
In your case the return status is ON, which means the host wasn't powered off.
Are you sure the power management setting are correct and for the same host?

Comment 4 Artyom 2016-11-27 08:51:10 UTC
I just stopped the host network. Do we have some different from 4.0 behavior for such issue?
I expect that:
1) Host moved to the connecting state
2) Host moved to non-responding state
3) The engine runs soft fencing
4) The engine runs power management restart command on the host

Comment 5 Oved Ourfali 2016-11-27 09:04:21 UTC
So the restart you mention in part 4 consist of (there are more steps, but these are the power management ones).
1. Power off.
2. Status until the host is indeed powered off
3. Power on.
4. Status until the host is indeed powered on (I think we actually do status here, but Martin will know best).

Seems like in the log you have several hosts.
From what I see:
host_mixed_1:
works good? I only see status power management happening twice.

host_mixed_2:
I see we issues stop, status and start, and the host was powered on.

host_mixed_3:
I see in the end of the log that we issued a STOP command.
Is that the host you had issues with?

Can you please make it easier and try the following:
Use two hosts
Disconnect one from network
Wait for a few minutes (as the fencing flow does take time)
Check the status let's say after 10 minutes?
Report back here, with the log of the test.

That way we'll see less noise and we might be able to find your issue.

Comment 6 Artyom 2016-11-27 09:53:56 UTC
Created attachment 1224785 [details]
engine log(you can start to look from Sun Nov 27 11:36:25 IST 2016)

1) Environment has two hosts(host_mixed_2, host_mixed_3)
2) Stop the network on host_mixed_3
3) Wait 15 minutes

Host host_mixed_3 has non-responding state and I can see via power management remote viewer that no restart happened on the host.

Comment 7 Oved Ourfali 2016-11-27 10:19:18 UTC
Thanks.
From the last lines I see:
2016-11-27 11:41:23,537+02 INFO  [org.ovirt.engine.core.bll.pm.StopVdsCommand] (org.ovirt.thread.pool-6-thread-32) [5eabacc9] Power-Management: STOP of host 'host_mixed_3' initiated.
2016-11-27 11:41:23,595+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-32) [5eabacc9] Correlation ID: 5eabacc9, Call Stack: null, Custom Event ID: -1, Message: Power management stop of Host host_mixed_3 initiated.
2016-11-27 11:41:23,629+02 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (org.ovirt.thread.pool-6-thread-32) [5eabacc9] START, SetVdsStatusVDSCommand(HostName = host_mixed_3, SetVdsStatusVDSCommandParameters:{runAsync='true', hostId='fd66bfae-e845-408c-94e7-dbb4c354a3e5', status='Reboot', nonOperationalReason='NONE', stopSpmFailureLogged='false', maintenanceReason='null'}), log id: b98d4e1
2016-11-27 11:43:18,327+02 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler3) [] Check for available updates is skipped for host 'host_mixed_3' due to unsupported host status 'NonResponsive'
2016-11-27 11:43:33,994+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler10) [3e4f6531] Command 'GetAllVmStatsVDSCommand(HostName = host_mixed_3, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='fd66bfae-e845-408c-94e7-dbb4c354a3e5'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues
2016-11-27 11:43:33,997+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher] (DefaultQuartzScheduler10) [3e4f6531] Failed to fetch vms info for host 'host_mixed_3' - skipping VMs monitoring.
2016-11-27 11:46:49,079+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler8) [6f96573e] Command 'GetAllVmStatsVDSCommand(HostName = host_mixed_3, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='fd66bfae-e845-408c-94e7-dbb4c354a3e5'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues
2016-11-27 11:46:49,079+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher] (DefaultQuartzScheduler8) [6f96573e] Failed to fetch vms info for host 'host_mixed_3' - skipping VMs monitoring.
2016-11-27 11:50:04,154+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler10) [3e4f6531] Command 'GetAllVmStatsVDSCommand(HostName = host_mixed_3, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='fd66bfae-e845-408c-94e7-dbb4c354a3e5'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues
2016-11-27 11:50:04,155+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher] (DefaultQuartzScheduler10) [3e4f6531] Failed to fetch vms info for host 'host_mixed_3' - skipping VMs monitoring

So the power management stop was indeed initiated, but perhaps never returned?
Can you attach the host log of host_mixed_2 to check what's wrong with the power management operation?

Comment 8 Artyom 2016-11-28 08:22:45 UTC
Created attachment 1225166 [details]
engine and vdsm logs for both hosts(you can start looking from Mon Nov 28 09:17:05 IST 2016)

The same scenario as in the comment 6

1) Environment has two hosts(host_mixed_2, host_mixed_3)
2) Stop the network on host_mixed_3
3) Wait 15 minutes

Host host_mixed_3 has non-responding state and I can see via power management remote viewer that no restart happened on the host.

Comment 9 Martin Perina 2016-12-09 14:17:06 UTC
There's something really strange on all those logs, is it possible to get complete full logs (boot/server/engine) or access to the system where the issue happened?

For example:

  1. On 2016-11-27 11:41:23,536+02 we executed StopVdsCommand
  2. On 2016-11-27 11:41:23,629+02 we executed SetVdsStatus to Reboot
  3. And after that there is nothing about really executing PM stop nor other notes about the same correlation id - wasn't the engine killed?

The same weird behaviour is for fencing around 2016-11-28 09:17, expected logs about command executed one by one from NonRespondingTreatment are just missing in the log.

Comment 10 Artyom 2016-12-13 12:31:06 UTC
No, the engine was UP.
I can provide you the environment, I believe it will be the best solution, you can ping me when you want to use it and I will provide the environment.

Comment 11 Artyom 2017-01-05 15:49:14 UTC
I tried to reproduce the bug on the downstream build rhevm-4.1.0-0.3.beta2.el7.noarch
 and looks like all backed to normal, so you can close the bug.