Bug 1477700

Summary: Host enters to power management restart loop
Product: [oVirt] ovirt-engine Reporter: Artyom <alukiano>
Component: BLL.InfraAssignee: Eli Mesika <emesika>
Status: CLOSED CURRENTRELEASE QA Contact: Petr Matyáš <pmatyas>
Severity: high Docs Contact:
Priority: high    
Version: 4.1.4.2CC: alukiano, bugs, emesika, lsvaty, mperina, oourfali, pmatyas
Target Milestone: ovirt-4.1.7Keywords: Automation
Target Release: 4.1.7.1Flags: rule-engine: ovirt-4.1+
rule-engine: exception+
lsvaty: testing_ack+
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1487981 (view as bug list) Environment:
Last Closed: 2017-11-13 12:29:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1487981    
Attachments:
Description Flags
engine log none

Description Artyom 2017-08-02 16:15:08 UTC
Created attachment 1308337 [details]
engine log

Description of problem:
Host enters to power management restart loop because NullPointerException in VdsNotRespondingTreatmentCommand(I think so)
2017-08-01 13:05:31,410+03 ERROR [org.ovirt.engine.core.bll.pm.VdsNotRespondingTreatmentCommand] (org.ovirt.thread.pool-6-thread-2) [] Exception: java.lang.NullPointerException
    at org.ovirt.engine.core.bll.pm.VdsNotRespondingTreatmentCommand.executeCommand(VdsNotRespondingTreatmentCommand.java:174) [bll.jar:]
    at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1251) [bll.jar:]
    at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1391) [bll.jar:]
    at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:2055) [bll.jar:]
...

Version-Release number of selected component (if applicable):
rhevm-4.1.4.2-0.1.el7.noarch

How reproducible:
25%

Steps to Reproduce:
1. Configure power management on host
2. Set down ovirtmgmt on the host # ip link set down ovirtmgmt 
3.

Actual results:
I can see that the engine restarts the host via power management but after only two minutes it run restart via power management again, that enter host to power management loop, so it never UP
2017-08-01 13:02:36,695+03 INFO  [org.ovirt.engine.core.bll.pm.StopVdsCommand] (org.ovirt.thread.pool-6-thread-5) [934039e] Power-Management: STOP of host 'host_mixed_1' initiated.
2017-08-01 13:03:09,587+03 INFO  [org.ovirt.engine.core.bll.pm.StartVdsCommand] (org.ovirt.thread.pool-6-thread-5) [] Power-Management: START of host 'host_mixed_1' initiated.
...
2017-08-01 13:03:18,797+03 INFO  [org.ovirt.engine.core.bll.pm.StartVdsCommand] (org.ovirt.thread.pool-6-thread-10) [] Waiting 300 seconds, for server to finish reboot process.
...
2017-08-01 13:04:54,074+03 INFO  [org.ovirt.engine.core.bll.pm.StopVdsCommand] (org.ovirt.thread.pool-6-thread-2) [466bccca] Power-Management: STOP of host 'host_mixed_1' initiated.
2017-08-01 13:05:21,121+03 INFO  [org.ovirt.engine.core.bll.pm.StartVdsCommand] (org.ovirt.thread.pool-6-thread-2) [] Power-Management: START of host 'host_mixed_1' initiated.

Expected results:
I believe the real interval must be applied(300 seconds) between power management operations

Additional info:

Comment 2 Red Hat Bugzilla Rules Engine 2017-08-14 14:21:48 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 13 Martin Perina 2017-08-23 13:48:51 UTC
So here's the issue:

When host is properly started during NotRespondingTreatment, we changed its status to Reboot and execute sleepOnReboot , which waits in different thread. In the meantime if server is really slow to boot and didn't finish reboot in 2 minutes, monitoring tries to monitor host even in Reboot status and when it's not responding to communication, it will be fenced again. Eli please take a look and try to find out why monitoring tries to monitor host in Reboot status? And if it has a reason, we will need to execute sleepOnReboot synchronously inStart artVdsCommand so it will not end before this timeout

Comment 15 Petr Matyáš 2017-08-28 16:45:56 UTC
There is no script, it's just clean RHEL 7.4 host

Comment 16 Eli Mesika 2017-08-29 06:40:25 UTC
(In reply to Petr Matyáš from comment #15)
> There is no script, it's just clean RHEL 7.4 host

Please provide fresh engine log for the scenario (the attached engine log contains the NPE problem that was resolved in the 1st QE round, I would like to get a clean engine log without the NPE in order to investigate further)

Comment 17 Petr Matyáš 2017-08-29 07:42:29 UTC
Since the fix was merged on 9.8.2017 and the retest was executed on 23.8.2017 with 4.1.5-4 which build was done on 15.8.2017 I though the fix was already there, was it not?

Also please put correct needinfo on me (QA contact) instead of reporter.

Comment 18 Martin Perina 2017-09-06 07:42:33 UTC
We are still not able to reproduce the issue, so retargeting to 4.1.7 for now, but we are continuing investigations and we can release as 4.1.6 async if ready soon

Comment 22 Petr Matyáš 2017-09-22 13:43:09 UTC
Verified on ovirt-engine-4.1.7.1-0.1.el7.noarch