Bug 1044096

Summary: refreshVdsRunTimeInfo not being called against the SPM results in no change in state after a power outage.
Product: Red Hat Enterprise Virtualization Manager Reporter: Lee Yarwood <lyarwood>
Component: ovirt-engineAssignee: Martin Perina <mperina>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Jiri Belka <jbelka>
Severity: high Docs Contact:
Priority: high    
Version: 3.2.0CC: abisogia, acathrow, agkesos, ahoness, bazulay, emesika, flo_bugzilla, iheim, lpeer, lyarwood, mperina, pstehlik, pzhukov, Rhev-m-bugs, rhodain, sputhenp, tdosek, vkaigoro, yeylon
Target Milestone: ---Keywords: Reopened
Target Release: 3.3.3   
Hardware: x86_64   
OS: Linux   
Whiteboard: infra
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-04-27 11:48:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1044088    
Attachments:
Description Flags
engine logs none

Description Lee Yarwood 2013-12-17 19:30:47 UTC
Description of problem:
refreshVdsRunTimeInfo not being called against the SPM results in no change in state after a power outage.

Version-Release number of selected component (if applicable):
rhevm-3.2.3-0.43.el6ev.noarch

How reproducible:
Not clear.

Steps to Reproduce:
1. Ensure refreshVdsRunTimeInfo is not being called against the SPM.
2. Remove all power to an active host, including any fence agents that are configured.

Actual results:
The status of the SPM is not changed and the host is not added back to the vdsManager allowing refreshVdsRunTimeInfo to be called against it.

Expected results:
The host is added back to the vdsManager, refreshVdsRunTimeInfo called and the correct state set.

Additional info:

Comment 3 Martin Perina 2014-01-13 14:29:56 UTC
Hi Lee,

I was unable to reproduce the problem using you steps on 3.2.3 or 3.2.5. I tested these scenarios:

I.
 1. Power off SPM host1 by IPMI and at the same time block access to IPMI from other hosts in cluster
 2. host1 status changed to Connection and after that to Non responsive, but host1 stayed SPM
 3. After I powered up host1 and unblock its IPMI, it changed status to Up and DC became Up again (host1 stayed SPM)

II.
 1. Power off SPM host1 by IPMI and at the same time block access to IPMI from other hosts in cluster
 2. host1 status changed to Connection and after that to Non responsive, but host1 stayed SPM
 3. I executed 'Confirm host as rebooted' on the host1 and after new election another host from the cluster became SPM
 4. After I powered up host1 it became Up

I any case refreshVdsRunTimeInfo started to execute on host1 once it became accessible by network.

Comment 4 Lee Yarwood 2014-01-14 11:02:57 UTC
(In reply to Martin Perina from comment #3)
> Hi Lee,
> 
> I was unable to reproduce the problem using you steps on 3.2.3 or 3.2.5. I
> tested these scenarios:
>
> [..]
>
> I any case refreshVdsRunTimeInfo started to execute on host1 once it became
> accessible by network.

Hey Martin,

As covered in step 1 of my suggested reproducer (I should add that I've not reproduce this yet) and also in private comment #2 AFAICT refreshVdsRunTimeInfo was not being called against the SPM prior to the outage.

If you agree that this was the case and want to manually reproduce this we could use something like byteman to inject code into the vdsmanager class that stops refreshVdsRunTimeInfo from being called against the current SPM.

Lee

Comment 5 Barak 2014-01-19 12:41:21 UTC
(In reply to Lee Yarwood from comment #4)
> (In reply to Martin Perina from comment #3)
> > Hi Lee,
> > 
> > I was unable to reproduce the problem using you steps on 3.2.3 or 3.2.5. I
> > tested these scenarios:
> >
> > [..]
> >
> > I any case refreshVdsRunTimeInfo started to execute on host1 once it became
> > accessible by network.
> 
> Hey Martin,
> 
> As covered in step 1 of my suggested reproducer (I should add that I've not
> reproduce this yet) and also in private comment #2 AFAICT
> refreshVdsRunTimeInfo was not being called against the SPM prior to the
> outage.


So this is the problem and not the $subject.
Can we have a reproducer for that ?

> 
> If you agree that this was the case and want to manually reproduce this we
> could use something like byteman to inject code into the vdsmanager class
> that stops refreshVdsRunTimeInfo from being called against the current SPM.
> 
> Lee

Comment 6 Lee Yarwood 2014-01-28 16:42:05 UTC
(In reply to Barak from comment #5)
> (In reply to Lee Yarwood from comment #4)
> > (In reply to Martin Perina from comment #3)
> > > Hi Lee,
> > > 
> > > I was unable to reproduce the problem using you steps on 3.2.3 or 3.2.5. I
> > > tested these scenarios:
> > >
> > > [..]
> > >
> > > I any case refreshVdsRunTimeInfo started to execute on host1 once it became
> > > accessible by network.
> > 
> > Hey Martin,
> > 
> > As covered in step 1 of my suggested reproducer (I should add that I've not
> > reproduce this yet) and also in private comment #2 AFAICT
> > refreshVdsRunTimeInfo was not being called against the SPM prior to the
> > outage.
> 
> 
> So this is the problem and not the $subject.
> Can we have a reproducer for that ?

Sorry Barak but $subject does state that and again no we don't have a complete reproducer as yet.

IMHO there are two issues in this bug :

1. The scheduled refreshVdsRunTimeInfo job not being called against the host.
  
I couldn't find any pointers as to why this happened in engine.log and would be grateful for a review from devel.

2. engine not monitoring or correcting the fact that this job wasn't being called.

We can artificially reproduce this issue with byteman by killing the scheduled job :

- Ensure the jdk is installed and used by engine :

root@rhevm # yum install java-1.7.0-openjdk-devel -y
root@rhevm # service ovirt-engine restart

- Change to the ovirt user and configure byteman :

root@rhevm # su ovirt -s /bin/sh
ovirt@rhevm # wget http://downloads.jboss.org/byteman/2.1.2/byteman-download-2.1.2-bin.zip
ovirt@rhevm # unzip byteman-download-2.1.2-bin.zip 
ovirt@rhevm # export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.51.x86_64/
ovirt@rhevm # export BYTEMAN_HOME=/var/lib/ovirt-engine/byteman-download-2.1.2
ovirt@rhevm # export PATH=$JAVA_HOME/bin:$BYTEMAN_HOME/bin:$PATH  

- Attach byteman to engine :

ovirt@rhevm # jps
16962 Jps
12936 jboss-modules.jar <-- this is the JBoss process to use.
20826 HistoryETL
ovirt@rhevm # bminstall.sh -b -Dorg.jboss.byteman.transform.all 12936

- Define the VDS UUID :

ovirt@rhevm # VDS_UUID="ea50ab33-361c-40d0-85e0-0d0773b96644"

- Create the killjob script :

ovirt@rhevm # cat > killjob.btm << EOF
RULE VdsManager.OnTimer entry
CLASS org.ovirt.engine.core.vdsbroker.VdsManager
METHOD OnTimer
AT ENTRY
BIND vdsManager:org.ovirt.engine.core.vdsbroker.VdsManager = \$this 
IF vdsManager.getVdsId().toString() == "${VDS_UUID}"
DO traceln("job " + vdsManager.onTimerJobId + " killed for VDS " + vdsManager.getVdsId().toString());
   org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl.getInstance().deleteJob(vdsManager.onTimerJobId)
ENDRULE
EOF

- Run the kill the job and stop the job :

ovirt@rhevm # bmsubmit.sh killjob.btm ; tailf /var/log/ovirt-engine/server.log
2014-01-28 16:28:46,984 INFO  [stdout] (Thread-81) CFG.carryForward: unexpected! invalid open enters count for block L1232225467 in method org.ovirt.engine.core.vdsbroker.VdsManager.OnTimer()V
2014-01-28 16:28:48,755 INFO  [stdout] (DefaultQuartzScheduler_Worker-95) job org.ovirt.engine.core.vdsbroker.VdsManager.OnTimer#-9223372036854775783 killed for VDS ea50ab33-361c-40d0-85e0-0d0773b96644

- Shutdown the host etc and the state will not change from UP.

Comment 7 Barak 2014-02-02 13:15:00 UTC
(In reply to Lee Yarwood from comment #6)
> (In reply to Barak from comment #5)
> > (In reply to Lee Yarwood from comment #4)
> > > (In reply to Martin Perina from comment #3)
> > > > Hi Lee,
> > > > 
> > > > I was unable to reproduce the problem using you steps on 3.2.3 or 3.2.5. I
> > > > tested these scenarios:
> > > >
> > > > [..]
> > > >
> > > > I any case refreshVdsRunTimeInfo started to execute on host1 once it became
> > > > accessible by network.
> > > 
> > > Hey Martin,
> > > 
> > > As covered in step 1 of my suggested reproducer (I should add that I've not
> > > reproduce this yet) and also in private comment #2 AFAICT
> > > refreshVdsRunTimeInfo was not being called against the SPM prior to the
> > > outage.
> > 
> > 
> > So this is the problem and not the $subject.
> > Can we have a reproducer for that ?
> 
> Sorry Barak but $subject does state that and again no we don't have a
> complete reproducer as yet.
> 
> IMHO there are two issues in this bug :
> 
> 1. The scheduled refreshVdsRunTimeInfo job not being called against the host.
>   
> I couldn't find any pointers as to why this happened in engine.log and would
> be grateful for a review from devel.
> 
> 2. engine not monitoring or correcting the fact that this job wasn't being
> called.
> 
> We can artificially reproduce this issue with byteman by killing the
> scheduled job :
> 
> - Ensure the jdk is installed and used by engine :
> 
> root@rhevm # yum install java-1.7.0-openjdk-devel -y
> root@rhevm # service ovirt-engine restart
> 
> - Change to the ovirt user and configure byteman :
> 
> root@rhevm # su ovirt -s /bin/sh
> ovirt@rhevm # wget
> http://downloads.jboss.org/byteman/2.1.2/byteman-download-2.1.2-bin.zip
> ovirt@rhevm # unzip byteman-download-2.1.2-bin.zip 
> ovirt@rhevm # export
> JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.51.x86_64/
> ovirt@rhevm # export
> BYTEMAN_HOME=/var/lib/ovirt-engine/byteman-download-2.1.2
> ovirt@rhevm # export PATH=$JAVA_HOME/bin:$BYTEMAN_HOME/bin:$PATH  
> 
> - Attach byteman to engine :
> 
> ovirt@rhevm # jps
> 16962 Jps
> 12936 jboss-modules.jar <-- this is the JBoss process to use.
> 20826 HistoryETL
> ovirt@rhevm # bminstall.sh -b -Dorg.jboss.byteman.transform.all 12936
> 
> - Define the VDS UUID :
> 
> ovirt@rhevm # VDS_UUID="ea50ab33-361c-40d0-85e0-0d0773b96644"
> 
> - Create the killjob script :
> 
> ovirt@rhevm # cat > killjob.btm << EOF
> RULE VdsManager.OnTimer entry
> CLASS org.ovirt.engine.core.vdsbroker.VdsManager
> METHOD OnTimer
> AT ENTRY
> BIND vdsManager:org.ovirt.engine.core.vdsbroker.VdsManager = \$this 
> IF vdsManager.getVdsId().toString() == "${VDS_UUID}"
> DO traceln("job " + vdsManager.onTimerJobId + " killed for VDS " +
> vdsManager.getVdsId().toString());
>   
> org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl.getInstance().
> deleteJob(vdsManager.onTimerJobId)
> ENDRULE
> EOF
> 
> - Run the kill the job and stop the job :
> 
> ovirt@rhevm # bmsubmit.sh killjob.btm ; tailf
> /var/log/ovirt-engine/server.log
> 2014-01-28 16:28:46,984 INFO  [stdout] (Thread-81) CFG.carryForward:
> unexpected! invalid open enters count for block L1232225467 in method
> org.ovirt.engine.core.vdsbroker.VdsManager.OnTimer()V
> 2014-01-28 16:28:48,755 INFO  [stdout] (DefaultQuartzScheduler_Worker-95)
> job org.ovirt.engine.core.vdsbroker.VdsManager.OnTimer#-9223372036854775783
> killed for VDS ea50ab33-361c-40d0-85e0-0d0773b96644
> 
> - Shutdown the host etc and the state will not change from UP.

The above is not a valid reproducer,
We need a real one, killing the quartz job = creating the problem described in the $subject of the bug,  the host status change is derived from this quarrtz.

So unless we have a clear reproducer or a  we can not proceed here ?

Comment 8 Jiri Belka 2014-02-03 15:51:29 UTC
Created attachment 858667 [details]
engine logs

I tried to reproduce based on comment#3.

I.

- IPMI power off host1
- connection to IPMI from host2 to host1 was not working
- host1 states: Connecting -> Reboot
- IPMI power on host1
- connection to IPMI from host2 to host1 was allowed
- host1 states: Reboot, Non Responsive, Active

II.

- IPMI power off host1
- connection to IPMI from host2 to host1 was not working
- host1 states: Connecting -> Reboot
- host1 - Confirmation the host has been rebooted in Admin Portal
- host1 states: Reboot (still)
- IPMI power on host1
- connection to IPMI from host2 to host1 was allowed
- host1 states: Reboot, Non Responsive, Active

Comment 9 Lee Yarwood 2014-02-04 14:21:36 UTC
(In reply to Barak from comment #7)
> The above is not a valid reproducer,
> We need a real one, killing the quartz job = creating the problem described
> in the $subject of the bug,  the host status change is derived from this
> quarrtz.
> 
> So unless we have a clear reproducer or a  we can not proceed here ?

I don't have a non-artificial reproducer that results in $subject. Again, can devel at least review the attached log for anything that I might've missed in this regard?

IMHO the artificial reproducer does highlight an issue with the engine's own monitoring (or lack of) for the quartz jobs. Shouldn't we at least ensure that this is fixed?

Comment 11 Martin Perina 2014-02-10 10:55:28 UTC
> I don't have a non-artificial reproducer that results in $subject. Again,
> can devel at least review the attached log for anything that I might've
> missed in this regard?

Lee, I investigated log several times, but I didn't found anything that lead to the cause of the issue.

I did several tests with 2 hosts cluster and try to shutdown SPM hosts (details in comment 3).

Last week I did more tests with 2 hosts cluster: I tried to shutdown both hosts at once (or with minimal interval). Every time all hosts got to Non responsive state (it takes a few minutes and in the mean time state can be also Up before changed to Connecting, or Reboot before changed to Non Responsive). When I powered up hosts again, every time all hosts and DC got back to Up state without any manual intervention.

Comment 27 Martin Perina 2014-06-12 11:29:40 UTC
Clearing needinfo