Bug 1317633

Summary: RHEV 3.6 upgrade fails due to database showing DWH is running
Product: [oVirt] ovirt-engine-dwh Reporter: Jason Woods <jwoods>
Component: SetupAssignee: Yedidyah Bar David <didi>
Status: CLOSED CURRENTRELEASE QA Contact: Pavel Stehlik <pstehlik>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.6.3CC: bugs, didi, jwoods, rhodain, ylavi
Target Milestone: ---Flags: rule-engine: planning_ack?
rule-engine: devel_ack?
rule-engine: testing_ack?
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-03-15 23:48:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Integration RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
logcollector from my personal RHEV VM built to test bug none

Description Jason Woods 2016-03-14 17:39:48 UTC
Created attachment 1136229 [details]
logcollector from my personal RHEV VM built to test bug

Description of problem:
Fresh install RHEV 3.5.8 upgraded to RHEV 3.6 caused error that DWH was running, when it was not. Upgrade script did shutdown DWH, but RHEV database entry did not reflect this.

Version-Release number of selected component (if applicable):
RHEV 3.6
rhevm-setup-3.6.3.4-0.1.el6.noarch


How reproducible:
Two systems were built fresh, both exhibited the same problem and problem both were fixed the same way, manual modification of RHEV DB using:
UPDATE dwh_history_timekeeping SET var_value=0 WHERE var_name ='DwhCurrentlyRunning';


Steps to Reproduce:
1. Install RHEL 6.7 OS. After install, attached to Satellite 5.7 repo rhel-x86_64-server-6 (up to date as of 2016-03-02), run update then reboot:
yum update
shutdown -r now
2. Add repos for RHEV 3.5 (up to date as of 2016-03-10):
rhel-x86_64-server-supplementary-6
rhel-x86_64-server-6-rhevm-3.5
jbappplatform-6-x86_64-server-6-rpm
3. Install RHEV 3.5.8:
yum install rhevm rhevm-reports rhevm-dwh
engine-setup
4. Verify all packages up to date post RHEV 3.5 install, then reboot:
yum update
shutdown -r now
5. Upgrade RHEV by attaching repo rhel-x86_64-server-6-rhevm-3.6 (up to date as of 2016-03-10), then run:
yum update rhevm-setup
engine-setup


Actual results:
Error is reported in RHEV upgrade from engine-setup:
2016-03-10 11:39:41 ERROR otopi.plugins.ovirt_engine_setup.ovirt_engine_dwh.core.single_etl single_etl._transactionBegin:136 dwhd is currently running.


Expected results:
RHEV upgrades without need for manual intervention to pass test of if DWH is running.


Additional info:
Systems were checked after errors to see if DWH was running. Both systems showed DWH was not running via 'service ovirt-engine-dwhd status'. The command 'service ovirt-engine-dwhd stop' was also run to verify DWH was in fact not running.

I was not able to get logcollector from customer system, but I was able to get one from my local RHEV manager, which is attached.

Comment 1 Shirly Radco 2016-03-14 21:43:22 UTC
I don't believe "shutdown -r now" is the proper way to stop the dwh service.
In order for it to properly shutdown.

Please stop the service :
service ovirt-engine-dwhd stop

and the upgrade and run engine-setup.
Otherwise the process is stopped without updating the engine db.

Comment 2 Jason Woods 2016-03-15 00:51:59 UTC
The shutdown command was not used for stopping any particular service, but as a way to verify all systems had reset after any major updates to the OS or RHEV. During and after updates for engine-setup, the service was stopped with the command you had given.

If DWH is not shutting down correctly with 'shutdown -r now' to reboot a server, then DWH needs to be fixed for that flaw.

The procedure asked to be thorough, so I included all reboots of the system.

Comment 3 Yedidyah Bar David 2016-03-15 07:03:40 UTC
*** Bug 1317630 has been marked as a duplicate of this bug. ***

Comment 4 Yedidyah Bar David 2016-03-15 07:12:57 UTC
Already commented on this at bug 1216125 comment 10.

I think we should still fix the specific flow described in current bug.

Some things to check:

1. Did dwhd of 3.5 correctly start after the reboot (step (4.)?
2. start/stop order of dwh and postgresql (in el6) and dependencies (in systemd).

Taking the bug for now. If anyone tests above points, please comment.

Comment 5 Jason Woods 2016-03-15 19:34:14 UTC
I have tested this again with another new OS build. My Satellite 5.7 was up to date as of 2016-03-15 12:15 am EST.

1. create new RHEL 6.7 OS
2. add RHEL OS to Satellite 5.7 channels for RHEV:
rhel-x86_64-server-6
rhel-x86_64-server-supplementary-6
rhel-x86_64-server-6-rhevm-3.5
jbappplatform-6-x86_64-server-6-rpm
3. update OS with current, reboot before RHEV install
4. start then stop DHW, install RHEV 3.5.8 (used all defaults), report status of ovirt-engine-dwhd
5. add RHEV channel 3.6:
rhel-x86_64-server-6-rhevm-3.6
6. upgrade RHEV to 3.6.3

This now works as expected, the setup gets past the check for DWH being stopped. I am not sure if the start/stop of DWH before RHEV engine-setup changed anything, or if any packages are different than before.

I will build another new RHEV and not start/stop DHW before RHEV engine-setup and see if that makes any difference.

Comment 7 Yedidyah Bar David 2016-03-16 10:15:22 UTC
Checked the attached sosreport, and you seem to be affected by bug 1286441.

It wasn't back-ported to 3.5.

You did 'yum update' after setup of 3.5, and this upgraded your postgresql, so restarted it. dwhd lost its connection to it, so could not update that it's stopping when stopped.

Not sure how serious this is. A probably-minimal reproduction flow:

1. install and setup 3.5
2. add 3.6 repos, update setup packages
3. restart postgresql
4. engine-setup

See comment 4 and the link there.

Yaniv - do we want to do something? If not, we probably want to update [1] or the docs.

[1] https://access.redhat.com/labs/rhevupgradehelper/

Comment 8 Yaniv Lavi 2016-03-20 15:33:05 UTC
(In reply to Yedidyah Bar David from comment #7)
> Checked the attached sosreport, and you seem to be affected by bug 1286441.
> 
> It wasn't back-ported to 3.5.
> 
> You did 'yum update' after setup of 3.5, and this upgraded your postgresql,
> so restarted it. dwhd lost its connection to it, so could not update that
> it's stopping when stopped.
> 
> Not sure how serious this is. A probably-minimal reproduction flow:
> 
> 1. install and setup 3.5
> 2. add 3.6 repos, update setup packages
> 3. restart postgresql
> 4. engine-setup
> 
> See comment 4 and the link there.
> 
> Yaniv - do we want to do something? 

No, no action item on 3.5

> If not, we probably want to update [1]
> or the docs.
> 
> [1] https://access.redhat.com/labs/rhevupgradehelper/

Roman, can you update the helper?

Comment 9 Yedidyah Bar David 2016-03-20 15:41:19 UTC
(In reply to Yaniv Dary from comment #8)
> (In reply to Yedidyah Bar David from comment #7)
> > Checked the attached sosreport, and you seem to be affected by bug 1286441.
> > 
> > It wasn't back-ported to 3.5.
> > 
> > You did 'yum update' after setup of 3.5, and this upgraded your postgresql,
> > so restarted it. dwhd lost its connection to it, so could not update that
> > it's stopping when stopped.
> > 
> > Not sure how serious this is. A probably-minimal reproduction flow:
> > 
> > 1. install and setup 3.5
> > 2. add 3.6 repos, update setup packages
> > 3. restart postgresql
> > 4. engine-setup
> > 
> > See comment 4 and the link there.
> > 
> > Yaniv - do we want to do something? 
> 
> No, no action item on 3.5

I didn't mean that, but to do things in 3.6 to prevent/workaround. See comment 4 and the link there. In particular (copying):

2. Decide that if we found and killed a local dwhd, it was the only one, and we do not need to check the flag (this will probably break some less likely flows)
3. Add some opposite check, that if e.g. dwh_history_timekeeping.heartBeat is more than X seconds after we asked it to stop and/or killed it, it actually was stopped, even if the flag says it's running.

> 
> > If not, we probably want to update [1]
> > or the docs.
> > 
> > [1] https://access.redhat.com/labs/rhevupgradehelper/
> 
> Roman, can you update the helper?