Bug 1878742 - It is possible to run two dwhd on the same engine database
Summary: It is possible to run two dwhd on the same engine database
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: ovirt-engine-dwh
Classification: oVirt
Component: ETL
Version: 4.4.0
Hardware: Unspecified
OS: Unspecified
low
medium
Target Milestone: ---
: ---
Assignee: Shirly Radco
QA Contact: Lucie Leistnerova
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-14 12:43 UTC by Yedidyah Bar David
Modified: 2021-02-04 09:01 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-09-17 15:37:40 UTC
oVirt Team: Metrics
Embargoed:
sbonazzo: planning_ack?
sbonazzo: devel_ack+
sbonazzo: testing_ack?


Attachments (Terms of Use)

Description Yedidyah Bar David 2020-09-14 12:43:00 UTC
Description of problem:

dwhd does not check if DwhCurrentlyRunning in dwh_history_timekeeping is already set to 1, and starts.

This can cause it to run in parallel on two separate machines.

It happened to me by accident when verifying [1][2], which are related, but do not cause (or solve) this problem, with this flow:

1. Setup on machine A engine+dwh.
2. Install dwh on machine B.
3. Run on B engine-setup. Accept to configure dwh.
4. When asked whether to disconnect dwh on machine A, stop dwh on A, and reply yes.
5. After engine-setup on machine B says:
[INFO] Creating/refreshing DWH database schema
quickly start dwhd on machine A.

dwhd will successfully start on A, if you started it quickly enough, before this line appears in the setup log:

   Stage misc METHOD otopi.plugins.ovirt_engine_setup.ovirt_engine_dwh.core.single_etl.Plugin._misc

and will also start successfully on machine B by engine-setup.

To reproduce this without relying on timing, you can just let engine-setup finish on B, and then connect to the engine db, and execute this, replacing MACHINE_A_UUID with the uuid there:

update dwh_history_timekeeping set var_value='MACHINE_A_UUID' where var_name='dwhUuid';

You can find the uuid in /etc/ovirt-engine-dwh/ovirt-engine-dwhd.conf.d/10-setup-uuid.conf .

[1] https://gerrit.ovirt.org/111200
[2] https://gerrit.ovirt.org/111201

Version-Release number of selected component (if applicable):
Current 4.4, probably many versions ago

How reproducible:
Always

Steps to Reproduce:
See above

Actual results:
Both dwhd's start

Expected results:
The one started by engine-setup on machine B should fail to start, emitting in the log something like:

 dwhd is already running, perhaps on machine A. Aborting.

Additional info:
If we do fix current bug, it will very likely cause it to fail also on "innocent" cases, such as a machine hard-reset or dwhd killed with SIGKILL, OOM (out-of-memory), etc.

An alternative, better but more complex fix, is probably to use some postgresql lock or something similar.

Comment 1 Shirly Radco 2020-09-17 14:55:46 UTC
I believe that there is a very small chance for this corner case to happen for a customer.

Comment 2 Sandro Bonazzola 2020-09-17 15:37:40 UTC
(In reply to Shirly Radco from comment #1)
> I believe that there is a very small chance for this corner case to happen
> for a customer.

deferring being unlikely to happen.

Comment 3 Yedidyah Bar David 2020-11-09 08:41:08 UTC
This now happened to me again while working on bug 1894420, and I now understand that this isn't a timing issue, so is somewhat more likely to happen.

Flow was:

1. Install and setup engine+dwh on machine A

2. Setup dwh on machine B. It asks whether to stop disconnect dwh on machine A, I replied Yes, it stopped it on A successfully and setup and started dwh on machine B.

3. on A: systemctl start ovirt-engine-dwhd (can happen also due to a reboot). This fails as expected, with the expected 'This installation is invalid' in the log, _But_: It also sets DwhCurrentlyRunning to 0. dwh on B is still running.

4. on A: engine-setup. It asks whether to disconnect dwh on B, I replied Yes. It didn't try to stop it on B (as DwhCurrentlyRunning was 0), successfully finished setup and started dwh on A. dwh on B was still running.

At this point, we have dwh running on both machines. If this is something which should not happen, we should reopen this bug.

The fix should be:

If dwh starts, and sees that DwhCurrentlyRunning is 1, it should stop as it does now with the error in the log, but do _not_ set DwhCurrentlyRunning to 0.

Shirly - I think this is a rather simple fix, please reconsider.

Comment 4 Yedidyah Bar David 2020-11-09 08:46:58 UTC
Re the solution and concern in my original request:

(In reply to Yedidyah Bar David from comment #0)
> Description of problem:
> 
> dwhd does not check if DwhCurrentlyRunning in dwh_history_timekeeping is
> already set to 1, and starts.
[snip]
> Additional info:
> If we do fix current bug, it will very likely cause it to fail also on
> "innocent" cases, such as a machine hard-reset or dwhd killed with SIGKILL,
> OOM (out-of-memory), etc.

I still think it makes sense to make dwh refuse to start if DwhCurrentlyRunning is
already 1, but it indeed will cause the above concern (hard-killing/reboot will
require manually setting it to 0). If we do not want to "pay this price", we
can still do the fix in comment 3 - do not set it to 0 when we exit with the
error "This installation is invalid".

Comment 5 Yedidyah Bar David 2020-11-09 08:48:19 UTC
BTW: That said, I admit I do not know what's the actual risk in having two DWHDs running in parallel. I only know we spend quite an effort in the past in preventing this, but perhaps it's not a big risk.

Comment 6 Shirly Radco 2021-01-18 09:43:26 UTC
Having 2 dwh run at the same time will corrupt the the data in dwh and it is critical that this will not happen.

Comment 7 Shirly Radco 2021-01-18 09:48:42 UTC
The question is how likely it is to reach the described scenario.

Comment 8 Yedidyah Bar David 2021-01-18 09:58:40 UTC
(In reply to Shirly Radco from comment #7)
> The question is how likely it is to reach the described scenario.

Not very likely.

But I think it should be easy to fix, so perhaps worth it.

There are two changes discussed here:

1. Make dwhd refuse to start if DwhCurrentlyRunning is 1.

2. Make dwhd not set DwhCurrentlyRunning to 0, if it's 1, in the flow where we exit with "This installation is invalid".

I see why (1.) might be considered risky, but (2.) IMO is not. So I vote for doing (2.).

Comment 9 Shirly Radco 2021-02-04 09:01:01 UTC
Since it is unlikely to hit this use case, I'll keep it as closed.
Please reopen if needed.


Note You need to log in before you can comment on or make changes to this bug.