Description of problem: dwhd does not check if DwhCurrentlyRunning in dwh_history_timekeeping is already set to 1, and starts. This can cause it to run in parallel on two separate machines. It happened to me by accident when verifying [1][2], which are related, but do not cause (or solve) this problem, with this flow: 1. Setup on machine A engine+dwh. 2. Install dwh on machine B. 3. Run on B engine-setup. Accept to configure dwh. 4. When asked whether to disconnect dwh on machine A, stop dwh on A, and reply yes. 5. After engine-setup on machine B says: [INFO] Creating/refreshing DWH database schema quickly start dwhd on machine A. dwhd will successfully start on A, if you started it quickly enough, before this line appears in the setup log: Stage misc METHOD otopi.plugins.ovirt_engine_setup.ovirt_engine_dwh.core.single_etl.Plugin._misc and will also start successfully on machine B by engine-setup. To reproduce this without relying on timing, you can just let engine-setup finish on B, and then connect to the engine db, and execute this, replacing MACHINE_A_UUID with the uuid there: update dwh_history_timekeeping set var_value='MACHINE_A_UUID' where var_name='dwhUuid'; You can find the uuid in /etc/ovirt-engine-dwh/ovirt-engine-dwhd.conf.d/10-setup-uuid.conf . [1] https://gerrit.ovirt.org/111200 [2] https://gerrit.ovirt.org/111201 Version-Release number of selected component (if applicable): Current 4.4, probably many versions ago How reproducible: Always Steps to Reproduce: See above Actual results: Both dwhd's start Expected results: The one started by engine-setup on machine B should fail to start, emitting in the log something like: dwhd is already running, perhaps on machine A. Aborting. Additional info: If we do fix current bug, it will very likely cause it to fail also on "innocent" cases, such as a machine hard-reset or dwhd killed with SIGKILL, OOM (out-of-memory), etc. An alternative, better but more complex fix, is probably to use some postgresql lock or something similar.
I believe that there is a very small chance for this corner case to happen for a customer.
(In reply to Shirly Radco from comment #1) > I believe that there is a very small chance for this corner case to happen > for a customer. deferring being unlikely to happen.
This now happened to me again while working on bug 1894420, and I now understand that this isn't a timing issue, so is somewhat more likely to happen. Flow was: 1. Install and setup engine+dwh on machine A 2. Setup dwh on machine B. It asks whether to stop disconnect dwh on machine A, I replied Yes, it stopped it on A successfully and setup and started dwh on machine B. 3. on A: systemctl start ovirt-engine-dwhd (can happen also due to a reboot). This fails as expected, with the expected 'This installation is invalid' in the log, _But_: It also sets DwhCurrentlyRunning to 0. dwh on B is still running. 4. on A: engine-setup. It asks whether to disconnect dwh on B, I replied Yes. It didn't try to stop it on B (as DwhCurrentlyRunning was 0), successfully finished setup and started dwh on A. dwh on B was still running. At this point, we have dwh running on both machines. If this is something which should not happen, we should reopen this bug. The fix should be: If dwh starts, and sees that DwhCurrentlyRunning is 1, it should stop as it does now with the error in the log, but do _not_ set DwhCurrentlyRunning to 0. Shirly - I think this is a rather simple fix, please reconsider.
Re the solution and concern in my original request: (In reply to Yedidyah Bar David from comment #0) > Description of problem: > > dwhd does not check if DwhCurrentlyRunning in dwh_history_timekeeping is > already set to 1, and starts. [snip] > Additional info: > If we do fix current bug, it will very likely cause it to fail also on > "innocent" cases, such as a machine hard-reset or dwhd killed with SIGKILL, > OOM (out-of-memory), etc. I still think it makes sense to make dwh refuse to start if DwhCurrentlyRunning is already 1, but it indeed will cause the above concern (hard-killing/reboot will require manually setting it to 0). If we do not want to "pay this price", we can still do the fix in comment 3 - do not set it to 0 when we exit with the error "This installation is invalid".
BTW: That said, I admit I do not know what's the actual risk in having two DWHDs running in parallel. I only know we spend quite an effort in the past in preventing this, but perhaps it's not a big risk.
Having 2 dwh run at the same time will corrupt the the data in dwh and it is critical that this will not happen.
The question is how likely it is to reach the described scenario.
(In reply to Shirly Radco from comment #7) > The question is how likely it is to reach the described scenario. Not very likely. But I think it should be easy to fix, so perhaps worth it. There are two changes discussed here: 1. Make dwhd refuse to start if DwhCurrentlyRunning is 1. 2. Make dwhd not set DwhCurrentlyRunning to 0, if it's 1, in the flow where we exit with "This installation is invalid". I see why (1.) might be considered risky, but (2.) IMO is not. So I vote for doing (2.).
Since it is unlikely to hit this use case, I'll keep it as closed. Please reopen if needed.