Bug 1878742
| Summary: | It is possible to run two dwhd on the same engine database | ||
|---|---|---|---|
| Product: | [oVirt] ovirt-engine-dwh | Reporter: | Yedidyah Bar David <didi> |
| Component: | ETL | Assignee: | Shirly Radco <sradco> |
| Status: | CLOSED DEFERRED | QA Contact: | Lucie Leistnerova <lleistne> |
| Severity: | medium | Docs Contact: | |
| Priority: | low | ||
| Version: | 4.4.0 | CC: | bugs, sradco |
| Target Milestone: | --- | Flags: | sbonazzo:
planning_ack?
sbonazzo: devel_ack+ sbonazzo: testing_ack? |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-09-17 15:37:40 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | Metrics | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Yedidyah Bar David
2020-09-14 12:43:00 UTC
I believe that there is a very small chance for this corner case to happen for a customer. (In reply to Shirly Radco from comment #1) > I believe that there is a very small chance for this corner case to happen > for a customer. deferring being unlikely to happen. This now happened to me again while working on bug 1894420, and I now understand that this isn't a timing issue, so is somewhat more likely to happen. Flow was: 1. Install and setup engine+dwh on machine A 2. Setup dwh on machine B. It asks whether to stop disconnect dwh on machine A, I replied Yes, it stopped it on A successfully and setup and started dwh on machine B. 3. on A: systemctl start ovirt-engine-dwhd (can happen also due to a reboot). This fails as expected, with the expected 'This installation is invalid' in the log, _But_: It also sets DwhCurrentlyRunning to 0. dwh on B is still running. 4. on A: engine-setup. It asks whether to disconnect dwh on B, I replied Yes. It didn't try to stop it on B (as DwhCurrentlyRunning was 0), successfully finished setup and started dwh on A. dwh on B was still running. At this point, we have dwh running on both machines. If this is something which should not happen, we should reopen this bug. The fix should be: If dwh starts, and sees that DwhCurrentlyRunning is 1, it should stop as it does now with the error in the log, but do _not_ set DwhCurrentlyRunning to 0. Shirly - I think this is a rather simple fix, please reconsider. Re the solution and concern in my original request: (In reply to Yedidyah Bar David from comment #0) > Description of problem: > > dwhd does not check if DwhCurrentlyRunning in dwh_history_timekeeping is > already set to 1, and starts. [snip] > Additional info: > If we do fix current bug, it will very likely cause it to fail also on > "innocent" cases, such as a machine hard-reset or dwhd killed with SIGKILL, > OOM (out-of-memory), etc. I still think it makes sense to make dwh refuse to start if DwhCurrentlyRunning is already 1, but it indeed will cause the above concern (hard-killing/reboot will require manually setting it to 0). If we do not want to "pay this price", we can still do the fix in comment 3 - do not set it to 0 when we exit with the error "This installation is invalid". BTW: That said, I admit I do not know what's the actual risk in having two DWHDs running in parallel. I only know we spend quite an effort in the past in preventing this, but perhaps it's not a big risk. Having 2 dwh run at the same time will corrupt the the data in dwh and it is critical that this will not happen. The question is how likely it is to reach the described scenario. (In reply to Shirly Radco from comment #7) > The question is how likely it is to reach the described scenario. Not very likely. But I think it should be easy to fix, so perhaps worth it. There are two changes discussed here: 1. Make dwhd refuse to start if DwhCurrentlyRunning is 1. 2. Make dwhd not set DwhCurrentlyRunning to 0, if it's 1, in the flow where we exit with "This installation is invalid". I see why (1.) might be considered risky, but (2.) IMO is not. So I vote for doing (2.). Since it is unlikely to hit this use case, I'll keep it as closed. Please reopen if needed. |