Bug 1879032 - If there is no master storage domain, the engine should elect one
Summary: If there is no master storage domain, the engine should elect one
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.4.1
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ovirt-4.4.6
: 4.4.6.4
Assignee: shani
QA Contact: Amit Sharir
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-15 09:08 UTC by Yedidyah Bar David
Modified: 2022-08-11 01:59 UTC (History)
7 users (show)

Fixed In Version: ovirt-engine-4.4.6.4
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-05-05 05:36:36 UTC
oVirt Team: Storage
Embargoed:
pm-rhel: ovirt-4.4+
mavital: testing_plan_complete-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 113842 0 master MERGED core: elect a new master storage domain if there's no master 2021-04-07 11:18:49 UTC
oVirt gerrit 113860 0 master ABANDONED core: elect master domain if no master indicated while running 2021-03-16 13:11:48 UTC

Description Yedidyah Bar David 2020-09-15 09:08:20 UTC
Description of problem:

I am opening this bug as a duplicate of a subset of bug 1576923 comment 7.

The flow is:

1. Deploy hosted-engine in a way that causes its hosted_storage to be the master storage domain. Not sure how - I think this is (still) the default, as it's the first domain added to the engine.

2. Take a backup with engine-backup.

3. Restore on a new/clean machine with:
engine-backup --he-remove-storage-vm --mode=restore --provision-all-databases

and then engine-setup as usual.

Now, when the engine comes up, there is no master storage domain, as '--he-remove-storage-vm' removes it from the database.

This bug is about making the engine select some other storage domain as master, in this flow.

The request in bug 1576923 is more general, and generally, is harder/riskier, thus postponed for quite some time now. I hope the current subset is easier/safer to do - either in the engine itself, or perhaps in engine-backup - if latter, please advice us (integration team) on what exactly to do inside the database.

Version-Release number of selected component (if applicable):
Current master, I think always

How reproducible:
Not sure, I think always, reported once in bug 1576923 comment 7.

Steps to Reproduce:
1. See above
2.
3.

Actual results:
No master storage domain.

Expected results:
Engine selects some other storage domain as master.

Additional info:

Comment 1 Nir Soffer 2020-09-15 09:30:30 UTC
It makes sense that engine select new master. I wonder why it does not work with
current code, probably this is a result of bad database change.

Why did you use:

    '--he-remove-storage-vm'

And why it removed the master from the database? I don't think this is valid
operation. It sounds like bad database change that the system is not ready
to handle yet.

Comment 2 Yedidyah Bar David 2020-09-15 10:06:59 UTC
(In reply to Nir Soffer from comment #1)
> It makes sense that engine select new master. I wonder why it does not work
> with
> current code, probably this is a result of bad database change.
> 
> Why did you use:
> 
>     '--he-remove-storage-vm'
> 
> And why it removed the master from the database?

It removes the engine VM and the hosted_storage domain.

It was added [1] for bug 1240466.

If you think this code is broken, please advice on what to do to fix it.

> I don't think this is valid
> operation. It sounds like bad database change that the system is not ready
> to handle yet.

What do you suggest as an alternative?

The specific flow this was used for in the report leading to opening this bug is migrating from hosted-engine setup to standalone/bare-metal. Do you see risk in this, for this flow?

Please remember that when doing this db manipulation, the engine is dead - the old engine should not be used anymore (or even does not exist), and all we have is the db, which was restored from backup.

[1] https://gerrit.ovirt.org/#/q/Id61ae0b05a75018ded532d7a0c38c15b4b885803,n,z

Comment 3 Nir Soffer 2020-09-15 10:29:39 UTC
(In reply to Yedidyah Bar David from comment #2)
> (In reply to Nir Soffer from comment #1)

I don't think we support system without master domain. We have a way
to create a new master when the current master is not accessible, but
engine is probably not ready to handle a state when there is no master
domain the db.

I'm not sure how it can be done on engine side, maybe Benny or Eyal can
recommend a way to remove the hosted storage domain in a correct way.

With bug 1576923 we should have an easy way to select a new master
domain. This can be used to select a new master domain.

Comment 4 shani 2020-09-16 09:17:06 UTC
I think this one was covered by Bella as a fix for this bug: https://bugzilla.redhat.com/1836034.
By that, the hosted-storage can became the master.
It seems to be applicable from ovirt-engine-4.4.3.

The fix is available here: https://gerrit.ovirt.org/#/c/110718/

What do you think?

Comment 5 Nir Soffer 2020-09-16 09:22:12 UTC
(In reply to shani from comment #4)
> I think this one was covered by Bella as a fix for this bug:
> https://bugzilla.redhat.com/1836034.
> By that, the hosted-storage can became the master.
> It seems to be applicable from ovirt-engine-4.4.3.

This may be the reason why this fail now.

If hosted engine storage domain cannot be master, we can safely delete it
from the database.

Once this domain can be master, we cannot delete it from the db without
setting another domain as master, and updating the other domain role
on storage. This is tricky since it cannot be done with running SPM.

Comment 6 Yedidyah Bar David 2020-09-16 10:03:43 UTC
Is bug 1836034 about changing a non-master hosted_storage to master? Or is also related to it being master originally, when created by a new HE deployment? IIUC only the former - it's the master, IIUC, since we moved to node-zero He deployment, where it's created by the engine (and not by HE code, also calling directly vdsm code).

If I got it right, then bug 1836034 is not very relevant to current, and current is applicable since node-zero.

Comment 7 Shir Fishbain 2021-01-19 09:45:59 UTC
QE doesn't have enough capacity to verify this bug on the 4.4.5 release.

Comment 8 Amit Sharir 2021-04-22 14:16:06 UTC
The bug was verified on enviorment -hosted-engine-11

[root@hosted-engine-11 ~]# rpm -q ovirt-engine
ovirt-engine-4.4.6.3-0.8.el8ev.noarch

[root@oncilla05 ~]# rpm -q vdsm
vdsm-4.40.60.3-1.el8ev.x86_64

[root@hosted-engine-11 ~]# rpm -qa | grep release
rhv-release-4.4.6-4-001.noarch
redhat-release-8.4-0.6.el8.x86_64


Full procedure that was done in the bug verification flow (approved by Yedidyah Bar David).


1. Deploy hosted-engine in a way that causes its hosted_storage to be the master storage domain.

2. On the engine take a backup with : engine-backup.

3. On the host run : hosted-engine --set-maintenance --mode=global.

4. On the engine run : engine-backup --he-remove-storage-vm --mode=restore --provision-all-databases --file=/var/lib/ovirt-engine-backup/ovirt-engine-backup-20210422144955.backup. 

5. On the engine run : engine-cleanup.

6. On the engine run : engine-setup (yes to all options).

7. in order to check the new status in the UI run on the engine : hosted-engine --vm-start (before entering the UI check that it is up and running using: hosted-engine --vm-status).



Verification Summary and conclusions - 

After completing the flow mentioned above the UI showed the correct status - the engine selected some other storage domain as master (approved by Yedidyah Bar David).
The expected output and actual output were identical and correct.
Bug verified.

Comment 9 Sandro Bonazzola 2021-05-05 05:36:36 UTC
This bugzilla is included in oVirt 4.4.6 release, published on May 4th 2021.

Since the problem described in this bug report should be resolved in oVirt 4.4.6 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Comment 10 meital avital 2022-07-20 09:41:24 UTC
Due to QE capacity, we are not going to cover this issue in our automation


Note You need to log in before you can comment on or make changes to this bug.