1662657 – Restore SHE environment on iscsi fails - KeyError: 'available'

Bug 1662657 - Restore SHE environment on iscsi fails - KeyError: 'available'

Summary: Restore SHE environment on iscsi fails - KeyError: 'available'

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-hosted-engine-setup
Sub Component:
Version:	4.4.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	ovirt-4.4.7
Target Release:	---
Assignee:	Yedidyah Bar David
QA Contact:	Petr Matyáš
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-12-31 10:06 UTC by Liran Rotenberg
Modified:	2024-06-13 22:01 UTC (History)
CC List:	20 users (show)
Fixed In Version:	ovirt-hosted-engine-setup-2.5.1-1.el8ev
Doc Type:	Bug Fix
Doc Text:	One of the steps during hosted-engine deployment is "Activate Storage Domain". Normally, this step returns the amount of available space in the domain. Under certain conditions, this information is missing. In previous releases, if the available space was missing, deployment failed. With this release, deployment will provide an error message and allow you to provide the details needed for creating a storage domain. This issue appears to affect deployments using '--restore-from-file', when the existing setup has problematic block storage (iSCSI or Fiber Channel). If this happens, it is recommended that you connect to the Administration Portal and clear all storage-related issues before continuing.
Clone Of:
Environment:
Last Closed:	2021-07-22 15:08:31 UTC
oVirt Team:	Integration
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
sosreport-2nd try (16.15 MB, application/x-xz) 2019-01-09 09:35 UTC, Liran Rotenberg	no flags	Details
sosreport-different storages (10.57 MB, application/x-xz) 2019-01-14 07:21 UTC, Liran Rotenberg	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	5895721	None	None	None	2021-03-22 01:57:23 UTC
Red Hat Product Errata	RHBA-2021:2864	None	None	None	2021-07-22 15:09:30 UTC
oVirt gerrit	114424	master	MERGED	Consider missing 'available' as broken storage	2021-04-27 07:07:43 UTC

Description Liran Rotenberg 2018-12-31 10:06:36 UTC

Description of problem:
Creating a backup on SHE environment on iscsi failed with:
KeyError: 'available'

2018-12-29 20:39:59,954+0200 DEBUG otopi.context context._executeMethod:143 method exception
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/otopi/context.py", line 133, in _executeMethod
    method['method']()
  File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/gr-he-ansiblesetup/core/storage_domain.py", line 770, in _closeup
    ] = int(storage_domain['available'])/1024/1024/1024
KeyError: 'available'
2018-12-29 20:39:59,984+0200 ERROR otopi.context context._executeMethod:152 Failed to execute stage 'Closing up': 'available'


Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha-2.2.19-1.el7ev.noarch
ovirt-hosted-engine-setup-2.2.32-1.el7ev.noarch
rhvm-appliance-4.2-20181212.0.el7.noarch
ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch
python-ovirt-engine-sdk4-4.2.9-1.el7ev.x86_64
ansible-2.7.5-1.el7ae.noarch
otopi-1.7.8-1.el7ev.noarch

How reproducible:


Steps to Reproduce:
1. Redploy on iscsi from vintage environment, where the host is non SPM with spm_id!=1 && power management configured and some hosts unreachable
2.
3.

Actual results:
Restore fails.

Expected results:
Restore should succeed.

Additional info:
Normal deployment on the iscsi works.
Didi pointed out it might be a problem in ansible/ovirt-sdk and not on the hosted-engine-setup.
The available space was there when the storage domain is created.

Comment 3 Yedidyah Bar David 2019-01-07 06:45:26 UTC

Liran let me look at the machine, and this is what I found:

We create the storage domain, in create_storage_domain.yml, 'name: Add iSCSI storage domain', and register the result in otopi_storage_domain_details_iscsi.

Later in the same file, we 'name: Activate storage domain', and register the result in otopi_storage_domain_details.

The logs showed that otopi_storage_domain_details_iscsi did have a key 'available', but otopi_storage_domain_details did not. No idea why, who populates it and when. It might be due to a bug specific to iscsi, in ansible, or the engine, or the sdk, or it might be a timing issue.

I compared this to a different system, that went through a somewhat different flow, and used nfs. There, both otopi_storage_domain_details_nfs and otopi_storage_domain_details included 'available'.

As I told Liran in private, if we fail to find and fix the root cause, we can use the value from otopi_storage_domain_details_iscsi, but IMO this is a hack, and might hide a more important bug that we do want to find and fix.

Comment 4 Sandro Bonazzola 2019-01-07 10:04:57 UTC

Moving to ovirt-engine under storage team. According to REst API guide the available field is mandatory and here it's missing.
Looks like a bug in the SDK answer from engine side.

Comment 5 Simone Tiraboschi 2019-01-07 10:16:39 UTC

(In reply to Sandro Bonazzola from comment #4)
> Moving to ovirt-engine under storage team. According to REst API guide the
> available field is mandatory and here it's missing.
> Looks like a bug in the SDK answer from engine side.

Yes, I agree.

In the logs of the failed attempt we have:

2018-12-29 20:39:58,456+0200 DEBUG var changed: host "localhost" var "otopi_storage_domain_details" type "<type 'dict'>" value: "{
    "changed": true,
    "deprecations": [
        {
            "msg": "The 'ovirt_storage_domains' module is being renamed 'ovirt_storage_domain'",
            "version": 2.8
        }
    ],
    "diff": {
        "after": {},
        "before": {}
    },
    "failed": false,
    "id": "1339b4f1-4d8b-4d2c-9aa7-8ad6e3e256c8",
    "storagedomain": {
        "backup": false,
        "comment": "",
        "committed": 0,
        "critical_space_action_blocker": 5,
        "data_center": {
            "href": "/ovirt-engine/api/datacenters/9c5a9e9a-0841-11e9-b3b7-001a4a16109f",
            "id": "9c5a9e9a-0841-11e9-b3b7-001a4a16109f"
        },
        "data_centers": [
            {
                "href": "/ovirt-engine/api/datacenters/9c5a9e9a-0841-11e9-b3b7-001a4a16109f",
                "id": "9c5a9e9a-0841-11e9-b3b7-001a4a16109f"
            }
        ],
        "description": "",
        "discard_after_delete": true,
        "disks": [],
        "external_status": "ok",
        "href": "/ovirt-engine/api/datacenters/9c5a9e9a-0841-11e9-b3b7-001a4a16109f/storagedomains/1339b4f1-4d8b-4d2c-9aa7-8ad6e3e256c8",
        "id": "1339b4f1-4d8b-4d2c-9aa7-8ad6e3e256c8",
        "master": false,
        "name": "hosted_storage",
        "status": "active",
        "storage": {
            "type": "iscsi",
            "volume_group": {
                "id": "7cX4gk-zIyH-wtY1-SKJA-O0tf-CGpV-fByIJ8",
                "logical_units": [
                    {
                        "address": "10.35.146.1",
                        "discard_max_size": 16777216,
                        "discard_zeroes_data": false,
                        "id": "360002ac000000000000000bd00021f6b",
                        "lun_mapping": 1,
                        "paths": 0,
                        "port": 3260,
                        "portal": "10.35.146.1:3260,21",
                        "product_id": "VV",
                        "serial": "S3PARdataVV_CZ3836C3RB",
                        "size": 139586437120,
                        "storage_domain_id": "1339b4f1-4d8b-4d2c-9aa7-8ad6e3e256c8",
                        "target": "iqn.2000-05.com.3pardata:20210002ac021f6b",
                        "vendor_id": "3PARdata",
                        "volume_group_id": "7cX4gk-zIyH-wtY1-SKJA-O0tf-CGpV-fByIJ8"
                    }
                ]
            }
        },
        "storage_format": "v4",
        "supports_discard": true,
        "supports_discard_zeroes_data": false,
        "type": "data",
        "warning_low_space_indicator": 10,
        "wipe_after_delete": false
    }
}"

and indeed in this case 'available' is not present in 'storagedomain' dict.

According to the ansible module doc:
https://github.com/ansible/ansible/blob/devel/lib/ansible/modules/cloud/ovirt/ovirt_storage_domain.py#L288

storagedomain dict in the returned value should contain the attributes documented in 
http://ovirt.github.io/ovirt-engine-api-model/master/#types/storage_domain
which includes available but, at least in that specific occasion, it was missing.

Comment 6 Simone Tiraboschi 2019-01-07 10:17:37 UTC

Ondra, can you please take a look as well?

Comment 7 Simone Tiraboschi 2019-01-07 11:40:04 UTC

Also 'used' value wasn't there while it was expected.

Comment 8 Eyal Shenitzky 2019-01-07 12:05:51 UTC

When running a simple GET request the 'available' field does exist for a single storage domain and when fetching all the storage domains:

- http://localhost:8080/ovirt-engine/api/storagedomains/456
- http://localhost:8080/ovirt-engine/api/storagedomains

The 'available' field also exists for a data-center storage domain:

- http://localhost:8080/ovirt-engine/api/datacenters/123/storagedomains
- http://localhost:8080/ovirt-engine/api/datacenters/123/storagedomains/456

So I believe the problem is not in the engine REST-API but somewhere else.

Comment 9 Simone Tiraboschi 2019-01-07 12:10:25 UTC

(In reply to Eyal Shenitzky from comment #8)
> The 'available' field also exists for a data-center storage domain:

Eyal the issue is that normally we get that value but on that specific execution (I'm not that sure it's systematically reproducible) we got all the expected values but available and used.

Comment 10 Eyal Shenitzky 2019-01-07 12:35:13 UTC

Also in the same failing log (ovirt-hosted-engine-setup-ansible-create_storage_domain-20181229203827-ej5liy.log),
we can see a call to get the storage domain, the storage domain object does contain 
the missing fields (used and available):

2018-12-29 20:39:05,863+0200 DEBUG var changed: host "localhost" var "ovirt_storage_domains" type "<type 'list'>" value: "[
    {
        "available": 133143986176, 
        "backup": false, 
        "comment": "", 
        "committed": 0, 
        "critical_space_action_blocker": 5, 
        "description": "", 
        "discard_after_delete": true, 
        "disk_profiles": [], 
        "disk_snapshots": [], 
        "disks": [], 
        "external_status": "ok", 
        "href": "/ovirt-engine/api/storagedomains/1339b4f1-4d8b-4d2c-9aa7-8ad6e3e256c8", 
        "id": "1339b4f1-4d8b-4d2c-9aa7-8ad6e3e256c8", 
        "master": false, 
        "name": "hosted_storage", 
        "permissions": [], 
        "status": "unattached", 
        "storage": {
            "type": "iscsi", 
            "volume_group": {
                "id": "7cX4gk-zIyH-wtY1-SKJA-O0tf-CGpV-fByIJ8", 
                "logical_units": [
                    {
                        "address": "10.35.146.1", 
                        "discard_max_size": 16777216, 
                        "discard_zeroes_data": false, 
                        "id": "360002ac000000000000000bd00021f6b", 
                        "lun_mapping": 1, 
                        "paths": 0, 
                        "port": 3260, 
                        "portal": "10.35.146.1:3260,21", 
                        "product_id": "VV", 
                        "serial": "S3PARdataVV_CZ3836C3RB", 
                        "size": 139586437120, 
                        "storage_domain_id": "1339b4f1-4d8b-4d2c-9aa7-8ad6e3e256c8", 
                        "target": "iqn.2000-05.com.3pardata:20210002ac021f6b", 
                        "vendor_id": "3PARdata", 
                        "volume_group_id": "7cX4gk-zIyH-wtY1-SKJA-O0tf-CGpV-fByIJ8"
                    }
                ]
            }
        }, 
        "storage_connections": [], 
        "storage_format": "v4", 
        "supports_discard": true, 
        "supports_discard_zeroes_data": false, 
        "templates": [], 
        "type": "data", 
        "used": 5368709120, 
        "vms": [], 
        "warning_low_space_indicator": 10, 
        "wipe_after_delete": false
    }
]"
2018-12-29 20:39:05,863+0200 INFO ansible ok {'status': 'OK', 'ansible_task': u'Get storage domain details', 'ansible_host': u'localhost', 'ansible_playbook': u'/usr/share/ovirt-hosted-engine-setup/ansible/create_storage_domain.yml', 'ansible_type': 'task'}


We can also see from the log that the playbook tries to remove this storage domain because of insufficient size:


2018-12-29 20:39:12,472+0200 INFO ansible ok {'status': 'OK', 'ansible_task': u'Get required size', 'ansible_host': u'localhost', 'ansible_playbook': u'/usr/share/ovirt-hosted-engine-setup/ansible/create_storage_domain.yml', 'ansible_type': 'task'}

2018-12-29 20:39:14,256+0200 INFO ansible task start {'status': 'OK', 'ansible_task': u'Remove unsuitable storage domain', 'ansible_playbook': u'/usr/share/ovirt-hosted-engine-setup/ansible/create_storage_domain.yml', 'ansible_type': 'task'}
2018-12-29 20:39:14,256+0200 DEBUG ansible on_any args TASK: Remove unsuitable storage domain kwargs is_conditional:False 
2018-12-29 20:39:14,823+0200 DEBUG var changed: host "localhost" var "remove_storage_domain_details" type "<type 'dict'>" value: "{
    "changed": false, 
    "skip_reason": "Conditional result was False", 
    "skipped": true
}"
2018-12-29 20:39:14,823+0200 INFO ansible skipped {'status': 'SKIPPED', 'ansible_task': u'Remove unsuitable storage domain', 'ansible_host': u'localhost', 'ansible_playbook': u'/usr/share/ovirt-hosted-engine-setup/ansible/create_storage_domain.yml', 'ansible_type': 'task'}

The removal of the domain looks like it failed.

I think that this might be a good place to look at.

Comment 11 Simone Tiraboschi 2019-01-07 13:09:51 UTC

(In reply to Eyal Shenitzky from comment #10)
> Also in the same failing log
> (ovirt-hosted-engine-setup-ansible-create_storage_domain-20181229203827-
> ej5liy.log),
> we can see a call to get the storage domain, the storage domain object does
> contain 
> the missing fields (used and available):

The flow is:
1. create a detached storage domain
2. check its properties
3. eventually remove if not suitable
4. attach it to the selected datacenter

> 
> 2018-12-29 20:39:05,863+0200 DEBUG var changed: host "localhost" var
> "ovirt_storage_domains" type "<type 'list'>" value: "[
>     {
>         "available": 133143986176, 

After step 1 we have available and used (at least in this reproduction).

> We can also see from the log that the playbook tries to remove this storage
> domain because of insufficient size:

No, we skipped this step since the SD was fine for us, indeed:


> "remove_storage_domain_details" type "<type 'dict'>" value: "{
>     "changed": false, 
>     "skip_reason": "Conditional result was False", 
>     "skipped": true
> }"
> 2018-12-29 20:39:14,823+0200 INFO ansible skipped {'status': 'SKIPPED',

and so we moved to the next step which is attaching it to the selected data-center.

So it called ovirt_storage_domains ansible module another time to attach it to the relevant stage.
In the reported case, on this specific step, we got a storagedomain dict that miss available value and so ovirt-hosted-engine-setup failed accessing it.

Comment 12 Liran Rotenberg 2019-01-09 09:35:59 UTC

Created attachment 1519394 [details]
sosreport-2nd try

I tried to do the same scenario.
The restore did succeed this time, but it took ~2hours to restore.
This is not what expected in terms of time.
The long wait was on the task: copy local vm disk to shared storage
sosreport of the host used attached.

Additional note is, the vintage deployment is on iscsi target, the restore is to the same target, new LUN(the old is still there untouched).

Vintage deployment(LUN1), added NFS(master storage).
Restore on LUN2(new LUN).

Comment 13 Liran Rotenberg 2019-01-14 07:21:56 UTC

Created attachment 1520464 [details]
sosreport-different storages

I tried once more, this time I used two different storages, one for the deployment, the other for the restore. This time, the restore passed as expected.

Comment 15 Tal Nisan 2019-03-01 03:40:17 UTC

According to comments 12 & 13 seems like the issue is fixed, please reopen if you encounter it again

Comment 16 Germano Veit Michel 2021-03-22 01:53:19 UTC

Re-opening, this was hit on 4.4, during upgrade from 4.3.

vdsm-4.40.40-1.el8ev.x86_64
rhvm-appliance-4.4-20201117.0.el8ev.x86_64

It seems the engine does not return 'available' space for the SD after the deployment attaches it to the Pool. It did return available on SD creation, but after attaching to the pull it did not. Seems like this may happen if there are lots of hosts in the environment and the deploy host is not the SPM, it takes a while for the SD to really become up it seems. I'll add details below.

Comment 20 Eyal Shenitzky 2021-03-22 06:58:16 UTC

Martin, it looks like an issue with otopi.
Can you please have a look?

Comment 21 Yedidyah Bar David 2021-04-04 12:59:43 UTC

Based on the information provided by Germano (thanks!), I agree it looks like a timing issue, where his specific case is due to temporary problems. For this specific case, I think a minimal reproducer is something like:

1. Deploy HE on some host+storage, take a backup
2. Create new hosted storage
3. Prevent access from the host to the new storage
4. Install a new host and do let it access the new storage
5. Start Deploy+Restore HE on the new host + new storage
6. Only after 'Activate storage domain' finishes, let the old host access the new storage.

The problem might have been that 'Activate storage domain' succeeded even though some hosts (the SPM?) still didn't manage to access the new storage (and thus report 'available'?).

Eyal: Should have 'Activate storage domain' failed, in this case (or at least wait until success or some timeout)?

If Storage team decides this is "working as designed" (meaning, 'Activate storage domain' should not fail), we have to consider how to continue. Some thoughts:

1. We can run 'Activate storage domain' in a loop until the results include 'available'

2. Perhaps less intrusive, we can add another task that runs ovirt_storage_domain_info in a loop until the results includes 'available', and export that to otopi as otopi_storage_domain_details (instead of the result of 'Activate storage domain').

3. In principle we can also just export to otopi the existing 'storage_domain_details.ovirt_storage_domains[0]' (as 'otopi_storage_domain_details') - we already know for sure that it had 'available' at that point, because we condition on this in 'Check storage domain free space'. This will make the flow in Germano's case finish this part of the code faster (not waiting for the engine), but might make things fail later on, if there indeed is/was a real (even if temporary) problem.

Comment 22 Eyal Shenitzky 2021-04-08 05:54:16 UTC

(In reply to Yedidyah Bar David from comment #21)
> Based on the information provided by Germano (thanks!), I agree it looks
> like a timing issue, where his specific case is due to temporary problems.
> For this specific case, I think a minimal reproducer is something like:
> 
> 1. Deploy HE on some host+storage, take a backup
> 2. Create new hosted storage
> 3. Prevent access from the host to the new storage
> 4. Install a new host and do let it access the new storage
> 5. Start Deploy+Restore HE on the new host + new storage
> 6. Only after 'Activate storage domain' finishes, let the old host access
> the new storage.
> 
> The problem might have been that 'Activate storage domain' succeeded even
> though some hosts (the SPM?) still didn't manage to access the new storage
> (and thus report 'available'?).
> 
> Eyal: Should have 'Activate storage domain' failed, in this case (or at
> least wait until success or some timeout)?

Activate storage domain will connect all the hosts in UP state in the pool to the storage domain server,
if it failed to do so, the command expected to fail and abort the storage domain activation.

> 
> If Storage team decides this is "working as designed" (meaning, 'Activate
> storage domain' should not fail), we have to consider how to continue. Some
> thoughts:
> 
> 1. We can run 'Activate storage domain' in a loop until the results include
> 'available'
> 
> 2. Perhaps less intrusive, we can add another task that runs
> ovirt_storage_domain_info in a loop until the results includes 'available',
> and export that to otopi as otopi_storage_domain_details (instead of the
> result of 'Activate storage domain').
> 
> 3. In principle we can also just export to otopi the existing
> 'storage_domain_details.ovirt_storage_domains[0]' (as
> 'otopi_storage_domain_details') - we already know for sure that it had
> 'available' at that point, because we condition on this in 'Check storage
> domain free space'. This will make the flow in Germano's case finish this
> part of the code faster (not waiting for the engine), but might make things
> fail later on, if there indeed is/was a real (even if temporary) problem.

Comment 23 Yedidyah Bar David 2021-04-08 10:11:51 UTC

(In reply to Eyal Shenitzky from comment #22)
> (In reply to Yedidyah Bar David from comment #21)
> > Based on the information provided by Germano (thanks!), I agree it looks
> > like a timing issue, where his specific case is due to temporary problems.
> > For this specific case, I think a minimal reproducer is something like:
> > 
> > 1. Deploy HE on some host+storage, take a backup
> > 2. Create new hosted storage
> > 3. Prevent access from the host to the new storage
> > 4. Install a new host and do let it access the new storage
> > 5. Start Deploy+Restore HE on the new host + new storage
> > 6. Only after 'Activate storage domain' finishes, let the old host access
> > the new storage.
> > 
> > The problem might have been that 'Activate storage domain' succeeded even
> > though some hosts (the SPM?) still didn't manage to access the new storage
> > (and thus report 'available'?).
> > 
> > Eyal: Should have 'Activate storage domain' failed, in this case (or at
> > least wait until success or some timeout)?
> 
> Activate storage domain will connect all the hosts in UP state in the pool
> to the storage domain server,
> if it failed to do so, the command expected to fail and abort the storage
> domain activation.

Understood, but in this case, this isn't what happened. Activation did succeed,
although one UP host (the old one) didn't have access to it. Is this a bug?

Comment 25 Eyal Shenitzky 2021-04-08 10:42:11 UTC

> Understood, but in this case, this isn't what happened. Activation did
> succeed,
> although one UP host (the old one) didn't have access to it. Is this a bug?

No, the engine will connect only the hosts that are in UP state.

Comment 26 Yedidyah Bar David 2021-04-08 13:49:30 UTC

Eyal, please let me clarify: The question isn't what hosts should be connected, or what the engine should do in this case.

The question is:

The engine did Activate the storage domain, but did not report the available space, in the result of the "Activate storage domain" task [1]. Is that ok? Or is this a bug?

The doc [1] for this module says that the return value should include "storage_domain" which is "Dictionary of all the storage domain attributes" and links at [2] for the list of attributes. This list includes 'available', and does not say it's conditional, or might be missing, or anything like that. So from my POV, if "Activate Storage Domain" finished successfully but didn't return 'available', that's a bug. So: If you agree that it's a bug (in the engine, sdk/api/ansible, whatever), let's fix it. If you claim it's not, let's consider this a doc bug and update [2].

So, what do you think? Is it ok that it didn't return 'available'?

All the rest of the discussion was about *why* it didn't return 'available' (storage issues, whatever), but that's unrelated to above question (only to the fix, if it's a bug, perhaps).

[1] https://docs.ansible.com/ansible/latest/collections/ovirt/ovirt/ovirt_storage_domain_module.html#return-storage_domain

[2] http://ovirt.github.io/ovirt-engine-api-model/master/#types/storage_domain

Comment 27 Eyal Shenitzky 2021-04-11 06:52:36 UTC

(In reply to Yedidyah Bar David from comment #26)
> Eyal, please let me clarify: The question isn't what hosts should be
> connected, or what the engine should do in this case.
> 
> The question is:
> 
> The engine did Activate the storage domain, but did not report the available
> space, in the result of the "Activate storage domain" task [1]. Is that ok?
> Or is this a bug?
> 
> The doc [1] for this module says that the return value should include
> "storage_domain" which is "Dictionary of all the storage domain attributes"
> and links at [2] for the list of attributes. This list includes 'available',
> and does not say it's conditional, or might be missing, or anything like
> that. So from my POV, if "Activate Storage Domain" finished successfully but
> didn't return 'available', that's a bug. So: If you agree that it's a bug
> (in the engine, sdk/api/ansible, whatever), let's fix it. If you claim it's
> not, let's consider this a doc bug and update [2].
> 
> So, what do you think? Is it ok that it didn't return 'available'?


Storage domain should return the available space when it is in UP state.
There is no logic that should prevent that info in some conditions, but if the information that returned from the
host doesn't contain that field it will not be included.

Is this reproduced only in block-based storage domains?
Maybe if we will wait for a retry of another cycle of fetching the storage status by the engine will solve it and the information
will be there.

Comment 28 Yedidyah Bar David 2021-04-11 07:15:02 UTC

(In reply to Eyal Shenitzky from comment #27)
> 
> 
> Storage domain should return the available space when it is in UP state.
> There is no logic that should prevent that info in some conditions, but if
> the information that returned from the
> host doesn't contain that field it will not be included.

I think that's exactly what happened. So I understand this isn't considered a bug.

> 
> Is this reproduced only in block-based storage domains?

No idea, I didn't yet try to reproduce by myself at all (neither block nor file).

> Maybe if we will wait for a retry of another cycle of fetching the storage
> status by the engine will solve it and the information
> will be there.

Yes, that's my plan, see the latter part of comment 21. You are also welcome to comment about that one. I think (2.) there makes most sense.

Comment 29 Yedidyah Bar David 2021-04-11 07:21:01 UTC

Germano, what do you think should Severity/Priority/Target version be?

I do not think that the flow is very reasonable (and the fact is, we didn't receive many reports so far), and that what actually happens is not that bad (no corruption or anything), but it does apply also to restores, which are always critical and sensitive.

If you have clear reproduction steps, I think it can be included in 4.4.6.

Comment 30 Eyal Shenitzky 2021-04-11 11:25:12 UTC

> Yes, that's my plan, see the latter part of comment 21. You are also welcome
> to comment about that one. I think (2.) there makes most sense.

Agreed, option number 2 sounds OK.

Comment 31 Germano Veit Michel 2021-04-11 22:24:06 UTC

Hi Didi and Eyal,

(In reply to Yedidyah Bar David from comment #29)
> Germano, what do you think should Severity/Priority/Target version be?

I think Sev3/Medium is fine, as we know how to workaround this - an ugly
playbook edit on the deploy host.

For target, that's more of a question to engineering depending on resources.

> I do not think that the flow is very reasonable (and the fact is, we didn't
> receive many reports so far), and that what actually happens is not that bad
> (no corruption or anything), but it does apply also to restores, which are
> always critical and sensitive.
> 
> If you have clear reproduction steps, I think it can be included in 4.4.6.

No, no clear reproduction steps. My understanding is that the activation
command returns some SD info, but that SD info is missing the space sometimes. 
If a new query is done shortly after the space is there. The activation command
completed and returned incomplete info with it, maybe its being collected too
early in the activation or something, and not after the activation is complete.

This means I agree with option 2:

> 2. Perhaps less intrusive, we can add another task that runs ovirt_storage_domain_info in a loop until the results includes 'available', and export that to otopi as otopi_storage_domain_details (instead of the result of 'Activate storage domain').

I don't even think you have to loop, as long as we get the info from GET on 
the SD it should be there. The problem is using the info returned from activation.

It feels like a small engine/sdk bug to me, but this option 2 solves the problem
and tracking this in the engine will probably be hard to verify.

Thanks!

Comment 33 Yedidyah Bar David 2021-04-22 14:22:44 UTC

I spent quite some time trying to reproduce, without success so far, and decided to give up.

The linked patch just adds "and we got 'available'" to an existing 'if', without adding any extra code.
If we did not get 'available', it will behave just like other cases handled in this 'if', and the user will simply be prompted with:

    Please specify the storage you would like to use (glusterfs, iscsi, fc, nfs)[nfs]:

If the deploy process so far didn't emit any other error (which seems to be the case, based on the attached logs), this will be a bit confusing, but hopefully better than crashing and requiring starting the deploy from scratch.

If this is not enough, we need more detailed reproduction details.

Comment 34 Yedidyah Bar David 2021-04-22 14:24:28 UTC

I don't mind adding another generic text just so that the user will understand why we ask again. "There was some problem with the storage, please try again:".

Comment 35 Yedidyah Bar David 2021-04-27 06:03:05 UTC

Current status of this bug:

- The pending patch 114424 now also emits:
[ERROR] There was some problem with the storage domain, please try again

This might be the only [ERROR], in cases like the currently-reported one.
It then prompts asking again for storage credentials, just like it already
did if the playbook failed.

- I failed to reproduce, after spending several days on this, and now gave
up. I verified the (very simple) patch by adding anther patch on top of it
that removes 'available' from the playbook's result.

- I'd still consider this to be a bug in vdsm, because as far as I can tell
from the logs, this is due to it checking the status of several VGs in a single
command and marking as Stale one for which the 'vgs' command didn't report an
error. A potential fix is to run a loop of 'vgs' commands, one per VG. Not sure
this does not introduce other issues (other than being slower). But I do not
open a bug right now because we can't reproduce.

- In any case, if anyone runs into the specific issue of current bug, which is
that otopi_storage_domain_details does not include 'available' although
otopi_storage_domain_details_fc or otopi_storage_domain_details_iscsi do include
it, please file a new bug, and please attach logs from all relevant machines -
including the engine, the host you deploy/restore on, the SPM, and probably from
all the hosts mentioned in the engine log around the relevant time.

- QE: I suggest to just try a bit to deploy with broken storage and/or active
hosts doing storage operations, e.g. some variation on comment 21.

Comment 38 Petr Matyáš 2021-06-18 13:10:16 UTC

Verified on ovirt-hosted-engine-setup-2.5.1-1.el8ev.noarch

Having access to new iscsi HE domain upon restore only from the restoring host will fail the setup with meaningful message and retry on SD select is presented and retry with added mappings on the same domain will then succeed.

Comment 45 errata-xmlrpc 2021-07-22 15:08:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV RHEL Host (ovirt-host) [ovirt-4.4.7]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2864

Comment 46 paulo 2023-08-22 06:41:59 UTC

I encountered this issue by deploying from backup. The manager portal version is 4.3 to 4.4.5 in OS RHEL 8.6.

versions of sofware:

ovirt-host-4.5.0-3.1.el8ev.x86_64
otopi-common-1.10.3-1.el8ev.noarch
python3-otopi-1.10.3-1.el8ev.noarch
vdsm-4.50.3.5-1.el8ev.x86_64
rhvm-appliance-4.5-20220603.1.el8ev.x86_64

The workaround https://access.redhat.com/solutions/7009236 did not work.
Please help

Comment 47 Michal Skrivanek 2023-08-22 08:19:52 UTC

if you encounter a problem in RHV please open a support ticket with Red Hat. This bug only speaks of one particular problem, there could be other reasons for your failure, but it's not possible to identify what is it without more details. Thanks.

Note You need to log in before you can comment on or make changes to this bug.

arachman
cshao
ddacosta
didi
eshenitz
frolland
gdeolive
gveitmic
jortialc
lsurette
mavital
michal.skrivanek
mnecas
mperina
mtessun
omachace
paulo.francisco
ratamir
Rhev-m-bugs
stirabos