Bug 1485883 - hosted engine agent is not able to refresh hosted engine status when iso domain is not available after network outage [NEEDINFO]
Summary: hosted engine agent is not able to refresh hosted engine status when iso doma...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-hosted-engine-ha
Version: 4.0.7
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ovirt-4.2.0
: ---
Assignee: Martin Sivák
QA Contact: Artyom
URL:
Whiteboard:
: 1443156 1538639 (view as bug list)
Depends On:
Blocks: 1516203
TreeView+ depends on / blocked
 
Reported: 2017-08-28 11:04 UTC by Marian Jankular
Modified: 2021-03-11 17:13 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1516203 (view as bug list)
Environment:
Last Closed: 2018-05-15 17:32:29 UTC
oVirt Team: SLA
Target Upstream Version:
bgraveno: needinfo? (msivak)


Attachments (Terms of Use)
agent, broker and vdsm logs from both hosts (2.24 MB, application/zip)
2017-10-16 16:33 UTC, Artyom
no flags Details
agent, broker and vdsm logs from both hosts(DEBUG) (4.13 MB, application/zip)
2017-10-18 15:22 UTC, Artyom
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3337941 0 None None None 2018-02-13 16:35:40 UTC
Red Hat Product Errata RHBA-2018:1472 0 None None None 2018-05-15 17:33:55 UTC
oVirt gerrit 83516 0 master ABANDONED storage: fixing volume path detection 2020-05-03 06:43:54 UTC
oVirt gerrit 83523 0 master MERGED Use /run/vdsm to lookup volume paths 2020-05-03 06:43:54 UTC
oVirt gerrit 83689 0 master MERGED Prepare symlinks for OVF store and cache the path 2020-05-03 06:43:54 UTC

Description Marian Jankular 2017-08-28 11:04:08 UTC
Description of problem:
hosted engine agent is not able to refresh hosted engine status when iso domain is not available after network outage

Version-Release number of selected component (if applicable):
rhevm-4.0.7.4-0.1.el7ev.noarch


How reproducible:
everytime

Steps to Reproduce:
1. install hosted engine
2. add iso storage domain
3. power off hosted engine vm
4, make iso storage doamin unavailable
5, start the hosted engine vm


Actual results:
hosted engine agent is not able to retrieve data about hosted engine status

--== Host 3 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : False
Hostname                           : hosted_engine3
Host ID                            : 3
Engine status                      : unknown stale-data
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : abdcbb9b
local_conf_timestamp               : 956778
Host timestamp                     : 956762
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=956762 (Tue Aug 22 13:05:51 2017)
        host-id=3
        score=3400
        vm_conf_refresh_time=956778 (Tue Aug 22 13:06:07 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineStop
        stopped=False
        timeout=Mon Jan 12 01:48:56 1970

Expected results:
agent will be able to determine egnine status, it should ignore iso domain status

Additional info:

SPM is not able to get the status of the unreachable iso sd

Comment 1 Martin Sivák 2017-08-28 11:06:01 UTC
Where is the ISO domain located? Is is a separate storage server or is it placed directly to the engine VM like it used to be possible in the past?

Comment 2 Martin Sivák 2017-08-28 12:10:33 UTC
Nir, can an unresponsive ISO domain cause something like that on the vdsm side? This is not the first time we saw something like this.

Comment 3 Michal Skrivanek 2017-08-29 05:02:21 UTC
Yes it can. Why do you have ISO attached to HE VM?

Comment 4 Marian Jankular 2017-08-29 07:32:02 UTC
Hello Michal,

I mean iso storage domain hosted on he vm, not iso image attached to iso he vm.

Comment 7 Martin Sivák 2017-09-13 12:33:39 UTC
We are currently trying to reproduce this issue in our test environments to be able to find out what the root cause might be. We will try with ISO domain inside the engine VM itself, on a separate server and with standard storage just to be sure we cover all possible paths.

Comment 8 Artyom 2017-09-14 13:23:20 UTC
To be honest I do not really understand the reproduce steps

1) Configure HE environment
2) Configure ISO domain on the HE VM
3) I assume that you have also master storage domain configured on the engine
4) Add ISO domain to the engine
5) Power off hosted engine VM(it will make ISO domain unavailable as well, it placed on the HE VM)
6) I do not understand step "make iso storage domain unavailable", how do you make it unavailable?
7) Also the step "start the hosted engine VM", do not clear to me, ovirt-ha-agent must start it on another host that has state UP by himself, without any interaction from a user side, why do you start it?

Comment 9 Marian Jankular 2017-09-16 09:59:44 UTC
Hi Artyom,

1-4 correct 
5, before powering off run "firewall-cmd --permanent --remove-service=nfs"
6, poweroff vm
7, poweron vm

I am sorry for initial steps it supposed to be:


Steps to Reproduce:
1. install hosted engine
2. add iso storage domain
3, make iso storage doamin unavailable
4, power off hosted engine vm
5, start the hosted engine vm

Comment 10 Artyom 2017-09-17 06:48:13 UTC
Thanks for the clarification.

Comment 11 Artyom 2017-09-17 07:08:24 UTC
And I hope the last question, do you block ISO domain from the engine or from the host where runs HE VM?

Comment 12 Marian Jankular 2017-09-18 09:22:37 UTC
Hi,

I remove the "allow rules" on the he vm so host can not access it.

Marian

Comment 13 Martin Sivák 2017-10-12 11:39:12 UTC
Any update about the test results?

Comment 14 Artyom 2017-10-16 16:32:38 UTC
Checked on:
ovirt-hosted-engine-setup-2.2.0-0.0.master.20171009203744.gitd01cc03.el7.centos.noarch
ovirt-hosted-engine-ha-2.2.0-0.0.master.20171013115034.20171013115031.gitc8edb37.el7.centos.noarch
ovirt-engine-appliance-4.2-20171016.1.el7.centos.noarch
=====================================================================
Steps:
1) Deploy HE environment with two hosts
2) Configure NFS storage on the HE VM
3) Add ISO domain from the HE VM to the engine 
4) Remove NFS firewall rule from the HE VM
# firewall-cmd --permanent --remove-service=nfs
# firewall-cmd --reload
5) Poweroff HE VM
# hosted-engine --vm-poweroff
6) Wait until the agent will start the HE VM - FAILED

From some reason also HE status command shows that
--== Host 1 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : False
Hostname                           : cyan-vdsf.qa.lab.tlv.redhat.com
Host ID                            : 1
Engine status                      : unknown stale-data
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 710c118a
local_conf_timestamp               : 12487
Host timestamp                     : 12487
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=12487 (Mon Oct 16 19:25:16 2017)
        host-id=1
        score=3400
        vm_conf_refresh_time=12487 (Mon Oct 16 19:25:16 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUp
        stopped=False


--== Host 2 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : False
Hostname                           : rose05.qa.lab.tlv.redhat.com
Host ID                            : 2
Engine status                      : unknown stale-data
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : a8512249
local_conf_timestamp               : 3307341
Host timestamp                     : 3307341
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=3307341 (Mon Oct 16 19:25:32 2017)
        host-id=2
        score=3400
        vm_conf_refresh_time=3307341 (Mon Oct 16 19:25:32 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineDown
        stopped=False

Comment 15 Artyom 2017-10-16 16:33:27 UTC
Created attachment 1339376 [details]
agent, broker and vdsm logs from both hosts

Comment 16 Artyom 2017-10-17 07:17:25 UTC
Manually run HE VM via "# hosted-engine --vm-start" back all environment to the normal state.

Comment 17 Martin Sivák 2017-10-18 12:43:41 UTC
Both agents seem to be stuck on OVF extraction. Would it be possible to reproduce this with DEBUG log enabled?

Comment 18 Artyom 2017-10-18 15:22:40 UTC
Created attachment 1340257 [details]
agent, broker and vdsm logs from both hosts(DEBUG)

Comment 19 Martin Sivák 2017-11-02 10:55:17 UTC
So I think I found the culprit here:

The code here https://gerrit.ovirt.org/gitweb?p=ovirt-hosted-engine-ha.git;a=blob;f=ovirt_hosted_engine_ha/lib/heconflib.py;h=9e1996b9b0355cf3e5c9560e6f59679790ec7e8f;hb=5985fc70c4d5198d2ae3d8a3682fb85cdc3a2d35#l362 uses glob on top of /rhev/data-center/mnt:

    volume_path = os.path.join(
        volume_path,
        '*',
        sd_uuid,
        'images',
        img_uuid,
        vol_uuid,
    )
    volumes = glob.glob(volume_path)

Notice the asterisk position, it basically scans all directories with all mounted storage domains.. and if some of the domains are NFS based and unavailable.. we get stuck here.

I wonder if we can avoid the glob call.

Comment 20 Nir Soffer 2017-11-02 11:47:17 UTC
(In reply to Martin Sivák from comment #19)
> I wonder if we can avoid the glob call.

hosted engine should not look inside /rhev/data-center/mnt. It should use

    /run/vdsm/storage/sd-id/img-id/vol-id

See this example - I have a vm with one block-based disk:
/dev/6ffbc483-0031-403a-819b-3bb2f0f8de0a/58adc0fb-c658-4ed1-a1b2-924b320477cb

And one file based disk:
/rhev/data-center/mnt/dumbo.tlv.redhat.com:_voodoo_40/d6e4a622-bd31-4d8f-904d-1e26b7286757/images/a6f96cf8-ffd9-4b14-ac7a-5f1fa8e80bb7

# tree /run/vdsm/storage/
/run/vdsm/storage/
├── 373e8c55-283f-41d4-8433-95c1ef1bbd1a
├── 6ffbc483-0031-403a-819b-3bb2f0f8de0a
│   └── e54681ee-01d7-46a9-848f-2da2a38b8f1e
│       ├── 58adc0fb-c658-4ed1-a1b2-924b320477cb -> /dev/6ffbc483-0031-403a-819b-3bb2f0f8de0a/58adc0fb-c658-4ed1-a1b2-924b320477cb
│       └── 93331705-46be-4cb8-9dc2-c1559843fd4a -> /dev/6ffbc483-0031-403a-819b-3bb2f0f8de0a/93331705-46be-4cb8-9dc2-c1559843fd4a
└── d6e4a622-bd31-4d8f-904d-1e26b7286757
    └── a6f96cf8-ffd9-4b14-ac7a-5f1fa8e80bb7 -> /rhev/data-center/mnt/dumbo.tlv.redhat.com:_voodoo_40/d6e4a622-bd31-4d8f-904d-1e26b7286757/images/a6f96cf8-ffd9-4b14-ac7a-5f1fa8e80bb7

But best use a vdsm api instead of duplicating the knowledge about file system
layout in hosted engine.

Comment 21 Martin Sivák 2017-11-02 12:35:52 UTC
Thanks Nir, we were wondering about those symlinks.

Are those created for all present volumes/images or do we need to call prepareImage to get them? I am asking because we are interested in the OVF store for example and we do not mount that one.

We are considering using the API as well now as this is pretty old code. We might not have had the necessary APIs when it was written.

Comment 22 Nir Soffer 2017-11-02 12:55:08 UTC
(In reply to Martin Sivák from comment #21)
> Thanks Nir, we were wondering about those symlinks.
> 
> Are those created for all present volumes/images or do we need to call
> prepareImage to get them? I am asking because we are interested in the OVF
> store for example and we do not mount that one.

These are created when preparing an image, so this is not a way to located
volumes you don't know about.

> We are considering using the API as well now as this is pretty old code. We
> might not have had the necessary APIs when it was written.

We don't have API for locating OVF_STORE volumes, these are private implementation
detail managed by engine. I think the right solution would be to register the
OVF_STORE disks in the domain metadata, and provide an API to fetch the disks
uuids.

Comment 23 Martin Sivák 2017-11-02 15:28:05 UTC
Right, but we know how to get the right UUIDs, so that might be a way. We just have to call prepareImages with the right IDs and then access the /run structure or use some reasonable API that would give us the path (any hint?).

Comment 24 Nir Soffer 2017-11-02 20:59:16 UTC
(In reply to Martin Sivák from comment #23)
> Right, but we know how to get the right UUIDs, so that might be a way. We
> just have to call prepareImages with the right IDs and then access the /run
> structure or use some reasonable API that would give us the path (any hint?).

If you know the uuid of the image, prepare it and get the path to the
volume from the response, see
https://github.com/oVirt/vdsm/blob/master/lib/vdsm/api/vdsm-api.yml#L2922

Comment 27 Artyom 2017-12-03 08:33:09 UTC
Verified on ovirt-hosted-engine-ha-2.2.0-1.el7ev.noarch

Comment 28 Simone Tiraboschi 2018-02-13 16:35:40 UTC
*** Bug 1538639 has been marked as a duplicate of this bug. ***

Comment 32 errata-xmlrpc 2018-05-15 17:32:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1472

Comment 33 Tal Nisan 2018-07-02 08:53:25 UTC
*** Bug 1443156 has been marked as a duplicate of this bug. ***

Comment 34 Franta Kust 2019-05-16 13:03:14 UTC
BZ<2>Jira Resync


Note You need to log in before you can comment on or make changes to this bug.