Bug 1485883

Summary: hosted engine agent is not able to refresh hosted engine status when iso domain is not available after network outage
Product: Red Hat Enterprise Virtualization Manager Reporter: Marian Jankular <mjankula>
Component: ovirt-hosted-engine-haAssignee: Martin Sivák <msivak>
Status: CLOSED ERRATA QA Contact: Artyom <alukiano>
Severity: high Docs Contact:
Priority: urgent    
Version: 4.0.7CC: alukiano, bgraveno, jbelka, lsurette, mavital, mgoldboi, michal.skrivanek, mjankula, msivak, nsoffer, stirabos, ykaul, ylavi
Target Milestone: ovirt-4.2.0Keywords: Triaged, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1516203 (view as bug list) Environment:
Last Closed: 2018-05-15 17:32:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: SLA RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1516203    
Attachments:
Description Flags
agent, broker and vdsm logs from both hosts
none
agent, broker and vdsm logs from both hosts(DEBUG) none

Description Marian Jankular 2017-08-28 11:04:08 UTC
Description of problem:
hosted engine agent is not able to refresh hosted engine status when iso domain is not available after network outage

Version-Release number of selected component (if applicable):
rhevm-4.0.7.4-0.1.el7ev.noarch


How reproducible:
everytime

Steps to Reproduce:
1. install hosted engine
2. add iso storage domain
3. power off hosted engine vm
4, make iso storage doamin unavailable
5, start the hosted engine vm


Actual results:
hosted engine agent is not able to retrieve data about hosted engine status

--== Host 3 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : False
Hostname                           : hosted_engine3
Host ID                            : 3
Engine status                      : unknown stale-data
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : abdcbb9b
local_conf_timestamp               : 956778
Host timestamp                     : 956762
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=956762 (Tue Aug 22 13:05:51 2017)
        host-id=3
        score=3400
        vm_conf_refresh_time=956778 (Tue Aug 22 13:06:07 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineStop
        stopped=False
        timeout=Mon Jan 12 01:48:56 1970

Expected results:
agent will be able to determine egnine status, it should ignore iso domain status

Additional info:

SPM is not able to get the status of the unreachable iso sd

Comment 1 Martin Sivák 2017-08-28 11:06:01 UTC
Where is the ISO domain located? Is is a separate storage server or is it placed directly to the engine VM like it used to be possible in the past?

Comment 2 Martin Sivák 2017-08-28 12:10:33 UTC
Nir, can an unresponsive ISO domain cause something like that on the vdsm side? This is not the first time we saw something like this.

Comment 3 Michal Skrivanek 2017-08-29 05:02:21 UTC
Yes it can. Why do you have ISO attached to HE VM?

Comment 4 Marian Jankular 2017-08-29 07:32:02 UTC
Hello Michal,

I mean iso storage domain hosted on he vm, not iso image attached to iso he vm.

Comment 7 Martin Sivák 2017-09-13 12:33:39 UTC
We are currently trying to reproduce this issue in our test environments to be able to find out what the root cause might be. We will try with ISO domain inside the engine VM itself, on a separate server and with standard storage just to be sure we cover all possible paths.

Comment 8 Artyom 2017-09-14 13:23:20 UTC
To be honest I do not really understand the reproduce steps

1) Configure HE environment
2) Configure ISO domain on the HE VM
3) I assume that you have also master storage domain configured on the engine
4) Add ISO domain to the engine
5) Power off hosted engine VM(it will make ISO domain unavailable as well, it placed on the HE VM)
6) I do not understand step "make iso storage domain unavailable", how do you make it unavailable?
7) Also the step "start the hosted engine VM", do not clear to me, ovirt-ha-agent must start it on another host that has state UP by himself, without any interaction from a user side, why do you start it?

Comment 9 Marian Jankular 2017-09-16 09:59:44 UTC
Hi Artyom,

1-4 correct 
5, before powering off run "firewall-cmd --permanent --remove-service=nfs"
6, poweroff vm
7, poweron vm

I am sorry for initial steps it supposed to be:


Steps to Reproduce:
1. install hosted engine
2. add iso storage domain
3, make iso storage doamin unavailable
4, power off hosted engine vm
5, start the hosted engine vm

Comment 10 Artyom 2017-09-17 06:48:13 UTC
Thanks for the clarification.

Comment 11 Artyom 2017-09-17 07:08:24 UTC
And I hope the last question, do you block ISO domain from the engine or from the host where runs HE VM?

Comment 12 Marian Jankular 2017-09-18 09:22:37 UTC
Hi,

I remove the "allow rules" on the he vm so host can not access it.

Marian

Comment 13 Martin Sivák 2017-10-12 11:39:12 UTC
Any update about the test results?

Comment 14 Artyom 2017-10-16 16:32:38 UTC
Checked on:
ovirt-hosted-engine-setup-2.2.0-0.0.master.20171009203744.gitd01cc03.el7.centos.noarch
ovirt-hosted-engine-ha-2.2.0-0.0.master.20171013115034.20171013115031.gitc8edb37.el7.centos.noarch
ovirt-engine-appliance-4.2-20171016.1.el7.centos.noarch
=====================================================================
Steps:
1) Deploy HE environment with two hosts
2) Configure NFS storage on the HE VM
3) Add ISO domain from the HE VM to the engine 
4) Remove NFS firewall rule from the HE VM
# firewall-cmd --permanent --remove-service=nfs
# firewall-cmd --reload
5) Poweroff HE VM
# hosted-engine --vm-poweroff
6) Wait until the agent will start the HE VM - FAILED

From some reason also HE status command shows that
--== Host 1 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : False
Hostname                           : cyan-vdsf.qa.lab.tlv.redhat.com
Host ID                            : 1
Engine status                      : unknown stale-data
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 710c118a
local_conf_timestamp               : 12487
Host timestamp                     : 12487
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=12487 (Mon Oct 16 19:25:16 2017)
        host-id=1
        score=3400
        vm_conf_refresh_time=12487 (Mon Oct 16 19:25:16 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUp
        stopped=False


--== Host 2 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : False
Hostname                           : rose05.qa.lab.tlv.redhat.com
Host ID                            : 2
Engine status                      : unknown stale-data
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : a8512249
local_conf_timestamp               : 3307341
Host timestamp                     : 3307341
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=3307341 (Mon Oct 16 19:25:32 2017)
        host-id=2
        score=3400
        vm_conf_refresh_time=3307341 (Mon Oct 16 19:25:32 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineDown
        stopped=False

Comment 15 Artyom 2017-10-16 16:33:27 UTC
Created attachment 1339376 [details]
agent, broker and vdsm logs from both hosts

Comment 16 Artyom 2017-10-17 07:17:25 UTC
Manually run HE VM via "# hosted-engine --vm-start" back all environment to the normal state.

Comment 17 Martin Sivák 2017-10-18 12:43:41 UTC
Both agents seem to be stuck on OVF extraction. Would it be possible to reproduce this with DEBUG log enabled?

Comment 18 Artyom 2017-10-18 15:22:40 UTC
Created attachment 1340257 [details]
agent, broker and vdsm logs from both hosts(DEBUG)

Comment 19 Martin Sivák 2017-11-02 10:55:17 UTC
So I think I found the culprit here:

The code here https://gerrit.ovirt.org/gitweb?p=ovirt-hosted-engine-ha.git;a=blob;f=ovirt_hosted_engine_ha/lib/heconflib.py;h=9e1996b9b0355cf3e5c9560e6f59679790ec7e8f;hb=5985fc70c4d5198d2ae3d8a3682fb85cdc3a2d35#l362 uses glob on top of /rhev/data-center/mnt:

    volume_path = os.path.join(
        volume_path,
        '*',
        sd_uuid,
        'images',
        img_uuid,
        vol_uuid,
    )
    volumes = glob.glob(volume_path)

Notice the asterisk position, it basically scans all directories with all mounted storage domains.. and if some of the domains are NFS based and unavailable.. we get stuck here.

I wonder if we can avoid the glob call.

Comment 20 Nir Soffer 2017-11-02 11:47:17 UTC
(In reply to Martin Sivák from comment #19)
> I wonder if we can avoid the glob call.

hosted engine should not look inside /rhev/data-center/mnt. It should use

    /run/vdsm/storage/sd-id/img-id/vol-id

See this example - I have a vm with one block-based disk:
/dev/6ffbc483-0031-403a-819b-3bb2f0f8de0a/58adc0fb-c658-4ed1-a1b2-924b320477cb

And one file based disk:
/rhev/data-center/mnt/dumbo.tlv.redhat.com:_voodoo_40/d6e4a622-bd31-4d8f-904d-1e26b7286757/images/a6f96cf8-ffd9-4b14-ac7a-5f1fa8e80bb7

# tree /run/vdsm/storage/
/run/vdsm/storage/
├── 373e8c55-283f-41d4-8433-95c1ef1bbd1a
├── 6ffbc483-0031-403a-819b-3bb2f0f8de0a
│   └── e54681ee-01d7-46a9-848f-2da2a38b8f1e
│       ├── 58adc0fb-c658-4ed1-a1b2-924b320477cb -> /dev/6ffbc483-0031-403a-819b-3bb2f0f8de0a/58adc0fb-c658-4ed1-a1b2-924b320477cb
│       └── 93331705-46be-4cb8-9dc2-c1559843fd4a -> /dev/6ffbc483-0031-403a-819b-3bb2f0f8de0a/93331705-46be-4cb8-9dc2-c1559843fd4a
└── d6e4a622-bd31-4d8f-904d-1e26b7286757
    └── a6f96cf8-ffd9-4b14-ac7a-5f1fa8e80bb7 -> /rhev/data-center/mnt/dumbo.tlv.redhat.com:_voodoo_40/d6e4a622-bd31-4d8f-904d-1e26b7286757/images/a6f96cf8-ffd9-4b14-ac7a-5f1fa8e80bb7

But best use a vdsm api instead of duplicating the knowledge about file system
layout in hosted engine.

Comment 21 Martin Sivák 2017-11-02 12:35:52 UTC
Thanks Nir, we were wondering about those symlinks.

Are those created for all present volumes/images or do we need to call prepareImage to get them? I am asking because we are interested in the OVF store for example and we do not mount that one.

We are considering using the API as well now as this is pretty old code. We might not have had the necessary APIs when it was written.

Comment 22 Nir Soffer 2017-11-02 12:55:08 UTC
(In reply to Martin Sivák from comment #21)
> Thanks Nir, we were wondering about those symlinks.
> 
> Are those created for all present volumes/images or do we need to call
> prepareImage to get them? I am asking because we are interested in the OVF
> store for example and we do not mount that one.

These are created when preparing an image, so this is not a way to located
volumes you don't know about.

> We are considering using the API as well now as this is pretty old code. We
> might not have had the necessary APIs when it was written.

We don't have API for locating OVF_STORE volumes, these are private implementation
detail managed by engine. I think the right solution would be to register the
OVF_STORE disks in the domain metadata, and provide an API to fetch the disks
uuids.

Comment 23 Martin Sivák 2017-11-02 15:28:05 UTC
Right, but we know how to get the right UUIDs, so that might be a way. We just have to call prepareImages with the right IDs and then access the /run structure or use some reasonable API that would give us the path (any hint?).

Comment 24 Nir Soffer 2017-11-02 20:59:16 UTC
(In reply to Martin Sivák from comment #23)
> Right, but we know how to get the right UUIDs, so that might be a way. We
> just have to call prepareImages with the right IDs and then access the /run
> structure or use some reasonable API that would give us the path (any hint?).

If you know the uuid of the image, prepare it and get the path to the
volume from the response, see
https://github.com/oVirt/vdsm/blob/master/lib/vdsm/api/vdsm-api.yml#L2922

Comment 27 Artyom 2017-12-03 08:33:09 UTC
Verified on ovirt-hosted-engine-ha-2.2.0-1.el7ev.noarch

Comment 28 Simone Tiraboschi 2018-02-13 16:35:40 UTC
*** Bug 1538639 has been marked as a duplicate of this bug. ***

Comment 32 errata-xmlrpc 2018-05-15 17:32:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1472

Comment 33 Tal Nisan 2018-07-02 08:53:25 UTC
*** Bug 1443156 has been marked as a duplicate of this bug. ***

Comment 34 Franta Kust 2019-05-16 13:03:14 UTC
BZ<2>Jira Resync