Description of problem: hosted engine agent is not able to refresh hosted engine status when iso domain is not available after network outage Version-Release number of selected component (if applicable): rhevm-4.0.7.4-0.1.el7ev.noarch How reproducible: everytime Steps to Reproduce: 1. install hosted engine 2. add iso storage domain 3. power off hosted engine vm 4, make iso storage doamin unavailable 5, start the hosted engine vm Actual results: hosted engine agent is not able to retrieve data about hosted engine status --== Host 3 status ==-- conf_on_shared_storage : True Status up-to-date : False Hostname : hosted_engine3 Host ID : 3 Engine status : unknown stale-data Score : 3400 stopped : False Local maintenance : False crc32 : abdcbb9b local_conf_timestamp : 956778 Host timestamp : 956762 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=956762 (Tue Aug 22 13:05:51 2017) host-id=3 score=3400 vm_conf_refresh_time=956778 (Tue Aug 22 13:06:07 2017) conf_on_shared_storage=True maintenance=False state=EngineStop stopped=False timeout=Mon Jan 12 01:48:56 1970 Expected results: agent will be able to determine egnine status, it should ignore iso domain status Additional info: SPM is not able to get the status of the unreachable iso sd
Where is the ISO domain located? Is is a separate storage server or is it placed directly to the engine VM like it used to be possible in the past?
Nir, can an unresponsive ISO domain cause something like that on the vdsm side? This is not the first time we saw something like this.
Yes it can. Why do you have ISO attached to HE VM?
Hello Michal, I mean iso storage domain hosted on he vm, not iso image attached to iso he vm.
We are currently trying to reproduce this issue in our test environments to be able to find out what the root cause might be. We will try with ISO domain inside the engine VM itself, on a separate server and with standard storage just to be sure we cover all possible paths.
To be honest I do not really understand the reproduce steps 1) Configure HE environment 2) Configure ISO domain on the HE VM 3) I assume that you have also master storage domain configured on the engine 4) Add ISO domain to the engine 5) Power off hosted engine VM(it will make ISO domain unavailable as well, it placed on the HE VM) 6) I do not understand step "make iso storage domain unavailable", how do you make it unavailable? 7) Also the step "start the hosted engine VM", do not clear to me, ovirt-ha-agent must start it on another host that has state UP by himself, without any interaction from a user side, why do you start it?
Hi Artyom, 1-4 correct 5, before powering off run "firewall-cmd --permanent --remove-service=nfs" 6, poweroff vm 7, poweron vm I am sorry for initial steps it supposed to be: Steps to Reproduce: 1. install hosted engine 2. add iso storage domain 3, make iso storage doamin unavailable 4, power off hosted engine vm 5, start the hosted engine vm
Thanks for the clarification.
And I hope the last question, do you block ISO domain from the engine or from the host where runs HE VM?
Hi, I remove the "allow rules" on the he vm so host can not access it. Marian
Any update about the test results?
Checked on: ovirt-hosted-engine-setup-2.2.0-0.0.master.20171009203744.gitd01cc03.el7.centos.noarch ovirt-hosted-engine-ha-2.2.0-0.0.master.20171013115034.20171013115031.gitc8edb37.el7.centos.noarch ovirt-engine-appliance-4.2-20171016.1.el7.centos.noarch ===================================================================== Steps: 1) Deploy HE environment with two hosts 2) Configure NFS storage on the HE VM 3) Add ISO domain from the HE VM to the engine 4) Remove NFS firewall rule from the HE VM # firewall-cmd --permanent --remove-service=nfs # firewall-cmd --reload 5) Poweroff HE VM # hosted-engine --vm-poweroff 6) Wait until the agent will start the HE VM - FAILED From some reason also HE status command shows that --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : False Hostname : cyan-vdsf.qa.lab.tlv.redhat.com Host ID : 1 Engine status : unknown stale-data Score : 3400 stopped : False Local maintenance : False crc32 : 710c118a local_conf_timestamp : 12487 Host timestamp : 12487 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=12487 (Mon Oct 16 19:25:16 2017) host-id=1 score=3400 vm_conf_refresh_time=12487 (Mon Oct 16 19:25:16 2017) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False --== Host 2 status ==-- conf_on_shared_storage : True Status up-to-date : False Hostname : rose05.qa.lab.tlv.redhat.com Host ID : 2 Engine status : unknown stale-data Score : 3400 stopped : False Local maintenance : False crc32 : a8512249 local_conf_timestamp : 3307341 Host timestamp : 3307341 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=3307341 (Mon Oct 16 19:25:32 2017) host-id=2 score=3400 vm_conf_refresh_time=3307341 (Mon Oct 16 19:25:32 2017) conf_on_shared_storage=True maintenance=False state=EngineDown stopped=False
Created attachment 1339376 [details] agent, broker and vdsm logs from both hosts
Manually run HE VM via "# hosted-engine --vm-start" back all environment to the normal state.
Both agents seem to be stuck on OVF extraction. Would it be possible to reproduce this with DEBUG log enabled?
Created attachment 1340257 [details] agent, broker and vdsm logs from both hosts(DEBUG)
So I think I found the culprit here: The code here https://gerrit.ovirt.org/gitweb?p=ovirt-hosted-engine-ha.git;a=blob;f=ovirt_hosted_engine_ha/lib/heconflib.py;h=9e1996b9b0355cf3e5c9560e6f59679790ec7e8f;hb=5985fc70c4d5198d2ae3d8a3682fb85cdc3a2d35#l362 uses glob on top of /rhev/data-center/mnt: volume_path = os.path.join( volume_path, '*', sd_uuid, 'images', img_uuid, vol_uuid, ) volumes = glob.glob(volume_path) Notice the asterisk position, it basically scans all directories with all mounted storage domains.. and if some of the domains are NFS based and unavailable.. we get stuck here. I wonder if we can avoid the glob call.
(In reply to Martin Sivák from comment #19) > I wonder if we can avoid the glob call. hosted engine should not look inside /rhev/data-center/mnt. It should use /run/vdsm/storage/sd-id/img-id/vol-id See this example - I have a vm with one block-based disk: /dev/6ffbc483-0031-403a-819b-3bb2f0f8de0a/58adc0fb-c658-4ed1-a1b2-924b320477cb And one file based disk: /rhev/data-center/mnt/dumbo.tlv.redhat.com:_voodoo_40/d6e4a622-bd31-4d8f-904d-1e26b7286757/images/a6f96cf8-ffd9-4b14-ac7a-5f1fa8e80bb7 # tree /run/vdsm/storage/ /run/vdsm/storage/ ├── 373e8c55-283f-41d4-8433-95c1ef1bbd1a ├── 6ffbc483-0031-403a-819b-3bb2f0f8de0a │ └── e54681ee-01d7-46a9-848f-2da2a38b8f1e │ ├── 58adc0fb-c658-4ed1-a1b2-924b320477cb -> /dev/6ffbc483-0031-403a-819b-3bb2f0f8de0a/58adc0fb-c658-4ed1-a1b2-924b320477cb │ └── 93331705-46be-4cb8-9dc2-c1559843fd4a -> /dev/6ffbc483-0031-403a-819b-3bb2f0f8de0a/93331705-46be-4cb8-9dc2-c1559843fd4a └── d6e4a622-bd31-4d8f-904d-1e26b7286757 └── a6f96cf8-ffd9-4b14-ac7a-5f1fa8e80bb7 -> /rhev/data-center/mnt/dumbo.tlv.redhat.com:_voodoo_40/d6e4a622-bd31-4d8f-904d-1e26b7286757/images/a6f96cf8-ffd9-4b14-ac7a-5f1fa8e80bb7 But best use a vdsm api instead of duplicating the knowledge about file system layout in hosted engine.
Thanks Nir, we were wondering about those symlinks. Are those created for all present volumes/images or do we need to call prepareImage to get them? I am asking because we are interested in the OVF store for example and we do not mount that one. We are considering using the API as well now as this is pretty old code. We might not have had the necessary APIs when it was written.
(In reply to Martin Sivák from comment #21) > Thanks Nir, we were wondering about those symlinks. > > Are those created for all present volumes/images or do we need to call > prepareImage to get them? I am asking because we are interested in the OVF > store for example and we do not mount that one. These are created when preparing an image, so this is not a way to located volumes you don't know about. > We are considering using the API as well now as this is pretty old code. We > might not have had the necessary APIs when it was written. We don't have API for locating OVF_STORE volumes, these are private implementation detail managed by engine. I think the right solution would be to register the OVF_STORE disks in the domain metadata, and provide an API to fetch the disks uuids.
Right, but we know how to get the right UUIDs, so that might be a way. We just have to call prepareImages with the right IDs and then access the /run structure or use some reasonable API that would give us the path (any hint?).
(In reply to Martin Sivák from comment #23) > Right, but we know how to get the right UUIDs, so that might be a way. We > just have to call prepareImages with the right IDs and then access the /run > structure or use some reasonable API that would give us the path (any hint?). If you know the uuid of the image, prepare it and get the path to the volume from the response, see https://github.com/oVirt/vdsm/blob/master/lib/vdsm/api/vdsm-api.yml#L2922
Verified on ovirt-hosted-engine-ha-2.2.0-1.el7ev.noarch
*** Bug 1538639 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1472
*** Bug 1443156 has been marked as a duplicate of this bug. ***
BZ<2>Jira Resync