Bug 1351213
Summary: | [Docs][SHE] Add warning/prerequisite that self-hosted engine can't use the same ISCSI target for the master domain and the hosted_storage domain | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Martin Tessun <mtessun> | ||||
Component: | Documentation | Assignee: | Tahlia Richardson <trichard> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Byron Gravenorst <bgraveno> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 4.0.0 | CC: | acanan, amureini, dfediuck, dmoessne, ebenahar, green, laravot, lbopf, lsurette, mavital, mgoldboi, msivak, mtessun, nsoffer, rbalakri, rgolan, rhev-docs, Rhev-m-bugs, sbonazzo, srevivo, trichard, ykaul, ylavi | ||||
Target Milestone: | ovirt-4.0.7 | Keywords: | Documentation | ||||
Target Release: | --- | Flags: | amureini:
needinfo+
|
||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Known Issue | |||||
Doc Text: |
In bootstrap mode, prior to having the self-hosted engine storage domain imported, the Manager tells VDSM to disconnect the iSCSI target prior to attaching the master domain. The Manager doesn't know that hosted_storage is using the same target, as it is not in the database yet.
As a result, VDSM disconnects the target, the Manager virtual machine loses connectivity to its disk, sanlock times out, and the Manager virtual machine is restarted.
To avoid this, do not use the same ISCSI target for the master domain and the hosted_storage domain.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2017-03-01 04:00:39 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | Docs | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1386337 | ||||||
Bug Blocks: | 1338732, 1393902 | ||||||
Attachments: |
|
Description
Martin Tessun
2016-06-29 13:28:15 UTC
The iscsi connection details, if needed: [root@ovirt1 ~]# iscsiadm -m session -P 3 iSCSI Transport Class version 2.0-870 version 6.2.0.873-30 Target: iqn.2003-01.org.linux-iscsi.kirk.x8664:sn.21dc789db84d (non-flash) Current Portal: 192.168.100.1:3260,1 Persistent Portal: 192.168.100.1:3260,1 ********** Interface: ********** Iface Name: default Iface Transport: tcp Iface Initiatorname: iqn.2015-08.de.die-tessuns:ovirt-1 Iface IPaddress: 192.168.100.21 Iface HWaddress: <empty> Iface Netdev: <empty> SID: 3 iSCSI Connection State: LOGGED IN iSCSI Session State: LOGGED_IN Internal iscsid Session State: NO CHANGE ********* Timeouts: ********* Recovery Timeout: 5 Target Reset Timeout: 30 LUN Reset Timeout: 30 Abort Timeout: 15 ***** CHAP: ***** username: <empty> password: ******** username_in: <empty> password_in: ******** ************************ Negotiated iSCSI params: ************************ HeaderDigest: None DataDigest: None MaxRecvDataSegmentLength: 262144 MaxXmitDataSegmentLength: 262144 FirstBurstLength: 65536 MaxBurstLength: 262144 ImmediateData: Yes InitialR2T: Yes MaxOutstandingR2T: 1 ************************ Attached SCSI devices: ************************ Host Number: 11 State: running scsi11 Channel 00 Id 0 Lun: 2 Attached scsi disk sdd State: running scsi11 Channel 00 Id 0 Lun: 3 Attached scsi disk sdc State: running scsi11 Channel 00 Id 0 Lun: 4 Attached scsi disk sdb State: running scsi11 Channel 00 Id 0 Lun: 5 Attached scsi disk sda State: running [root@ovirt1 ~]# Martin, can you attach the logs please? Nir, Martin Sivak - who's taking point on this? It's currently on Martin but on the Storage team, which doesn't make too much sense. (In reply to Allon Mureinik from comment #3) > Nir, Martin Sivak - who's taking point on this? It's currently on Martin but > on the Storage team, which doesn't make too much sense. Looks like the issue is that engine does not know anything about hosted engine storage domain at this point. When engine create a storage domain, we connect and disconnect to the target. Disconnecting from the target will cause hosted engine storage domain to be disconnected and the engine will pause. The only way to prevent this is to import the hosted engine storage domain into engine database *before* adding another storage domain on same target. I'm afraid that we cannot fix this with the current solution to import the hosted engine storage domain *after* creating another storage domain. I think the only solution for now is to documented this limitation, and install the hosted engine storage domain on its own target. Roy, can you confirm my theory on this problem? Sorry for the late update on this, but Nirs comment in C#4 looks reasonable to me. I will try grabbing the logs today, but I want to redo an installation, as I did a quite longish demo with my setup, showing lots of "unnecessary" things as well. Just let me know if logs aren't needed anymore and problem is clear. Kind regards, Martin (In reply to Martin Tessun from comment #6) > Sorry for the late update on this, but Nirs comment in C#4 looks reasonable > to me. > > I will try grabbing the logs today, but I want to redo an installation, as I > did a quite longish demo with my setup, showing lots of "unnecessary" things > as well. Martin, I believe this issue exists also in previous versions (.e.g 3.6). If you try to create a storage domain on the same target as the hosted engine storage domain, the target will be disconnected after creating the storage domain, and the hosted engine vm will pause. Once you import the hosted engine storage domain, engine knows about the target and will not disconnected when creating a new storage domain on same target. Created attachment 1176114 [details] messages from RHEV-H (vdsm, messages, ovirt* libvirt) (In reply to Nir Soffer from comment #7) > (In reply to Martin Tessun from comment #6) > > Sorry for the late update on this, but Nirs comment in C#4 looks reasonable > > to me. > > > > I will try grabbing the logs today, but I want to redo an installation, as I > > did a quite longish demo with my setup, showing lots of "unnecessary" things > > as well. > > Martin, I believe this issue exists also in previous versions (.e.g 3.6). If > you > try to create a storage domain on the same target as the hosted engine > storage > domain, the target will be disconnected after creating the storage domain, > and > the hosted engine vm will pause. > Well I used the same setup for RHEV 3.6 HE setups and this did never happen. Maybe it has been a timing issue. > Once you import the hosted engine storage domain, engine knows about the > target > and will not disconnected when creating a new storage domain on same target. Ack. I will attach the logs now anyways; I also have gathered sosreport, if that is needed. Additionally I will try using a different iSCSI target for the Hosted Engine, to verify this, but to me it looks exactly as you are right (and I was just "lucky" to not hitting this in 3.6; and I did set this up at least 3 times.) So just let me know, if you need additional tests/logs (even with RHEV 3.6) Cheers, Martin P.S.: Some lsscsi output during the initialisation: [root@ovirt1 ~]# lsscsi [3:0:0:0] cd/dvd QEMU QEMU DVD-ROM 2.4. /dev/sr0 [root@ovirt1 ~]# lsscsi [3:0:0:0] cd/dvd QEMU QEMU DVD-ROM 2.4. /dev/sr0 [root@ovirt1 ~]# lsscsi [3:0:0:0] cd/dvd QEMU QEMU DVD-ROM 2.4. /dev/sr0 [10:0:0:2] disk LIO-ORG rhevh-01 4.0 /dev/sdd [10:0:0:3] disk LIO-ORG test-01 4.0 /dev/sdc [10:0:0:4] disk LIO-ORG ovirt-01 4.0 /dev/sdb [10:0:0:5] disk LIO-ORG temp-01 4.0 /dev/sda [root@ovirt1 ~]# hosted-engine --vm-status --== Host 1 status ==-- Status up-to-date : True Hostname : ovirt1.satellite.local Host ID : 1 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "up", "detail": "paused"} Score : 3000 stopped : False Local maintenance : False crc32 : d12f5f8e Host timestamp : 8505 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=8505 (Mon Jul 4 16:19:52 2016) host-id=1 score=3000 maintenance=False state=EngineStarting stopped=False [root@ovirt1 ~]# Just an update to the tests using a different target for the HE Domain. This indeed solves the issue with the paused HE VM in RHV 4.0. (In reply to Nir Soffer from comment #5) > Roy, can you confirm my theory on this problem? Moving to SLA, as there's nothing we can currently do with this. Also, IIUC, once the HE domain is properly registered in the engine, we should be OK, no? (In reply to Allon Mureinik from comment #10) > (In reply to Nir Soffer from comment #5) > Also, IIUC, once the HE domain is properly registered in the engine, we > should be OK, no? Elad, can you confirm that after hosted engine storage domain is imported, we can safely create and remove storage domains using luns from the same target used by hosted engine storage domain? (In reply to Nir Soffer from comment #11) > Elad, can you confirm that after hosted engine storage domain is imported, > we can safely create and remove storage domains using luns from the same > target used by hosted engine storage domain? No, tested on latest 4.0, the HE VM gets paused on storage domain creation when using a LUN from the same target as the HE storage uses. Note that during hosted-engine deployment, the connection to the storage server can be done via only a single target, while the storage servers we use expose multiple targets (4 in ours). This means that after the deployment is done, when initiating an iSCSI storage domain creation, the available target to use is the one that vdsm already connected to, which is the one that was used during deployment for HE storage. If connecting to a different target (even of the same storage server) and creating an iSCSI domain over an unused LUN, storage domain creation will succeed. [root@blond-vdsh ~]# hosted-engine --vm-status --== Host 1 status ==-- Status up-to-date : True Hostname : blond-vdsh.qa.lab.tlv.redhat.com Host ID : 1 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "up", "detail": "paused"} Score : 3400 stopped : False Local maintenance : False crc32 : 09c1acac Host timestamp : 1133265 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=1133265 (Sun Jul 10 16:38:32 2016) host-id=1 score=3400 maintenance=False state=EngineStop stopped=False timeout=Wed Jan 14 04:49:22 1970 rhevm-4.0.0.6-0.1.el7ev.noarch ovirt-hosted-engine-ha-2.0.0-1.el7ev.noarch ovirt-hosted-engine-setup-2.0.0.2-1.el7ev.noarch vdsm-4.18.4-2.el7ev.x86_64 Summing this up: 1. usage of same target as HE domain will result the VM to loose its storage and pause. 2. using different target to add a domain overcome that. So we need to keep the hosted engine target for not being used for other domains. - Production wise, how common practice is to use the same target for domains? - Do we have a supported way today to make a target exclusive and not reusable? (does that make sense?) (In reply to Roy Golan from comment #13) > Summing this up: > 1. usage of same target as HE domain will result the VM to loose its storage > and pause. > 2. using different target to add a domain overcome that. > > So we need to keep the hosted engine target for not being used for other > domains. > > - Production wise, how common practice is to use the same target for domains? Quite common, but: > > - Do we have a supported way today to make a target exclusive and not > reusable? (does that make sense?) It's basically a configuration item. I think that documentation that HE storage should be exposed via a different target than the other storage domains is a reasonable limitation. I am moving this to documentation as this has really nothing to do with hosted-engine-ha component and the last decision was to document this properly. (In reply to Elad from comment #12) > (In reply to Nir Soffer from comment #11) > > > Elad, can you confirm that after hosted engine storage domain is imported, > > we can safely create and remove storage domains using luns from the same > > target used by hosted engine storage domain? > > No, tested on latest 4.0, the HE VM gets paused on storage domain creation > when using a LUN from the same target as the HE storage uses. Did you finish the import of the storage domain before you tried to create a new domain using the same target? > Note that during hosted-engine deployment, the connection to the storage > server can be done via only a single target, while the storage servers we > use expose multiple targets (4 in ours). This means that after the > deployment is done, when initiating an iSCSI storage domain creation, the > available target to use is the one that vdsm already connected to, which is > the one that was used during deployment for HE storage. I don't see why this should be a problem to create other domains on same target. Engine should know the connection used to connect to the hosted engine imported domain. When you add new connections (more paths) to the same target, it should not effect the existing connection. When after creating new storage domain, engine should not disconnect all connections to a target, since it know one of them is used for an existing storage domain (the imported hosted engine domain). If it really does not work,this is engine connection management issue, or maybe import issue (not registering the connection correctly). (In reply to Nir Soffer from comment #16) > If it really does not work,this is engine connection management issue, or > maybe import issue (not registering the connection correctly). Most probably an import issue. This doesn't happen with any other domain creation. (In reply to Nir Soffer from comment #16) > Did you finish the import of the storage domain before you tried to > create a new domain using the same target? No, I used this target to create the first domain in the setup, so it's before Before the engine disconnects from storage server of an ISCSI domain, its checking whether the domain connection is "used" by anything else (LUN disk/ISCSI domain) and if it does it skips the disconnection. In the described scenario, the HE engine domain wasn't imported to the setup and therefore the engine wasn't "aware" to its existence (if it would have been added, the engine wouldn't disconnect from it at least in the common case) - so first importing the domain either automatically or manually should solve the issue. Elad, can you confirm this by testing again? please import the HE domain first. thanks, Liron. The creation of an iSCSI storage domain using the same target as the one used by the hosted_storage SD, after hosted_storage SD import (that takes place automatically right after first storage domain creation), does not cause disconnection from the target and the HE VM doesn't get paused. Tested with the following: ovirt-vmconsole-host-1.0.4-1.el7ev.noarch ovirt-setup-lib-1.0.2-1.el7ev.noarch libgovirt-0.3.3-1.el7_2.4.x86_64 ovirt-imageio-common-0.3.0-0.el7ev.noarch ovirt-vmconsole-1.0.4-1.el7ev.noarch ovirt-hosted-engine-ha-2.0.2-1.el7ev.noarch ovirt-hosted-engine-setup-2.0.1.4-1.el7ev.noarch ovirt-engine-sdk-python-3.6.7.0-1.el7ev.noarch ovirt-host-deploy-1.5.1-1.el7ev.noarch ovirt-imageio-daemon-0.3.0-0.el7ev.noarch vdsm-hook-vmfex-dev-4.18.11-1.el7ev.noarch vdsm-python-4.18.11-1.el7ev.noarch vdsm-cli-4.18.11-1.el7ev.noarch vdsm-infra-4.18.11-1.el7ev.noarch vdsm-jsonrpc-4.18.11-1.el7ev.noarch vdsm-api-4.18.11-1.el7ev.noarch vdsm-4.18.11-1.el7ev.x86_64 vdsm-xmlrpc-4.18.11-1.el7ev.noarch vdsm-yajsonrpc-4.18.11-1.el7ev.noarch rhevm-appliance-20160731.0-1.el7ev.noarch rhevm-4.0.2.2-0.1.el7ev.noarch (In reply to Liron Aravot from comment #20) > Before the engine disconnects from storage server of an ISCSI domain, its > checking whether the domain connection is "used" by anything else (LUN > disk/ISCSI domain) and if it does it skips the disconnection. > In the described scenario, the HE engine domain wasn't imported to the setup > and therefore the engine wasn't "aware" to its existence (if it would have > been added, the engine wouldn't disconnect from it at least > in the common case) - so first importing the domain either automatically or > manually should solve the issue. > > Elad, can you confirm this by testing again? please import the HE domain > first. > > thanks, > Liron. Liron and Elad, while bootsraping the HE env you already have the hosted -engine domain connected, but the engine isn't active, yet. After the engine is installed you add your first DATA domain, which **mustn't** be the hostd_storage. And if that domain is on the same target, it will disconnect cause the engine doesn't know still on the hosted_engine. VDSM does, engine doesn't. So first we need to prevent from doing that, you may consider querring VDSM for that, albeit again, we don't have an SPM at that stage yet. Roy, the comment came somehow unclear and I want to make sure i got your meaning, care to explain again? Doc text filled, needs doc review Moving to 'NEW' to be retriaged as resources allow. Assigning to Tahlia for review. Tahlia, can you please add a brief warning to the SHE Guide, and edit the doc text in this bug, so we can pull it into the Release Notes? Moving back to 4.0.6 temporarily in order to pull the Known Issue into the Release Notes. Now published at https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1-beta/html-single/self-hosted_engine_guide/#chap-Deploying_Self-Hosted_Engine and https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.0/html-single/self-hosted_engine_guide/#chap-Deploying_Self-Hosted_Engine |