Bug 1474904 - iSCSI Multipath IO issues: vdsm tries to connect to unreachable paths. [NEEDINFO]
Summary: iSCSI Multipath IO issues: vdsm tries to connect to unreachable paths.
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: vdsm
Classification: oVirt
Component: Services
Version: 4.19.20
Hardware: x86_64
OS: All
unspecified
high with 1 vote vote
Target Milestone: ovirt-4.1.6
: ---
Assignee: Maor
QA Contact: Raz Tamir
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-07-25 15:24 UTC by Vinícius Ferrão
Modified: 2021-06-29 15:27 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-09-03 13:04:44 UTC
oVirt Team: Storage
ai90iv: needinfo? (tnisan)
nsoffer: needinfo? (j.keetRHELbugzilla)
rule-engine: ovirt-4.1+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 115465 0 master NEW iscsi session login fix. No more trashing iscsi multipathing. 2021-06-28 23:03:42 UTC

Description Vinícius Ferrão 2017-07-25 15:24:24 UTC
Description of problem:
Hello,

iSCSI Multipath simply does not work on oVirt/RHV. As detected by some guys over the oVirt mailing list: "the OVirt implementation of iSCSI-Bonding assumes that all network interfaces in the bond can connect/reach all targets, including those in the other net(s). The fact that you use separate, isolated networks means that this is not the case in your setup (and not in mine). I am not sure if this is a bug, a design flaw or a feature, but as a result of this OVirt's iSCSI-Bonding does not work".

Since the log files are too bug, I've upload it to a web server, here's the link: http://www.if.ufrj.br/~ferrao/ovirt

Version-Release number of selected component (if applicable):
[root@ovirt3 ~]# imgbase w
2017-07-25 12:18:09,402 [INFO] You are on rhvh-4.1-0.20170706.0+1

[root@ovirt3 ~]# rpm -qa | grep -i vdsm
vdsm-xmlrpc-4.19.20-1.el7ev.noarch
vdsm-hook-vmfex-dev-4.19.20-1.el7ev.noarch
vdsm-client-4.19.20-1.el7ev.noarch
vdsm-hook-openstacknet-4.19.20-1.el7ev.noarch
vdsm-yajsonrpc-4.19.20-1.el7ev.noarch
vdsm-python-4.19.20-1.el7ev.noarch
vdsm-cli-4.19.20-1.el7ev.noarch
vdsm-hook-vhostmd-4.19.20-1.el7ev.noarch
vdsm-4.19.20-1.el7ev.x86_64
vdsm-gluster-4.19.20-1.el7ev.noarch
vdsm-hook-fcoe-4.19.20-1.el7ev.noarch
vdsm-jsonrpc-4.19.20-1.el7ev.noarch
vdsm-api-4.19.20-1.el7ev.noarch
vdsm-hook-ethtool-options-4.19.20-1.el7ev.noarch

How reproducible:
100%

Steps to Reproduce:
1. Installed oVirt Node 4.1.3 with the following network settings:

eno1 and eno2 on a 802.3ad (LACP) Bond, creating a bond0 interface.
eno3 with 9216 MTU.
eno4 with 9216 MTU.
vlan11 on eno3 with 9216 MTU and fixed IP addresses.
vlan12 on eno4 with 9216 MTU and fixed IP addresses.

eno3 and eno4 are my iSCSI MPIO Interfaces, completelly segregated, on different switches.

2. Started the installation of Self-hosted engine after three hours of waiting, because: https://bugzilla.redhat.com/show_bug.cgi?id=1454536

3. Selected iSCSI as default interface for Hosted Engine. Everything was fine.

4. On the Hosted Engine I’ve done the following:

a. System > Data Centers > Default > Networks
. Created iSCSI1 with VLAN 11 and MTU 9216, removed VM Network option.
. Created iSCSI2 with VLAN 12 and MTU 9216, removed VM Network option.

b. System > Data Centers > Default > Clusters > Default > Hosts > ovirt3.cc.if.ufrj.br (my machine)

Selected Setup Host Networks and moved iSCSI1 to eno3 and iSCSI2 to eno4. Both icons gone green, indicating an “up” state.

c. System > Data Centers > Default > Clusters

Selected Logical Networks and them Manage Network. Removed the Required checkbox from both iSCSI connections.

d. System > Data Centers > Default > Storage

Added an iSCSI Share with two initiators. Both shows up correctly.

e. System > Data Centers

Now the iSCSI Multipath tab is visible. Selected it and added an iSCSI Bond:
. iSCSI1 and iSCSI2 selected on Logical Networks.
. Two iqn’s selected on Storage Targets.

5. oVirt just goes down. VDSM gets crazy and everything “crashes”. iSCSI is still alive, since we can still talk with the Self Hosted Engine, but **NOTHING** works. If the iSCSI Bond is removed everything regenerates to a usable state.

Actual results:
Broken iSCSI connectivity and everything going down on oVirt HE

Expected results:
MPIO on iSCSI paths.

Additional info:
I've got a trial of RHV just to confirm that the bug exists in RHV too. That's why "imgbase w" shows RHV-H. Just in case, my storage system is FreeNAS on common x86_64 hardware. It's tested and working as expected, and confirmed to be working with MPIO in other hypervisor solutions.

Comment 1 Maor 2017-07-25 20:48:49 UTC
Hi Vinícius,

I first want to clarify the issue:
You have two interfaces eno3.11 and eno4.12
iface eno3.11 can only login into 192.168.11.14
iface eno4.12 can only login to 192.168.12.14
The host become non operational since eno3.11 fail to login to 192.168.12.14 and eno4.12 fail login to 192.168.11.14.
What you are suggests, is that oVirt will be able to group a network to a target and make those groups work with the iSCSI multipath.
Is this correct?

Another question (If the above is indeed the case), what will happen if you will configure 2 iSCSI bonds, each one with a its specific network and target?

Comment 2 Vinícius Ferrão 2017-07-27 04:58:23 UTC
That's the exactly topology, just to be extremely precise, both networks are /28 so it's addressable from 1 to 14.

Here's a drawing of the topology:

          +---------------+               
          |    FreeNAS    |
          +---------------+
             |         |
             |         | 10GbE Links
             |         |
+---------------+   +---------------+
| Nexus 3048 #1 |   | Nexus 3048 #2 |
+---------------+   +---------------+
             |         |
             |         | 1GbE Links
             |         |
          +---------------+
       5x |   oVirt/RHV   | 
          +---------------+

VLAN11: 192.168.11.0/28
VLAN12: 192.168.12.0/28
oVirt Servers from 1 to 5, Storage on 14.

So as you can see this is classic MPIO iSCSI topology. Plain layer2 domain without any routing. VLAN11 only exists on the first Nexus and VLAN12 only exists on the second Nexus. 

Changing the configuration as you requested brings up the storages once again. But I don't know if it's working with MPIO or not. It does not makes sense to configure this way. At this moment I have two iSCSI Multipaths. Here's the photos:

http://www.if.ufrj.br/~ferrao/ovirt/separate-mpio0.png
http://www.if.ufrj.br/~ferrao/ovirt/separate-mpio1.png
http://www.if.ufrj.br/~ferrao/ovirt/separate-mpio2.png

Multipath -ll report two paths but only one active:
[root@ovirt3 ~]# multipath -ll
36589cfc0000003400be2813b01be08c3 dm-24 FreeNAS ,iSCSI Disk      
size=10T features='0' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=1 status=active
| `- 9:0:0:0  sdc 8:32 active ready running
`-+- policy='service-time 0' prio=1 status=enabled
  `- 10:0:0:0 sdd 8:48 active ready running
36589cfc0000007e77a9df827b65176f2 dm-12 FreeNAS ,iSCSI Disk      
size=200G features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  `- 8:0:0:0  sdb 8:16 active ready running

If more information is needed, please let me know.

Comment 3 Maor 2017-07-27 06:59:37 UTC
(In reply to Vinícius Ferrão from comment #2)
> That's the exactly topology, just to be extremely precise, both networks are
> /28 so it's addressable from 1 to 14.
> 
> Here's a drawing of the topology:
> 
>           +---------------+               
>           |    FreeNAS    |
>           +---------------+
>              |         |
>              |         | 10GbE Links
>              |         |
> +---------------+   +---------------+
> | Nexus 3048 #1 |   | Nexus 3048 #2 |
> +---------------+   +---------------+
>              |         |
>              |         | 1GbE Links
>              |         |
>           +---------------+
>        5x |   oVirt/RHV   | 
>           +---------------+
> 
> VLAN11: 192.168.11.0/28
> VLAN12: 192.168.12.0/28
> oVirt Servers from 1 to 5, Storage on 14.
> 
> So as you can see this is classic MPIO iSCSI topology. Plain layer2 domain
> without any routing. VLAN11 only exists on the first Nexus and VLAN12 only
> exists on the second Nexus. 
> 
> Changing the configuration as you requested brings up the storages once
> again. But I don't know if it's working with MPIO or not. It does not makes
> sense to configure this way. At this moment I have two iSCSI Multipaths.
> Here's the photos:
> 
> http://www.if.ufrj.br/~ferrao/ovirt/separate-mpio0.png
> http://www.if.ufrj.br/~ferrao/ovirt/separate-mpio1.png
> http://www.if.ufrj.br/~ferrao/ovirt/separate-mpio2.png

It doesn't mean that you have two different iSCSI multipath in your Host.
This is only the oVirt configuration, eventually it is translated to iscsiadm login command and the multipath in your Host is still one process which should be connected with its two nics, but with this configuration you can achieve the the isolation level you want to configure.

> 
> Multipath -ll report two paths but only one active:
> [root@ovirt3 ~]# multipath -ll
> 36589cfc0000003400be2813b01be08c3 dm-24 FreeNAS ,iSCSI Disk      
> size=10T features='0' hwhandler='0' wp=rw
> |-+- policy='service-time 0' prio=1 status=active
> | `- 9:0:0:0  sdc 8:32 active ready running
> `-+- policy='service-time 0' prio=1 status=enabled
>   `- 10:0:0:0 sdd 8:48 active ready running
> 36589cfc0000007e77a9df827b65176f2 dm-12 FreeNAS ,iSCSI Disk      
> size=200G features='0' hwhandler='0' wp=rw
> `-+- policy='service-time 0' prio=1 status=active
>   `- 8:0:0:0  sdb 8:16 active ready running
> 
> If more information is needed, please let me know.

Can you please also share the following output:
   iscsiadm -m session -P 3

Comment 4 Vinícius Ferrão 2017-07-27 07:28:26 UTC
Maor, there is:

[root@ovirt3 ~]# iscsiadm -m session -P 3
iSCSI Transport Class version 2.0-870
version 6.2.0.873-35
Target: iqn.2017-07.br.ufrj.if.cc.storage.ctl:ovirt-he (non-flash)
	Current Portal: 192.168.12.14:3260,1
	Persistent Portal: 192.168.12.14:3260,1
		**********
		Interface:
		**********
		Iface Name: default
		Iface Transport: tcp
		Iface Initiatorname: iqn.1994-05.com.redhat:89c40b169bd9
		Iface IPaddress: 192.168.12.3
		Iface HWaddress: <empty>
		Iface Netdev: <empty>
		SID: 1
		iSCSI Connection State: LOGGED IN
		iSCSI Session State: LOGGED_IN
		Internal iscsid Session State: NO CHANGE
		*********
		Timeouts:
		*********
		Recovery Timeout: 5
		Target Reset Timeout: 30
		LUN Reset Timeout: 30
		Abort Timeout: 15
		*****
		CHAP:
		*****
		username: <empty>
		password: ********
		username_in: <empty>
		password_in: ********
		************************
		Negotiated iSCSI params:
		************************
		HeaderDigest: None
		DataDigest: None
		MaxRecvDataSegmentLength: 262144
		MaxXmitDataSegmentLength: 131072
		FirstBurstLength: 131072
		MaxBurstLength: 16776192
		ImmediateData: Yes
		InitialR2T: Yes
		MaxOutstandingR2T: 1
		************************
		Attached SCSI devices:
		************************
		Host Number: 8	State: running
		scsi8 Channel 00 Id 0 Lun: 0
			Attached scsi disk sdb		State: running
Target: iqn.2017-07.br.ufrj.if.cc.storage.ctl:ovirt-mpio (non-flash)
	Current Portal: 192.168.11.14:3260,1
	Persistent Portal: 192.168.11.14:3260,1
		**********
		Interface:
		**********
		Iface Name: eno3.11
		Iface Transport: tcp
		Iface Initiatorname: iqn.1994-05.com.redhat:89c40b169bd9
		Iface IPaddress: 192.168.11.3
		Iface HWaddress: <empty>
		Iface Netdev: eno3.11
		SID: 2
		iSCSI Connection State: LOGGED IN
		iSCSI Session State: LOGGED_IN
		Internal iscsid Session State: NO CHANGE
		*********
		Timeouts:
		*********
		Recovery Timeout: 5
		Target Reset Timeout: 30
		LUN Reset Timeout: 30
		Abort Timeout: 15
		*****
		CHAP:
		*****
		username: <empty>
		password: ********
		username_in: <empty>
		password_in: ********
		************************
		Negotiated iSCSI params:
		************************
		HeaderDigest: None
		DataDigest: None
		MaxRecvDataSegmentLength: 262144
		MaxXmitDataSegmentLength: 131072
		FirstBurstLength: 131072
		MaxBurstLength: 16776192
		ImmediateData: Yes
		InitialR2T: Yes
		MaxOutstandingR2T: 1
		************************
		Attached SCSI devices:
		************************
		Host Number: 9	State: running
		scsi9 Channel 00 Id 0 Lun: 0
			Attached scsi disk sdc		State: running
	Current Portal: 192.168.12.14:3260,1
	Persistent Portal: 192.168.12.14:3260,1
		**********
		Interface:
		**********
		Iface Name: eno4.12
		Iface Transport: tcp
		Iface Initiatorname: iqn.1994-05.com.redhat:89c40b169bd9
		Iface IPaddress: 192.168.12.3
		Iface HWaddress: <empty>
		Iface Netdev: eno4.12
		SID: 3
		iSCSI Connection State: LOGGED IN
		iSCSI Session State: LOGGED_IN
		Internal iscsid Session State: NO CHANGE
		*********
		Timeouts:
		*********
		Recovery Timeout: 5
		Target Reset Timeout: 30
		LUN Reset Timeout: 30
		Abort Timeout: 15
		*****
		CHAP:
		*****
		username: <empty>
		password: ********
		username_in: <empty>
		password_in: ********
		************************
		Negotiated iSCSI params:
		************************
		HeaderDigest: None
		DataDigest: None
		MaxRecvDataSegmentLength: 262144
		MaxXmitDataSegmentLength: 131072
		FirstBurstLength: 131072
		MaxBurstLength: 16776192
		ImmediateData: Yes
		InitialR2T: Yes
		MaxOutstandingR2T: 1
		************************
		Attached SCSI devices:
		************************
		Host Number: 10	State: running
		scsi10 Channel 00 Id 0 Lun: 0
			Attached scsi disk sdd		State: running

Comment 5 Maor 2017-07-27 09:16:23 UTC
That seems to be ok,
Target iqn.2017-07.br.ufrj.if.cc.storage.ctl:ovirt-mpio is connected through the following interfaces:
  iface eno4.12 with portal 192.168.12.14:3260,1 
  iface eno3.11 with portal 192.168.11.14:3260,1

Regarding what you asked before about two paths but only one active, it seems to be the iscsi multipath normal behavior.
active means that the path group currently receiving I/O requests and enabled means that the path groups to try if the active path group has no paths in the ready state.

Can you try to test it, check if that answers your requirement.

Comment 6 Vinícius Ferrão 2017-07-27 19:40:38 UTC
Maor, this works. I'm not sure if the multipath is in fact load balancing and if it will failover in case of a link failure. I should pull the cable to see if a VM keeps running. I can do this on the following days, since I don't have physical access to the datacenter this moment.

But there's a problem. What's the point of iSCSI Multipath tab on the hosted engine? If I remove everything on this tab the output of "multipath -ll" and "iscsiadm -m session -P3" are exactly the same.

So I really don't get it. What I'm missing? Why the iSCSI Multipath tab exists if the results are the same?

[root@ovirt3 ~]# iscsiadm -m session
tcp: [1] 192.168.12.14:3260,1 iqn.2017-07.br.ufrj.if.cc.storage.ctl:ovirt-he (non-flash)
tcp: [2] 192.168.12.14:3260,1 iqn.2017-07.br.ufrj.if.cc.storage.ctl:ovirt-mpio (non-flash)
tcp: [3] 192.168.11.14:3260,1 iqn.2017-07.br.ufrj.if.cc.storage.ctl:ovirt-mpio (non-flash)

[root@ovirt3 ~]# multipath -ll
36589cfc0000003400be2813b01be08c3 dm-26 FreeNAS ,iSCSI Disk      
size=10T features='0' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=1 status=active
| `- 9:0:0:0  sdc 8:32 active ready running
`-+- policy='service-time 0' prio=1 status=enabled
  `- 10:0:0:0 sdd 8:48 active ready running
36589cfc0000007e77a9df827b65176f2 dm-12 FreeNAS ,iSCSI Disk      
size=200G features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  `- 8:0:0:0  sdb 8:16 active ready running

Thanks,
V.

Comment 7 Maor 2017-07-30 08:08:20 UTC
(In reply to Vinícius Ferrão from comment #6)
> Maor, this works. I'm not sure if the multipath is in fact load balancing
> and if it will failover in case of a link failure. I should pull the cable
> to see if a VM keeps running. I can do this on the following days, since I
> don't have physical access to the datacenter this moment.

That is basically what the linux multipath should support (regardless of the oVirt configuration)

Can you please share an update regarding the test results.

> 
> But there's a problem. What's the point of iSCSI Multipath tab on the hosted
> engine?
> If I remove everything on this tab the output of "multipath -ll" and
> "iscsiadm -m session -P3" are exactly the same.

hosted engine with iSCSI bond is not officially supported yet, you can track the open bug through: https://bugzilla.redhat.com/1193961

The current design of the iSCSI multipath is that the engine does not disconnect from network interfaces while the storage domain is active. (see https://bugzilla.redhat.com/show_bug.cgi?id=1094144#c2)

It should disconnect the network interfaces when the storage domain/Host is moving to maintenance (Though, there is an issue which I encountered while verifying your scenario, see https://bugzilla.redhat.com/show_bug.cgi?id=1476030)

> 
> So I really don't get it. What I'm missing? Why the iSCSI Multipath tab
> exists if the results are the same?

See my previous answer, the behavior should be to deactivate the storage domain/Host again, although as I mentioned before, there is an open issue on that (see BZ1476030)

> 
> [root@ovirt3 ~]# iscsiadm -m session
> tcp: [1] 192.168.12.14:3260,1 iqn.2017-07.br.ufrj.if.cc.storage.ctl:ovirt-he
> (non-flash)
> tcp: [2] 192.168.12.14:3260,1
> iqn.2017-07.br.ufrj.if.cc.storage.ctl:ovirt-mpio (non-flash)
> tcp: [3] 192.168.11.14:3260,1
> iqn.2017-07.br.ufrj.if.cc.storage.ctl:ovirt-mpio (non-flash)
> 
> [root@ovirt3 ~]# multipath -ll
> 36589cfc0000003400be2813b01be08c3 dm-26 FreeNAS ,iSCSI Disk      
> size=10T features='0' hwhandler='0' wp=rw
> |-+- policy='service-time 0' prio=1 status=active
> | `- 9:0:0:0  sdc 8:32 active ready running
> `-+- policy='service-time 0' prio=1 status=enabled
>   `- 10:0:0:0 sdd 8:48 active ready running
> 36589cfc0000007e77a9df827b65176f2 dm-12 FreeNAS ,iSCSI Disk      
> size=200G features='0' hwhandler='0' wp=rw
> `-+- policy='service-time 0' prio=1 status=active
>   `- 8:0:0:0  sdb 8:16 active ready running
> 
> Thanks,
> V.

Comment 8 Uwe Laverenz 2017-08-01 09:01:42 UTC
Hi Maor,

I guess there is some kind of difference or misunderstanding of what we (the users/admins) expect from "iSCSI Bonding" and what you (the developers) intended.

If you add an iSCSI storage to OVirt (2 networks, 2 targets), you already get both paths enabled (active/passive) even without "iSCSI Bonding".

What I expected from creating an iSCSI Bond was to be able to control what kind of load balancing or failover policy (fixed, least recently used, round robin) should be used for the storage domain.

What actually seems to happen is that an iSCSI Bond is something the vdsm uses to monitor the storage paths?! While creating an iSCSI Bond the system tries to change the network and/or multipathd configuration in a way that leads to a system failure, at least when you use separated storage networks.

However, the question is whether we don't understand the concept of iSCSI Bonding in OVirt or the described behaviour is a bug. :)

thanks,
Uwe

Comment 9 Maor 2017-08-01 12:21:14 UTC
(In reply to Uwe Laverenz from comment #8)
> Hi Maor,
> 
> I guess there is some kind of difference or misunderstanding of what we (the
> users/admins) expect from "iSCSI Bonding" and what you (the developers)
> intended.
> 
> If you add an iSCSI storage to OVirt (2 networks, 2 targets), you already
> get both paths enabled (active/passive) even without "iSCSI Bonding".


Are you referring to 2 network interfaces on the storage server? or 2 network interfaces on the host?
If you do not declare an iSCSI bond in oVirt, the engine will not pass VDSM the non-required network interfaces to connect to it, and iscsiadm will not be connected with the new network interface of the host.

> 
> What I expected from creating an iSCSI Bond was to be able to control what
> kind of load balancing or failover policy (fixed, least recently used, round
> robin) should be used for the storage domain.


That should be done before you create the iSCSI bond in the engine, thorough  the iSCSI multipath in the linux host (see https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html-single/DM_Multipath/index.html#mpio_configfile )

> 
> What actually seems to happen is that an iSCSI Bond is something the vdsm
> uses to monitor the storage paths?! While creating an iSCSI Bond the system
> tries to change the network and/or multipathd configuration in a way that
> leads to a system failure, at least when you use separated storage networks.

iSCSI Bond in the engine configure the non-required network interfaces that should be connected to the iSCSI storage domain.
You can have 10 non-required networks, but if you will configure only 2 of them in the iSCSI bond then the iscsiadm will only connect with those 2 which you configured.

VDSM does monitor those networks and if all of those networks fail to connect with the iSCSI storage domain, then it should go to non-active state.

The system should not change the multipathd configuration, it should only use the connect command to connect to those network interfaces, multipathd should run on the host regardless of iSCSI bond.

> 
> However, the question is whether we don't understand the concept of iSCSI
> Bonding in OVirt or the described behaviour is a bug. :)
> 
> thanks,
> Uwe

Please let me know if anything is still unclear

Comment 10 Vinícius Ferrão 2017-08-04 20:17:22 UTC
Hello Maor, I've answered through email but the messages aren't attached to the issue. "Reanswering":

About the tests: I was able to test the architecture today. It "failedover" successfully. I’m not sure if MPIO is working for speeding up the bandwidth, I was not able to simulate a huge traffic to see if it would get more than gigabit speeds over two paths. But this is almost good.

On the issue with hosted-engine: Perhaps I’ve explained the question in a non understandable format. Sorry.

I’m not running the Self Hosted Engine with multiparty because as you said it’s unsupported. So you can see that I’ve three iSCSI connection. One is the Self Hosted Engine and the remaining two are the multipaths to a generic LUN just for testing.

What I’m talking about is the necessity of configuring the iSCSI Multipath Tab on the web interface of oVirt. I’ve removed everything from this tab and the behavior was the same as the configuration that you’ve asked me to do, with the iSCSI1 network only selected for path 1 and the iSCSI2 network only for the second path.

So the question again is: what is the purpose of this tab? On documentation it says clearly that this should be configured to multipath works, but this does not appears to be 100% accurate, since the result was the same configuring this (in the way you’ve said) and not configuring at all.

Thanks,
V.

Comment 11 Uwe Laverenz 2017-08-07 13:21:12 UTC
Hello Maor,

(In reply to Maor from comment #9)
> Are you referring to 2 network interfaces on the storage server? or 2
> network interfaces on the host?

I use both: 2 separate network interfaces on the storage server and 2 separate network interfaces on the host. The interfaces are connected to 2 separate networks/VLANs, one network card per network. These networks are dedicated storage networks and therefor aren't routed or otherwise reachable from other networks.

When I connect my host to the iSCSI storage server I first connect to the portal on the first network and then to the portal on the second network. The host connects to both portals and uses both network interfaces. The host detects automatically that there are 2 paths for each LUN. So you already have "multipath" at this point, configured as active/passive. The only missing detail is the setting of round robin policy which can be easily be done via multipath.conf.


> If you do not declare an iSCSI bond in oVirt, the engine will not pass VDSM
> the non-required network interfaces to connect to it, and iscsiadm will not
> be connected with the new network interface of the host.

With the described setup above, all network interfaces for iSCSI are connected already and iscsiadm uses them, there aren't any unconnected "non-required" interfaces left.

> That should be done before you create the iSCSI bond in the engine, thorough
> the iSCSI multipath in the linux host (see
> https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/
> html-single/DM_Multipath/index.html#mpio_configfile )

This is configured in the "defaults" section in "/etc/multipath.conf". The problem is, that VDSM overwrites this file unless you put a "# VDSM PRIVATE" into it. So you can't get "round robin" policy without keeping VDSM out.


> iSCSI Bond in the engine configure the non-required network interfaces that
> should be connected to the iSCSI storage domain.
> You can have 10 non-required networks, but if you will configure only 2 of
> them in the iSCSI bond then the iscsiadm will only connect with those 2
> which you configured.

What would be the correct order of configuration? In order to use iSCSI Bonding I would only connect the interface in the first network, declare the second network interface "non-required" and instead of connecting it in the storage domain dialog I use it in an iSCSI bond?

> VDSM does monitor those networks and if all of those networks fail to
> connect with the iSCSI storage domain, then it should go to non-active state.

The standard setup without iSCSI Bond is not being monitored by VDSM as far as network connectivity is concerned? But I guess the availability of the storage device ist monitored (I/O errors)?

> The system should not change the multipathd configuration, it should only
> use the connect command to connect to those network interfaces, multipathd
> should run on the host regardless of iSCSI bond.

VDSM overwrites the multipathd config file unless you tell him not to. So it does change the multipathd configuration.

> Please let me know if anything is still unclear

Sorry, I still don't understand what problem iSCSI Bonding would solve in my setup. The only thing might be the monitoring of the network connectivity but IMHO the control of failover policy is the job of multipathd.

IMHO a settings dialog for VDSM's creation of the multipath configuration file would be more useful than iSCSI Bonding.

Thank you,
Uwe

Comment 12 Maor 2017-08-10 09:10:45 UTC
(In reply to Uwe Laverenz from comment #11)
> Hello Maor,
> 
> (In reply to Maor from comment #9)
> > Are you referring to 2 network interfaces on the storage server? or 2
> > network interfaces on the host?
> 
> I use both: 2 separate network interfaces on the storage server and 2
> separate network interfaces on the host. The interfaces are connected to 2
> separate networks/VLANs, one network card per network. These networks are
> dedicated storage networks and therefor aren't routed or otherwise reachable
> from other networks.
> 
> When I connect my host to the iSCSI storage server I first connect to the
> portal on the first network and then to the portal on the second network.
> The host connects to both portals and uses both network interfaces. The host
> detects automatically that there are 2 paths for each LUN. So you already
> have "multipath" at this point, configured as active/passive. The only
> missing detail is the setting of round robin policy which can be easily be
> done via multipath.conf.
> 
> 
> > If you do not declare an iSCSI bond in oVirt, the engine will not pass VDSM
> > the non-required network interfaces to connect to it, and iscsiadm will not
> > be connected with the new network interface of the host.
> 
> With the described setup above, all network interfaces for iSCSI are
> connected already and iscsiadm uses them, there aren't any unconnected
> "non-required" interfaces left.

Can you please try this with a new DC after the host will logout from all the networks using iscsadm.
It could be that those networks were still connected (BZ1476030)

> 
> > That should be done before you create the iSCSI bond in the engine, thorough
> > the iSCSI multipath in the linux host (see
> > https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/
> > html-single/DM_Multipath/index.html#mpio_configfile )
> 
> This is configured in the "defaults" section in "/etc/multipath.conf". The
> problem is, that VDSM overwrites this file unless you put a "# VDSM PRIVATE"
> into it. So you can't get "round robin" policy without keeping VDSM out.

Can you please open an RFE for that issue.

> 
> 
> > iSCSI Bond in the engine configure the non-required network interfaces that
> > should be connected to the iSCSI storage domain.
> > You can have 10 non-required networks, but if you will configure only 2 of
> > them in the iSCSI bond then the iscsiadm will only connect with those 2
> > which you configured.
> 
> What would be the correct order of configuration? In order to use iSCSI
> Bonding I would only connect the interface in the first network, declare the
> second network interface "non-required" and instead of connecting it in the
> storage domain dialog I use it in an iSCSI bond?

basically yes, non-required networks should be declared in iSCSI bond.
Here is an example how to do that:
1. Add an iSCSI storage domain in maintenance mode
2. Configure an iSCSI bond with the non-required networks
3. Active the iSCSI storage domain

> 
> > VDSM does monitor those networks and if all of those networks fail to
> > connect with the iSCSI storage domain, then it should go to non-active state.
> 
> The standard setup without iSCSI Bond is not being monitored by VDSM as far
> as network connectivity is concerned? But I guess the availability of the
> storage device ist monitored (I/O errors)?

The standard setup monitors the default network connectivity with the storage domains, if you want also to use non-required networks for iSCSI storage domain you should declare those as part of the iSCSI bond.


> 
> > The system should not change the multipathd configuration, it should only
> > use the connect command to connect to those network interfaces, multipathd
> > should run on the host regardless of iSCSI bond.
> 
> VDSM overwrites the multipathd config file unless you tell him not to. So it
> does change the multipathd configuration.
> 
> > Please let me know if anything is still unclear
> 
> Sorry, I still don't understand what problem iSCSI Bonding would solve in my
> setup. The only thing might be the monitoring of the network connectivity
> but IMHO the control of failover policy is the job of multipathd.
> 
> IMHO a settings dialog for VDSM's creation of the multipath configuration
> file would be more useful than iSCSI Bonding.
> 
> Thank you,
> Uwe

Comment 13 Maor 2017-08-28 13:23:03 UTC
Hi,

Is there anything else which is still unclear?

Comment 14 Vinícius Ferrão 2017-08-28 16:24:42 UTC
Hello Maor, I've received the message about the phone call but I was on travel and just forget to answer it. Sorry!

The points explained by Uwe still persists, about the design decisions of multipath handling and it would/should behave.

I'm not sure if Uwe contacted you for more details or not.

Comment 15 Maor 2017-08-29 10:26:52 UTC
You can say that the iSCSI bond is mainly for monitoring and set the storage domain as non-operational once the the Host can not connect to the storage using the non-required networks.
Regarding multipathd.conf IIRC there should be some comment which you can add in the file that VDSM will not run over your configuration

Comment 16 Maor 2017-08-29 10:44:25 UTC
(In reply to Maor from comment #15)
> You can say that the iSCSI bond is mainly for monitoring and set the storage
> domain as non-operational once the the Host can not connect to the storage
> using the non-required networks.
> Regarding multipathd.conf IIRC there should be some comment which you can
> add in the file that VDSM will not run over your configuration

See this comment in multipath.py on VDSM:
# The second line of multipath.conf may contain PRIVATE_TAG. This means
# vdsm-tool should never change the conf file even when using the --force flag.

IIUC That means that if you will add this private tag (# VDSM PRIVATE) on top of your configuration VDSM will not run over it.

Comment 17 Maor 2017-09-03 13:04:44 UTC
It seems that the configuration of the iSCSI bond in the engine was resolved. There are still unclear issues from the user point of view but those can be discussed in the users mailing list or by a phone meeting.

Closing the bug for now, let's move back the discussion of the other unclear issues to the mailing list.

Comment 18 Vinícius Ferrão 2020-02-15 17:17:49 UTC
Maor, I'm once again asking to reconsider this. It's now almost 1000 days since the opening of this bug and nothing has changed.

I'm commenting here today, because since the report of this a lot of people asks the same thing, at least for me:

Why iSCSI multipath is broken in oVirt?
How to enable multipath in oVirt?
My storage domain keeps getting down when I enable iSCSI multipath.
Don't use iSCSI in oVirt, use NFS with LACP since iSCSI is broken.
iSCSI does not work at all in oVirt.

The same question came to me today, and since it's saturday I've a little spare time to post here again.

All those questions always appear in a Brazilian Telegram group about oVirt that I'm in oVirt. And I'm one of the guys who maintain this group. Today one guy said that he don't understand how iSCSI Multipath works, because he done everything right and the storage domain keeps going down. He don't even consider that something is wrong in oVirt, because oVirt is Red Hat sponsored, so they must get right and not us.

But this is definitely not the case here.

So please, if you guys are not willing to change the behaviour of iSCSI Multipath tab, which is fine, and we can understand this, please remove it completely or rename it. Because it's at best misleading.

I've done my homework, gone back to the books and the basics to really try to understand what happened here, and why things got at this point. We have a community that does not understand what's the purpose of the iSCSI Multipath option inside oVirt. It's common sense that the iSCSI tab would enable things like load balancing, and as you said, monitoring the links health. But it's not the case since the same issue happens: unreachable paths, by physically separated paths tries to talk to each other. This will never happen in a proper segregated multipath scheme.

Wikipedia, which is the basic thing we get on the web, have a drawing about a Multipath I/O topology: https://en.wikipedia.org/wiki/Multipath_I/O

Just take a look at FC or SAS multipath, they are the same thing as iSCSI multipath, but in another context with different protocols, but they behave the same: separate paths that works together in a failover and/or load balancing scenario.

I've been running the basic iSCSI setup, with nothing configured on the iSCSI multipath tab on the Engine, and it's fine. It failover correctly, but it does not support any load balancing. There's some things that went out of control, like some redundant failover connections to the same target, where should be only two, but this is another bug, just showing here for the sake of completude:

[root@ovirt3 ~]# multipath -ll
3600605b00805e0401c4c11ae0aff0af8 dm-0 IBM     ,ServeRAID M1115 
size=278G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  `- 0:2:0:0  sda 8:0   active ready running
36589cfc0000006f6c96763988802912b dm-18 FreeNAS ,iSCSI Disk      
size=10T features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=1 status=active
| `- 12:0:0:1 sdn 8:208 active ready running
|-+- policy='service-time 0' prio=1 status=enabled
| `- 13:0:0:1 sdo 8:224 active ready running
|-+- policy='service-time 0' prio=1 status=enabled
| `- 11:0:0:1 sdk 8:160 active ready running
|-+- policy='service-time 0' prio=1 status=enabled
| `- 7:0:0:1  sdf 8:80  active ready running
|-+- policy='service-time 0' prio=1 status=enabled
| `- 9:0:0:1  sdg 8:96  active ready running
|-+- policy='service-time 0' prio=1 status=enabled
| `- 10:0:0:1 sdh 8:112 active ready running
`-+- policy='service-time 0' prio=1 status=enabled
  `- 8:0:0:1  sdi 8:128 active ready running
36589cfc000000a1b985d3908c07e41ad dm-17 FreeNAS ,iSCSI Disk      
size=200G features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=1 status=active
| `- 12:0:0:0 sdl 8:176 active ready running
|-+- policy='service-time 0' prio=1 status=enabled
| `- 13:0:0:0 sdm 8:192 active ready running
|-+- policy='service-time 0' prio=1 status=enabled
| `- 11:0:0:0 sdj 8:144 active ready running
|-+- policy='service-time 0' prio=1 status=enabled
| `- 7:0:0:0  sdb 8:16  active ready running
|-+- policy='service-time 0' prio=1 status=enabled
| `- 9:0:0:0  sdc 8:32  active ready running
|-+- policy='service-time 0' prio=1 status=enabled
| `- 8:0:0:0  sde 8:64  active ready running
`-+- policy='service-time 0' prio=1 status=enabled
  `- 10:0:0:0 sdd 8:48  active ready running

Thanks for listening again, and please reconsider this.

PS: I'm also answering Tal Nisan needinfo request, sorry for the extremely long delay but I gave up on this issue in 2017.

Comment 19 Ai90iV 2020-08-20 20:21:11 UTC
 Dear Vinícius Ferrão,

 I have the same situation with redundant failover connections to the same target not matching to expected count same as You. I figured out with iscsiadm -m session -P3 I have connections to same target from two physical NICs, which I clearly identified based on Iface Name output, plus additional connectiosn originating to same target from Iface Name: default.

Based on output you provided I clearly see that you have session from unknown interface default and from interface eno4.12 to the same target. Based on Iface IPaddress: 192.168.12.3 it is clearly visible that both sessions belongs to interface eno4.12.

Target: iqn.2017-07.br.ufrj.if.cc.storage.ctl:ovirt-he (non-flash)
	Current Portal: 192.168.12.14:3260,1
	Persistent Portal: 192.168.12.14:3260,1
		**********
		Interface:
		**********
		Iface Name: default
		Iface Transport: tcp
		Iface Initiatorname: iqn.1994-05.com.redhat:89c40b169bd9
		Iface IPaddress: 192.168.12.3

	Persistent Portal: 192.168.12.14:3260,1
		**********
		Interface:
		**********
		Iface Name: eno4.12
		Iface Transport: tcp
		Iface Initiatorname: iqn.1994-05.com.redhat:89c40b169bd9
		Iface IPaddress: 192.168.12.3

		*********

 To be honest I do not understand why interface default appear in communication chain, maybe oVirt or RedHat Team can provide us more explanation what is the purpose of this default interface. Can we simple kill sessions originating from interface Iface Name: without impact on oVirt and multipath? 

 I noticed also interesting behaviour, in case I shutdown primary NIC, session with Iface Name: default jump over to secondary NIC. 

 Thank you in advance!

Comment 20 Vinícius Ferrão 2021-03-09 16:35:13 UTC
Ai90iV I already lost the hope on this issue. Now with the news that RHV will be decommissioned in favor of OpenShift Virtualization, this ticket will definitely never see the light of the day.

As a record, issue still persists on oVirt/RHV 4.4.4; iSCSI is simply not understood correctly by oVirt/RHV.

Comment 21 Jeroen 2021-06-10 15:04:27 UTC
Like  Vinícius Ferrão and the rest of the world, I also ran into big trouble with this, but it seems I've found some workaround, that for now
it at least make it work, but also can inspire bug fixing.
(The story is a little long, but the "fix"/method/workaround is small, but understand the implications, please read on, it's worth the effort, and not too technically.)

The be short: Vinícius is absolutely right, and all SAN network engineers working with MP based in iscsi (ip/lay3) + lay3roundrobin + ethernet know it.
It's also kernel+routing at initiator level and target level "logic", try things like "all in same subnet", and/or , in oVirts case ; all in one iscsi-"bond",
and you run into big trouble now, or some-day.

I've built and used XenServer CE 6.5 en 7.0 for many years, with exact the same paradigms, and they worked like a charm.
Also there it was all iscsi at o.s. level plus multipath+roundrobin at o.s. level, and.... like... ovirt, it had some abstract
"Enable Multipath" button at high-level cluster management interface.
Indeed... that monitors the traffic at Xen-level so that if no path's are there anymore, only than the DS's would get a handled failure state.
If some, but not all, iscsi channels where down, an admin just could see it was, and no harm to anything else.
Retry-reconnect, etc etc, all was done o.s. level with perhaps Xen triggering it, but Xen didn't tamper with MP and settings/algo's like roundrobin from
the multipath daemon.

oVirt, to my surprise still works the same way, while indeed the backing of RedHat enterprise, the year 2021, and the Beauty!of an interface and possibilities
misleads in a lot of ways, regarding some lurking fatal bug, and a badly suggesting iscsi-bond "feature".

A lot of the symptoms are already written about, also see all the above and I will not repeat them.

Inspired by ip+nic behaviour src<>dest and Xen's MP "enabling but do nothing else but monitoring", I decided to indeed let ip's live in their
own subnet, forcing kernels to know to connect tco src to dest as it should.
Also I needed to create these oVirt DC isci-bonds, but I mapped 1-to-1 the logical networks to the ip-target that belongs to that per subnet.
( To be clear: My setup consists of 2 oVirt nodes 4.4.6 with 4 san nic's per host, a standalone engine on Centos8stream, two san switches not interconnected, and one central Synology SAN
with 4 nic ifaces. On that some LUN's from I connected two of them in de DC as stores, because of research when I encountered (to my horror) these bug-problems. )
The isci "bonds" must be there it seems, because like Xen, it makes the DC MP aware, and knows how to handle, read: when it need to trigger all kind of o.s. level
mechanisms to handle iscsi connections, *and* in the end to conclude if some DS store still should be kept alive cluster-wide by administration via the Engine.

But somehow... if you've all performed these setups, and a lot of people already tried that, when isci goes berzerk by some failure, your hosts run into trouble.
That is.... when you create separate iscsi bonds per etherchannel+ip subnet 1-to-1 , that layer works, and it also seem to force dispatching src ip packets from the
correct interface (it also seems to do that when you define all your nic ip's in same subnet, I tested that also, but you shouldn't do that for a dozen of reasons),
but, still, when you disconnect on host or SAN your channels, trouble begins.
A nice detail of "trouble begins" is... that when you ARE running a healthy situation, like, all hosts Active, SAN store running and all ip/iscsi channels are there,
and multipath -ll show the correct amount of connections AND roundrobin, if, if than, you disconnect something, all behaves still well... DS's keep running, Hosts keep active,
and on o.s. level multipath nicely shows some parallel's in faulty state, exactly the ones that should. All the other paths to same iscsi target are there, and they actualy
work. 
Test some VM, create a new vdisk, all works, no problem.
Also... reconnecting things, at what point ever, reconnect iscsi and multipath show all running.
Oh wow... just like 8 year old Xen 6.5.
But.... and that's the horror... when you create situation where there are NO initial iscsi channels running, that's where oVirt collapses.
Like putting a host in maintenance and re-activate, or, reboot it.
You get in an extremely long running loop of try-to-activate of hosts, and in the end all fails, en retry's forever, and with all the hosts the cluster goes down, so
your store look administratively down. Yes you get some greens in-loop-between but they are fake. Technically they are there, o.s. ping of the still runnings, etc etc
and you get lost.
I'm testing these things for three full stressy days now, and I went almost mad.
But at least I've boiled down to the point: "all can be healthy, good news, but when connects are lost AND you reboot/re-activate etc, only than you get these symptoms".
And just in case of power failure and power-up and one switch broken, you can go to the office and stress. Try to fix switch, or between the horrific long delays, try
to demolish isci channel registrations on de DS's, what seems almost impossible in this state of looping with blocks.

So... what's happing when that happens.... I forced my cluster in this loop, and checked linux process with ps axjf, and discovered that the vdsm python written main
process invokes o.s. level iscsiadmin, and there it tries and tries the lost ip targets, one by one, by sequentally going over the oVirt administrated targets.
OK, that's where the long long long loop delays come from. And indeed when this connect-tryout loops over one that's there, it emerges as iscsi channel, and
bind it automatically to multipath.
BUT in the end... this one (or more) succesful emerged channel(s) isn't enough, for ovirt, to say hey-it's-ok, and enables DS's, no... it just evaluates the
entire loop (although it looks like that) and says "no", it thinks there are no iscsi targets alive, and smashes the host down again, al to get started again a few moments later.

I traveled through the python infra, and discovered the function define that handles this o.s. level iscsiadm connect cases. The file it's in is called... iscsiadm.py ...
In there, a caller passes params to it, like the registered target ip's etc, and than it try/excepts to connect. 
If the o.s. call (iscsiadm) returns an error state, it catches/handles this exception, and on his turn this iscsi-login function Raises/Trows also
exceptions.
That's it. No intelligence on "check first of by ping or tcp the remote ip is there before trying" to speed things op by lightspeed, but also it seems that the calling infra also doesn't
workaround different states of failure. So.. if f.i. an Login call fails, register at least which one succeeded, so you *know* that iscsi *can* be used, and oVirt can
use DS's, instead if thinking "error" and throw host down to retry the same stupid method.
(OK, also you need to register that at least one of the channels *failed* , so in background you need to try to reconnect, while not blocking/downing oVirt, hosts, DS's.)

OK, what I have tried is this:
In this python function I've just commented all, and replaced it with a "pass", to let the calling infra do it's job, and the function becomes more of a stub.
Than, see what happens if I activate a host, while in maintenance, an so all iscsi is lost.
Well... Activation fails, but after a shorter time, and of course it fails... because it doesn't login to iscsi targets. So it indeed is impossible, but oVirt didn't go
berzerk, it just gently fails.
OK, than from ssh shell I activated the iscsi channels (that I now are responding), by just entering the entire command as fetched from ps axjf in the old situation
replacing of course with the correct params, and -I should be "default" instead of the forcing source-iface, but that's no problem, because of isolated subnets.
OK iscsiadm -m session shows they are there, and, multipath -ll show they are picked up and grouped round-robin. And the number of sessions is correct, no stale or double sessions.
(Again: check also you isci-bond where you should force  it one-1-one)

After that... Activate host again, and indeed it skips very fast the login python func (it call's but it does nothing) , and then, by nature it sees
the LUN's are there and activates host, and it runs. Wow... party time !

Well, what's the result? With some manual intervention you can start a remote failed cluster by power issues or whatever, and when all runs, it keeps
multipathing and roundrobin, so the bandwidth binding plus path fails redundancy seems to works as it should for years, so no re-activation or reboot issues, 
than you seem to be / are covered.
It's not nice either. It's very amateuristic, and not RHEL worthy, but perhaps they focus on Fiberchannel and single-channel 10GBps only, and perhaps
that's very "enterprise", although 99% of the oVirt using companies would be smaller in size than the FC boys.

My point is; perhaps the developers can be inspired by this, and recognize the correct MP iscsi paradigms better, and translating it to the vdsm python management
infra. Again, in XenServer the exact the same filosophy was used, but they didn't have this misleading iscsi-bond rules (only an enable button), but better, they also didn't try to iscsi 
reconnect and evaluate in the too simplistic way as oVirt does. There it just worked, for many years.
I've used it in a multinational with great success.
Also, this one login function in iscsiadm.py is only called when oVirt needs to connect and check, it's not used for rescanning etc, so it seems to only get called
when system boots / (re)activates. 
After that... it's not needed anymore, because o.s. iscsiadm and multipath handels dis/reconnects and grouping in mp roundrobin, and it sweems that oVirt indeed
doesn't call this connect/login function afterwards, because even for them it doesn't make sense then (??)
So this call should be more clever triggerd, and if triggered, it should be more clever in pre-check "target ip responsive?" before connecting, and somehow
oVirt should be informed that the ones that DO connect, are a reason to bring things up, versus throw all down.
Also that IF not ALL was success-connected, retry in background until all is there. At the o.s. level this works, now oVirt python level is needed.

Myself I guess I'm going to use the .py function stubbing as a "solution" , perhaps letting a bootup script connect
iscsi, so that power failure at night perhaps still gets the hosts up.
The rest already works, *with* the rule that the dreaded iscsi-bond is mappen 1-to-1 per channel.
This doesn't survive updates I guess, so every time you need to edit this .py variant when overwritten, and when
still not fixed upstream.
Also perhaps my own "bootup script" .

Beside the "fix advises" described, also oVirt team should indeed re-evaluate this iscsi-bond "thing".
OR make it take over multipathing completely (with algo selection like round-robin) so "bond" is more what it looks like, OR go the Xen way,
and only let it monitor all your (already there by nature) MP's. Just with a "enable MP" button.
As also suggested by others.

The bug has two faces: iscsi bond misleading, but you seem to be able to workaround with 1-to-1 mapping, and emulating Xen behaviour,
combined with the iscsi login/connect issues, which are rather clear to see, and it's clear how oVirt reacts on it. Adapt it
to your needs, and indeed all works. (I need to test further, but it looks good, and you know when a puzzle seems complete.)

If one of these faces errors... always the same thing happens, and it's caused by face 2 (iscsi), combined with what ovirt thinks
about the face-2 results... What happens (of course)... endless horrific long delayed re-activations of hosts, and in the end
nothing but administratively lost DS's and hosts, and very hard to repair or bring up via oVirt high level interfacing.

I hope this helped a bit, so my days of stress are not completely useless....
Enjoy, oVirt seems very very great, think about it, but this MP iscsi layer is way too long to buggy, but it also seems
they are almost there.
With a bit of shame compared to Xen CE they can perhaps make a sprint and fix at least these isci connect/login things.
Than at least it works / seem to work.
IMHO.

I'm no oVirt developer or related to oVirt project, sorry, but I came across this and a lot of others, when last monday the
stresslevel arose to boiling point. I promised the boss to build a fantastic RHEL based cluster, working for years
with Xen CE until it disappeared behind bars. oVirt looked so damn good, also QoS at layers, etc etc, and KVM based hypervisoring,
and REST Api, so all kind of software , backups, your own, good steer it, wow, man it looked profi, and at home with
a single node local storage DC it already runs for some time with great experience.
But what a bummer to encounter this jurassic truly unbelievable bug that cripples all to a halt. Bad advertisement for RHEV also.

I hope developers fix it in no-time, so perhaps 2021 could be the year of oVirt's production ready iscsi MP cluster !
Thank you guys !

-Jeroen

@ Maor : What do you think about this ? Kind regards.

Comment 22 Jeroen Keet 2021-06-17 10:35:37 UTC
O.K. "closed, because not a bug". Well, I've fixed it at it's minimum (but it works great), (and indeed it is/was a bug,) by just a small adding of python code.
It pings the iterated target beforehand, and if successful, it iscsiadm logins to that target and succeeds.
(It automatically again emerges and gets grouped in multipath -ll roundrobin.)
Also, the initial buildup of iscsi based Storage Domains are liquid now.
The code addition just uses oVirt's internal python modules, and some call's from oVirt's own modules.
So no strange actions, and indeed just calling o.s. commands like the entire oVirt frameworks seems
to do all the time. And it fits transparently in oVirt, at this moment/version that is.

Before this fix oVirt was completely useless regarding iscsi in combination with multipathing and it's roundrobin.
It led to a chuck of problems, f.i. when some iscsi path failed, and a host needed re-activation or
booted up while the path fails, the host simply kept trying to activate, and after a very long time it
failed, again and again, while there just are iscsi paths available, and connectable.

Still oVirt's orchestration framework isn't extremely correct regarding iscsi.
It works, with this fix, but you also need the iscsi bonding, but with this rule:
Make a *separate* bond, 1-to-1 mapped logical network subnet 1 -> dest. ip. iscsi target.
Repeat this for every iscsi connection (every connection in it's own ip subnet, as IS the only good, and seems usable now, with this fix)
So this in fact aren't bonds anyway, and it doesn't make sense anyway. Multipath -ll stil shows the real "bonding".
It only makes the oVirt orchestration multipath aware, just like XenServer always did.

Also make sure that your SAN storage (iscsi target) is configured the correct way per oVirt-node host,
so it indeed performs roundrobin, and use the "parent config options override" method to keep your own
SAN storage settings safe from tampering by oVirt orchestration/vdsm.

Still in my tests when some target-LUN was destroyed by me, things got worse when, f.i., you create
an extra new LUN, and wanted to instantiate a SD from it, by using the gui, discovery wizard, etc.
Long story short, the logins fix just works, but other oVirt parts got trashed in the end, while
not discovering new Lun, not disposing the stale, and vdsmd (systemctl status vdsmd) reporting
udev return errors.
Both my host/nodes got trashed this way, and while vdsmd was running, or even a restart of vdsmd,
they couldn't activate, by other causes than what this bug was all about. And so oVirt admin thinks (again),
Hosts and SD's are down.
VM's kept running.
Former isci connects where up, and multipath works.
After a VM shutdown (from within VM), and reboots of the hosts/nodes, all was very OK.
Stales disappeared, and new LUN's could be connected as it used to.
All iscsi channels where up again, incl. multipathing.

I expect that practically with the iscsiadm login python script fix, things work well.
But during these extreme tests, well, it shows that oVirt still isn't the king of iscsi,
and that's at least surprising i.m.h.o.

Also remind that when you apply this fix, it can be overwritten by some update, until
someone of oVirt team applies fixes upstream. Mine, or other.


I use oVirt std.alone Engine and oVirt node 4.4.6 , and in my case the file:

/lib/python3.6/site-packages/vdsm/storage/iscsiadm.py    (on a ovirt-node host)

contains the iscsi login function, that's called when a connect needed to be made, so not
when f.i. a re-connect should be made if some channel fails, and gets back, when all other things still
runs (in multipath -ll this just shows "inactive/failed" per path, and reconnects again when path gets up again,
but this reconnect has nothing to do with ovirt call's of this python function.)

I just added code, more wasn't needed here, to at least fix in it's basic this years-old dreaded cause
of useless disaster. (Use correct nr. of space indentions where needed. It's python, remember that, or it crashes
in background, and no effect, or worse, can be seen.)

The function define now is:

def node_login(iface, portal, targetName):
    # JCK - ping pre-test Q&D fix.
    #       If it fails, just skip.
    #       It sends 3 pings, evaluates the result of last ping, and waits no longer
    #       than 3 seconds on some response (OK or unreach, etc) per fired ping.
    #       Remark: The target iscsi storage shouldn't firewallfilter on icmp ping.
    #               It's ipv4.
    #               After implementing this, place host in Maintenance, then stop
    #               ovirt's vdsmd via systemctl.
    #               iscsiadm -m session should be empty than.
    #               Now start vdsmd again, and re-Activate host.
    # We need try/except, because commands.run() returns a "byte" with all stdout in it,
    # (when o.s. command itself returns a 0 exit, so OK, code), and nothing else.
    try:
        commands.run(["/usr/bin/ping", "-c", "3", "-W", "3", str(portal).split(":")[0]], sudo=False)
        retval = 0;
    except cmdutils.Error as e:
        retval = e.rc
    if retval == 0:
        try:
            run_cmd(["-m", "node", "-T", targetName, "-I", iface,
                     "-p", portal, "-l"])
        except cmdutils.Error as e:
            if not iface_exists(iface):
                raise IscsiInterfaceDoesNotExistError(iface)

            if e.rc == ISCSI_ERR_LOGIN_AUTH_FAILED:
                raise IscsiAuthenticationError(e.rc, e.out, e.err)

            raise IscsiNodeError(e.rc, e.out, e.err)
    else :
        pass

As you can see in the end it's rather easy. Only the ping part, and the retval
evaluation has been added by me.

After that, in fact you can place Hosts in Maintenance, and stop/start vdsmd by systemctl.
Activate hosts.
(The .py script is part of vdsdmd it seems. If you don't restart, changes in code do not effectuate.)
OK, test the things you want. Disconnect iscsi channels, reboot hosts, etc etc, they should
NOT fail if only a selection of channels fail, and it also shouldn't take ages to complete, but in fact
only take 3+ seconds per iscsi channel iteration.
Also when checking your iscsi and multipath connections, there is only one session there per target-ip,
and no crosswize cartasion mesh -like src <> dest connects. (ip's in separate subnets (and vlans if you like), remember?)

I hope someone can benefit from this fix. I do.

-Regards, Jeroen

Comment 23 Benny Zlotnik 2021-06-17 13:33:38 UTC
(In reply to Jeroen Keet from comment #22)
> O.K. "closed, because not a bug". Well, I've fixed it at it's minimum (but
> it works great), (and indeed it is/was a bug,) by just a small adding of
> python code.
> It pings the iterated target beforehand, and if successful, it iscsiadm
> logins to that target and succeeds.
> (It automatically again emerges and gets grouped in multipath -ll
> roundrobin.)
> Also, the initial buildup of iscsi based Storage Domains are liquid now.
> The code addition just uses oVirt's internal python modules, and some call's
> from oVirt's own modules.
> So no strange actions, and indeed just calling o.s. commands like the entire
> oVirt frameworks seems
> to do all the time. And it fits transparently in oVirt, at this
> moment/version that is.
> 
> Before this fix oVirt was completely useless regarding iscsi in combination
> with multipathing and it's roundrobin.
> It led to a chuck of problems, f.i. when some iscsi path failed, and a host
> needed re-activation or
> booted up while the path fails, the host simply kept trying to activate, and
> after a very long time it
> failed, again and again, while there just are iscsi paths available, and
> connectable.
> 
> Still oVirt's orchestration framework isn't extremely correct regarding
> iscsi.
> It works, with this fix, but you also need the iscsi bonding, but with this
> rule:
> Make a *separate* bond, 1-to-1 mapped logical network subnet 1 -> dest. ip.
> iscsi target.
> Repeat this for every iscsi connection (every connection in it's own ip
> subnet, as IS the only good, and seems usable now, with this fix)
> So this in fact aren't bonds anyway, and it doesn't make sense anyway.
> Multipath -ll stil shows the real "bonding".
> It only makes the oVirt orchestration multipath aware, just like XenServer
> always did.
> 
> Also make sure that your SAN storage (iscsi target) is configured the
> correct way per oVirt-node host,
> so it indeed performs roundrobin, and use the "parent config options
> override" method to keep your own
> SAN storage settings safe from tampering by oVirt orchestration/vdsm.
> 
> Still in my tests when some target-LUN was destroyed by me, things got worse
> when, f.i., you create
> an extra new LUN, and wanted to instantiate a SD from it, by using the gui,
> discovery wizard, etc.
> Long story short, the logins fix just works, but other oVirt parts got
> trashed in the end, while
> not discovering new Lun, not disposing the stale, and vdsmd (systemctl
> status vdsmd) reporting
> udev return errors.
> Both my host/nodes got trashed this way, and while vdsmd was running, or
> even a restart of vdsmd,
> they couldn't activate, by other causes than what this bug was all about.
> And so oVirt admin thinks (again),
> Hosts and SD's are down.
> VM's kept running.
> Former isci connects where up, and multipath works.
> After a VM shutdown (from within VM), and reboots of the hosts/nodes, all
> was very OK.
> Stales disappeared, and new LUN's could be connected as it used to.
> All iscsi channels where up again, incl. multipathing.
> 
> I expect that practically with the iscsiadm login python script fix, things
> work well.
> But during these extreme tests, well, it shows that oVirt still isn't the
> king of iscsi,
> and that's at least surprising i.m.h.o.
> 
> Also remind that when you apply this fix, it can be overwritten by some
> update, until
> someone of oVirt team applies fixes upstream. Mine, or other.
> 
> 
> I use oVirt std.alone Engine and oVirt node 4.4.6 , and in my case the file:
> 
> /lib/python3.6/site-packages/vdsm/storage/iscsiadm.py    (on a ovirt-node
> host)
> 
> contains the iscsi login function, that's called when a connect needed to be
> made, so not
> when f.i. a re-connect should be made if some channel fails, and gets back,
> when all other things still
> runs (in multipath -ll this just shows "inactive/failed" per path, and
> reconnects again when path gets up again,
> but this reconnect has nothing to do with ovirt call's of this python
> function.)
> 
> I just added code, more wasn't needed here, to at least fix in it's basic
> this years-old dreaded cause
> of useless disaster. (Use correct nr. of space indentions where needed. It's
> python, remember that, or it crashes
> in background, and no effect, or worse, can be seen.)
> 
> The function define now is:
> 
> def node_login(iface, portal, targetName):
>     # JCK - ping pre-test Q&D fix.
>     #       If it fails, just skip.
>     #       It sends 3 pings, evaluates the result of last ping, and waits
> no longer
>     #       than 3 seconds on some response (OK or unreach, etc) per fired
> ping.
>     #       Remark: The target iscsi storage shouldn't firewallfilter on
> icmp ping.
>     #               It's ipv4.
>     #               After implementing this, place host in Maintenance, then
> stop
>     #               ovirt's vdsmd via systemctl.
>     #               iscsiadm -m session should be empty than.
>     #               Now start vdsmd again, and re-Activate host.
>     # We need try/except, because commands.run() returns a "byte" with all
> stdout in it,
>     # (when o.s. command itself returns a 0 exit, so OK, code), and nothing
> else.
>     try:
>         commands.run(["/usr/bin/ping", "-c", "3", "-W", "3",
> str(portal).split(":")[0]], sudo=False)
>         retval = 0;
>     except cmdutils.Error as e:
>         retval = e.rc
>     if retval == 0:
>         try:
>             run_cmd(["-m", "node", "-T", targetName, "-I", iface,
>                      "-p", portal, "-l"])
>         except cmdutils.Error as e:
>             if not iface_exists(iface):
>                 raise IscsiInterfaceDoesNotExistError(iface)
> 
>             if e.rc == ISCSI_ERR_LOGIN_AUTH_FAILED:
>                 raise IscsiAuthenticationError(e.rc, e.out, e.err)
> 
>             raise IscsiNodeError(e.rc, e.out, e.err)
>     else :
>         pass
> 
> As you can see in the end it's rather easy. Only the ping part, and the
> retval
> evaluation has been added by me.
> 
> After that, in fact you can place Hosts in Maintenance, and stop/start vdsmd
> by systemctl.
> Activate hosts.
> (The .py script is part of vdsdmd it seems. If you don't restart, changes in
> code do not effectuate.)
> OK, test the things you want. Disconnect iscsi channels, reboot hosts, etc
> etc, they should
> NOT fail if only a selection of channels fail, and it also shouldn't take
> ages to complete, but in fact
> only take 3+ seconds per iscsi channel iteration.
> Also when checking your iscsi and multipath connections, there is only one
> session there per target-ip,
> and no crosswize cartasion mesh -like src <> dest connects. (ip's in
> separate subnets (and vlans if you like), remember?)
> 
> I hope someone can benefit from this fix. I do.
> 
> -Regards, Jeroen

Hi Jeroen, 
I appreciate you taking the time and looking into this and proposing a fix. oVirt is an open source community project so we are more than happy to get contributions from the community.
I suggest you submit a patch to vdsm[1] and send an email to devel@ovirt.org so the proposed solution reaches a broader audience than just the people CC'd to this bug and get input and reviews.



[1] https://www.ovirt.org/develop/dev-process/working-with-gerrit.html

Comment 24 Jeroen Keet 2021-06-22 18:09:56 UTC
(In reply to Benny Zlotnik from comment #23)
> (In reply to Jeroen Keet from comment #22)
> > O.K. "closed, because not a bug". Well, I've fixed it at it's minimum (but
> > it works great), (and indeed it is/was a bug,) by just a small adding of
> > python code.
> > It pings the iterated target beforehand, and if successful, it iscsiadm
> > logins to that target and succeeds.
> > (It automatically again emerges and gets grouped in multipath -ll
> > roundrobin.)
> > Also, the initial buildup of iscsi based Storage Domains are liquid now.
> > The code addition just uses oVirt's internal python modules, and some call's
> > from oVirt's own modules.
> > So no strange actions, and indeed just calling o.s. commands like the entire
> > oVirt frameworks seems
> > to do all the time. And it fits transparently in oVirt, at this
> > moment/version that is.
> > 
> > Before this fix oVirt was completely useless regarding iscsi in combination
> > with multipathing and it's roundrobin.
> > It led to a chuck of problems, f.i. when some iscsi path failed, and a host
> > needed re-activation or
> > booted up while the path fails, the host simply kept trying to activate, and
> > after a very long time it
> > failed, again and again, while there just are iscsi paths available, and
> > connectable.
> > 
> > Still oVirt's orchestration framework isn't extremely correct regarding
> > iscsi.
> > It works, with this fix, but you also need the iscsi bonding, but with this
> > rule:
> > Make a *separate* bond, 1-to-1 mapped logical network subnet 1 -> dest. ip.
> > iscsi target.
> > Repeat this for every iscsi connection (every connection in it's own ip
> > subnet, as IS the only good, and seems usable now, with this fix)
> > So this in fact aren't bonds anyway, and it doesn't make sense anyway.
> > Multipath -ll stil shows the real "bonding".
> > It only makes the oVirt orchestration multipath aware, just like XenServer
> > always did.
> > 
> > Also make sure that your SAN storage (iscsi target) is configured the
> > correct way per oVirt-node host,
> > so it indeed performs roundrobin, and use the "parent config options
> > override" method to keep your own
> > SAN storage settings safe from tampering by oVirt orchestration/vdsm.
> > 
> > Still in my tests when some target-LUN was destroyed by me, things got worse
> > when, f.i., you create
> > an extra new LUN, and wanted to instantiate a SD from it, by using the gui,
> > discovery wizard, etc.
> > Long story short, the logins fix just works, but other oVirt parts got
> > trashed in the end, while
> > not discovering new Lun, not disposing the stale, and vdsmd (systemctl
> > status vdsmd) reporting
> > udev return errors.
> > Both my host/nodes got trashed this way, and while vdsmd was running, or
> > even a restart of vdsmd,
> > they couldn't activate, by other causes than what this bug was all about.
> > And so oVirt admin thinks (again),
> > Hosts and SD's are down.
> > VM's kept running.
> > Former isci connects where up, and multipath works.
> > After a VM shutdown (from within VM), and reboots of the hosts/nodes, all
> > was very OK.
> > Stales disappeared, and new LUN's could be connected as it used to.
> > All iscsi channels where up again, incl. multipathing.
> > 
> > I expect that practically with the iscsiadm login python script fix, things
> > work well.
> > But during these extreme tests, well, it shows that oVirt still isn't the
> > king of iscsi,
> > and that's at least surprising i.m.h.o.
> > 
> > Also remind that when you apply this fix, it can be overwritten by some
> > update, until
> > someone of oVirt team applies fixes upstream. Mine, or other.
> > 
> > 
> > I use oVirt std.alone Engine and oVirt node 4.4.6 , and in my case the file:
> > 
> > /lib/python3.6/site-packages/vdsm/storage/iscsiadm.py    (on a ovirt-node
> > host)
> > 
> > contains the iscsi login function, that's called when a connect needed to be
> > made, so not
> > when f.i. a re-connect should be made if some channel fails, and gets back,
> > when all other things still
> > runs (in multipath -ll this just shows "inactive/failed" per path, and
> > reconnects again when path gets up again,
> > but this reconnect has nothing to do with ovirt call's of this python
> > function.)
> > 
> > I just added code, more wasn't needed here, to at least fix in it's basic
> > this years-old dreaded cause
> > of useless disaster. (Use correct nr. of space indentions where needed. It's
> > python, remember that, or it crashes
> > in background, and no effect, or worse, can be seen.)
> > 
> > The function define now is:
> > 
> > def node_login(iface, portal, targetName):
> >     # JCK - ping pre-test Q&D fix.
> >     #       If it fails, just skip.
> >     #       It sends 3 pings, evaluates the result of last ping, and waits
> > no longer
> >     #       than 3 seconds on some response (OK or unreach, etc) per fired
> > ping.
> >     #       Remark: The target iscsi storage shouldn't firewallfilter on
> > icmp ping.
> >     #               It's ipv4.
> >     #               After implementing this, place host in Maintenance, then
> > stop
> >     #               ovirt's vdsmd via systemctl.
> >     #               iscsiadm -m session should be empty than.
> >     #               Now start vdsmd again, and re-Activate host.
> >     # We need try/except, because commands.run() returns a "byte" with all
> > stdout in it,
> >     # (when o.s. command itself returns a 0 exit, so OK, code), and nothing
> > else.
> >     try:
> >         commands.run(["/usr/bin/ping", "-c", "3", "-W", "3",
> > str(portal).split(":")[0]], sudo=False)
> >         retval = 0;
> >     except cmdutils.Error as e:
> >         retval = e.rc
> >     if retval == 0:
> >         try:
> >             run_cmd(["-m", "node", "-T", targetName, "-I", iface,
> >                      "-p", portal, "-l"])
> >         except cmdutils.Error as e:
> >             if not iface_exists(iface):
> >                 raise IscsiInterfaceDoesNotExistError(iface)
> > 
> >             if e.rc == ISCSI_ERR_LOGIN_AUTH_FAILED:
> >                 raise IscsiAuthenticationError(e.rc, e.out, e.err)
> > 
> >             raise IscsiNodeError(e.rc, e.out, e.err)
> >     else :
> >         pass
> > 
> > As you can see in the end it's rather easy. Only the ping part, and the
> > retval
> > evaluation has been added by me.
> > 
> > After that, in fact you can place Hosts in Maintenance, and stop/start vdsmd
> > by systemctl.
> > Activate hosts.
> > (The .py script is part of vdsdmd it seems. If you don't restart, changes in
> > code do not effectuate.)
> > OK, test the things you want. Disconnect iscsi channels, reboot hosts, etc
> > etc, they should
> > NOT fail if only a selection of channels fail, and it also shouldn't take
> > ages to complete, but in fact
> > only take 3+ seconds per iscsi channel iteration.
> > Also when checking your iscsi and multipath connections, there is only one
> > session there per target-ip,
> > and no crosswize cartasion mesh -like src <> dest connects. (ip's in
> > separate subnets (and vlans if you like), remember?)
> > 
> > I hope someone can benefit from this fix. I do.
> > 
> > -Regards, Jeroen
> 
> Hi Jeroen, 
> I appreciate you taking the time and looking into this and proposing a fix.
> oVirt is an open source community project so we are more than happy to get
> contributions from the community.
> I suggest you submit a patch to vdsm[1] and send an email to devel@ovirt.org
> so the proposed solution reaches a broader audience than just the people
> CC'd to this bug and get input and reviews.
> 
> 
> 
> [1] https://www.ovirt.org/develop/dev-process/working-with-gerrit.html

Hello Benny,
Thank you for tips and feedback.
I'll try to manage some time these days to propose a patch via "Gerrit" to vdsm branche.
I'm not a community developer, and not involved with oVirt, but I'll certainly give it a try.
The more because I work for a few days now with my humble patch, and it still works like a charm.

Regards,
Jeroen

Comment 25 Vinícius Ferrão 2021-06-22 18:15:51 UTC
Hi Jeroen, what you've made is just awesome. Congratulations.

I really wanted to test the patch but I don't have a cluster setup to test it right now unfortunately.

There's one point that I wasn't able to understand if it was fixed or not. On the start of the discussion there was an issue that VDSM tries to connect to unreachable paths, since it tries to issue an ICMP request from a path that does not communicate with the other one. Breaking basic MPIO premises.

How this is achieved in the patch or this isn't touched at all? I see that roundrobin is enabled but I can't see how VDSM will not crash everything when iSCSI bond (misleading name) is defined on the web interface with the multiple and properly segregated iSCSI paths.

Thank you again for taking your time to patch and propose a fix for this 4 year old bugzilla!

Comment 26 Jeroen Keet 2021-06-22 19:18:52 UTC
(In reply to Vinícius Ferrão from comment #25)
> Hi Jeroen, what you've made is just awesome. Congratulations.
> 
> I really wanted to test the patch but I don't have a cluster setup to test
> it right now unfortunately.
> 
> There's one point that I wasn't able to understand if it was fixed or not.
> On the start of the discussion there was an issue that VDSM tries to connect
> to unreachable paths, since it tries to issue an ICMP request from a path
> that does not communicate with the other one. Breaking basic MPIO premises.
> 
> How this is achieved in the patch or this isn't touched at all? I see that
> roundrobin is enabled but I can't see how VDSM will not crash everything
> when iSCSI bond (misleading name) is defined on the web interface with the
> multiple and properly segregated iSCSI paths.
> 
> Thank you again for taking your time to patch and propose a fix for this 4
> year old bugzilla!

To be honest, I took a different approach, because it seemed rather clear that
the discussion somehow was on the wrong track.
I worked for many years with my XenServer CE edition clusters, and learning
the hard way already "took place". Learning the "philosophy" of it's MP enable option/button,
and from there letting the o.s. iscsi and multipathd layers take care of the rest.
(Assuming one also understand the multipath config file, so your iscsi san actually
works. I created a config for FreeNAS and some bigger synology at the time, and it works. Roundrobin.)

With that in mind I was very stressed to discover "last-minute" that at least this
type of configuring also didn't work on oVirt.
And above that, also this iscsi-bonding that suggested the world, but in fact was one riddle
what it does under the hood (and in fact it's still a riddle.)

So I forgot about all the discussions and details of it, because no suggestion ever works, and
if it did, it forced you to create iscsi subnets that, of course will lead to mesh-like iscsi connections/sessions,
and you also cannot isolate by f.i. vlan.
You already know everything about this, so I'll not explain ;-)  . And I agreed with you 100%.

I was a little lost, and so I took the approach of reversing just looking if I could see what processes
where running for the time when f.i. the activation of hypervisors dramatically failed in an infinite and slow loop,
a situation that occurs when, f.i., a mp path fails, and you reboot hypervisor, or just place it in maintenance/reactivate,
and also a LOT of other situations/combinations of causes and results.
And I was lucky... it was *very* clear that from vdsmd's daemon python calls took place that on their
turn just call shell commands / o.s. calls. In this case iscsiadm.
It was clear that is was just *logging in* when this happens.
And this login call's only occur on certain moments, namely, when in some process before it, these
sessions where cleared, f.i. when placing hypervisor/node/host in maintenance.
Ovirt/vdsmd just seem to rely in most of other situations just on the situation that "there is", like
Xen did. So multipathd & friends f.i. do their job as should, and also there seems not interfering of
some virt or vdsm layer.

So I traveled down to the python function define where this took place, and indeed it was just this small
function, too little imho.
The problem is that when it tries to login, it iterates through all known iscsiadm "nodes" that during discovery and
SD instantiation where registered.
And it just does that... so when some target isn't there, it fails after a huge timeout, and it takes time until
all iteration is done. Al these type of fails, "normal" iscsi failing, but also "target not there timeout",
raise error exceptions, and the above framework handles this just as "node doesn't activate", while
in reality soms iscsi session *can* login. After that failure, it removes al sessions, clears all, and tries
the entire loop again after some time. You know the issue I believe.
It doesn't return anything else. If all succeeds, than it just doesn't raise an exception. So OK states
aren't handled, and so a seemingly "false" OK doesn't harm. In this case a ping, that exits "fail", because
ip doesn't respond, can be just skipped, so function-ending without effect and without raising.
The above call iterates the next target ip "try-out".
In the end, f.i. one session could connect, and host gets Activated and works. Multipath show good things,
and from the on all is well.
(If somehow an iscsi session login fails because of true errors, other than "does not respond", it still
raises the failure as intended by oVirt framework, and one can assume, gets handled as before.)

1. If all sessions fail login, I did see that still the node doesn't get activated by other means, and
    it does that rather quick, no minute long timeouts. I think perhaps it also checks on SD's "emerged" fail, and other things.
2. If, afterwards iscsi ip targets/paths get up again, and it was *after* the login moment (f.i. reboot or maintenance/activate)
    they do not appear for themselves. No login call takes place, but in fact that's good, because we want multipath and iscsiadm
    to take care, and not oVirt (for now that is, because it fails on these things we see).
    In that case you can add them manually by replaying the command that normally is called from that function, and using
    params just from iscsiadm, like iscsiadm -m node , to call it's node registerings.
    OR, by reboot or re-activate of a node/host , if you can.

By fixing this, all these fundamentally wrong problems just disappear. I've tested all kind of scenarios, and
actually already migrating systems, and keeping a good eye, and in my situation it works good, and as intended imho.
Also you can go back to the "every iscsi ip in it´s own subnet and vlan", because all other things that were written
down do not seem to point in the right direction.
I also reverted to that setup, and it works. 4 etherchannels (in my case), 4 vlans, 4 subnets, and so in multipath -ll, if all sessions
logged in, 4 channels per roundrobin bond.

Only the iscsi-bond in oVirt GUI keeps a mystery. In Xen I could conclude in the old days, that it just makes Xen mp aware,
so it doesn't panic when somehow it check iscsi, and it sees something fails (??)
(I can imagine checks in situations where iscsi is used, but single channel of without redundancy, f.i. and your orchestration needs
to take care of failing (and from there perhaps calling o.s. like isciadm & friends).)
So, I thought, yes, also in oVirt this MP awareness should exist, and perhaps these bonding performs this. But .. "bonding" ? That
also suggests handling of some redundancy and/or bandwidth, like multipathd, and indeed it tampers with multipath's main config,
but it seems that it's not doing something truly usefull (and imho it can't this easy way, because rr is configured per SAN device section).
So bonding doesn't work, and if it does, it's so minor, you cannot rely on it for anything.
But I *assume*, because of awareness, it *should* be somehow there, to make oVirt know where at least "icsi mp users".
So my test was just to make 4 "bonds", mapped 1-to-1 so every iscsi subnet in it's own bond, resulting in 4 bonds.
It wouldn't harm, and perhaps with some future updates, even if this bonding gets better, it still doesn't harm to do it
your own way and rely on iscsiadm and multipathd, just as with Xen in the days.

More I cannot say about all of this ;-)
So why at first there was concensus about vdsm icmp testing or not, at which moment and at which layer, and why
that perhaps needs all-ip-in-same-subnet (which is very very bad here), and other discussions, I don't know.
But what I do know, is what I discovered, and that wasn't cleary not correct (too simple handled), and it was exactly
what happened when all these symptoms occured.
Also I can imagine that this bug is hard to understand. It hits a lot of layers that work together, also with
invisible architecture in mind. Not every pro coder is a SAN networker, so I don't blame.
I'm happy to have made myself happy at least ;-)

But now the oVirt pro's are on turn now I think. Make this iscsi-bond thing a just "one-button" enabler, or
build it up to the multipath configurer as perhaps intended (?)

I work and test now for days with the fix, and it works as supposed, but also read my comments written above,
ovirt isn't king of iscsi. But if you're able to manage the shell, it's very easy to analyse and fix sessions
in extreme loss situtions I think. My tests showed good results.
I hope it keeps that way ;-)  (but why not?)

I hope you are a bit more informed now ;-)

Regards,
Jeroen

Comment 27 Jeroen Keet 2021-06-22 19:55:33 UTC
> There's one point that I wasn't able to understand if it was fixed or not.
> On the start of the discussion there was an issue that VDSM tries to connect
> to unreachable paths, since it tries to issue an ICMP request from a path
> that does not communicate with the other one. Breaking basic MPIO premises.

To be more clear, instead of the story above ; I don't think that story
is correct(?) Perhaps it is, but after fixing, and making things "sense" again,
and using and testing it, everything works.
No, it doesn't icmp test cartasian or otherwise to ??? test ? and then what.
(In fact you can almost state that my fix just *does* that, and before fixing just didn't (??)
But my fix prevents useless timeout, and *also* prevents error raising when it shouldn't happen.)

Perhaps that story is related to re-login iscsi sessions, that were'nt there
when host activated. That it somehow poll's registered sessions to login
again when available?
In that case, it doesn't work, and never seen any sign of it trying.
You can actually login sessions by shell, as written above, and these
sessions will group again automatically in multipath rr. But that's by it's nature.

Also I wanted to say; In my tests, also I tested all kind of failures, like failing
3 of 4 ether channels (hypervisor level, switch level, and/or SANstor level), reconnecting, but also failing, re-boot and re-activating,
and all works. But as told, when at login-time a path isn't there, it skips fast and doesn't
trash vdsm, but the channel needs to be connected manually, when the failure is solved.
By iscsadm of by reboot/re-activate. The first can occur during running hypervisor and VM's,
and the sessions get grouped again in multipath -ll.
If iscsi sessions fail while they *were* OK, and disruption is solved again, all goes
automatically. All can by seen fi with multipath -ll .

All quick and liquid, and no more mystery and irritatinly long host-non-activations, and useless reboots.

For the record ;-)

Comment 28 Jeroen Keet 2021-06-28 23:18:32 UTC
Dear all,

I've pushed a patch to Gerrit.
I followed the guides, and after using some change-ID git hook it succeeded in some way.

I hope it went well, and also the reviewer(s) don't get too bored...

https://gerrit.ovirt.org/c/vdsm/+/115465/1/lib/vdsm/storage/iscsiadm.py

Thank you all !

Jeroen

Comment 29 Nir Soffer 2021-06-29 15:27:39 UTC
Jeroen, this bug is closed and is too old and contain too much history to add more data now.
Lets file a new bug with clear description of the issue and the proposed solution
(if you have one).

What I want to understand is:

- Why logging in to inaccessible targets is wrong?

- If we don't log in to the target when connecting to storage, when
  are we going to log in?

- How do we detect that the target becomes available?

- Which component in the system will trigger the login when the target
  becomes available?


Note You need to log in before you can comment on or make changes to this bug.