Bug 1728255 - [Cinderlib] - Starting VM with 3PAR-ISCSI MBS fails on "timeout which can be caused by communication issues"
Summary: [Cinderlib] - Starting VM with 3PAR-ISCSI MBS fails on "timeout which can be...
Keywords:
Status: CLOSED DUPLICATE of bug 1684889
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: ---
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Tal Nisan
QA Contact: Avihai
URL:
Whiteboard:
: 1730335 (view as bug list)
Depends On:
Blocks: 1673035
TreeView+ depends on / blocked
 
Reported: 2019-07-09 12:56 UTC by Shir Fishbain
Modified: 2019-07-22 14:21 UTC (History)
3 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2019-07-22 14:17:05 UTC
oVirt Team: Storage
Embargoed:


Attachments (Terms of Use)
new_bug_3PAR (2.01 MB, application/zip)
2019-07-09 12:57 UTC, Shir Fishbain
no flags Details

Description Shir Fishbain 2019-07-09 12:56:53 UTC
Description of problem:
Running VM with 3PAR-ISCSI MBD driver disk fails with the message :
VDSNetworkException: Message timeout which can be caused by communication issues (Failed with error VDS_NETWORK_ERROR and code 5022)

Version-Release number of selected component (if applicable):
ovirt-engine-4.3.5.3-0.1.el7.noarch
vdsm-4.30.23-2.el7ev.x86_64
Cinderlib version : 0.9.0

How reproducible:
100%

1.Create a managed block storage domain (3PAR driver)
2.Create VM 
3.Create disk from the storage domain created in step1
4.Attach the disk to the VM
5. Start VM

Actual results:
Start VM for 3PAR-ISCSI drivers fails.

From engine log:
2019-07-09 14:08:15,529+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (EE-ManagedThreadFactory-engine-Thread-71712) [6fc42e37] FINISH, ConnectStoragePoolVDSCommand, return: , 
log id: 5464cc1
2019-07-09 14:08:15,530+03 ERROR [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (EE-ManagedThreadFactory-engine-Thread-71712) [6fc42e37] Could not connect host 'host_mixed_2' to pool 'golden_env_mixed': Message 
timeout which can be caused by communication issues
2019-07-09 14:08:15,622+03 INFO  [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-54) [6fcac365] Running command: SetNonOperationalVdsCommand internal: true
. Entities affected :  ID: d5d281ab-6956-40f1-b7ba-d4e3c7a58f54 Type: VDS
2019-07-09 14:08:15,629+03 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-54) [6fcac365] START, SetVdsStatusVDSCommand(HostName = host_mixed_2, Set
VdsStatusVDSCommandParameters:{hostId='d5d281ab-6956-40f1-b7ba-d4e3c7a58f54', status='NonOperational', nonOperationalReason='STORAGE_DOMAIN_UNREACHABLE', stopSpmFailureLogged='false', maintenanceReason='null'}),
 log id: 13a7b368
2019-07-09 14:08:15,639+03 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-54) [6fcac365] FINISH, SetVdsStatusVDSCommand, return: , log id: 13a7b368
2019-07-09 14:08:15,770+03 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-54) [6fcac365] EVENT_ID: VDS_SET_NONOPERATIONAL_DOMAIN(522)
, Host host_mixed_2 cannot access the Storage Domain(s) <UNKNOWN> attached to the Data Center golden_env_mixed. Setting Host state to Non-Operational.
2019-07-09 14:08:15,781+03 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-54) [6fcac365] EVENT_ID: VDS_ALERT_FENCE_IS_NOT_CONFIGURED(9,000), Failed to verify Power Management configuration for Host host_mixed_2.
2019-07-09 14:08:15,786+03 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-54) [6fcac365] EVENT_ID: CONNECT_STORAGE_POOL_FAILED(995), Failed to connect Host host_mixed_2 to Storage Pool golden_env_mixed
2019-07-09 14:08:15,827+03 INFO  [org.ovirt.engine.core.bll.HandleVdsVersionCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-54) [42c649f4] Running command: HandleVdsVersionCommand internal: true. Entities affected :  ID: d5d281ab-6956-40f1-b7ba-d4e3c7a58f54 Type: VDS
2019-07-09 14:08:15,832+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-54) [42c649f4] Host 'host_mixed_2'(d5d281ab-6956-40f1-b7ba-d4e3c7a58f54) is already in NonOperational status for reason 'STORAGE_DOMAIN_UNREACHABLE'. SetNonOperationalVds command is skipped.
2019-07-09 14:08:50,060+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-71713) [30c9b343] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_1 command AttachManagedBlockStorageVolumeVDS failed: Message timeout which can be caused by communication issues
2019-07-09 14:08:50,062+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.AttachManagedBlockStorageVolumeVDSCommand] (EE-ManagedThreadFactory-engine-Thread-71713) [30c9b343] Command 'AttachManagedBlockStorageVolumeVDSCommand(HostName = host_mixed_1, AttachManagedBlockStorageVolumeVDSCommandParameters:{hostId='7af1b63e-721e-4644-b3fa-4cb392c51c4a', vds='Host[host_mixed_1,7af1b63e-721e-4644-b3fa-4cb392c51c4a]'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues
2019-07-09 14:08:50,062+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.AttachManagedBlockStorageVolumeVDSCommand] (EE-ManagedThreadFactory-engine-Thread-71713) [30c9b343] FINISH, AttachManagedBlockStorageVolumeVDSCommand, return: , log id: 1083394b
2019-07-09 14:08:50,063+03 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (EE-ManagedThreadFactory-engine-Thread-71731) [30c9b343] Host 'host_mixed_1' is not responding.
2019-07-09 14:08:50,077+03 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-71731) [30c9b343] EVENT_ID: VDS_HOST_NOT_RESPONDING(9,027), Host host_mixed_1 is not responding. Host cannot be fenced automatically because power management for the host is disabled.
2019-07-09 14:08:50,085+03 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (EE-ManagedThreadFactory-engine-Thread-71713) [30c9b343] Lock freed to object 'EngineLock:{exclusiveLocks='[3b302f30-357d-4d55-bcb0-5c2f1d3569ee=VM]', sharedLocks=''}'
2019-07-09 14:08:50,085+03 ERROR [org.ovirt.engine.core.bll.RunVmCommand] (EE-ManagedThreadFactory-engine-Thread-71713) [30c9b343] Command 'org.ovirt.engine.core.bll.RunVmCommand' failed: org.ovirt.engine.core.common.errors.EngineException: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues (Failed with error VDS_NETWORK_ERROR and code 5022)


Expected results:
Start VM for 3PAR-ISCSI drivers should work

Additional info:
1. I removed all the LUNS that already attached in 3PAR (https://10.35.84.14:8443/) and it still impossible to run the VM.
2. I moved the regular ISCSI domains to Unexport status in 3PAR.

Comment 1 Shir Fishbain 2019-07-09 12:57:48 UTC
Created attachment 1588710 [details]
new_bug_3PAR

Comment 2 Benny Zlotnik 2019-07-09 17:43:37 UTC
This is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1684889
Did you verify that the hosts are not connected to the 3par portals?
you can use the following command to make sure:
$ iscsiadm --mode node --logoutall=all

otherwise this setup is not supported.
os-brick and vdsm cannot be using the same portals (especially not when one uses fqdn and the other an ip address)

Comment 3 Fred Rolland 2019-07-15 14:43:06 UTC
Shir,
Can you test the workaround?

Comment 4 Shir Fishbain 2019-07-16 13:33:34 UTC
I opened a new bug about the two workarounds that need to do for start VM with 3PAR-ISCSI disk :
https://bugzilla.redhat.com/show_bug.cgi?id=1730335

Comment 5 Avihai 2019-07-21 08:20:39 UTC
(In reply to Benny Zlotnik from comment #2)
> This is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1684889
> Did you verify that the hosts are not connected to the 3par portals?
> you can use the following command to make sure:
> $ iscsiadm --mode node --logoutall=all
> 
> otherwise this setup is not supported.
I strongly disagree that it's ok for this to not be supported and I think this is a big blocker issue for customers.

As I discussed this issue with Fred at the last bug scrub, this must be supported as currently:
1) This means that you can not work with MBS ISCSI SD's and DATA ISCSI SD's on the same DC -> Not acceptable.

Example:

In your DC/cluster you have 2 storage domains,
- A is an ISCSI data domain
- B is an MBS storage domain

You have VM's working with A and now you add a new MBS SD B and you expect it to work with new VM as a new SD should.

Well New VM's in SD B will not start - unusable until you disconnect/logout/unmap all hosts from data SD A.
But if you disconnect/logout ISCSI connections from all hosts or unmap all hosts from storage of data SD A you cannot work with VM's on SD A anymore!


So basically this bug is really about the inability to work with both MBS and data SD's with ISCSI - this is bad and should be fixed.
I am sure the customer does not expect that he should choose between a working ISCSI data doamin in the expense of an ISCSI MBS domain but both .
 

> os-brick and vdsm cannot be using the same portals (especially not when one
> uses fqdn and the other an ip address)
We can sync/workaround this via code no?

How about FCP? does it also have the same issue?

Comment 6 Fred Rolland 2019-07-22 07:20:29 UTC
The issue is not that you cannot have MBS ISCSI SD's and DATA ISCSI SD's on the same DC.
The issue is that you are using the same target/portal for both of them and also because of the use of FQDN in the QE setup.

I don't think it will be a real setup at customer environments.

FC should work both with MBS and Data together as there are no discovery/connect/disconnect flows in FC.

Comment 7 Tal Nisan 2019-07-22 14:17:05 UTC

*** This bug has been marked as a duplicate of bug 1684889 ***

Comment 8 Tal Nisan 2019-07-22 14:21:13 UTC
*** Bug 1730335 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.