Bug 1570362 - Delete Lun while SD is up & reactivate all hosts cause host to become non operational
Summary: Delete Lun while SD is up & reactivate all hosts cause host to become non ope...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.2.3
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ovirt-4.3.0
: ---
Assignee: Eyal Shenitzky
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-22 10:36 UTC by Avihai
Modified: 2018-04-23 14:22 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-04-23 09:59:31 UTC
oVirt Team: Storage
Embargoed:
rule-engine: ovirt-4.3+


Attachments (Terms of Use)
relevant engine , vdsm logs (2.99 MB, application/x-gzip)
2018-04-22 10:36 UTC, Avihai
no flags Details
relevant engine , vdsm log (256.72 KB, application/x-gzip)
2018-04-22 14:55 UTC, Avihai
no flags Details

Description Avihai 2018-04-22 10:36:36 UTC
Created attachment 1425292 [details]
relevant engine , vdsm logs

Description of problem:
Create a storage domain with a dedicated ISCSI LUN in the storage server, then from storage server either Delete/unmap the Lun while SD is up & reactivate(maintanance-> activate) all hosts 
-> see that it's causing the hosts to become non-operational .

From Engine:
2018-04-22 13:27:59,448+03 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-2489) [6a526d5b] Domain 'ec44d375-b43c-46dd-b9af-c766a851fd3f:sd_test222' was reported with error code '2003'
2018-04-22 13:27:59,449+03 ERROR [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (EE-ManagedThreadFactory-engine-Thread-2489) [6a526d5b] Storage Domain 'sd_test222' of pool 'golden_env_mixed' is in problem in host 'host_mixed_1'
2018-04-22 13:27:59,456+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-2489) [6a526d5b] EVENT_ID: VDS_STORAGE_VDS_STATS_FAILED(189), Host host_mixed_1 reports about one of the Active Storage Domains as Problematic.
2018-04-22 13:27:59,511+03 INFO  [org.ovirt.engine.core.bll.storage.pool.ConnectHostToStoragePoolServersCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-81) [79f2d665] Running command: ConnectHostToStoragePoolServersCommand internal: true. Entities affected :  ID: 5f741a3c-905e-4695-aee9-34a24410b83f Type: StoragePool
2018-04-22 13:27:59,573+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-81) [79f2d665] START, ConnectStorageServerVDSCommand(HostName = host_mixed_2, StorageServerConnectionManagementVDSParameters:{hostId='14ddbea5-dd3b-44ae-9f39-c137b8d03344', storagePoolId='5f741a3c-905e-4695-aee9-34a24410b83f', storageType='GLUSTERFS', connectionList='[StorageServerConnections:{id='2d42ad55-5bc2-45d1-8208-46feb882942d', connection='gluster01.scl.lab.tlv.redhat.com:/storage_local_ge8_volume_0', iqn='null', vfsType='glusterfs', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}, StorageServerConnections:{id='dc08c8e9-1e0b-4d41-9f59-4ceee86481ad', connection='gluster01.scl.lab.tlv.redhat.com:/storage_local_ge8_volume_1', iqn='null', vfsType='glusterfs', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}, StorageServerConnections:{id='4e4bf30f-5738-4596-b63d-eb864e0c88cb', connection='gluster01.scl.lab.tlv.redhat.com:/storage_local_ge8_volume_2', iqn='null', vfsType='glusterfs', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}]'}), log id: 3d0bb757
2018-04-22 13:27:59,580+03 INFO  [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-15) [5b4961d4] Running command: SetNonOperationalVdsCommand internal: true. Entities affected :  ID: ede6a614-2f97-4d39-97f2-04eafb0a6abb Type: VDS
2018-04-22 13:27:59,592+03 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-15) [5b4961d4] START, SetVdsStatusVDSCommand(HostName = host_mixed_1, SetVdsStatusVDSCommandParameters:{hostId='ede6a614-2f97-4d39-97f2-04eafb0a6abb', status='NonOperational', nonOperationalReason='STORAGE_DOMAIN_UNREACHABLE', stopSpmFailureLogged='false', maintenanceReason='null'}), log id: 6b1c4b24
2018-04-22 13:27:59,620+03 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-15) [5b4961d4] FINISH, SetVdsStatusVDSCommand, log id: 6b1c4b24
2018-04-22 13:27:59,669+03 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-15) [5b4961d4] EVENT_ID: VDS_SET_NONOPERATIONAL_DOMAIN(522), Host host_mixed_1 cannot access the Storage Domain(s) sd_test222 attached to the Data Center golden_env_mixed. Setting Host state to Non-Operational.

VDSM host1(SPM)
2018-04-22 13:29:50,469+0300 ERROR (monitor/ec44d37) [storage.Monitor] Setting up monitor for ec44d375-b43c-46dd-b9af-c766a851fd3f failed (monitor:329)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/monitor.py", line 326, in _setupLoop
    self._setupMonitor()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/monitor.py", line 348, in _setupMonitor
    self._produceDomain()
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 158, in wrapper
    value = meth(self, *a, **kw)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/monitor.py", line 366, in _produceDomain
    self.domain = sdCache.produce(self.sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 110, in produce
    domain.getRealDomain()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 134, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 151, in _findDomain
    return findMethod(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1637, in findDomain
    return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1577, in findDomainPath
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: ('ec44d375-b43c-46dd-b9af-c766a851fd3f',)



Version-Release number of selected component (if applicable):
4.2.3-0.1

How reproducible:
100%

Steps to Reproduce:

0) DC/cluster with 2 hosts host_mixed_1 SPM , host_mixed_2
1) In storage create new lun1 & map it to the host 
2) create new SD with the new lun & attach to DC
3) In storage -> unmap/delete lun1
4) Move hosts to maintenance & re-activate

Actual results:
HSM Host_mixed_2 becomes non-responsive 

Expected results:


Additional info:

Comment 1 Avihai 2018-04-22 14:54:25 UTC
Hi Nir,

We had an automation run which we saw similar issues with the hosts going to NONOPERATIONAL state when as storage domain LUN was removed from an active SD after + hosts maintanace-> reactivated. 

I attached the relevant VDSM & Engine logs, just to be on the safe size can you go over them and see if this is the same root cause or a different/new issue?

Comment 2 Avihai 2018-04-22 14:55:19 UTC
Created attachment 1425357 [details]
relevant engine , vdsm log

Comment 3 Yaniv Kaul 2018-04-23 06:24:22 UTC
Avihai, what is the expected result? I fail to see the bug here - hosts cannot connect to storage, they should become non-op.

Comment 4 Avihai 2018-04-23 09:01:15 UTC
(In reply to Yaniv Kaul from comment #3)
> Avihai, what is the expected result? I fail to see the bug here - hosts
> cannot connect to storage, they should become non-op.

The issue here came from an automation Tier2 scenario (Started with Test scenario above) which we see that all hosts toggled between active -> non-operational for ~3 hours! (see an attached log with only the entries).

I would expect that after the SD is not available hosts should move to non-operational ONCE and not toggle between active-> non-operational states.

cat Hosts_NONOPERATIONAL.log :                                                                                                                      
2018-04-14 18:07:05,947+03 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-38) [3ca03d92] EVENT_ID: VDS_SET_NONOPERATIONAL_DOMAIN(522), Host host_mixed_2 cannot access the Storage Domain(s) sd_TestCase10140_1417574196 attached to the Data Center golden_env_mixed. Setting Host state to Non-Operational.

..............................................................................
..............................................................................
..............................................................................

2018-04-14 21:58:08,057+03 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-79) [2db87384] EVENT_ID: VDS_SET_NONOPERATIONAL_DOMAIN(522), Host host_mixed_2 cannot access the Storage Domain(s) <UNKNOWN> attached to the Data Center golden_env_mixed. Setting Host state to Non-Operational.

Comment 5 Yaniv Kaul 2018-04-23 09:13:41 UTC
(In reply to Avihai from comment #4)
> (In reply to Yaniv Kaul from comment #3)
> > Avihai, what is the expected result? I fail to see the bug here - hosts
> > cannot connect to storage, they should become non-op.
> 
> The issue here came from an automation Tier2 scenario (Started with Test
> scenario above) which we see that all hosts toggled between active ->
> non-operational for ~3 hours! (see an attached log with only the entries).
> 
> I would expect that after the SD is not available hosts should move to
> non-operational ONCE and not toggle between active-> non-operational states.

I *think* we retry to get them up, since it it may be a temporary issue (connectivity, mapping/zoning, etc.). What's wrong with this?

Comment 6 Avihai 2018-04-23 09:38:08 UTC
(In reply to Yaniv Kaul from comment #5)
> (In reply to Avihai from comment #4)
> > (In reply to Yaniv Kaul from comment #3)
> > > Avihai, what is the expected result? I fail to see the bug here - hosts
> > > cannot connect to storage, they should become non-op.
> > 
> > The issue here came from an automation Tier2 scenario (Started with Test
> > scenario above) which we see that all hosts toggled between active ->
> > non-operational for ~3 hours! (see an attached log with only the entries).
> > 
> > I would expect that after the SD is not available hosts should move to
> > non-operational ONCE and not toggle between active-> non-operational states.
> 
> I *think* we retry to get them up, since it it may be a temporary issue
> (connectivity, mapping/zoning, etc.). What's wrong with this?

I think that after retrying a small amount of time/retries we should :

Plan A:
Stop retrying and take a more drastic action to forcefully remove/destroy the problematic SD instead of making hosts go up-> non -operational for hours.

I know that currently, we try to deactivate a problematic SD if no host sees it but in this case, as OVF update was in progress, deactivating the SD failed.

What I suggest is to extend that behavior:
If Remove/destroy the problematic SD fails due to OVF update/other reason retry again afterward as many times necessary until problematic SD is removed -> WDYT?

Plan B(bad plan):
Move to a stable state, maybe retires in a longer interval of time. 

About "(connectivity, mapping/zoning, etc.) temporary issue":
In this case, the only disruptive thing that was done was to remove the LUN of a specific storage domain.

This is not temporary but a constant state that can occur & we should be able to deal with it with minimal impact on the hosts and other non-related storage domains that was not affected by this disruption.

Also in this case storage server & network connectivity(zoning/mapping) was not affected beside that specific LUN.

How can I be sure? 
Well, in the same test time we also had 3 other storage domains connected that same server(named: 'iscsi_0/1/2' ) with different LUN's(that was not deleted) that are active and the hosts do not complain about them.

Comment 7 Yaniv Kaul 2018-04-23 09:40:09 UTC
Aviahi,
All the above is a good RFE to discuss, but not a bug. Can we close this one?

I suggest you talk to Yaniv L. and Tal about the proposal above before opening a RFE.

Comment 8 Avihai 2018-04-23 09:54:45 UTC
(In reply to Yaniv Kaul from comment #7)
> Aviahi,
> All the above is a good RFE to discuss, but not a bug. Can we close this one?
> 
> I suggest you talk to Yaniv L. and Tal about the proposal above before
> opening a RFE.

Yaniv,
Think of yourself as a customer that have to 3 hours of all the hosts going to non-operational & up again due to one SD disrupting your whole environment.

If still, this is an RFE in your eyes than I trust your judgment.

Comment 9 Yaniv Kaul 2018-04-23 09:59:31 UTC
(In reply to Avihai from comment #8)
> (In reply to Yaniv Kaul from comment #7)
> > Aviahi,
> > All the above is a good RFE to discuss, but not a bug. Can we close this one?
> > 
> > I suggest you talk to Yaniv L. and Tal about the proposal above before
> > opening a RFE.
> 
> Yaniv,
> Think of yourself as a customer that have to 3 hours of all the hosts going
> to non-operational & up again due to one SD disrupting your whole
> environment.
> 
> If still, this is an RFE in your eyes than I trust your judgment.

It is - it's not a regression, it's a request to change existing behavior (which is arguable if it's OK or not - imagine a temporary issue in the middle of the night - why not let it continue and retry?). If you wish to propose a better behavior - this is exactly the definition of a RFE. I'm closing this one - please talk to Yaniv L. before opening a relevant RFE.

Comment 10 Elad 2018-04-23 11:46:38 UTC
Yaniv, correct me if I'm wrong - the current behavior is when all hosts cannot access a storage domain, non of the hosts should move to non operational.
This is exactly what Avihai is reporting about and this is not something new, therefore, should not be an RFE.

Comment 11 Yaniv Kaul 2018-04-23 11:53:28 UTC
(In reply to Elad from comment #10)
> Yaniv, correct me if I'm wrong - the current behavior is when all hosts
> cannot access a storage domain, non of the hosts should move to non
> operational.
> This is exactly what Avihai is reporting about and this is not something
> new, therefore, should not be an RFE.

1. Is it a regression?
2. All hosts should not move to non-op - if they were previously active. Not if they were not activated before the storage went kaboom?

Comment 12 Elad 2018-04-23 12:38:07 UTC
> 1. Is it a regression?
Will have to check on previous builds
> 2. All hosts should not move to non-op - if they were previously active. Not
> if they were not activated before the storage went kaboom?
The hosts were active before the storage went down.

Comment 13 Elad 2018-04-23 12:57:44 UTC
After a F2F with Yaniv, we've agreed that we'll check the scenario of removing a LUN without deactivating and activating the hosts. Avihai, please do. Thanks

Comment 14 Avihai 2018-04-23 14:22:47 UTC
(In reply to Elad from comment #13)
> After a F2F with Yaniv, we've agreed that we'll check the scenario of
> removing a LUN without deactivating and activating the hosts. Avihai, please
> do. Thanks

Ran the requested scenario (removing a LUN without deactivating and activating the hosts) & hosts does not go non-operational state.
What happens is that after ~5 minutes after LUN was removed the SD was deactivated.


Note You need to log in before you can comment on or make changes to this bug.