Bug 1459532

Summary: Deactivation of NFS SD fails after (1+ minute) vdsm kill was done
Product: [oVirt] ovirt-engine Reporter: Avihai <aefrat>
Component: BLL.StorageAssignee: Maor <mlipchuk>
Status: CLOSED WORKSFORME QA Contact: Raz Tamir <ratamir>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.1.3CC: aefrat, amureini, bugs
Target Milestone: ovirt-4.1.4Flags: rule-engine: ovirt-4.1+
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-06-20 07:50:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
engine & vdsm logs none

Description Avihai 2017-06-07 12:08:01 UTC
Created attachment 1285776 [details]
engine & vdsm logs

Description of problem:
Deactivation of NFS SD fails after (1+ minute) vdsm kill was done & DC was already up . 

Version-Release number of selected component (if applicable):
Engine:
ovirt-engine-4.1.3.1-0.1.el7.noarch

VDSM:
4.19.17-1

How reproducible:
Hard to reproduce.
Only reproduced once in automation tier3 run's , locally in my env i ran it 10 times & it did not occur.


Steps to Reproduce:
1. Create DC + cluster on v3
2. Upgrade the Cluster to ver 4 and kill the SPM during the operation
3. Upgrade the Cluster again
4. Deactivate storage domain 


Actual results:
SD deactivation failed .


Expected results:
SD should be deactivated successfully


Additional info:
Automation output shows remove failed but the actual cause was that deactivation did not occur to SD  domain upgrade_4_0_to_4_1nfs0 , see engine & VDSM logs :

From Engine you can see DEACTIVATION FAILED :
2017-06-05 02:13:30,757+03 ERROR [org.ovirt.engine.core.bll.storage.domain.DeactivateStorageDomainCommand] (DefaultQuartzScheduler3) [998764a] Aborting execution due to failure to stop SPM
2017-06-05 02:13:30,769+03 INFO  [org.ovirt.engine.core.bll.storage.domain.DeactivateStorageDomainCommand] (DefaultQuartzScheduler3) [998764a] Command [id=98559627-f344-438d-b479-b1b4b9845966]: Compensating CHAN
GED_STATUS_ONLY of org.ovirt.engine.core.common.businessentities.StoragePool; snapshot: EntityStatusSnapshot:{id='4a4485b7-181b-4ca3-a582-b330fa886500', status='Up'}.
2017-06-05 02:13:30,771+03 INFO  [org.ovirt.engine.core.bll.storage.domain.DeactivateStorageDomainCommand] (DefaultQuartzScheduler3) [998764a] Command [id=98559627-f344-438d-b479-b1b4b9845966]: Compensating CHAN
GED_STATUS_ONLY of org.ovirt.engine.core.common.businessentities.StoragePoolIsoMap; snapshot: EntityStatusSnapshot:{id='StoragePoolIsoMapId:{storagePoolId='4a4485b7-181b-4ca3-a582-b330fa886500', storageId='34d2b
bea-f998-4e1d-b132-66cfca9d0b2e'}', status='Unknown'}.
2017-06-05 02:13:30,778+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler3) [998764a] EVENT_ID: USER_DEACTIVATE_STORAGE_DOMAIN_FAILED(969), Correlation ID: 998764a, Job ID: 0e1a8f21-d056-454f-a5d1-857766285ad5, Call Stack: null, Custom Event ID: -1, Message: Failed to deactivate Storage Domain upgrade_4_0_to_4_1nfs0 (Data Center dc_upgrade_4_0_to_4_1).
2017-06-05 02:13:30,782+03 INFO  [org.ovirt.engine.core.bll.storage.domain.DeactivateStorageDomainWithOvfUpdateCommand] (DefaultQuartzScheduler3) [998764a] Lock freed to object 'EngineLock:{exclusiveLocks='[34d2bbea-f998-4e1d-b132-66cfca9d0b2e=<STORAGE, ACTION_TYPE_FAILED_OBJECT_LOCKED>, 4a4485b7-181b-4ca3-a582-b330fa886500=<POOL, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}'
5ad5, Call Stack: null, Custom Event ID: -1, Message: Failed to deactivate Storage Domain upgrade_4_0_to_4_1nfs0 (Data Center dc_upgrade_4_0_to_4_1).

VDSM log:

2017-06-05 02:13:06,069+0300 ERROR (jsonrpc/0) [storage.TaskManager.Task] (Task='cbe4ee58-90af-4e26-98d7-311f5318cfa0') Unexpected error (task:870)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 877, in _run
    return fn(*args, **kargs)
  File "<string>", line 2, in getAllTasksStatuses
  File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method
    ret = func(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 2166, in getAllTasksStatuses
    allTasksStatus = self._pool.getAllTasksStatuses()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 77, in wrapper
    raise SecureError("Secured object is not in safe state")
SecureError: Secured object is not in safe state
2017-06-05 02:13:06,069+0300 INFO  (jsonrpc/0) [storage.TaskManager.Task] (Task='cbe4ee58-90af-4e26-98d7-311f5318cfa0') aborting: Task is aborted: u'Secured object is not in safe state' - code 100 (task:1175)
2017-06-05 02:13:06,069+0300 ERROR (jsonrpc/0) [storage.Dispatcher] FINISH getAllTasksStatuses error=Secured object is not in safe state (dispatcher:82)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/dispatcher.py", line 72, in wrapper
    result = ctask.prepare(func, *args, **kwargs)
  File "/usr/share/vdsm/storage/task.py", line 105, in wrapper
    return m(self, *a, **kw)
  File "/usr/share/vdsm/storage/task.py", line 1183, in prepare
    raise self.error
SecureError: Secured object is not in safe state
2017-06-05 02:13:06,070+0300 INFO  (jsonrpc/0) [jsonrpc.JsonRpc

2017-06-05 02:13:29,904+0300 INFO  (jsonrpc/3) [vdsm.api] START deactivateStorageDomain(sdUUID='34d2bbea-f998-4e1d-b132-66cfca9d0b2e', spUUID='4a4485b7-181b-4ca3-a582-b330fa886500', msdUUID='00000000-0000-0000-0000-000000000000', masterVersion=1, options=None) from=::ffff:10.35.69.107,39538, flow_id=998764a (api:46)
2017-06-05 02:13:29,906+0300 INFO  (jsonrpc/3) [storage.StoragePool] sdUUID=34d2bbea-f998-4e1d-b132-66cfca9d0b2e spUUID=4a4485b7-181b-4ca3-a582-b330fa886500 newMsdUUID=00000000-0000-0000-0000-000000000000 (sp:1154)
2017-06-05 02:13:29,906+0300 INFO  (jsonrpc/3) [storage.StoragePool] Silently ignoring the deactivation request for the master domain 34d2bbea-f998-4e1d-b132-66cfca9d0b2e (sp:1177)
2017-06-05 02:13:29,906+0300 INFO  (jsonrpc/3) [vdsm.api] FINISH deactivateStorageDomain return=None from=::ffff:10.35.69.107,39538, flow_id=998764a (api:52)


automation Full timeline :
2017-06-05 02:10:49,583 - MainThread - art.logging - DEBUG - Skipping duplicate log-messages...
2017-06-05 02:10:49,584 - MainThread - art.logging - INFO - Test class description: 1. Create DC + cluster on v3
2017-06-05 02:10:49,584 - MainThread - art.logging - INFO - Test class description: 2. Upgrade the Cluster to ver 4 and kill the SPM during the operation
2017-06-05 02:10:49,584 - MainThread - art.logging - INFO - Test class description: 3. Upgrade the Cluster again
2017-06-05 02:10:49,584 - MainThread - art.logging - INFO - SETUP <TestCaseFunction 'test_verify_new_domain_version'>
2017-06-05 02:10:49,584 - MainThread - art.logging - INFO - PACKAGE SETUP: Entry(/var/lib/jenkins/workspace/rhv-4.1v4-ge-runner-tier3/automation/ART/art/tests/rhevmtests/storage/storage_qcow2_v3)
2017-06-05 02:10:49,692 - MainThread - rhevmtests.helpers - INFO - Retrieve all Storage domains
2017-06-05 02:10:49,824 - MainThread - rhevmtests.helpers - INFO - The storage domains names in engine: ['export_domain', 'iscsi_0', 'iscsi_1', 'iscsi_2', 'ISO_DOMAIN', 'netappISO', 'nfs_0', 'nfs_1', 'nfs_2', 'rhevm-qe-infra-glance', 'test_gluster_0', 'test_gluster_1', 'test_gluster_2']
2017-06-05 02:10:49,825 - MainThread - rhevmtests.helpers - INFO - The GE storage domains names: ['rhevm-qe-infra-glance', 'ovirt-image-repository', 'nfs_0', 'nfs_1', 'nfs_2', 'iscsi_0', 'iscsi_1', 'iscsi_2', 'test_gluster_0', 'test_gluster_1', 'test_gluster_2', 'rhevm-qe-infra-cinder', 'export_domain', 'netappISO', 'ISO_DOMAIN']
2017-06-05 02:10:50,617 - MainThread - test_utils - INFO - All tasks are gone
2017-06-05 02:10:50,617 - MainThread - pytest.art.matrix - INFO - The storage type switched to nfs
2017-06-05 02:10:50,623 - MainThread - root - INFO - Get SPM host from the list of hosts host_mixed_1,host_mixed_2,host_mixed_3
2017-06-05 02:10:50,627 - MainThread - root - INFO - Checking Storage Pool Manager (SPM) status of the host host_mixed_1.
2017-06-05 02:10:50,631 - MainThread - root - INFO - Get host object by host_name host_mixed_1. with {'attribute': 'name'}
2017-06-05 02:10:50,714 - MainThread - art.ll_lib.hosts - INFO - check_host_spm_status - SPM Status of host host_mixed_1 is: none
2017-06-05 02:10:50,715 - MainThread - root - ERROR - Failed to checking storage pool manager (spm) status of the host host_mixed_1.
2017-06-05 02:10:50,718 - MainThread - root - INFO - Checking Storage Pool Manager (SPM) status of the host host_mixed_2.
2017-06-05 02:10:50,722 - MainThread - root - INFO - Get host object by host_name host_mixed_2. with {'attribute': 'name'}
2017-06-05 02:10:50,806 - MainThread - art.ll_lib.hosts - INFO - check_host_spm_status - SPM Status of host host_mixed_2 is: spm
2017-06-05 02:10:50,811 - MainThread - root - INFO - Get any not SPM host from the list of hosts host_mixed_1,host_mixed_2,host_mixed_3 in the expected state with {'cluster_name': 'golden_env_mixed_1'}
2017-06-05 02:10:50,815 - MainThread - root - INFO - Get host object by host_name host_mixed_1. with {'attribute': 'name'}
2017-06-05 02:10:50,906 - MainThread - root - INFO - Get host object by host_name host_mixed_2. with {'attribute': 'name'}
2017-06-05 02:10:50,996 - MainThread - root - INFO - Get host object by host_name host_mixed_3. with {'attribute': 'name'}
2017-06-05 02:10:51,149 - MainThread - root - INFO - Checking Storage Pool Manager (SPM) status of the host host_mixed_1.
2017-06-05 02:10:51,153 - MainThread - root - INFO - Get host object by host_name host_mixed_1. with {'attribute': 'name'}
2017-06-05 02:10:51,230 - MainThread - art.ll_lib.hosts - INFO - check_host_spm_status - SPM Status of host host_mixed_1 is: none
2017-06-05 02:10:51,230 - MainThread - root - ERROR - Failed to checking storage pool manager (spm) status of the host host_mixed_1.
2017-06-05 02:10:51,230 - MainThread - art.ll_lib.dcs - INFO - Add datacenter dc_upgrade_4_0_to_4_1 with {'version': '4.0', 'name': 'dc_upgrade_4_0_to_4_1'}
2017-06-05 02:10:51,231 - MainThread - datacenters - INFO - Using Correlation-Id: datacenters_create_75a6bf29-041f-41b6
2017-06-05 02:10:51,428 - MainThread - datacenters - INFO - New entity was added
2017-06-05 02:10:51,432 - MainThread - datacenters - INFO - Property 'data_center->name' has correct value: dc_upgrade_4_0_to_4_1
2017-06-05 02:10:51,432 - MainThread - datacenters - INFO - Property 'data_center->local' has correct value: false
2017-06-05 02:10:51,432 - MainThread - datacenters - INFO - Property 'data_center->version->major' has correct value: 4
2017-06-05 02:10:51,432 - MainThread - datacenters - INFO - Property 'data_center->version->minor' has correct value: 0
2017-06-05 02:10:51,513 - MainThread - datacenters - INFO - Validated supported versions of data center dc_upgrade_4_0_to_4_1
2017-06-05 02:10:51,513 - MainThread - art.rhevm_api.tests_lib.low_level.clusters - INFO - Create cluster cluster_upgrade_4_0_to_4_1 with {'data_center': 'dc_upgrade_4_0_to_4_1', 'name': 'cluster_upgrade_4_0_to_4_1', 'version': '4.0', 'cpu': 'Intel Conroe Family'}
2017-06-05 02:10:51,602 - MainThread - clusters - INFO - Using Correlation-Id: clusters_create_7f02a10c-635f-404b
2017-06-05 02:10:51,947 - MainThread - clusters - INFO - New entity was added
2017-06-05 02:10:51,952 - MainThread - clusters - INFO - Property 'cluster->data_center->id' has correct value: 4a4485b7-181b-4ca3-a582-b330fa886500
2017-06-05 02:10:51,952 - MainThread - clusters - INFO - Property 'cluster->cpu->type_' has correct value: Intel Conroe Family
2017-06-05 02:10:51,952 - MainThread - clusters - INFO - Property 'cluster->version->major' has correct value: 4
2017-06-05 02:10:51,952 - MainThread - clusters - INFO - Property 'cluster->version->minor' has correct value: 0
2017-06-05 02:10:51,953 - MainThread - clusters - INFO - Property 'cluster->name' has correct value: cluster_upgrade_4_0_to_4_1
2017-06-05 02:10:51,964 - MainThread - root - INFO - Switch host host_mixed_1 to different cluster cluster_upgrade_4_0_to_4_1 with {'activate': True}
2017-06-05 02:10:51,968 - MainThread - root - INFO - Checks if host host_mixed_1 is in maintenance state
2017-06-05 02:10:51,972 - MainThread - root - INFO - Returns host host_mixed_1 status
2017-06-05 02:10:51,976 - MainThread - root - INFO - Get host object by host_name host_mixed_1. with {'attribute': 'name'}
2017-06-05 02:10:52,074 - MainThread - root - INFO - Deactivate the host host_mixed_1. with {'expected_status': 'maintenance'}
2017-06-05 02:10:52,077 - MainThread - root - INFO - Get host object by host_name host_mixed_1. with {'attribute': 'name'}
2017-06-05 02:10:52,167 - MainThread - root - INFO - Get host hosted-engine object with {'host_name': 'host_mixed_1'}
2017-06-05 02:10:52,244 - MainThread - root - ERROR - Failed to get host hosted-engine object with {'host_name': 'host_mixed_1'}
2017-06-05 02:10:52,245 - MainThread - hosts - INFO - Using Correlation-Id: hosts_syncAction_cf113b81-3b77-4103
2017-06-05 02:11:02,706 - MainThread - hosts - INFO - host status is 'maintenance'
2017-06-05 02:11:02,711 - MainThread - root - INFO - Update properties of host host_mixed_1 (provided in parameters) with {'cluster': 'cluster_upgrade_4_0_to_4_1'}
2017-06-05 02:11:02,716 - MainThread - root - INFO - Get host object by host_name host_mixed_1. with {'attribute': 'name'}
2017-06-05 02:11:02,900 - MainThread - hosts - INFO - Using Correlation-Id: hosts_update_0574ea9e-99f0-41a1
2017-06-05 02:11:08,425 - MainThread - hosts - INFO - host was updated
2017-06-05 02:11:08,431 - MainThread - hosts - INFO - Property 'host->cluster->id' has correct value: 696dc642-3da5-4b8f-9662-10eee447af66
2017-06-05 02:11:08,441 - MainThread - root - INFO - Activate host host_mixed_1 (set status to UP) with {'wait': True}
2017-06-05 02:11:08,445 - MainThread - root - INFO - Get host object by host_name host_mixed_1. with {'attribute': 'name'}
2017-06-05 02:11:08,527 - MainThread - hosts - INFO - Using Correlation-Id: hosts_syncAction_26f9a20b-a7aa-4bba
2017-06-05 02:11:18,765 - MainThread - hosts - INFO - host status is 'up'
2017-06-05 02:11:18,769 - MainThread - root - INFO - Get SPM host from the list of hosts host_mixed_1,host_mixed_2,host_mixed_3
2017-06-05 02:11:18,773 - MainThread - root - INFO - Checking Storage Pool Manager (SPM) status of the host host_mixed_1.
2017-06-05 02:11:18,776 - MainThread - root - INFO - Get host object by host_name host_mixed_1. with {'attribute': 'name'}
2017-06-05 02:11:18,854 - MainThread - art.ll_lib.hosts - INFO - check_host_spm_status - SPM Status of host host_mixed_1 is: none
2017-06-05 02:11:18,855 - MainThread - root - ERROR - Failed to checking storage pool manager (spm) status of the host host_mixed_1.
2017-06-05 02:11:18,858 - MainThread - root - INFO - Checking Storage Pool Manager (SPM) status of the host host_mixed_2.
2017-06-05 02:11:18,862 - MainThread - root - INFO - Get host object by host_name host_mixed_2. with {'attribute': 'name'}
2017-06-05 02:11:18,939 - MainThread - art.ll_lib.hosts - INFO - check_host_spm_status - SPM Status of host host_mixed_2 is: spm
2017-06-05 02:11:18,939 - MainThread - art.hl_lib.sds - INFO - Add NFS storage domain upgrade_4_0_to_4_1nfs0 to datacenter dc_upgrade_4_0_to_4_1
2017-06-05 02:11:19,029 - MainThread - root - INFO - Get Host host_mixed_2 compatibility version
2017-06-05 02:11:19,043 - MainThread - root - INFO - Get host object by host_name host_mixed_2. with {'attribute': 'name'}
2017-06-05 02:11:19,197 - MainThread - root - INFO - Get Cluster golden_env_mixed_1 compatibility version
2017-06-05 02:11:19,261 - MainThread - storagedomains - INFO - Using Correlation-Id: storagedomains_create_c5bdad2f-f5fa-4249
2017-06-05 02:11:24,089 - MainThread - storagedomains - INFO - New entity was added
2017-06-05 02:11:24,091 - MainThread - storagedomains - INFO - Property 'storage_domain->storage->address' has correct value: vserver-nas01.qa.lab.tlv.redhat.com
2017-06-05 02:11:24,091 - MainThread - storagedomains - INFO - Property 'storage_domain->storage->path' has correct value: /nas01/ge_4_unused_nfs_0
2017-06-05 02:11:24,091 - MainThread - storagedomains - INFO - Property 'storage_domain->storage->type_' has correct value: nfs
2017-06-05 02:11:24,091 - MainThread - storagedomains - INFO - Property 'storage_domain->storage_format' has correct value: v3
2017-06-05 02:11:24,092 - MainThread - storagedomains - INFO - Property 'storage_domain->name' has correct value: upgrade_4_0_to_4_1nfs0
2017-06-05 02:11:24,092 - MainThread - storagedomains - INFO - Attribute: import___ changed to: import
2017-06-05 02:11:24,092 - MainThread - storagedomains - INFO - Property 'storage_domain->type_' has correct value: data
2017-06-05 02:11:24,313 - MainThread - storagedomains - INFO - Using Correlation-Id: storagedomains_create_b5f9c16c-84a9-4441
2017-06-05 02:12:20,614 - MainThread - storagedomains - INFO - New entity was added
2017-06-05 02:12:20,616 - MainThread - storagedomains - INFO - Property 'storage_domain->id' has correct value: 34d2bbea-f998-4e1d-b132-66cfca9d0b2e
2017-06-05 02:12:20,616 - MainThread - storagedomains - INFO - Attribute: import___ changed to: import
2017-06-05 02:12:20,728 - MainThread - datacenters - INFO - data_center status is 'up'
2017-06-05 02:12:20,953 - MainThread - storagedomains - WARNING - Storage domain upgrade_4_0_to_4_1nfs0 already activated
2017-06-05 02:12:20,954 - MainThread - art.ll_lib.jobs - INFO - Waiting for jobs ['Adding .* Storage Domain .*', 'Activating Storage Domain .* on Data Center .*']
2017-06-05 02:12:21,037 - MainThread - art.ll_lib.jobs - INFO - Active jobs: []
2017-06-05 02:12:21,108 - MainThread - art.ll_lib.jobs - INFO - JOB 'Adding NFS Storage Domain upgrade_4_0_to_4_1nfs0 to Data Center <UNKNOWN>' TOOK 2.442 seconds
2017-06-05 02:12:21,183 - MainThread - art.ll_lib.jobs - INFO - All jobs are gone
2017-06-05 02:12:22,126 - MainThread - test_utils - INFO - All tasks are gone
2017-06-05 02:12:22,127 - MainThread - art.logging - INFO - ================================================================================2017-06-05 02:12:22,128 - MainThread - art.logging - INFO - Test Name: rhevmtests.storage.storage_qcow2_v3.test_qcow2_v3.TestCase18307_REST_NFS.test_verify_new_domain_version
2017-06-05 02:12:22,128 - MainThread - art.logging - INFO - Iteration number: 264
2017-06-05 02:12:22,132 - MainThread - root - INFO - Update cluster cluster_upgrade_4_0_to_4_1 with {'version': '4.1'}
2017-06-05 02:12:22,192 - MainThread - clusters - INFO - Using Correlation-Id: clusters_update_0d44bc90-8d06-4417
2017-06-05 02:12:22,400 - MainThread - clusters - INFO - cluster was updated
2017-06-05 02:12:22,402 - MainThread - clusters - INFO - Property 'cluster->version->major' has correct value: 4
2017-06-05 02:12:22,402 - MainThread - clusters - INFO - Property 'cluster->version->minor' has correct value: 1
2017-06-05 02:12:22,460 - MainThread - art.ll_lib.dcs - INFO - Update datacenter dc_upgrade_4_0_to_4_1 with {'version': '4.1'}
2017-06-05 02:12:22,461 - MainThread - datacenters - INFO - Using Correlation-Id: datacenters_update_6d70dd15-64a6-4de1
2017-06-05 02:12:25,024 - MainThread - datacenters - INFO - data_center was updated
2017-06-05 02:12:25,025 - MainThread - datacenters - INFO - Property 'data_center->version->major' has correct value: 4
2017-06-05 02:12:25,026 - MainThread - datacenters - INFO - Property 'data_center->version->minor' has correct value: 1
2017-06-05 02:12:25,030 - MainThread - root - INFO - Get host resource by host_name host_mixed_1
2017-06-05 02:12:25,030 - MainThread - VDS - INFO - [10.35.66.4] Executing command pgrep vdsm
2017-06-05 02:12:25,624 - MainThread - VDS - INFO - [10.35.66.4] Executing command pgrep super
2017-06-05 02:12:26,204 - MainThread - art.ll_lib.hosts - INFO - kill VDSM pid: 10813 on host 10.35.66.4
2017-06-05 02:12:26,205 - MainThread - VDS - INFO - [10.35.66.4] Executing command kill -9 10813
2017-06-05 02:12:26,751 - MainThread - root - INFO - Get host object by host_name host_mixed_1. with {'attribute': 'name'}
2017-06-05 02:12:26,914 - MainThread - art.ll_lib.hosts - INFO - Check if host host_mixed_1 has state connecting
2017-06-05 02:12:37,004 - MainThread - art.ll_lib.hosts - DEBUG - Skipping duplicate log-messages...
2017-06-05 02:12:37,004 - MainThread - art.ll_lib.hosts - INFO - Host host_mixed_1 has state connecting
2017-06-05 02:12:37,067 - MainThread - datacenters - INFO - Waiting for query `name=dc_upgrade_4_0_to_4_1 and status=up` and event_id None up to 180 seconds,sampling every 10 second.
2017-06-05 02:13:17,382 - MainThread - root - INFO - Get host object by host_name host_mixed_1. with {'attribute': 'name'}
2017-06-05 02:13:17,535 - MainThread - art.ll_lib.hosts - INFO - Check if host host_mixed_1 has state up
2017-06-05 02:13:17,535 - MainThread - art.ll_lib.hosts - INFO - Host host_mixed_1 has state up
2017-06-05 02:13:17,539 - MainThread - root - INFO - Update cluster cluster_upgrade_4_0_to_4_1 with {'version': '4.1'}
2017-06-05 02:13:17,592 - MainThread - clusters - INFO - Using Correlation-Id: clusters_update_fecde6da-ede0-4ebc
2017-06-05 02:13:17,690 - MainThread - clusters - INFO - cluster was updated
2017-06-05 02:13:17,691 - MainThread - clusters - INFO - Property 'cluster->version->major' has correct value: 4
2017-06-05 02:13:17,692 - MainThread - clusters - INFO - Property 'cluster->version->minor' has correct value: 1
2017-06-05 02:13:17,776 - MainThread - art.ll_lib.dcs - INFO - Update datacenter dc_upgrade_4_0_to_4_1 with {'version': '4.1'}
2017-06-05 02:13:17,777 - MainThread - datacenters - INFO - Using Correlation-Id: datacenters_update_fa1b6ec0-5608-4a27
2017-06-05 02:13:17,866 - MainThread - datacenters - INFO - data_center was updated
2017-06-05 02:13:17,867 - MainThread - datacenters - INFO - Property 'data_center->version->major' has correct value: 4
2017-06-05 02:13:17,867 - MainThread - datacenters - INFO - Property 'data_center->version->minor' has correct value: 1
2017-06-05 02:13:18,111 - MainThread - rhevmtests.storage.storage_qcow2_v3.test_qcow2_v3 - INFO - Checking that upgrade_4_0_to_4_1nfs0 was upgraded: True2017-06-05 02:13:18,112 - MainThread - art.logging - INFO - ================================================================================
2017-06-05 02:13:18,113 - MainThread - art.logging - INFO - TEARDOWN <TestCaseFunction 'test_verify_new_domain_version'>
2017-06-05 02:13:18,113 - MainThread - pytest.art.matrix - INFO - The storage type reset to None
2017-06-05 02:13:18,490 - MainThread - rhevmtests.storage.helpers - INFO - Checking if domain upgrade_4_0_to_4_1nfs0 is active in dc dc_upgrade_4_0_to_4_1
2017-06-05 02:13:18,490 - MainThread - storagedomains - DEBUG - Skipping duplicate log-messages...
2017-06-05 02:13:18,604 - MainThread - storagedomains - INFO - Domain upgrade_4_0_to_4_1nfs0 in dc dc_upgrade_4_0_to_4_1 is active
2017-06-05 02:13:18,604 - MainThread - storagedomains - INFO - Checking if domain upgrade_4_0_to_4_1nfs0 is active in dc dc_upgrade_4_0_to_4_1
2017-06-05 02:13:18,725 - MainThread - storagedomains - INFO - Domain upgrade_4_0_to_4_1nfs0 in dc dc_upgrade_4_0_to_4_1 is active
2017-06-05 02:13:19,533 - MainThread - test_utils - INFO - All tasks are gone
2017-06-05 02:13:19,658 - MainThread - storagedomains - INFO - Deactivating domain upgrade_4_0_to_4_1nfs0 in data center dc_upgrade_4_0_to_4_1
2017-06-05 02:13:19,659 - MainThread - storagedomains - INFO - Using Correlation-Id: storagedomains_syncAction_84fe63ed-a1b9-4378
2017-06-05 02:18:24,398 - MainThread - storagedomains - INFO - Removing storage domain upgrade_4_0_to_4_1nfs0
2017-06-05 02:18:24,523 - MainThread - storagedomains - INFO - Using Correlation-Id: storagedomains_delete_72f4c775-6cf0-442d
2017-06-05 02:18:25,920 - MainThread - api_utils - ERROR - Failed to delete element NOT as expected:
	Status: 409
	Reason: Conflict
	Detail: [Cannot remove Storage.
 The Storage Domain metadata indicates it is already attached to a Data Center hence cannot be formatted.
 To remove the Storage Domain one should either remove it without the format option or attach it to an existing Data Center, detach it, and try again.]

2017-06-05 02:18:25,920 - MainThread - storagedomains - ERROR - Response code is not valid, expected is: [200, 202, 204], actual is: 409


Automation Test:
rhevmtests.storage.storage_qcow2_v3.test_qcow2_v3.TestCase18307_REST_NFS.test_verify_new_domain_version

Comment 1 Maor 2017-06-08 12:23:11 UTC
It seems like the deactivate storage domain failed since the host was not SPM yet:

From VDSM log:

1) spmStatus is free:
2017-06-05 02:13:05,062+0300 INFO  (jsonrpc/7) [vdsm.api] FINISH getSpmStatus return={'spm_st': {'spmId': -1, 'spmStatus': 'Free', 'spmLver': -1}} from=::ffff:10.35.69.107,39536, flow_id=1320da4b (api:52)

2017-06-05 02:13:05,062+0300 INFO  (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call StoragePool.getSpmStatus succeeded in 0.01 seconds (__init__:539)

2) getAllTasksStatuses fails:
2017-06-05 02:13:06,068+0300 INFO  (jsonrpc/0) [vdsm.api] START getAllTasksStatuses(spUUID=None, options=None) from=::ffff:10.35.69.107,39536, flow_id=1320da4b (api:46)
2017-06-05 02:13:06,068+0300 INFO  (jsonrpc/0) [vdsm.api] FINISH getAllTasksStatuses error=Secured object is not in safe state from=::ffff:10.35.69.107,39536, flow_id=1320da4b (api:50)
2017-06-05 02:13:06,069+0300 ERROR (jsonrpc/0) [storage.TaskManager.Task] (Task='cbe4ee58-90af-4e26-98d7-311f5318cfa0') Unexpected error (task:870)
Traceback (most recent call last):


Is there a check in the test whether the SPM is up after the restart, before trying to deactivate?

Comment 2 Avihai 2017-06-08 13:03:46 UTC
(In reply to Maor from comment #1)
> It seems like the deactivate storage domain failed since the host was not
> SPM yet:

I am checking that DC is active before deactivating SD & after upgrade DC  was active so no way the only host in that DC was not SPM at that time (2017-06-05 02:13:18).
2017-06-05 02:13:18,604 - MainThread - storagedomains - INFO - Domain upgrade_4_0_to_4_1nfs0 in dc dc_upgrade_4_0_to_4_1 is active

> From VDSM log:
> 
> 1) spmStatus is free:
> 2017-06-05 02:13:05,062+0300 INFO  (jsonrpc/7) [vdsm.api] FINISH
> getSpmStatus return={'spm_st': {'spmId': -1, 'spmStatus': 'Free', 'spmLver':
> -1}} from=::ffff:10.35.69.107,39536, flow_id=1320da4b (api:52)
> 
> 2017-06-05 02:13:05,062+0300 INFO  (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC
> call StoragePool.getSpmStatus succeeded in 0.01 seconds (__init__:539)

I see spmStatus = 'SPM' before & after deactivation time which is 2017-06-05 02:13:19.

From VDSM log excerpts : 
2017-06-05 02:13:08,539+0300 INFO  (jsonrpc/6) [vdsm.api] FINISH getSpmStatus return={'spm_st': {'spmId': 1, 'spmStatus': 'SPM', 'spmLver': 3L}} from=::ffff:10.35.69.107,39536, flow_id=1320da4b (api:52)
2017-06-05 02:13:21,635+0300 INFO  (jsonrpc/7) [vdsm.api] FINISH getSpmStatus return={'spm_st': {'spmId': 1, 'spmStatus': 'SPM', 'spmLver': 3L}} from=::ffff:10.35.69.107,39536, flow_id=3571470c (api:52)


> 2) getAllTasksStatuses fails:
> 2017-06-05 02:13:06,068+0300 INFO  (jsonrpc/0) [vdsm.api] START
> getAllTasksStatuses(spUUID=None, options=None)
> from=::ffff:10.35.69.107,39536, flow_id=1320da4b (api:46)
> 2017-06-05 02:13:06,068+0300 INFO  (jsonrpc/0) [vdsm.api] FINISH
> getAllTasksStatuses error=Secured object is not in safe state
> from=::ffff:10.35.69.107,39536, flow_id=1320da4b (api:50)
> 2017-06-05 02:13:06,069+0300 ERROR (jsonrpc/0) [storage.TaskManager.Task]
> (Task='cbe4ee58-90af-4e26-98d7-311f5318cfa0') Unexpected error (task:870)
> Traceback (most recent call last):

Also getAllTasksStatuses succeeded 8 sec's before deactivation (2017-06-05 02:13:19)  , from VDSM log:

2017-06-05 02:13:11,627+0300 INFO  (jsonrpc/3) [vdsm.api] START getAllTasksStatuses(spUUID=None, options=None) from=::ffff:10.35.69.107,39536, flow_id=35ccc7b9 (api:46)
2017-06-05 02:13:11,628+0300 INFO  (jsonrpc/3) [vdsm.api] FINISH getAllTasksStatuses return={'allTasksStatus': {'e9e60f40-06d1-4a2b-b5ec-aaf095d0c8cf': {'code': 0, 'message': u'1 jobs completed successfully', 't
askState': 'finished', 'taskResult': 'success', 'taskID': 'e9e60f40-06d1-4a2b-b5ec-aaf095d0c8cf'}, '648956e3-4150-40b7-881c-d3624389d0dd': {'code': 0, 'message': u'1 jobs completed successfully', 'taskState': 'f
inished', 'taskResult': 'success', 'taskID': '648956e3-4150-40b7-881c-d3624389d0dd'}}} from=::ffff:10.35.69.107,39536, flow_id=35ccc7b9 (api:52)
2017-06-05 02:13:11,628+0300 INFO  (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call Host.getAllTasksStatuses succeeded in 0.00 seconds (__init__:539

> Is there a check in the test whether the SPM is up after the restart, before
> trying to deactivate?

I am checking that DC is active before deactivating SD & after upgrade DC  was active so no way the only host in that DC was not SPM at that time (2017-06-05 02:13:18 ) - See SPM logs above that shows host was SPM at that time .

Comment 3 Maor 2017-06-08 14:46:35 UTC
Thanks for the detailed respond Avihai.
I think that I found the issue, it seems that the OvfDataUpdater is being called every minute, probably the config value OvfUpdateIntervalInMinutes was changed from 60 minutes to 1 minute.

That is why the storage domain failed to be deactivated, because there were tasks of OVF_STORE update that was being done while the deactivate took place.

See the following logs:

2017-06-05 02:13:30,757+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (DefaultQuartzScheduler3) [998764a] SpmStopVDSCommand::Not stopping SPM on vds 'host_mixed_1', pool id '4a4485b7-181b-4ca3-a582-b330fa886500' as there are uncleared tasks

2017-06-05 02:13:30,757+03 ERROR [org.ovirt.engine.core.bll.storage.domain.DeactivateStorageDomainCommand] (DefaultQuartzScheduler3) [998764a] Aborting execution due to failure to stop SPM

Comment 4 Maor 2017-06-08 14:48:20 UTC
Can you please check what is the value of OvfUpdateIntervalInMinutes, if it is not 60 minutes, please try to use the default config value (60 minutes) and see if that error reprduces

Comment 5 Avihai 2017-06-08 15:07:42 UTC
(In reply to Maor from comment #4)
> Can you please check what is the value of OvfUpdateIntervalInMinutes

Default was & still is 60 minutes .

engine-config --get OvfUpdateIntervalInMinutes
OvfUpdateIntervalInMinutes: 60 version: general

>if it is not 60 minutes, please try to use the default config value (60 minutes)
> and see if that error reprduces

As we discussed I tried 10 times with the default & the issue did not reproduce.
This does not mean this is not an issue. 

Anyway, even if an update is running every 1min ( how did you see that ? ) 
it does not mean that deactivation of SD should fail .. shouldn't deactivation work in each case. ( update or not OVF update ) or wait till the update is done? 

What about the unexplained VDSM restart we saw in the time of the deactivation?

Comment 6 Maor 2017-06-08 17:07:07 UTC
(In reply to Avihai from comment #5)
> (In reply to Maor from comment #4)
> > Can you please check what is the value of OvfUpdateIntervalInMinutes
> 
> Default was & still is 60 minutes .
> 
> engine-config --get OvfUpdateIntervalInMinutes
> OvfUpdateIntervalInMinutes: 60 version: general
> 
> >if it is not 60 minutes, please try to use the default config value (60 minutes)
> > and see if that error reprduces
> 
> As we discussed I tried 10 times with the default & the issue did not
> reproduce.
> This does not mean this is not an issue. 

Weird, I do see it running every minute in the logs, can you try to run this test once more and check when OvfDataUpdater is being called?

> 
> Anyway, even if an update is running every 1min ( how did you see that ? ) 

.....
2017-06-05 02:14:26,354+03 INFO  [org.ovirt.engine.core.bll.storage.ovfstore.OvfDataUpdater] (DefaultQuartzScheduler6) [6c58e4e8] Attempting to update VMs/Templates Ovf.
....
2017-06-05 02:15:26,391+03 INFO  [org.ovirt.engine.core.bll.storage.ovfstore.OvfDataUpdater] (DefaultQuartzScheduler2) [cc9b24e] Attempting to update VMs/Templates Ovf.
....
2017-06-05 02:16:26,435+03 INFO  [org.ovirt.engine.core.bll.storage.ovfstore.OvfDataUpdater] (DefaultQuartzScheduler8) [7c00331a] Attempting to update VMs/Templates Ovf.
....
2017-06-05 02:17:26,463+03 INFO  [org.ovirt.engine.core.bll.storage.ovfstore.OvfDataUpdater] (DefaultQuartzScheduler1) [75df751] Attempting to update VMs/Templates Ovf.
...
2017-06-05 02:18:26,491+03 INFO  [org.ovirt.engine.core.bll.storage.ovfstore.OvfDataUpdater] (DefaultQuartzScheduler9) [2db328be] Attempting to update VMs/Templates Ovf.
...
and so on and so on

> it does not mean that deactivation of SD should fail .. shouldn't
> deactivation work in each case. ( update or not OVF update ) or wait till
> the update is done? 

There should be a validation of engine whether there are any running tasks (The OVF update), if there are then it should fail with CDA message.
The reason I asked is that if that is a configuration issue that probably we can target this bug to 4.2 for now.

> 
> What about the unexplained VDSM restart we saw in the time of the
> deactivation?

Not sure that I know what you mean?

Comment 7 Maor 2017-06-08 17:12:01 UTC
(In reply to Maor from comment #6)
> (In reply to Avihai from comment #5)
> > (In reply to Maor from comment #4)
> > > Can you please check what is the value of OvfUpdateIntervalInMinutes
> > 
> > Default was & still is 60 minutes .
> > 
> > engine-config --get OvfUpdateIntervalInMinutes
> > OvfUpdateIntervalInMinutes: 60 version: general
> > 
> > >if it is not 60 minutes, please try to use the default config value (60 minutes)
> > > and see if that error reprduces
> > 
> > As we discussed I tried 10 times with the default & the issue did not
> > reproduce.
> > This does not mean this is not an issue. 
> 
> Weird, I do see it running every minute in the logs, can you try to run this
> test once more and check when OvfDataUpdater is being called?
> 
> > 
> > Anyway, even if an update is running every 1min ( how did you see that ? ) 
> 
> .....
> 2017-06-05 02:14:26,354+03 INFO 
> [org.ovirt.engine.core.bll.storage.ovfstore.OvfDataUpdater]
> (DefaultQuartzScheduler6) [6c58e4e8] Attempting to update VMs/Templates Ovf.
> ....
> 2017-06-05 02:15:26,391+03 INFO 
> [org.ovirt.engine.core.bll.storage.ovfstore.OvfDataUpdater]
> (DefaultQuartzScheduler2) [cc9b24e] Attempting to update VMs/Templates Ovf.
> ....
> 2017-06-05 02:16:26,435+03 INFO 
> [org.ovirt.engine.core.bll.storage.ovfstore.OvfDataUpdater]
> (DefaultQuartzScheduler8) [7c00331a] Attempting to update VMs/Templates Ovf.
> ....
> 2017-06-05 02:17:26,463+03 INFO 
> [org.ovirt.engine.core.bll.storage.ovfstore.OvfDataUpdater]
> (DefaultQuartzScheduler1) [75df751] Attempting to update VMs/Templates Ovf.
> ...
> 2017-06-05 02:18:26,491+03 INFO 
> [org.ovirt.engine.core.bll.storage.ovfstore.OvfDataUpdater]
> (DefaultQuartzScheduler9) [2db328be] Attempting to update VMs/Templates Ovf.
> ...
> and so on and so on
> 
> > it does not mean that deactivation of SD should fail .. shouldn't
> > deactivation work in each case. ( update or not OVF update ) or wait till
> > the update is done? 
> 
> There should be a validation of engine whether there are any running tasks
> (The OVF update), if there are then it should fail with CDA message.
> The reason I asked is that if that is a configuration issue that probably we
> can target this bug to 4.2 for now.
> 
> > 
> > What about the unexplained VDSM restart we saw in the time of the
> > deactivation?
> 
> Not sure that I know what you mean?

could you please try to run this test one more time, I would like to check when and how many time OvfDataUpdater is being called?
It seems it is called every minute, although you say that the config value is 60 minutes.
Is it possible that the config value was changed at some point?

Comment 8 Maor 2017-06-08 17:17:39 UTC
> > 
> > > it does not mean that deactivation of SD should fail .. shouldn't
> > > deactivation work in each case. ( update or not OVF update ) or wait till
> > > the update is done? 
> > 
> > There should be a validation of engine whether there are any running tasks
> > (The OVF update), if there are then it should fail with CDA message.
> > The reason I asked is that if that is a configuration issue that probably we
> > can target this bug to 4.2 for now.
> > 

BTW regarding the validation in the engine, it could be that the OVF update process was called after the deactivate command validation and right before the stopSPM was called.

Comment 9 Avihai 2017-06-09 13:49:43 UTC
(In reply to Maor from comment #7)
> (In reply to Maor from comment #6)
> > (In reply to Avihai from comment #5)
> > > (In reply to Maor from comment #4)
> > > > Can you please check what is the value of OvfUpdateIntervalInMinutes
> > > 
> > > Default was & still is 60 minutes .
> > > 
> > > engine-config --get OvfUpdateIntervalInMinutes
> > > OvfUpdateIntervalInMinutes: 60 version: general
> > > 
> > > >if it is not 60 minutes, please try to use the default config value (60 minutes)
> > > > and see if that error reprduces
> > > 
> > > As we discussed I tried 10 times with the default & the issue did not
> > > reproduce.
> > > This does not mean this is not an issue. 
> > 
> > Weird, I do see it running every minute in the logs, can you try to run this
> > test once more and check when OvfDataUpdater is being called?
> > 
> > > 
> > > Anyway, even if an update is running every 1min ( how did you see that ? ) 
> > 
> > .....
> > 2017-06-05 02:14:26,354+03 INFO 
> > [org.ovirt.engine.core.bll.storage.ovfstore.OvfDataUpdater]
> > (DefaultQuartzScheduler6) [6c58e4e8] Attempting to update VMs/Templates Ovf.
> > ....
> > 2017-06-05 02:15:26,391+03 INFO 
> > [org.ovirt.engine.core.bll.storage.ovfstore.OvfDataUpdater]
> > (DefaultQuartzScheduler2) [cc9b24e] Attempting to update VMs/Templates Ovf.
> > ....
> > 2017-06-05 02:16:26,435+03 INFO 
> > [org.ovirt.engine.core.bll.storage.ovfstore.OvfDataUpdater]
> > (DefaultQuartzScheduler8) [7c00331a] Attempting to update VMs/Templates Ovf.
> > ....
> > 2017-06-05 02:17:26,463+03 INFO 
> > [org.ovirt.engine.core.bll.storage.ovfstore.OvfDataUpdater]
> > (DefaultQuartzScheduler1) [75df751] Attempting to update VMs/Templates Ovf.
> > ...
> > 2017-06-05 02:18:26,491+03 INFO 
> > [org.ovirt.engine.core.bll.storage.ovfstore.OvfDataUpdater]
> > (DefaultQuartzScheduler9) [2db328be] Attempting to update VMs/Templates Ovf.
> > ...
> > and so on and so on
> > 
> > > it does not mean that deactivation of SD should fail .. shouldn't
> > > deactivation work in each case. ( update or not OVF update ) or wait till
> > > the update is done? 
> > 
> > There should be a validation of engine whether there are any running tasks
> > (The OVF update), if there are then it should fail with CDA message.
> > The reason I asked is that if that is a configuration issue that probably we
> > can target this bug to 4.2 for now.

This was not run on my locally stands but on Nelly's tierX automation stands which now is already running other builds & configuration.

I'll see if I can dig up some configurations from that specific run .
> > > 
> > > What about the unexplained VDSM restart we saw in the time of the
> > > deactivation?
> > 
> > Not sure that I know what you mean?

I mean vdsm was killed ONLY ONCE at 2017-06-05 02:12:26 , while in vdsm log 3 vdsm restarts appear:

VDSM log:
Expected after kill :
1) 2017-06-05 02:12:33,104+0300 INFO  (MainThread) [vds] (PID: 61977) I am the actual vdsm 4.19.17-1.el7ev lynx12.qa.lab.tlv.redhat.com (3.10.0-514.21.1.el7.x86_64) (vdsm:145)

Not expected :
2) 2017-06-05 02:18:41,361+0300 INFO  (MainThread) [vds] (PID: 62794) I am the actual vdsm 4.19.17-1.el7ev lynx12.qa.lab.tlv.redhat.com (3.10.0-514.21.1.el7.x86_64) (vdsm:145)

Not expected:
3) 2017-06-05 02:21:01,646+0300 INFO  (MainThread) [vds] (PID: 63539) I am the actual vdsm 4.19.17-1.el7ev lynx12.qa.lab.tlv.redhat.com (3.10.0-514.21.1.el7.x86_64) (vdsm:145)

Automation log:
2017-06-05 02:12:26,204 - MainThread - art.ll_lib.hosts - INFO - kill VDSM pid: 10813 on host 10.35.66.4
2017-06-05 02:12:26,205 - MainThread - VDS - INFO - [10.35.66.4] Executing command kill -9 10813


> 
> could you please try to run this test one more time, I would like to check
> when and how many time OvfDataUpdater is being called?
> It seems it is called every minute, although you say that the config value
> is 60 minutes.
> Is it possible that the config value was changed at some point?

I checked in the engine_db.sql file from that run and indeed it looks it was changed to 1 minute (attached engine_db.sql file ):

engine_db.sql
230	OvfUpdateIntervalInMinutes	1	general

This was not run on my locally stands but on Nelly's tierX automation stands which now is already running other builds & configuration.

Comment 10 Maor 2017-06-19 18:46:35 UTC
....
> > > > 
> > > > What about the unexplained VDSM restart we saw in the time of the
> > > > deactivation?
> > > 
> > > Not sure that I know what you mean?
> 
> I mean vdsm was killed ONLY ONCE at 2017-06-05 02:12:26 , while in vdsm log
> 3 vdsm restarts appear:
> 
> VDSM log:
> Expected after kill :
> 1) 2017-06-05 02:12:33,104+0300 INFO  (MainThread) [vds] (PID: 61977) I am
> the actual vdsm 4.19.17-1.el7ev lynx12.qa.lab.tlv.redhat.com
> (3.10.0-514.21.1.el7.x86_64) (vdsm:145)
> 
> Not expected :
> 2) 2017-06-05 02:18:41,361+0300 INFO  (MainThread) [vds] (PID: 62794) I am
> the actual vdsm 4.19.17-1.el7ev lynx12.qa.lab.tlv.redhat.com
> (3.10.0-514.21.1.el7.x86_64) (vdsm:145)
> 
> Not expected:
> 3) 2017-06-05 02:21:01,646+0300 INFO  (MainThread) [vds] (PID: 63539) I am
> the actual vdsm 4.19.17-1.el7ev lynx12.qa.lab.tlv.redhat.com
> (3.10.0-514.21.1.el7.x86_64) (vdsm:145)
> 

It might be sanlock that caused this restart.
Does it reproduces consistently on your env?
If so, I suggest to open a separate bug on that.

> 
> I checked in the engine_db.sql file from that run and indeed it looks it was
> changed to 1 minute (attached engine_db.sql file ):
> 
> engine_db.sql
> 230	OvfUpdateIntervalInMinutes	1	general
> 
> This was not run on my locally stands but on Nelly's tierX automation stands
> which now is already running other builds & configuration.

Thanks for the clarification, basically the validation seems valid, since when there are running tasks the storage domain should fail to be deactivated.
please let me know if that still reproduces when running with the default value.
Once we know the tests pass I think this bug can be closed.

Comment 11 Avihai 2017-06-20 05:34:02 UTC
(In reply to Maor from comment #10)
> ....
> > > > > 
> > > > > What about the unexplained VDSM restart we saw in the time of the
> > > > > deactivation?
> > > > 
> > > > Not sure that I know what you mean?
> > 
> > I mean vdsm was killed ONLY ONCE at 2017-06-05 02:12:26 , while in vdsm log
> > 3 vdsm restarts appear:
> > 
> > VDSM log:
> > Expected after kill :
> > 1) 2017-06-05 02:12:33,104+0300 INFO  (MainThread) [vds] (PID: 61977) I am
> > the actual vdsm 4.19.17-1.el7ev lynx12.qa.lab.tlv.redhat.com
> > (3.10.0-514.21.1.el7.x86_64) (vdsm:145)
> > 
> > Not expected :
> > 2) 2017-06-05 02:18:41,361+0300 INFO  (MainThread) [vds] (PID: 62794) I am
> > the actual vdsm 4.19.17-1.el7ev lynx12.qa.lab.tlv.redhat.com
> > (3.10.0-514.21.1.el7.x86_64) (vdsm:145)
> > 
> > Not expected:
> > 3) 2017-06-05 02:21:01,646+0300 INFO  (MainThread) [vds] (PID: 63539) I am
> > the actual vdsm 4.19.17-1.el7ev lynx12.qa.lab.tlv.redhat.com
> > (3.10.0-514.21.1.el7.x86_64) (vdsm:145)
> > 
> 
> It might be sanlock that caused this restart.
> Does it reproduces consistently on your env?
> If so, I suggest to open a separate bug on that.

No its does not, so far only occurred once but still its there.
I opened Bug 1463075 on this issue , please try to work with current logs , if this is not enough I will try to reproduce locally with OVF update of 1min .

> > I checked in the engine_db.sql file from that run and indeed it looks it was
> > changed to 1 minute (attached engine_db.sql file ):
> > 
> > engine_db.sql
> > 230	OvfUpdateIntervalInMinutes	1	general
> > 
> > This was not run on my locally stands but on Nelly's tierX automation stands
> > which now is already running other builds & configuration.
> 
> Thanks for the clarification, basically the validation seems valid, since
> when there are running tasks the storage domain should fail to be
> deactivated.
> please let me know if that still reproduces when running with the default
> value.
> Once we know the tests pass I think this bug can be closed.

So far this does not produce with the default value of 60min , you can close this bug for now.

Comment 12 Maor 2017-06-20 07:50:37 UTC
Closing this bug based on earlier comments

Comment 13 Maor 2017-06-20 07:58:56 UTC
(In reply to Avihai from comment #11)
> (In reply to Maor from comment #10)
> > ....
> > > > > > 
> > > > > > What about the unexplained VDSM restart we saw in the time of the
> > > > > > deactivation?
> > > > > 
> > > > > Not sure that I know what you mean?
> > > 
> > > I mean vdsm was killed ONLY ONCE at 2017-06-05 02:12:26 , while in vdsm log
> > > 3 vdsm restarts appear:
> > > 
> > > VDSM log:
> > > Expected after kill :
> > > 1) 2017-06-05 02:12:33,104+0300 INFO  (MainThread) [vds] (PID: 61977) I am
> > > the actual vdsm 4.19.17-1.el7ev lynx12.qa.lab.tlv.redhat.com
> > > (3.10.0-514.21.1.el7.x86_64) (vdsm:145)
> > > 
> > > Not expected :
> > > 2) 2017-06-05 02:18:41,361+0300 INFO  (MainThread) [vds] (PID: 62794) I am
> > > the actual vdsm 4.19.17-1.el7ev lynx12.qa.lab.tlv.redhat.com
> > > (3.10.0-514.21.1.el7.x86_64) (vdsm:145)
> > > 
> > > Not expected:
> > > 3) 2017-06-05 02:21:01,646+0300 INFO  (MainThread) [vds] (PID: 63539) I am
> > > the actual vdsm 4.19.17-1.el7ev lynx12.qa.lab.tlv.redhat.com
> > > (3.10.0-514.21.1.el7.x86_64) (vdsm:145)
> > > 
> > 
> > It might be sanlock that caused this restart.
> > Does it reproduces consistently on your env?
> > If so, I suggest to open a separate bug on that.
> 
> No its does not, so far only occurred once but still its there.
> I opened Bug 1463075 on this issue , please try to work with current logs ,
> if this is not enough I will try to reproduce locally with OVF update of
> 1min .
> 

Probably better to try to reproduce this again to see if that is consistent.
If it only happened once with current circumstances, it sounds like a corner case which I'm not sure will be worth the effort.