Bug 844438 - ovirt-engine-backend: Auto-Recovery should check whether getVdsStats returns 'lastCheck<60' before it proclaims host as up
Summary: ovirt-engine-backend: Auto-Recovery should check whether getVdsStats returns ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.1.0
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: mkublin
QA Contact: Omri Hochman
URL:
Whiteboard: infra
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-07-30 15:49 UTC by Omri Hochman
Modified: 2016-02-10 19:36 UTC (History)
7 users (show)

Fixed In Version: si19
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-12-04 20:06:52 UTC
oVirt Team: Infra
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
engine.log (51.28 KB, application/octet-stream)
2012-07-30 15:50 UTC, Omri Hochman
no flags Details
new_engine_log (278.44 KB, application/x-zip-compressed)
2012-11-04 16:19 UTC, Omri Hochman
no flags Details

Description Omri Hochman 2012-07-30 15:49:16 UTC
ovirt-engine-backend: Auto-Recovery should check whether getVdsStats returns 'lastCheck<60' before it starts to recover the host.

Description:
***************
(50 iSCSI  SD's environment)  rhevm-engine sends Host to 'Non-Operational' when getVdsStats returns value for 'lastCheck > 60' for one of the SD's. Auto-Recovery that runs every 5 minutes does not check that same criteria (lastCheck>60) and just sends ActivateVdsCommand, which succeeds. 

Results: 
*********
Since the criteria that caused engine to set host to 'Non-Operational' is different from  the criteria that caused auto-recovery to set the host back to 'Up' --> the host will switch to 'Up' (for few minutes)  and switched back to 'Non-Operational'   

vdsClient -s 0 getVdsStats (lastCheck': '72279.2):
***************************************************
st'0.0259990692139', 'lastCheck': '11.9', 'code': 0, 'valid': True}, '4c84bb36-96fa-4b73-89e1-90f81c394483': {'delay': '0.0061411857605', 'lastCheck': '72496.9', 'code': 0, 'valid': True}, '768b1fcd-e79a-42d5-a85e-4f794a20d67f': {'delay': '0.168473005295', 'lastCheck': '8.7', 'code': 0, 'valid': True}, 'd23aa631-5a64-43d2-9dbe-ff83dc16be39': {'delay': '0.791082143784', 'lastCheck': '72279.2', 'code': 0, 'valid': True},


Engine.log:
***********
2012-07-30 16:35:55,250 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-51) [65234c09] Domain f02a705c-aaaa-43cc-9b0c-3730358d2
8c8 has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer.
2012-07-30 16:40:00,001 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-2) [35b9bc67] Checking autorecoverable hosts
2012-07-30 16:40:00,018 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-2) [35b9bc67] Autorecovering 3 hosts
2012-07-30 16:40:00,019 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-2) [35b9bc67] Autorecovering hosts: 8a9ab47c-d97e-11e1-ad7a-43fef5f1
b388
2012-07-30 16:40:00,029 INFO  [org.ovirt.engine.core.bll.ActivateVdsCommand] (QuartzScheduler_Worker-2) [38500a41] Lock Acquired to object EngineLock [exclusiveLocks= key
: 8a9ab47c-d97e-11e1-ad7a-43fef5f1b388 value: VDS
, sharedLocks= ]
2012-07-30 16:40:00,029 INFO  [org.ovirt.engine.core.bll.ActivateVdsCommand] (QuartzScheduler_Worker-2) [38500a41] Running command: ActivateVdsCommand internal: true. Ent
ities affected :  ID: 8a9ab47c-d97e-11e1-ad7a-43fef5f1b388 Type: VDS
2012-07-30 16:40:00,037 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-2) [38500a41] START, SetVdsStatusVDSCommand(vdsId = 8a9ab47
c-d97e-11e1-ad7a-43fef5f1b388, status=Unassigned, nonOperationalReason=NONE), log id: 6f619968
2012-07-30 16:40:00,058 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-2) [38500a41] FINISH, SetVdsStatusVDSCommand, log id: 6f619
968
2012-07-30 16:40:00,076 INFO  [org.ovirt.engine.core.vdsbroker.ActivateVdsVDSCommand] (QuartzScheduler_Worker-2) [38500a41] START, ActivateVdsVDSCommand(vdsId = 8a9ab47c-
d97e-11e1-ad7a-43fef5f1b388), log id: 274e8c8e
2012-07-30 16:40:00,238 INFO  [org.ovirt.engine.core.bll.HandleVdsCpuFlagsOrClusterChangedCommand] (QuartzScheduler_Worker-2) [46911aa8] Running command: HandleVdsCpuFlag
sOrClusterChangedCommand internal: true. Entities affected :  ID: 8a9ab47c-d97e-11e1-ad7a-43fef5f1b388 Type: VDS
2012-07-30 16:40:00,249 INFO  [org.ovirt.engine.core.bll.HandleVdsVersionCommand] (QuartzScheduler_Worker-2) [6ba4f799] Running command: HandleVdsVersionCommand internal:
 true. Entities affected :  ID: 8a9ab47c-d97e-11e1-ad7a-43fef5f1b388 Type: VDS
2012-07-30 16:40:00,252 INFO  [org.ovirt.engine.core.vdsbroker.ActivateVdsVDSCommand] (QuartzScheduler_Worker-2) [6ba4f799] FINISH, ActivateVdsVDSCommand, log id: 274e8c8
e
2012-07-30 16:40:00,309 INFO  [org.ovirt.engine.core.bll.ActivateVdsCommand] (QuartzScheduler_Worker-2) [6ba4f799] Lock freed to object EngineLock [exclusiveLocks= key: 8
a9ab47c-d97e-11e1-ad7a-43fef5f1b388 value: VDS
, sharedLocks= ]
2012-07-30 16:40:00,311 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-2) [6ba4f799] Autorecovering hosts: 8d11dd3e-d97e-11e1-a92e-831f3c8a
7b9d
2012-07-30 16:40:00,320 INFO  [org.ovirt.engine.core.bll.ActivateVdsCommand] (QuartzScheduler_Worker-2) [3fdba36b] Lock Acquired to object EngineLock [exclusiveLocks= key
: 8d11dd3e-d97e-11e1-a92e-831f3c8a7b9d value: VDS
, sharedLocks= ]
2012-07-30 16:40:00,320 INFO  [org.ovirt.engine.core.bll.ActivateVdsCommand] (QuartzScheduler_Worker-2) [3fdba36b] Running command: ActivateVdsCommand internal: true. Entities affected :  ID: 8d11dd3e-d97e-11e1-a92e-831f3c8a7b9d Type: VDS
2012-07-30 16:40:00,326 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-2) [3fdba36b] START, SetVdsStatusVDSCommand(vdsId = 8d11dd3e-d97e-11e1-a92e-831f3c8a7b9d, status=Unassigned, nonOperationalReason=NONE), log id: 6ef236b5
2012-07-30 16:40:00,347 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-2) [3fdba36b] FINISH, SetVdsStatusVDSCommand, log id: 6ef236b5
2012-07-30 16:40:00,359 INFO  [org.ovirt.engine.core.vdsbroker.ActivateVdsVDSCommand] (QuartzScheduler_Worker-2) [3fdba36b] START, ActivateVdsVDSCommand(vdsId = 8d11dd3e-d97e-11e1-a92e-831f3c8a7b9d), log id: d4593cc
2012-07-30 16:40:00,537 INFO  [org.ovirt.engine.core.bll.HandleVdsCpuFlagsOrClusterChangedCommand] (QuartzScheduler_Worker-2) [2f4b15c9] Running command: HandleVdsCpuFlagsOrClusterChangedCommand internal: true. Entities affected :  ID: 8d11dd3e-d97e-11e1-a92e-831f3c8a7b9d Type: VDS
2012-07-30 16:40:00,549 INFO  [org.ovirt.engine.core.bll.HandleVdsVersionCommand] (QuartzScheduler_Worker-2) [6fe99d5e] Running command: HandleVdsVersionCommand internal: true. Entities affected :  ID: 8d11dd3e-d97e-11e1-a92e-831f3c8a7b9d Type: VDS
2012-07-30 16:40:00,552 INFO  [org.ovirt.engine.core.vdsbroker.ActivateVdsVDSCommand] (QuartzScheduler_Worker-2) [6fe99d5e] FINISH, ActivateVdsVDSCommand, log id: d4593cc
2012-07-30 16:40:00,590 INFO  [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (QuartzScheduler_Worker-90) [294fa788] Running command: InitVdsOnUpCommand internal: true.
2012-07-30 16:40:00,610 INFO  [org.ovirt.engine.core.bll.ActivateVdsCommand] (QuartzScheduler_Worker-2) [6fe99d5e] Lock freed to object EngineLock [exclusiveLocks= key: 8d11dd3e-d97e-11e1-a92e-831f3c8a7b9d value: VDS
, sharedLocks= ]
2012-07-30 16:40:00,611 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-2) [6fe99d5e] Autorecovering hosts: 8dc369be-d97e-11e1-8f21-cffe7252f68f
2012-07-30 16:40:00,614 INFO  [org.ovirt.engine.core.bll.ActivateVdsCommand] (QuartzScheduler_Worker-2) [114af736] Lock Acquired to object EngineLock [exclusiveLocks= key: 8dc369be-d97e-11e1-8f21-cffe7252f68f value: VDS
, sharedLocks= ]
2012-07-30 16:40:00,614 INFO  [org.ovirt.engine.core.bll.ActivateVdsCommand] (QuartzScheduler_Worker-2) [114af736] Running command: ActivateVdsCommand internal: true. Entities affected :  ID: 8dc369be-d97e-11e1-8f21-cffe7252f68f Type: VDS
2012-07-30 16:40:00,621 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-2) [114af736] START, SetVdsStatusVDSCommand(vdsId = 8dc369be-d97e-11e1-8f21-cffe7252f68f, status=Unassigned, nonOperationalReason=NONE), log id: 4152639a
2012-07-30 16:40:00,628 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-2) [114af736] FINISH, SetVdsStatusVDSCommand, log id: 4152639a
2012-07-30 16:40:00,635 INFO  [org.ovirt.engine.core.vdsbroker.ActivateVdsVDSCommand] (QuartzScheduler_Worker-2) [114af736] START, ActivateVdsVDSCommand(vdsId = 8dc369be-d97e-11e1-8f21-cffe7252f68f), log id: 4ddb72a2
2012-07-30 16:40:00,807 INFO  [org.ovirt.engine.core.bll.HandleVdsCpuFlagsOrClusterChangedCommand] (QuartzScheduler_Worker-2) [4a4eb9b7] Running command: HandleVdsCpuFlagsOrClusterChangedCommand internal: true. Entities affected :  ID: 8dc369be-d97e-11e1-8f21-cffe7252f68f Type: VDS
2012-07-30 16:40:00,812 INFO  [org.ovirt.engine.core.bll.HandleVdsVersionCommand] (QuartzScheduler_Worker-2) [64452010] Running command: HandleVdsVersionCommand internal: true. Entities affected :  ID: 8dc369be-d97e-11e1-8f21-cffe7252f68f Type: VDS
2012-07-30 16:40:00,815 INFO  [org.ovirt.engine.core.vdsbroker.ActivateVdsVDSCommand] (QuartzScheduler_Worker-2) [64452010] FINISH, ActivateVdsVDSCommand, log id: 4ddb72a2
2012-07-30 16:40:00,843 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ValidateStorageServerConnectionVDSCommand] (QuartzScheduler_Worker-90) [47fb7571] START, ValidateStorageServerConnectionVDSCommand(vdsId = 8a9ab47c-d97e-11e1-ad7a-43fef5f1b388, storagePoolId = 4f5ab747-38da-4693-ac0f-5531ce70945d, storageType = ISCSI, connectionList = [{ id: 235bd7e9-1d28-4dd7-b298-b40c2afff190, connection: 10.35.160.107 };]), log id: 4446cf6f
2012-07-30 16:40:00,858 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ValidateStorageServerConnectionVDSCommand] (QuartzScheduler_Worker-90) [47fb7571] FINISH, ValidateStorageServerConnectionVDSCommand, return: {235bd7e9-1d28-4dd7-b298-b40c2afff190=0}, log id: 4446cf6f
2012-07-30 16:40:00,858 INFO  [org.ovirt.engine.core.bll.storage.ConnectHostToStoragePoolServersCommand] (QuartzScheduler_Worker-90) [47fb7571] Running command: ConnectHostToStoragePoolServersCommand internal: true. Entities affected :  ID: 4f5ab747-38da-4693-ac0f-5531ce70945d Type: StoragePool
2012-07-30 16:40:00,868 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (QuartzScheduler_Worker-90) [47fb7571] START, ConnectStorageServerVDSCommand(vdsId = 8a9ab47c-d97e-11e1-ad7a-43fef5f1b388, storagePoolId = 4f5ab747-38da-4693-ac0f-5531ce70945d, storageType = ISCSI, connectionList = [{ id: 235bd7e9-1d28-4dd7-b298-b40c2afff190, connection: 10.35.160.107 };]), log id: 1ed2d5cb
2012-07-30 16:40:00,869 INFO  [org.ovirt.engine.core.bll.ActivateVdsCommand] (QuartzScheduler_Worker-2) [64452010] Lock freed to object EngineLock [exclusiveLocks= key: 8dc369be-d97e-11e1-8f21-cffe7252f68f value: VDS
, sharedLocks= ]
2012-07-30 16:40:00,870 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-2) [64452010] Checking autorecoverable hosts done
2012-07-30 16:40:00,870 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-2) [64452010] Checking autorecoverable storage domains
2012-07-30 16:40:00,871 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-2) [64452010] Autorecovering 0 storage domains
2012-07-30 16:40:00,871 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-2) [64452010] Checking autorecoverable storage domains done
2012-07-30 16:40:02,347 INFO  [org.ovirt.engine.core.bll.DbUserCacheManager] (QuartzScheduler_Worker-87) [28a20b91] DbUserCacheManager::refreshAllUserData() - entered
2012-07-30 16:40:07,813 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (QuartzScheduler_Worker-90) [47fb7571] FINISH, ConnectStorageServerVDSCommand, return: {235bd7e9-1d28-4dd7-b298-b40c2afff190=0}, log id: 1ed2d5cb
2012-07-30 16:40:07,813 INFO  [org.ovirt.engine.core.bll.storage.ConnectHostToStoragePoolServersCommand] (QuartzScheduler_Worker-90) [47fb7571] Host puma05 storage connection was succeeded 
2012-07-30 16:40:07,894 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (QuartzScheduler_Worker-90) [47fb7571] START, ConnectStoragePoolVDSCommand(vdsId = 8a9ab47c-d97e-11e1-ad7a-43fef5f1b388, storagePoolId = 4f5ab747-38da-4693-ac0f-5531ce70945d, vds_spm_id = 2, masterDomainId = a4b36945-3f41-4f2a-871b-dd70d7ca7ead, masterVersion = 1), log id: 748f7fbf
2012-07-30 16:40:10,683 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-98) [7529e477] starting ProcessDomainRecovery for domain b7491d9b-a0fb-4323-b787-b782e4695872
2012-07-30 16:40:10,713 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-98) [7529e477] vds puma05 reported domain b7491d9b-a0-4323-b787-b782e4695872:3600601601282300032f94e5b78cbe111 as in problem, moving the vds to status NonOperational
2012-07-30 16:40:10,730 WARN  [org.ovirt.engine.core.dal.job.ExecutionMessageDirector] (QuartzScheduler_Worker-98) [d7f9295] The message key SetNonOperationalVds is missing from bundles/ExecutionMessages
2012-07-30 16:40:10,769 INFO  [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (QuartzScheduler_Worker-98) [d7f9295] Running command: SetNonOperationalVdsCommand internal: true. Entities affected :  ID: 8a9ab47c-d97e-11e1-ad7a-43fef5f1b388 Type: VDS
2012-07-30 16:40:10,769 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-98) [d7f9295] START, SetVdsStatusVDSCommand(vdsId = 8a9ab47c-d97e-11e1-ad7a-43fef5f1b388, status=NonOperational, nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: 7621f900

Comment 1 Omri Hochman 2012-07-30 15:50:27 UTC
Created attachment 601303 [details]
engine.log

Comment 2 Yair Zaslavsky 2012-08-30 08:46:06 UTC
I talked with Michael Kublin about this,
Seems like at current time frame, solving this on existing auto recovery code is problematic.
Suggesting this to rhevm-future.

Comment 3 mkublin 2012-09-11 08:55:11 UTC
I can not decide during initVdsOnUp if host is ok or not or problem is in storage.
At case described  in bug the problem was in vdsm so it looks easy, but if a problem in storage - I just can not distinct between two cases.

Comment 4 mkublin 2012-09-19 13:54:03 UTC
After discussion with Ayal and Barak, the following solution will be implement:
I will add getVdsStats to InitVdsOnUp and will not move vdsm to Up if it has one of the domain in problem.

Problem: if we have some domain in problem which is actually in problem but meanwhile was not moved to status InActive we can not move a vdsm to status UP.
Holpfully will be moved to status up by auto recovery when domain will switch it status to InActive

Comment 5 mkublin 2012-09-23 10:22:01 UTC
http://gerrit.ovirt.org/#/c/8131/

Comment 6 mkublin 2012-09-24 11:42:37 UTC
https://gerrit.eng.lab.tlv.redhat.com/#/c/2262
Downstream

Comment 7 Omri Hochman 2012-11-04 16:17:01 UTC
Still reproduces - RHEVM Build SI23 :  

Scenario: 
********** 
Environment: multiple Hosts against one Storage machine.
 
A) block connection between one HSM host to Storage Machine :  'iptables -A INPUT -s <Storage Machine IP> -j DROP'

B) watch -d 'vdsClient -s 0 getVdsStats |grep 'lastCheck'' 
storageDomains = {'2376bde4-0b67-4e65-9acf-dc9c19a32d13': {'delay': '0', 'lastCheck': '215.3', 'code': 358, 'valid': False}, '4d9f133d-fdd7-4839-9bb6-e766db74f517
': {'delay': '0', 'lastCheck': '216.0', '

Results:  
*********
1) After about 6 minutes lastCheck = '350' the host will switch to non-operational. 
2) about 5 minutes later hostAutoRecovery will attemt to activate the host 
althogh it not connected to the storage .
3) Event message will display 
  a)"Detected new Host puma35. Host state was set to Up."
  b)"Failed to connect Host puma35 to Storage Pool iscsi_dc"
  c)"Host puma35 cannot access one of the Storage Domains attached to the Data Center iscsi_dc. Setting Host state to Non-Operational."
4) Host status will be change to 'Unasigned' then switch back to 'Non-Operational'.
5) Steps will repeat every 5 minutes. 	
	

Engine.log (attaching full engine.log):
************************************* 
2012-11-04 15:30:00,000 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-21) [3fb08a50] Checking autorecoverable hosts
2012-11-04 15:30:00,013 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-21) [3fb08a50] Autorecovering 1 hosts
2012-11-04 15:30:00,013 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-21) [3fb08a50] Autorecovering hosts: 8107e470-2663-11e2-9efd-441ea17
336ee
2012-11-04 15:30:00,027 INFO  [org.ovirt.engine.core.bll.ActivateVdsCommand] (QuartzScheduler_Worker-21) [3c804998] Lock Acquired to object EngineLock [exclusiveLocks= ke
y: 8107e470-2663-11e2-9efd-441ea17336ee value: VDS
, sharedLocks= ]
2012-11-04 15:30:00,028 INFO  [org.ovirt.engine.core.bll.ActivateVdsCommand] (QuartzScheduler_Worker-21) [3c804998] Running command: ActivateVdsCommand internal: true. En
tities affected :  ID: 8107e470-2663-11e2-9efd-441ea17336ee Type: VDS
2012-11-04 15:30:00,043 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-21) [3c804998] START, SetVdsStatusVDSCommand(HostName = pum
a35, HostId = 8107e470-2663-11e2-9efd-441ea17336ee, status=Unassigned, nonOperationalReason=NONE), log id: 2dfee627
2012-11-04 15:30:00,057 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-21) [3c804998] FINISH, SetVdsStatusVDSCommand, log id: 2dfe
e627
2012-11-04 15:30:00,086 INFO  [org.ovirt.engine.core.vdsbroker.ActivateVdsVDSCommand] (QuartzScheduler_Worker-21) [3c804998] START, ActivateVdsVDSCommand(HostName = puma3
5, HostId = 8107e470-2663-11e2-9efd-441ea17336ee), log id: 7bf823c9
2012-11-04 15:30:00,337 INFO  [org.ovirt.engine.core.bll.HandleVdsCpuFlagsOrClusterChangedCommand] (QuartzScheduler_Worker-21) [42811bd4] Running command: HandleVdsCpuFla
gsOrClusterChangedCommand internal: true. Entities affected :  ID: 8107e470-2663-11e2-9efd-441ea17336ee Type: VDS
2012-11-04 15:30:00,351 INFO  [org.ovirt.engine.core.bll.HandleVdsVersionCommand] (QuartzScheduler_Worker-21) [54f5c5c5] Running command: HandleVdsVersionCommand internal
: true. Entities affected :  ID: 8107e470-2663-11e2-9efd-441ea17336ee Type: VDS
2012-11-04 15:30:00,353 INFO  [org.ovirt.engine.core.vdsbroker.ActivateVdsVDSCommand] (QuartzScheduler_Worker-21) [54f5c5c5] FINISH, ActivateVdsVDSCommand, log id: 7bf823
c9
2012-11-04 15:30:00,890 INFO  [org.ovirt.engine.core.bll.ActivateVdsCommand] (QuartzScheduler_Worker-21) [54f5c5c5] Lock freed to object EngineLock [exclusiveLocks= key: 
8107e470-2663-11e2-9efd-441ea17336ee value: VDS
, sharedLocks= ]
2012-11-04 15:30:00,891 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-21) [54f5c5c5] Checking autorecoverable hosts done
2012-11-04 15:30:00,892 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-21) [54f5c5c5] Autorecovering storage domains is disabled, skipping
2012-11-04 15:30:01,166 INFO  [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (QuartzScheduler_Worker-76) [5b35229d] Running command: InitVdsOnUpCommand internal: true.
2012-11-04 15:30:01,503 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ValidateStorageServerConnectionVDSCommand] (QuartzScheduler_Worker-76) [53942145] START, ValidateStorageServerConnectionVDSCommand(HostName = puma35, HostId = 8107e470-2663-11e2-9efd-441ea17336ee, storagePoolId = 020d9b34-265d-11e2-8865-441ea17336ee, storageType = ISCSI, connectionList = [{ id: 93e1cadd-7f7d-4a7e-9216-baa62053352e, connection: 10.35.160.107, iqn: iqn.1992-04.com.emc:cx.ckm00121000438.b7, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };]), log id: 415b13cd
2012-11-04 15:30:01,524 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ValidateStorageServerConnectionVDSCommand] (QuartzScheduler_Worker-76) [53942145] FINISH, ValidateStorageServerConnectionVDSCommand, return: {93e1cadd-7f7d-4a7e-9216-baa62053352e=0}, log id: 415b13cd
2012-11-04 15:30:01,524 INFO  [org.ovirt.engine.core.bll.storage.ConnectHostToStoragePoolServersCommand] (QuartzScheduler_Worker-76) [53942145] Running command: ConnectHostToStoragePoolServersCommand internal: true. Entities affected :  ID: 020d9b34-265d-11e2-8865-441ea17336ee Type: StoragePool
2012-11-04 15:30:01,537 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (QuartzScheduler_Worker-76) [53942145] START, ConnectStorageServerVDSCommand(HostName = puma35, HostId = 8107e470-2663-11e2-9efd-441ea17336ee, storagePoolId = 020d9b34-265d-11e2-8865-441ea17336ee, storageType = ISCSI, connectionList = [{ id: 93e1cadd-7f7d-4a7e-9216-baa62053352e, connection: 10.35.160.107, iqn: iqn.1992-04.com.emc:cx.ckm00121000438.b7, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };]), log id: 5de88663
2012-11-04 15:30:01,596 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (QuartzScheduler_Worker-76) [53942145] FINISH, ConnectStorageServerVDSCommand, return: {93e1cadd-7f7d-4a7e-9216-baa62053352e=0}, log id: 5de88663
2012-11-04 15:30:01,596 INFO  [org.ovirt.engine.core.bll.storage.ConnectHostToStoragePoolServersCommand] (QuartzScheduler_Worker-76) [53942145] Host puma35 storage connection was succeeded 
2012-11-04 15:30:01,632 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (QuartzScheduler_Worker-76) [53942145] START, ConnectStoragePoolVDSCommand(HostName = puma35, HostId = 8107e470-2663-11e2-9efd-441ea17336ee, storagePoolId = 020d9b34-265d-11e2-8865-441ea17336ee, vds_spm_id = 32, masterDomainId = cc015740-7710-4985-ab63-08f8b90c054a, masterVersion = 1), log id: 47311fdb
2012-11-04 15:30:02,638 ERROR [org.ovirt.engine.core.engineencryptutils.EncryptionUtils] (QuartzScheduler_Worker-98) Failed to decrypt Data must not be longer than 256 bytes
2012-11-04 15:30:02,638 ERROR [org.ovirt.engine.core.engineencryptutils.EncryptionUtils] (QuartzScheduler_Worker-84) Failed to decrypt Data must not be longer than 256 bytes
2012-11-04 15:30:06,689 ERROR [org.ovirt.engine.core.engineencryptutils.EncryptionUtils] (QuartzScheduler_Worker-18) Failed to decrypt Data must not be longer than 256 bytes
2012-11-04 15:30:06,689 ERROR [org.ovirt.engine.core.engineencryptutils.EncryptionUtils] (QuartzScheduler_Worker-26) Failed to decrypt Data must not be longer than 256 bytes
2012-11-04 15:30:16,851 ERROR [org.ovirt.engine.core.engineencryptutils.EncryptionUtils] (QuartzScheduler_Worker-46) Failed to decrypt Data must not be longer than 256 bytes
2012-11-04 15:30:16,851 ERROR [org.ovirt.engine.core.engineencryptutils.EncryptionUtils] (QuartzScheduler_Worker-97) Failed to decrypt Data must not be longer than 256 bytes
2012-11-04 15:30:45,491 ERROR [org.ovirt.engine.core.engineencryptutils.EncryptionUtils] (QuartzScheduler_Worker-45) Failed to decrypt Data must not be longer than 256 bytes
2012-11-04 15:30:45,491 ERROR [org.ovirt.engine.core.engineencryptutils.EncryptionUtils] (QuartzScheduler_Worker-27) Failed to decrypt Data must not be longer than 256 bytes
2012-11-04 15:30:54,393 ERROR [org.ovirt.engine.core.engineencryptutils.EncryptionUtils] (QuartzScheduler_Worker-82) Failed to decrypt Data must not be longer than 256 bytes
2012-11-04 15:30:54,404 ERROR [org.ovirt.engine.core.engineencryptutils.EncryptionUtils] (QuartzScheduler_Worker-7) Failed to decrypt Data must start with zero
2012-11-04 15:31:55,878 ERROR [org.ovirt.engine.core.engineencryptutils.EncryptionUtils] (QuartzScheduler_Worker-71) Failed to decrypt Data must not be longer than 256 bytes
2012-11-04 15:31:55,891 ERROR [org.ovirt.engine.core.engineencryptutils.EncryptionUtils] (QuartzScheduler_Worker-40) Failed to decrypt Data must start with zero
2012-11-04 15:32:28,785 ERROR [org.ovirt.engine.core.engineencryptutils.EncryptionUtils] (QuartzScheduler_Worker-85) Failed to decrypt Data must not be longer than 256 bytes
2012-11-04 15:32:28,796 ERROR [org.ovirt.engine.core.engineencryptutils.EncryptionUtils] (QuartzScheduler_Worker-96) Failed to decrypt Data must start with zero
2012-11-04 15:32:36,401 ERROR [org.ovirt.engine.core.engineencryptutils.EncryptionUtils] (QuartzScheduler_Worker-28) Failed to decrypt Data must not be longer than 256 bytes
2012-11-04 15:32:48,333 ERROR [org.ovirt.engine.core.engineencryptutils.EncryptionUtils] (QuartzScheduler_Worker-80) Failed to decrypt Data must not be longer than 256 bytes
2012-11-04 15:32:48,333 ERROR [org.ovirt.engine.core.engineencryptutils.EncryptionUtils] (QuartzScheduler_Worker-21) Failed to decrypt Data must not be longer than 256 bytes
2012-11-04 15:33:01,633 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand] (QuartzScheduler_Worker-76) [53942145] XML RPC error in command ConnectStoragePoolVDS ( HostName = puma35 ), the error was: java.util.concurrent.TimeoutException, TimeoutException:  
2012-11-04 15:33:01,633 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (QuartzScheduler_Worker-76) [53942145] FINISH, ConnectStoragePoolVDSCommand, log id: 47311fdb
2012-11-04 15:33:01,633 ERROR [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (QuartzScheduler_Worker-76) [53942145] Could not connect host puma35 to pool iscsi_dc
2012-11-04 15:33:01,676 INFO  [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (QuartzScheduler_Worker-76) [278152df] Running command: SetNonOperationalVdsCommand internal: true. Entities affected :  ID: 8107e470-2663-11e2-9efd-441ea17336ee Type: VDS
2012-11-04 15:33:01,689 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-76) [278152df] START, SetVdsStatusVDSCommand(HostName = puma35, HostId = 8107e470-2663-11e2-9efd-441ea17336ee, status=NonOperational, nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: 5e7b01cd
2012-11-04 15:33:01,702 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-76) [278152df] FINISH, SetVdsStatusVDSCommand, log id: 5e7b01cd
2012-11-04 15:33:01,714 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (QuartzScheduler_Worker-76) [278152df] Try to add duplicate values with same name. Type: VDS_SET_NONOPERATIONAL_DOMAIN. Value: storagepoolname
2012-11-04 15:33:01,761 INFO  [org.ovirt.engine.core.bll.MultipleActionsRunner] (pool-4-thread-578) [278152df] MultipleActionsRunner of type MigrateVmToServer invoked with no actions
2012-11-04 15:33:01,787 INFO  [org.ovirt.engine.core.bll.HandleVdsCpuFlagsOrClusterChangedCommand] (QuartzScheduler_Worker-76) [19ca30fb] Running command: HandleVdsCpuFlagsOrClusterChangedCommand internal: true. Entities affected :  ID: 8107e470-2663-11e2-9efd-441ea17336ee Type: VDS
2012-11-04 15:33:01,806 INFO  [org.ovirt.engine.core.bll.HandleVdsVersionCommand] (QuartzScheduler_Worker-76) [532e3704] Running command: HandleVdsVersionCommand internal: true. Entities affected :  ID: 8107e470-2663-11e2-9efd-441ea17336ee Type: VDS
2012-11-04 15:33:01,808 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-76) [532e3704] Host 8107e470-2663-11e2-9efd-441ea17336ee : puma35 is already in NonOperational status. SetNonOperationalVds command is skipped.
2012-11-04 15:33:19,101 ERROR [org.ovirt.engine.core.engineencryptutils.EncryptionUtils] (QuartzScheduler_Worker-70) Failed to decrypt Data must start with zero
2012-11-04 15:33:19,101 ERROR [org.ovirt.engine.core.engineencryptutils.EncryptionUtils] (QuartzScheduler_Worker-90) Failed to decrypt Data must start with zero

Comment 8 Omri Hochman 2012-11-04 16:19:57 UTC
Created attachment 638062 [details]
new_engine_log

Comment 9 mkublin 2012-11-05 07:32:33 UTC
These is works as design:

4) Host status will be change to 'Unasigned' then switch back to 'Non-Operational'.
5) Steps will repeat every 5 minutes. 	

During activation the first status of host is Unasigned, if we failed to Activate host it will be moved to Non-Operational. Agreed that these annoying, but these is flow. Host was not moved to Up, it means no one can use it or run VM.
Moving back to QA.

Comment 10 Omri Hochman 2012-11-05 08:53:50 UTC
Verified- RHEVM Build SI23: 

In case host having a storage issue the following behavior will take place: 

1) After about 5-6 minutes the Host will be switch to Non-Operational.
2) After 5 more minutes HostAutoRecovery will attempt to activate the host. 
    Event message will display:
      a)"Detected new Host puma35. Host state was set to Up."
      b) Host status will be change to 'Unasigned'. 
3 )After 2-3 minutes host will switch back to Non-Operational.
    Event message will display: 
      a)"Failed to connect Host puma35 to Storage Pool iscsi_dc"
      b)"Host puma35 cannot access one of the Storage Domains attached to the Data" 
4) Steps (1-3) repeat every 5 minutes (untill storage issue is solved).

Comment 11 Barak 2012-11-05 14:45:12 UTC
Moving back to ON_QA for verification.

Omri, per comment #10 it looks like it was already VERIFIED, please move the BZ to it's write status.


Note You need to log in before you can comment on or make changes to this bug.