Bug 848439

Summary: [engine-core][PosixFS only] host moves to non-operational due to stale cache regarding storage status (host is connected to both storage and pool successfully)
Product: Red Hat Enterprise Virtualization Manager Reporter: vvyazmin <vvyazmin>
Component: ovirt-engineAssignee: Maor <mlipchuk>
Status: CLOSED DUPLICATE QA Contact: Haim <hateya>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.1.0CC: amureini, dyasny, hateya, iheim, lpeer, mkenneth, mlipchuk, Rhev-m-bugs, sgrinber, yeylon, ykaul
Target Milestone: ---   
Target Release: 3.1.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-09-02 15:13:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
## Logs vdsm, rhevm, screen-shots
none
## Logs vdsm, rhevm none

Description vvyazmin@redhat.com 2012-08-15 15:31:12 UTC
Created attachment 604628 [details]
## Logs vdsm, rhevm, screen-shots

Description of problem: Host moves to non-operational due to stale cache regarding storage status (host is connected to both storage and pool successfully)


Version-Release number of selected component (if applicable):
Verified on RHEVM 3.1 - SI13.3

RHEVM: rhevm-3.1.0-11.el6ev.noarch
VDSM: vdsm-4.9.6-27.0.el6_3.x86_64
LIBVIRT: libvirt-0.9.10-21.el6.x86_64
QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.298.el6_3.x86_64
SANLOCK: sanlock-2.3-3.el6_3.x86_64

How reproducible:


Steps to Reproduce:
1. Create DC with “POSIX compliant FS” type
2. Attached 2 Hosts (in my case Cougar05 [SMP] &  Cougar08)
3. Attached one storage domain (“POSIX compliant FS” type)
4. Verify that all configuration work properly
5. Via IpTables on both servers block incoming & outgoing connection to PosixFS server
6. After 10 second remove restriction from IpTables on both servers, and verify that both servers have connection to PosixFS server.
7. Activate both servers

  
Actual results:
After 5-10 minutes HSM (Cougar08) servers enter to Non-Operational state
State is reproducible after activating hosts again !

Expected results:
All hosts will continue work properly after network disconnection 

Additional info:

*** rhevm log ***
2012-08-15 16:22:14,999 INFO  [org.ovirt.engine.core.bll.ActivateVdsCommand] (pool-4-thread-50) [707619c4] Running command: ActivateVdsCommand internal: false. Entities affected :  ID: 0047d3ee-e62c-11e1-a449-001a4a169738 Type: VDS
2012-08-15 16:22:15,001 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (pool-4-thread-50) [707619c4] START, SetVdsStatusVDSCommand(vdsId = 0047d3ee-e62c-11e1-a449-001a4a169738, status=Unassigned, nonOperationalReason=NONE
), log id: 58796879
2012-08-15 16:22:15,004 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (pool-4-thread-50) [707619c4] FINISH, SetVdsStatusVDSCommand, log id: 58796879
2012-08-15 16:22:15,008 INFO  [org.ovirt.engine.core.vdsbroker.ActivateVdsVDSCommand] (pool-4-thread-50) [707619c4] START, ActivateVdsVDSCommand(vdsId = 0047d3ee-e62c-11e1-a449-001a4a169738), log id: 433783b8
2012-08-15 16:22:15,243 INFO  [org.ovirt.engine.core.bll.HandleVdsCpuFlagsOrClusterChangedCommand] (pool-4-thread-50) [14d74cac] Running command: HandleVdsCpuFlagsOrClusterChangedCommand internal: true. Entities affected :  ID: 0047d3ee-
e62c-11e1-a449-001a4a169738 Type: VDS
2012-08-15 16:22:15,247 INFO  [org.ovirt.engine.core.bll.HandleVdsVersionCommand] (pool-4-thread-50) [1df147b3] Running command: HandleVdsVersionCommand internal: true. Entities affected :  ID: 0047d3ee-e62c-11e1-a449-001a4a169738 Type: 
VDS
2012-08-15 16:22:15,249 INFO  [org.ovirt.engine.core.vdsbroker.ActivateVdsVDSCommand] (pool-4-thread-50) [1df147b3] FINISH, ActivateVdsVDSCommand, log id: 433783b8
2012-08-15 16:22:15,279 INFO  [org.ovirt.engine.core.bll.ActivateVdsCommand] (pool-4-thread-50) [1df147b3] Lock freed to object EngineLock [exclusiveLocks= key: 0047d3ee-e62c-11e1-a449-001a4a169738 value: VDS
, sharedLocks= ]
2012-08-15 16:22:16,342 INFO  [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (QuartzScheduler_Worker-33) [15c1232f] Running command: InitVdsOnUpCommand internal: true.
2012-08-15 16:22:16,403 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ValidateStorageServerConnectionVDSCommand] (QuartzScheduler_Worker-33) [161595a] START, ValidateStorageServerConnectionVDSCommand(vdsId = 0047d3ee-e62c-11e1-a449-00
1a4a169738, storagePoolId = 5c069c70-6424-4125-b4bd-c0da62dd3c29, storageType = POSIXFS, connectionList = [{ id: fc0805b5-b9e7-4003-9db8-ccd27ea6a5ba, connection: 10.35.97.44:/vol1/ };]), log id: 1fbb89e
2012-08-15 16:22:16,442 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ValidateStorageServerConnectionVDSCommand] (QuartzScheduler_Worker-33) [161595a] FINISH, ValidateStorageServerConnectionVDSCommand, return: {fc0805b5-b9e7-4003-9db8
-ccd27ea6a5ba=0}, log id: 1fbb89e
2012-08-15 16:22:16,442 INFO  [org.ovirt.engine.core.bll.storage.ConnectHostToStoragePoolServersCommand] (QuartzScheduler_Worker-33) [161595a] Running command: ConnectHostToStoragePoolServersCommand internal: true. Entities affected :  ID: 5c069c70-6424-4125-b4bd-c0da62dd3c29 Type: StoragePool
2012-08-15 16:22:16,445 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (QuartzScheduler_Worker-33) [161595a] START, ConnectStorageServerVDSCommand(vdsId = 0047d3ee-e62c-11e1-a449-001a4a169738, storagePoolId = 5c069c70-6424-4125-b4bd-c0da62dd3c29, storageType = POSIXFS, connectionList = [{ id: fc0805b5-b9e7-4003-9db8-ccd27ea6a5ba, connection: 10.35.97.44:/vol1/ };]), log id: f193800
2012-08-15 16:22:16,478 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (QuartzScheduler_Worker-33) [161595a] FINISH, ConnectStorageServerVDSCommand, return: {fc0805b5-b9e7-4003-9db8-ccd27ea6a5ba=0}, log id: f193800
2012-08-15 16:22:16,478 INFO  [org.ovirt.engine.core.bll.storage.ConnectHostToStoragePoolServersCommand] (QuartzScheduler_Worker-33) [161595a] Host Cougar08 storage connection was succeeded 
2012-08-15 16:22:16,491 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (QuartzScheduler_Worker-33) [161595a] START, ConnectStoragePoolVDSCommand(vdsId = 0047d3ee-e62c-11e1-a449-001a4a169738, storagePoolId = 5c069c70-6424-4125-b4bd-c0da62dd3c29, vds_spm_id = 1, masterDomainId = cf0e1006-b506-4700-9be2-e69a5001e259, masterVersion = 1), log id: 8027953
2012-08-15 16:22:17,400 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (QuartzScheduler_Worker-33) [161595a] FINISH, ConnectStoragePoolVDSCommand, log id: 8027953
2012-08-15 16:22:17,408 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (QuartzScheduler_Worker-33) [161595a] No string for UNASSIGNED type. Use default Log
2012-08-15 16:22:17,413 INFO  [org.ovirt.engine.core.bll.MultipleActionsRunner] (pool-4-thread-50) [161595a] MultipleActionsRunner of type MigrateVmToServer invoked with no actions
2012-08-15 16:22:17,437 INFO  [org.ovirt.engine.core.bll.HandleVdsCpuFlagsOrClusterChangedCommand] (QuartzScheduler_Worker-33) [171a0ccc] Running command: HandleVdsCpuFlagsOrClusterChangedCommand internal: true. Entities affected :  ID: 0047d3ee-e62c-11e1-a449-001a4a169738 Type: VDS
2012-08-15 16:22:17,444 INFO  [org.ovirt.engine.core.bll.HandleVdsVersionCommand] (QuartzScheduler_Worker-33) [2873b66f] Running command: HandleVdsVersionCommand internal: true. Entities affected :  ID: 0047d3ee-e62c-11e1-a449-001a4a169738 Type: VDS
2012-08-15 16:22:27,643 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-67) domain cf0e1006-b506-4700-9be2-e69a5001e259 in problem. vds: Cougar08
2012-08-15 16:25:00,001 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-79) Autorecovering hosts is disabled, skipping
2012-08-15 16:25:00,001 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-79) Autorecovering storage domains is disabled, skipping
2012-08-15 16:27:27,645 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-83) [6eccfefc] starting ProcessDomainRecovery for domain cf0e1006-b506-4700-9be2-e69a5001e259
2012-08-15 16:27:27,650 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-83) [6eccfefc] vds Cougar08 reported domain cf0e1006-b506-4700-9be2-e69a5001e259:sDomain-PosixFS-03 as in problem, moving the vds to status NonOperational
2012-08-15 16:27:27,657 WARN  [org.ovirt.engine.core.dal.job.ExecutionMessageDirector] (QuartzScheduler_Worker-83) [1096d25e] The message key SetNonOperationalVds is missing from bundles/ExecutionMessages
2012-08-15 16:27:27,700 INFO  [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (QuartzScheduler_Worker-83) [1096d25e] Running command: SetNonOperationalVdsCommand internal: true. Entities affected :  ID: 0047d3ee-e62c-11e1-a449-001a4a169738 Type: VDS
2012-08-15 16:27:27,701 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-83) [1096d25e] START, SetVdsStatusVDSCommand(vdsId = 0047d3ee-e62c-11e1-a449-001a4a169738, status=NonOperational, nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: 53b7ee2c
2012-08-15 16:27:27,709 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-83) [1096d25e] FINISH, SetVdsStatusVDSCommand, log id: 53b7ee2c
2012-08-15 16:27:27,737 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-83) [1096d25e] Clearing cache of pool: 5c069c70-6424-4125-b4bd-c0da62dd3c29 for problematic entities of VDS: Cougar08.
2012-08-15 16:27:27,737 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-83) [1096d25e] Domain cf0e1006-b506-4700-9be2-e69a5001e259 recovered from problem. vds: Cougar08
2012-08-15 16:27:27,737 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-83) [1096d25e] Domain cf0e1006-b506-4700-9be2-e69a5001e259 has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer.

Worked with Maor Lipchuk on this issue

Comment 1 Yaniv Kaul 2012-08-15 17:16:59 UTC
Does it happen with non-Posix (NFS) as well? In other words, is that a PosixFS-specific behavior?

Comment 2 vvyazmin@redhat.com 2012-08-16 12:12:22 UTC
Created attachment 604920 [details]
## Logs vdsm, rhevm

Comment 3 vvyazmin@redhat.com 2012-08-16 12:14:19 UTC
This behavior relevant to PosixFS. 

Not reproduced on NFS storage domain. 
This problem reproduced in RHEVM 3.1 – SI14

A new logs attached

I run a script to reproduce this behavior:
iptables -A OUTPUT -d 10.35.97.44 -j DROP && iptables -A OUTPUT -d 10.35.97.54 -j DROP && sleep 60 && iptables -F

Comment 5 vvyazmin@redhat.com 2012-08-20 10:43:03 UTC
Not reproduced on iSCSI storage domain.

Comment 7 vvyazmin@redhat.com 2012-08-22 10:36:00 UTC
There is no workaround, server stay on Non-Operational state

Comment 8 Maor 2012-08-22 16:05:42 UTC
From what I saw in the logs,
I suspect there is a race in the engine,
since the logs from the VDSM indicates that the storage domain is valid.
The problem is that this specific module in the engine is not informative enough, so it is hard to spot the race

Comment 9 Ayal Baron 2012-08-22 21:41:32 UTC
(In reply to comment #8)
> From what I saw in the logs,
> I suspect there is a race in the engine,
> since the logs from the VDSM indicates that the storage domain is valid.
> The problem is that this specific module in the engine is not informative
> enough, so it is hard to spot the race

Maor sent Haim a patch with which to reproduce for more info.
Waiting on qa.

Comment 12 vvyazmin@redhat.com 2012-08-27 11:17:40 UTC
Run same scenario on RHEVM 3.1 - SI15, and get same result  

RHEVM: rhevm-3.1.0-13.el6ev.noarch
VDSM: vdsm-4.9.6-29.0.el6_3.x86_64
LIBVIRT: libvirt-0.9.10-21.el6.x86_64
QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.298.el6_3.x86_64
SANLOCK: sanlock-2.3-3.el6_3.x86_64

Comment 15 vvyazmin@redhat.com 2012-09-02 15:13:25 UTC

*** This bug has been marked as a duplicate of bug 852728 ***