Created attachment 1239103 [details] logs from engine and hypervisor Description of problem: In a storage pool with 3 active domains: 2 data NFS (one of them was the master) reside in one storage server and 1 iSCSI data domain; I blocked connectivity to the NFS server where the master domain resides. Reconstruct master did not succeed, DC remains non-responsive. Version-Release number of selected component (if applicable): ovirt-engine-4.1.0-0.4.master.20170109222652.git53fd6cb.el7.centos.noarch vdsm-4.20.0-92.gita812c81.el7.centos.x86_64 sanlock-3.4.0-1.el7.x86_64 How reproducible: 1/1 Steps to Reproduce: In a storage pool with 1 active host and 3 active domains: 2 data NFS (one of them was the master) reside in one storage server and 1 iSCSI data domain: 1. Block connectivity from SPM to the storage server where the master domain resides: 'iptables -A OUTPUT -d <storage_server_add> -j DROP' Actual results: Reconstruct is called and blocked in CDA: 2017-01-10 14:08:25,538+02 WARN [org.ovirt.engine.core.bll.storage.pool.ReconstructMasterDomainCommand] (org.ovirt.thread.pool-6-thread-32) [13f21fbc] Validation of action 'ReconstructMaste rDomain' failed for user SYSTEM. Reasons: VAR__ACTION__RECONSTRUCT_MASTER,VAR__TYPE__STORAGE__DOMAIN,ACTION_TYPE_FAILED_NO_VDS_IN_POOL Then, nothing happens with the pool, the DC remains non-responsive. I waited for ~45 minutes. Expected results: Reconstruct master should succeed in case there is a domain that can take master. Additional info: logs from engine and hypervisor
The issue here is that after vdsm is restarted by sanlock the engine fails to reestablish the connection to it, so altough the vdsm service is running according to Elad - the engine still fails to connect to it and marks the host as non responsive. From the reconstruct flow perspective there is no active host to be used for performing the reconstruct - so the behavior on that aspect is as expected. Moving to infra for further investigation of why the engine fails to reconnect to the host.
I do not see attempt to connect to the engine after the restart. The engine did not trigger any communication. I see the vdsm was up again at: 2017-01-10 14:04:59,238 INFO (MainThread) [vds] (PID: 2246) I am the actual vdsm 4.20.0-92.gita812c81.el7.centos storage-ge1-vdsm4.qa.lab.tlv.redhat.com (3.10.0-514.el7.x86_64) (vdsm:145) and I can see only local connections from mom. The only attempt that I see is at: 2017-01-10 13:56:59,028+02 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [41f4de89] Connecting to /10.35.65.72 I see that irs connection timeout at: 2017-01-10 14:09:01,499+02 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler1) [39f9056d] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues at org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:188) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.irsbroker.GetStoragePoolInfoVDSCommand.executeIrsBrokerCommand(GetStoragePoolInfoVDSCommand.java:32) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand.lambda$executeVDSCommand$0(IrsBrokerCommand.java:95) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy.runInControlledConcurrency(IrsProxy.java:262) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand.executeVDSCommand(IrsBrokerCommand.java:92) [vdsbroker.jar:] but reconstruction master was run prior to that: 2017-01-10 14:08:25,376+02 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (org.ovirt.thread.pool-6-thread-32) [] Domain '5570ebc3-4507-4808-b767-2c3407b96739:nfs_1' was reported by all hosts in status UP as problematic. Not moving the domain to NonOperational because it is being reconstructed now. 2017-01-10 14:08:25,538+02 WARN [org.ovirt.engine.core.bll.storage.pool.ReconstructMasterDomainCommand] (org.ovirt.thread.pool-6-thread-32) [13f21fbc] Validation of action 'ReconstructMasterDomain' failed for user SYSTEM. Reasons: VAR__ACTION__RECONSTRUCT_MASTER,VAR__TYPE__STORAGE__DOMAIN,ACTION_TYPE_FAILED_NO_VDS_IN_POOL and the first higher level timeout occurred at: 2017-01-10 14:06:00,318+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler1) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM host1 command failed: Message timeout which can be caused by communication issues 2017-01-10 14:06:00,321+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler1) [] Command 'SpmStatusVDSCommand(HostName = host1, SpmStatusVDSCommandParameters:{runAsync='true', hostId='446dd414-436e-417b-9cab-401c37fff436', storagePoolId='00000001-0001-0001-0001-000000000311'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues My question is why there was no attempt to reconnect was it caused by GetStoragePoolInfoVDSCommand holding the lock?
Just to be sure that we are looking at communication issue or master recovery issue please provide engine logs in debug.
Created attachment 1239125 [details] logs-engine in DEBUG Reproduced again, this time engine.log is in DEBUG. 2017-01-10 16:42:42,678+02 WARN [org.ovirt.engine.core.bll.storage.pool.ReconstructMasterDomainCommand] (org.ovirt.thread.pool-6-thread-27) [15aa4028] Validation of action 'ReconstructMaste rDomain' failed for user SYSTEM. Reasons: VAR__ACTION__RECONSTRUCT_MASTER,VAR__TYPE__STORAGE__DOMAIN,ACTION_TYPE_FAILED_NO_VDS_IN_POOL
Restoring needinfo on Liron.
Only part of the log is in debug. Please update both places: <logger category="org.ovirt" use-parent-handlers="false"> <level name="DEBUG"/> <handlers> and <root-logger> <level name="DEBUG"/> <handlers> I see that there was a connection attempt at: 2017-01-10 16:33:59,427+02 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to /10.35.65.72 which implies that the engine never attempted to reconnect. Please provide the logs whit above settings
Only part of the log is in DEBUG because I configured it to DEBUG 1 ~30 minutes ago. All in DEBUG: <logger category="org.ovirt.engine.ui.frontend.server.gwt.OvirtRemoteLoggingService" use-parent-handlers="false"> <level name="DEBUG"/> <handlers> <handler name="UI"/> </handlers> </logger> <logger category="org.ovirt.engine.core.bll"> <level name="DEBUG"/> </logger> <logger category="org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresJdbcTemplate"> <level name="WARN"/> </logger> <root-logger> <level name="DEBUG"/> <handlers> <handler name="ovirt-logger"/> <handler name="SERVER"/> </handlers> </root-logger> </subsystem>
You are missing debug for category="org.ovirt". Restoring need info.
> My question is why there was no attempt to reconnect was it caused by > GetStoragePoolInfoVDSCommand holding the lock? Hi Piotr, can you elaborate? I didn't understand the question completely. The engine may attempt to execute vds commands and fail. When the vdsm service is active again the connection should be restored.
Liron, You stated in comment #1 that "the engine fails to reestablish the connection" and I see in the logs that there was no attempt to do so. We log attempt to talk to vdsm like this: Connecting to /10.35.65.72 When the domain was blocked the engine never attempted to reconnect which means that no commands where triggered. My question is why we haven't called anything.
Piotr, There are multiple attempts to run vds commands on that host, HostMonitoring executes getStats every few seconds and fail (example on [1] ). [1] 2017-01-10 14:06:00,372+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler4) [] Command 'GetStatsVDSCommand(HostName = host1, Vd sIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='446dd414-436e-417b-9cab-401c37fff436', vds='Host[host1,446dd414-436e-417b-9cab-401c37fff436]'})' execution failed : VDSGenericException: VDSNetworkException: Vds timeout occured 2017-01-10 14:06:00,401+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler4) [] Failed getting vds stats, host='host1'(446dd414-436 e-417b-9cab-401c37fff436): org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Vds timeout occured 2017-01-10 14:06:00,401+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler4) [] Failure to refresh host 'host1' runtime info: VDSGe nericException: VDSNetworkException: Vds timeout occured 2017-01-10 14:06:00,401+02 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler4) [] Failed to refresh VDS, network error, continuing, vds='host1'(446d d414-436e-417b-9cab-401c37fff436): VDSGenericException: VDSNetworkException: Vds timeout occured
Liron, The command that you mentioned timeout and it was run prior to vdsm being restarted. In this part of the log you just see timeout occurring. In other words engine send a message to vdsm but it was restarted so the response never arrived. By default we have 300 secs timeout which occurred at 14:06:00. This doesn't mean that there was any interaction with vdsm. When the timeout occurs we close the channel (if not closed already) and will attempt to reconnect once any command is triggered. In this flow we see that there was no attempt to send anything to vdsm.
Piotr, The VdsManager is responsible monitor the host and change it's status accordingly. The VdsManager attempts to execute GetStats/GetCaps every few seconds on the host. If the vdsm service is running on the host as Elad says, the connection should be restored and GetStats should succeed. We can see later that those executions still fail- 2017-01-10 16:40:18,366+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler1) [30277379] Command 'GetStatsVDSCommand(HostName = h ost1, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='446dd414-436e-417b-9cab-401c37fff436', vds='Host[host1,446dd414-436e-417b-9cab-401c37fff436]'})' executio n failed: VDSGenericException: VDSNetworkException: Vds timeout occured 2017-01-10 16:40:18,366+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler1) [30277379] Failed getting vds stats, host='host1'(446d d414-436e-417b-9cab-401c37fff436): org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Vds timeout occured
Liron, It is not invocation of a command. It is a timeout that iccurred for command which was executed around 300 seconds before.
Piotr - > When the timeout occurs we close the channel (if not closed already) and > will attempt to reconnect once any command is triggered. In this flow we see > that there was no attempt to send anything to vdsm. As you see in my last comment, commands (GetStats/GetCaps) fail to be executed hours later with network exceptions even though the vdsm service was restarted successfully on the host - we need to find out why the connection between the engine and vdsm isn't restored although the service is back up. I understand that you look for the "Connecting.." message in the logs, the fact that it's not there is the bug here imo, It's need to be investigated why the engine doesn't attemp/succeed to restore the connection - the engine definitely attempts to execute commands on that host.
I agree that we need to understand why the connection is not recovered. I honestly do not see any attempts to execute commands and only timeouts which are past as network exceptions. Let's wait for debug logs to confirm. The last provided were missing one category.
Created attachment 1239340 [details] engine.log in DEBUG (1) engine.log in DEBUG 2017-01-11 01:09:53,330+02 WARN [org.ovirt.engine.core.bll.storage.pool.ReconstructMasterDomainCommand] (org.ovirt.thread.pool-6-thread-30) [6e861bf2] Validation of action 'ReconstructMaste rDomain' failed for user SYSTEM. Reasons: VAR__ACTION__RECONSTRUCT_MASTER,VAR__TYPE__STORAGE__DOMAIN,ACTION_TYPE_FAILED_NO_VDS_IN_POOL
Elad - what's the jsonrpc version you're using?
Created attachment 1239397 [details] engine.log in DEBUG (2) Oved, vdsm: python-anyjson-0.3.3-3.el7ost.noarch rubygem-json-1.7.7-29.el7.x86_64 python-simplejson-3.5.3-5.el7.x86_64 python-jsonpatch-1.2-2.el7.noarch python-jsonpointer-1.9-2.el7.noarch python-jsonschema-2.3.0-1.el7ost.noarch vdsm-jsonrpc-4.20.0-92.gita812c81.el7.centos.noarch json-c-0.11-4.el7_0.x86_64 vdsm-yajsonrpc-4.20.0-92.gita812c81.el7.centos.noarch engine: json-c-0.11-4.el7_0.x86_64 rubygem-json-1.7.7-29.el7.x86_64 vdsm-jsonrpc-java-1.4.0-1.20170109154425.git44d7998.el7.centos.noarch Piotr, logs attached. 2017-01-11 11:28:16,038+02 WARN [org.ovirt.engine.core.bll.storage.pool.ReconstructMasterDomainCommand] (org.ovirt.thread.pool-6-thread-40) [359b5bd8] Validation of action 'ReconstructMaste rDomain' failed for user SYSTEM. Reasons: VAR__ACTION__RECONSTRUCT_MASTER,VAR__TYPE__STORAGE__DOMAIN,ACTION_TYPE_FAILED_NO_VDS_IN_POOL
Elad - can you downgrade the vdsm-jsonrpc-java to 1.3.6 (stop the engine first, of course). Please report back if it helped. 1.4.0 is the "master" version, and 1.3.6 is the 4.1. I'm interested in knowing whether it happens also on 1.3.6.
Downgrade engine only?
Downgrade the vdsm-jsonrpc-java package on the engine machine from 1.4.0 to 1.3.6.
Created attachment 1239416 [details] downgraded vdsm-json-java Using vdsm-jsonrpc-java-1.3.6-1.20170111073901.gita5e600c.el7.centos.noarch communication between engine and vdsm gets restored successfully after vdsm service starts. Reconstruct master takes place and succeeds: 2017-01-11 14:12:42,495+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-1) [32cba7d9] Correlation ID: 32cba7d9, Job ID: ce2ee1 c2-7217-431c-a823-077d2956b496, Call Stack: null, Custom Event ID: -1, Message: Reconstruct Master Domain for Data Center dc completed.
This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
ok, ovirt-engine-4.1.0.2-0.2.el7.noarch & vdsm-jsonrpc-java-1.3.7-1.el7ev.noarch 2017-01-25 12:44:24,300+01 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-50) [70f7ffc0] EVENT_ID: RECONSTRUCT_MASTER_DONE(984), Correlation ID: 70f7ffc0, Job ID: 1ec303d0-7e4d-4302-b67d-ae154a6e07f5, Call Stack: null, Custom Event ID: -1, Message: Reconstruct Master Domain for Data Center Default completed.