1411777 – [engine-backend] Engine-vdsm connection isn't restored after vdsm service restart while the engine ia up

Bug 1411777 - [engine-backend] Engine-vdsm connection isn't restored after vdsm service restart while the engine ia up

Summary: [engine-backend] Engine-vdsm connection isn't restored after vdsm service res...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Infra
Sub Component:
Version:	future
Hardware:	x86_64
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	ovirt-4.1.0-rc
Target Release:	4.1.0
Assignee:	Piotr Kliczewski
QA Contact:	Jiri Belka
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-01-10 12:54 UTC by Elad
Modified:	2017-02-01 14:34 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-02-01 14:34:51 UTC
oVirt Team:	Infra
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.1+ rule-engine: blocker+

Attachments	(Terms of Use)
logs from engine and hypervisor (4.68 MB, application/x-gzip) 2017-01-10 12:54 UTC, Elad	no flags	Details
logs-engine in DEBUG (602.53 KB, application/x-gzip) 2017-01-10 14:47 UTC, Elad	no flags	Details
engine.log in DEBUG (1) (1.74 MB, application/x-gzip) 2017-01-11 06:59 UTC, Elad	no flags	Details
engine.log in DEBUG (2) (2.22 MB, application/x-gzip) 2017-01-11 10:00 UTC, Elad	no flags	Details
downgraded vdsm-json-java (3.04 MB, application/x-gzip) 2017-01-11 12:20 UTC, Elad	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	69993	0	master	MERGED	handle ssl closed status	2020-05-21 11:21:47 UTC
oVirt gerrit	70385	0	ovirt-4.1	MERGED	handle ssl closed status	2020-05-21 11:21:47 UTC

Description Elad 2017-01-10 12:54:51 UTC

Created attachment 1239103 [details]
logs from engine and hypervisor

Description of problem:
In a storage pool with 3 active domains: 2 data NFS (one of them was the master) reside in one storage server and 1 iSCSI data domain; I blocked connectivity to the NFS server where the master domain resides. Reconstruct master did not succeed, DC remains non-responsive. 

Version-Release number of selected component (if applicable):
ovirt-engine-4.1.0-0.4.master.20170109222652.git53fd6cb.el7.centos.noarch
vdsm-4.20.0-92.gita812c81.el7.centos.x86_64
sanlock-3.4.0-1.el7.x86_64

How reproducible:
1/1

Steps to Reproduce:
In a storage pool with 1 active host and 3 active domains: 2 data NFS (one of them was the master) reside in one storage server and 1 iSCSI data domain:
1. Block connectivity from SPM to the storage server where the master domain resides:
'iptables -A OUTPUT -d <storage_server_add> -j DROP' 



Actual results:
Reconstruct is called and blocked in CDA:

2017-01-10 14:08:25,538+02 WARN  [org.ovirt.engine.core.bll.storage.pool.ReconstructMasterDomainCommand] (org.ovirt.thread.pool-6-thread-32) [13f21fbc] Validation of action 'ReconstructMaste
rDomain' failed for user SYSTEM. Reasons: VAR__ACTION__RECONSTRUCT_MASTER,VAR__TYPE__STORAGE__DOMAIN,ACTION_TYPE_FAILED_NO_VDS_IN_POOL

Then, nothing happens with the pool, the DC remains non-responsive. I waited for ~45 minutes.

Expected results:
Reconstruct master should succeed in case there is a domain that can take master.

Additional info:
logs from engine and hypervisor

Comment 1 Liron Aravot 2017-01-10 13:08:18 UTC

The issue here is that after vdsm is restarted by sanlock the engine fails to reestablish the connection to it, so altough the vdsm service is running according to Elad - the engine still fails to connect to it and marks the host as non responsive. 
From the reconstruct flow perspective there is no active host to be used for performing the reconstruct - so the behavior on that aspect is as expected.

Moving to infra for further investigation of why the engine fails to reconnect to the host.

Comment 2 Piotr Kliczewski 2017-01-10 14:22:03 UTC

I do not see attempt to connect to the engine after the restart. The engine did not trigger any communication.

I see the vdsm was up again at:

2017-01-10 14:04:59,238 INFO  (MainThread) [vds] (PID: 2246) I am the actual vdsm 4.20.0-92.gita812c81.el7.centos storage-ge1-vdsm4.qa.lab.tlv.redhat.com (3.10.0-514.el7.x86_64) (vdsm:145)

and I can see only local connections from mom.

The only attempt that I see is at:

2017-01-10 13:56:59,028+02 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [41f4de89] Connecting to /10.35.65.72

I see that irs connection timeout at:

2017-01-10 14:09:01,499+02 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler1) [39f9056d] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues
	at org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:188) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.irsbroker.GetStoragePoolInfoVDSCommand.executeIrsBrokerCommand(GetStoragePoolInfoVDSCommand.java:32) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand.lambda$executeVDSCommand$0(IrsBrokerCommand.java:95) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy.runInControlledConcurrency(IrsProxy.java:262) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand.executeVDSCommand(IrsBrokerCommand.java:92) [vdsbroker.jar:]

but reconstruction master was run prior to that:

2017-01-10 14:08:25,376+02 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (org.ovirt.thread.pool-6-thread-32) [] Domain '5570ebc3-4507-4808-b767-2c3407b96739:nfs_1' was reported by all hosts in status UP as problematic. Not moving the domain to NonOperational because it is being reconstructed now.
2017-01-10 14:08:25,538+02 WARN  [org.ovirt.engine.core.bll.storage.pool.ReconstructMasterDomainCommand] (org.ovirt.thread.pool-6-thread-32) [13f21fbc] Validation of action 'ReconstructMasterDomain' failed for user SYSTEM. Reasons: VAR__ACTION__RECONSTRUCT_MASTER,VAR__TYPE__STORAGE__DOMAIN,ACTION_TYPE_FAILED_NO_VDS_IN_POOL

and the first higher level timeout occurred at:

2017-01-10 14:06:00,318+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler1) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM host1 command failed: Message timeout which can be caused by communication issues
2017-01-10 14:06:00,321+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler1) [] Command 'SpmStatusVDSCommand(HostName = host1, SpmStatusVDSCommandParameters:{runAsync='true', hostId='446dd414-436e-417b-9cab-401c37fff436', storagePoolId='00000001-0001-0001-0001-000000000311'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues

My question is why there was no attempt to reconnect was it caused by GetStoragePoolInfoVDSCommand holding the lock?

Comment 3 Piotr Kliczewski 2017-01-10 14:22:59 UTC

Just to be sure that we are looking at communication issue or master recovery issue please provide engine logs in debug.

Comment 4 Elad 2017-01-10 14:47:46 UTC

Created attachment 1239125 [details]
logs-engine in DEBUG

Reproduced again, this time engine.log is in DEBUG.


2017-01-10 16:42:42,678+02 WARN  [org.ovirt.engine.core.bll.storage.pool.ReconstructMasterDomainCommand] (org.ovirt.thread.pool-6-thread-27) [15aa4028] Validation of action 'ReconstructMaste
rDomain' failed for user SYSTEM. Reasons: VAR__ACTION__RECONSTRUCT_MASTER,VAR__TYPE__STORAGE__DOMAIN,ACTION_TYPE_FAILED_NO_VDS_IN_POOL

Comment 5 Oved Ourfali 2017-01-10 14:50:38 UTC

Restoring needinfo on Liron.

Comment 6 Piotr Kliczewski 2017-01-10 14:55:36 UTC

Only part of the log is in debug. Please update both places:

      <logger category="org.ovirt" use-parent-handlers="false">
        <level name="DEBUG"/>
        <handlers>

and

      <root-logger>
        <level name="DEBUG"/>
        <handlers>

I see that there was a connection attempt at:

2017-01-10 16:33:59,427+02 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to /10.35.65.72

which implies that the engine never attempted to reconnect.

Please provide the logs whit above settings

Comment 7 Elad 2017-01-10 15:00:53 UTC

Only part of the log is in DEBUG because I configured it to DEBUG 1 ~30 minutes ago.
All in DEBUG:

      <logger category="org.ovirt.engine.ui.frontend.server.gwt.OvirtRemoteLoggingService" use-parent-handlers="false">
        <level name="DEBUG"/>
        <handlers>
          <handler name="UI"/>
        </handlers>
      </logger>
       <logger category="org.ovirt.engine.core.bll">
        <level name="DEBUG"/>
      </logger>
      <logger category="org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresJdbcTemplate">
        <level name="WARN"/>
      </logger>

      <root-logger>
        <level name="DEBUG"/>
        <handlers>
          <handler name="ovirt-logger"/>
          <handler name="SERVER"/>
        </handlers>
      </root-logger>
    </subsystem>

Comment 8 Piotr Kliczewski 2017-01-10 15:44:17 UTC

You are missing debug for category="org.ovirt".

Restoring need info.

Comment 9 Liron Aravot 2017-01-10 16:10:55 UTC

> My question is why there was no attempt to reconnect was it caused by
> GetStoragePoolInfoVDSCommand holding the lock?

Hi Piotr,
can you elaborate? I didn't understand the question completely. The engine may attempt to execute vds commands and fail. When the vdsm service is active again the connection should be restored.

Comment 10 Piotr Kliczewski 2017-01-10 16:20:04 UTC

Liron,

You stated in comment #1 that "the engine fails to reestablish the connection" and I see in the logs that there was no attempt to do so. We log attempt to talk to vdsm like this: Connecting to /10.35.65.72

When the domain was blocked the engine never attempted to reconnect which means that no commands where triggered. My question is why we haven't called anything.

Comment 11 Liron Aravot 2017-01-10 16:37:54 UTC

Piotr,
There are multiple attempts to run vds commands on that host, HostMonitoring executes getStats every few seconds and fail (example on [1] ).


[1]
2017-01-10 14:06:00,372+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler4) [] Command 'GetStatsVDSCommand(HostName = host1, Vd
sIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='446dd414-436e-417b-9cab-401c37fff436', vds='Host[host1,446dd414-436e-417b-9cab-401c37fff436]'})' execution failed
: VDSGenericException: VDSNetworkException: Vds timeout occured
2017-01-10 14:06:00,401+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler4) [] Failed getting vds stats, host='host1'(446dd414-436
e-417b-9cab-401c37fff436): org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Vds timeout occured
2017-01-10 14:06:00,401+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler4) [] Failure to refresh host 'host1' runtime info: VDSGe
nericException: VDSNetworkException: Vds timeout occured
2017-01-10 14:06:00,401+02 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler4) [] Failed to refresh VDS, network error, continuing, vds='host1'(446d
d414-436e-417b-9cab-401c37fff436): VDSGenericException: VDSNetworkException: Vds timeout occured

Comment 12 Piotr Kliczewski 2017-01-10 18:11:58 UTC

Liron,

The command that you mentioned timeout and it was run prior to vdsm being restarted. In this part of the log you just see timeout occurring. In other words engine send a message to vdsm but it was restarted so the response never arrived. By default we have 300 secs timeout which occurred at 14:06:00. This doesn't mean that there was any interaction with vdsm.

When the timeout occurs we close the channel (if not closed already) and will attempt to reconnect once any command is triggered. In this flow we see that there was no attempt to send anything to vdsm.

Comment 13 Liron Aravot 2017-01-10 18:18:58 UTC

Piotr,

The VdsManager is responsible monitor the host and change it's status accordingly.
The VdsManager attempts to execute GetStats/GetCaps every few seconds on the host.
If the vdsm service is running on the host as Elad says, the connection should be restored and GetStats should succeed.

We can see later that those executions still fail-

2017-01-10 16:40:18,366+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler1) [30277379] Command 'GetStatsVDSCommand(HostName = h
ost1, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='446dd414-436e-417b-9cab-401c37fff436', vds='Host[host1,446dd414-436e-417b-9cab-401c37fff436]'})' executio
n failed: VDSGenericException: VDSNetworkException: Vds timeout occured
2017-01-10 16:40:18,366+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler1) [30277379] Failed getting vds stats, host='host1'(446d
d414-436e-417b-9cab-401c37fff436): org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Vds timeout occured

Comment 14 Piotr Kliczewski 2017-01-10 20:23:14 UTC

Liron,

It is not invocation of a command. It is a timeout that iccurred for command which was executed around 300 seconds before.

Comment 15 Liron Aravot 2017-01-10 20:55:32 UTC

Piotr - 

> When the timeout occurs we close the channel (if not closed already) and
> will attempt to reconnect once any command is triggered. In this flow we see
> that there was no attempt to send anything to vdsm.


As you see in my last comment, commands (GetStats/GetCaps) fail to be executed hours later with network exceptions even though the vdsm service was restarted successfully on the host - we need to find out why the connection between the engine and vdsm isn't restored although the service is back up.
I understand that you look for the "Connecting.." message in the logs, the fact that it's not there is the bug here imo, It's need to be investigated why the engine doesn't attemp/succeed to restore the connection - the engine definitely attempts to execute commands on that host.

Comment 16 Piotr Kliczewski 2017-01-10 21:49:19 UTC

I agree that we need to understand why the connection is not recovered. I honestly do not see any attempts to execute commands and only timeouts which are past as network exceptions.

Let's wait for debug logs to confirm. The last provided were missing one category.

Comment 17 Elad 2017-01-11 06:59:21 UTC

Created attachment 1239340 [details]
engine.log in DEBUG (1)

engine.log in DEBUG

2017-01-11 01:09:53,330+02 WARN  [org.ovirt.engine.core.bll.storage.pool.ReconstructMasterDomainCommand] (org.ovirt.thread.pool-6-thread-30) [6e861bf2] Validation of action 'ReconstructMaste
rDomain' failed for user SYSTEM. Reasons: VAR__ACTION__RECONSTRUCT_MASTER,VAR__TYPE__STORAGE__DOMAIN,ACTION_TYPE_FAILED_NO_VDS_IN_POOL

Comment 18 Oved Ourfali 2017-01-11 09:42:49 UTC

Elad - what's the jsonrpc version you're using?

Comment 19 Elad 2017-01-11 10:00:09 UTC

Created attachment 1239397 [details]
engine.log in DEBUG (2)

Oved,

vdsm:
python-anyjson-0.3.3-3.el7ost.noarch
rubygem-json-1.7.7-29.el7.x86_64
python-simplejson-3.5.3-5.el7.x86_64
python-jsonpatch-1.2-2.el7.noarch
python-jsonpointer-1.9-2.el7.noarch
python-jsonschema-2.3.0-1.el7ost.noarch
vdsm-jsonrpc-4.20.0-92.gita812c81.el7.centos.noarch
json-c-0.11-4.el7_0.x86_64
vdsm-yajsonrpc-4.20.0-92.gita812c81.el7.centos.noarch

engine:
json-c-0.11-4.el7_0.x86_64
rubygem-json-1.7.7-29.el7.x86_64
vdsm-jsonrpc-java-1.4.0-1.20170109154425.git44d7998.el7.centos.noarch


Piotr, logs attached.

2017-01-11 11:28:16,038+02 WARN  [org.ovirt.engine.core.bll.storage.pool.ReconstructMasterDomainCommand] (org.ovirt.thread.pool-6-thread-40) [359b5bd8] Validation of action 'ReconstructMaste
rDomain' failed for user SYSTEM. Reasons: VAR__ACTION__RECONSTRUCT_MASTER,VAR__TYPE__STORAGE__DOMAIN,ACTION_TYPE_FAILED_NO_VDS_IN_POOL

Comment 20 Oved Ourfali 2017-01-11 11:05:40 UTC

Elad - can you downgrade the vdsm-jsonrpc-java to 1.3.6 (stop the engine first, of course).
Please report back if it helped.
1.4.0 is the "master" version, and 1.3.6 is the 4.1. I'm interested in knowing whether it happens also on 1.3.6.

Comment 21 Elad 2017-01-11 11:17:56 UTC

Downgrade engine only?

Comment 22 Oved Ourfali 2017-01-11 11:20:51 UTC

Downgrade the vdsm-jsonrpc-java package on the engine machine from 1.4.0 to 1.3.6.

Comment 23 Elad 2017-01-11 12:20:59 UTC

Created attachment 1239416 [details]
downgraded vdsm-json-java

Using vdsm-jsonrpc-java-1.3.6-1.20170111073901.gita5e600c.el7.centos.noarch communication between engine and vdsm gets restored successfully after vdsm service starts. Reconstruct master takes place and succeeds:



2017-01-11 14:12:42,495+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-1) [32cba7d9] Correlation ID: 32cba7d9, Job ID: ce2ee1
c2-7217-431c-a823-077d2956b496, Call Stack: null, Custom Event ID: -1, Message: Reconstruct Master Domain for Data Center dc completed.

Comment 24 Red Hat Bugzilla Rules Engine 2017-01-11 13:04:14 UTC

This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 25 Jiri Belka 2017-01-25 11:47:15 UTC

ok,  ovirt-engine-4.1.0.2-0.2.el7.noarch & vdsm-jsonrpc-java-1.3.7-1.el7ev.noarch

2017-01-25 12:44:24,300+01 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-50) [70f7ffc0] EVENT_ID: RECONSTRUCT_MASTER_DONE(984), Correlation ID: 70f7ffc0, Job ID: 1ec303d0-7e4d-4302-b67d-ae154a6e07f5, Call Stack: null, Custom Event ID: -1, Message: Reconstruct Master Domain for Data Center Default completed.

Note You need to log in before you can comment on or make changes to this bug.