Bug 1326009

Summary: SPM Host become Non-responsive when block connection to Master Storage Domain
Product: [oVirt] ovirt-engine Reporter: Eyal Shenitzky <eshenitz>
Component: BLL.InfraAssignee: Eli Mesika <emesika>
Status: CLOSED NOTABUG QA Contact: Pavel Stehlik <pstehlik>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.6.5.1CC: amureini, bugs, eshenitz, laravot, mperina, nsoffer, pkliczew, ratamir, rhodain
Target Milestone: ---Keywords: Regression, Reopened
Target Release: ---Flags: eshenitz: planning_ack?
eshenitz: devel_ack?
eshenitz: testing_ack?
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-06-13 08:59:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
engine and vdsm logs
none
block connection to non-master domain - engine and vdsm log
none
05/05/16-engine vdsm logs
none
Reproducing logs
none
journalctl from dev-21
none
reattach engine and vdsm logs none

Description Eyal Shenitzky 2016-04-11 15:06:57 UTC
Created attachment 1146014 [details]
engine and vdsm logs

Description of problem:

In environment with more than one host and few storage-domains, 
When blocking connection between SPM host via iptable to Master Storage Domain
The engine failed to connect with the host and it become Non-responsive and not Non-operational as expected.
 

Version-Release number of selected component (if applicable):

engine version - 3.6.5.1-0.1.el6
vdsm version - 4.17.25-0.el7ev.noarch


How reproducible:
100%

Steps to Reproduce:
1. Block connection between the SPM host to the Master Storage Domain via iptable

Actual results:
Host become Non-responsive

Expected results:
Host should be Non-operational and release the SPM

Additional info:
Attached the following Exception and vdsm+engine logs - 
2016-04-11 17:36:48,240 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-46) [53ba98a5] Failure to refresh Vds runtime info: org.ovirt.v
dsm.jsonrpc.client.ClientConnectionException: Connection timeout
2016-04-11 17:36:48,240 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-46) [53ba98a5] Exception: org.ovirt.engine.core.vdsbroker.vdsbr
oker.VDSNetworkException: org.ovirt.vdsm.jsonrpc.client.ClientConnectionException: Connection timeout
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.createNetworkException(VdsBrokerCommand.java:157) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:120) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:65) [vdsbroker.jar:]
        at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) [dal.jar:]
        at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:467) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VdsManager.refreshCapabilities(VdsManager.java:652) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsRunTimeInfo(HostMonitoring.java:119) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.HostMonitoring.refresh(HostMonitoring.java:84) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:227) [vdsbroker.jar:]
        at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source) [:1.7.0_99]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_99]
        at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_99]
        at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) [scheduler.jar:]
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52) [scheduler.jar:]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:]
Caused by: org.ovirt.vdsm.jsonrpc.client.ClientConnectionException: Connection timeout
        at org.ovirt.vdsm.jsonrpc.client.reactors.stomp.SSLStompClient.waitForConnect(SSLStompClient.java:107) [vdsm-jsonrpc-java-client.jar:]
        at org.ovirt.vdsm.jsonrpc.client.reactors.stomp.SSLStompClient.sendMessage(SSLStompClient.java:77) [vdsm-jsonrpc-java-client.jar:]
        at org.ovirt.vdsm.jsonrpc.client.JsonRpcClient.call(JsonRpcClient.java:73) [vdsm-jsonrpc-java-client.jar:]
        at org.ovirt.engine.core.vdsbroker.jsonrpc.FutureMap.<init>(FutureMap.java:68) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.jsonrpc.JsonRpcVdsServer.getCapabilities(JsonRpcVdsServer.java:268) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand.executeVdsBrokerCommand(GetCapabilitiesVDSCommand.java:15) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:110) [vdsbroker.jar:]

Comment 1 Allon Mureinik 2016-04-11 15:55:00 UTC
From the logs, it seems as though the connection between the HOST and ENGINE is timing out, which would explain the Non-Responsive status. I don't see, offhand, any evidence this has anything to do with blocking the master domain connection.

Oved/Dan - any insight? Is this related to the recent issues we've seen with networking timing out when adding a host (I know the two operations aren't related - but perhaps there's a common root cause)?

Comment 2 Oved Ourfali 2016-04-12 07:15:21 UTC
Piotr - can you take a look?

Comment 3 Piotr Kliczewski 2016-04-12 07:46:17 UTC
It looks like we could saturate the network because engine disconnected due to:

2016-04-11 17:11:47,193 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-37) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM host_mixed_1 command failed: Heartbeat exeeded

and on vdsm side I can see:

d/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000 (cwd None)
mailbox.SPMMonitor::DEBUG::2016-04-11 17:11:42,082::storage_mailbox::735::Storage.Misc.excCmd::(_checkForMail) SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB) copied, 0.0222868 s, 45.9 MB/s\n'; <rc> = 0
Thread-37::DEBUG::2016-04-11 17:11:42,136::fileSD::173::Storage.Misc.excCmd::(getReadDelay) /usr/bin/taskset --cpu-list 0-1 /usr/bin/dd if=/rhev/data-center/mnt/glusterSD/10.35.65.25:_storage__local__ge3__volume__0/b5c69862-2087-43df-a162-502ce5e331e4/dom_md/metadata iflag=direct of=/dev/null bs=4096 count=1 (cwd None)
Thread-37::DEBUG::2016-04-11 17:11:42,143::fileSD::173::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n357 bytes (357 B) copied, 0.00012556 s, 2.8 MB/s\n'; <rc> = 0
Thread-62::DEBUG::2016-04-11 17:11:42,947::blockSD::422::Storage.Misc.excCmd::(getReadDelay) /usr/bin/taskset --cpu-list 0-1 /usr/bin/dd if=/dev/cb7ccb5b-5684-465e-98b2-31b0c4c3924c/metadata iflag=direct of=/dev/null bs=4096 count=1 (cwd None)
Thread-62::DEBUG::2016-04-11 17:11:42,954::blockSD::422::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '1+0 records in\n1+0 records out\n4096 bytes (4.1 kB) copied, 0.000662718 s, 6.2 MB/s\n'; <rc> = 0
mailbox.SPMMonitor::DEBUG::2016-04-11 17:11:44,095::storage_mailbox::735::Storage.Misc.excCmd::(_checkForMail) /usr/bin/taskset --cpu-list 0-1 dd if=/rhev/data-center/48028c56-65ab-4bd4-87c8-2003444251b7/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000 (cwd None)
mailbox.SPMMonitor::DEBUG::2016-04-11 17:11:44,121::storage_mailbox::735::Storage.Misc.excCmd::(_checkForMail) SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB) copied, 0.0182608 s, 56.1 MB/s\n'; <rc> = 0
Thread-64::DEBUG::2016-04-11 17:11:44,807::fileSD::173::Storage.Misc.excCmd::(getReadDelay) /usr/bin/taskset --cpu-list 0-1 /usr/bin/dd if=/rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Storage_export__domain__storage__local__ge3/cfded165-c52f-456e-8108-3dd3a91f27b1/dom_md/metadata iflag=direct of=/dev/null bs=4096 count=1 (cwd None)
Thread-64::DEBUG::2016-04-11 17:11:44,815::fileSD::173::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n374 bytes (374 B) copied, 0.000586225 s, 638 kB/s\n'; <rc> = 0

and since 

JsonRpc (StompReactor)::ERROR::2016-04-11 17:11:47,090::betterAsyncore::124::vds.dispatcher::(recv) SSL error during reading data: unexpected eof

we were unable to connect to vdsm and there were no more connection attempts in vdsm log. We just fail to connect from the engine perspective.

Comment 4 Oved Ourfali 2016-04-12 07:48:04 UTC
Eyal - are you sure that you only blocked this and not also the communication with the host?
Can you reproduce and give us access to examine this?

Comment 5 Liron Aravot 2016-04-12 07:50:34 UTC
1. How come there are no more connection attempts?
2. Is there some weird connection between blocking the master domain and the error? as it seems the error only appeared when there was supposed to be an error in performing the requested operation.
3. Eyal, are you sure that you didn't block the engine/host? although it doesn't seem so from the logs
3. Eyal, does this issue happened only once or does it reproduce constantly?

Comment 6 Piotr Kliczewski 2016-04-12 07:52:41 UTC
(In reply to Liron Aravot from comment #5)
> 1. How come there are no more connection attempts?

From the engine perspective there are numerous attempts but none of the reach vdsm. All of them end up with org.ovirt.vdsm.jsonrpc.client.ClientConnectionException: Connection failed.

> 2. Is there some weird connection between blocking the master domain and the
> error? as it seems the error only appeared when there was supposed to be an
> error in performing the requested operation.
> 3. Eyal, are you sure that you didn't block the engine/host? although it
> doesn't seem so from the logs
> 3. Eyal, does this issue happened only once or does it reproduce constantly?

Comment 7 Piotr Kliczewski 2016-04-12 07:53:09 UTC
restoring need info

Comment 8 Eyal Shenitzky 2016-04-13 07:50:36 UTC
(In reply to Liron Aravot from comment #5)
> 1. How come there are no more connection attempts?
> 2. Is there some weird connection between blocking the master domain and the
> error? as it seems the error only appeared when there was supposed to be an
> error in performing the requested operation.
> 3. Eyal, are you sure that you didn't block the engine/host? although it
> doesn't seem so from the logs
> 3. Eyal, does this issue happened only once or does it reproduce constantly?

Yes, I only blocked the connection between the SPM to the master storage domain.
And yes, this issue reproduce constantly as written above.

Comment 9 Eyal Shenitzky 2016-04-13 07:53:06 UTC
Same behavior happens also when block the connection between the SPM and Non-master domain

Comment 10 Eyal Shenitzky 2016-04-13 07:58:09 UTC
Created attachment 1146732 [details]
block connection to non-master domain - engine and vdsm log

Comment 11 Piotr Kliczewski 2016-04-13 09:12:00 UTC
From the old logs I can see that engine lost connection with vdsm at some point and it never was able to talk to it again.

How did you block the connection?

Comment 12 Eyal Shenitzky 2016-04-13 10:20:23 UTC
Via iptable(In reply to Piotr Kliczewski from comment #11)
> From the old logs I can see that engine lost connection with vdsm at some
> point and it never was able to talk to it again.
> 
> How did you block the connection?

Via iptable

Comment 13 Piotr Kliczewski 2016-04-13 11:37:07 UTC
Please make sure that you do not remove the rule which is added during vdsm installation and retest.

It that would still make the engine <-> vdsm connection go away. Please post here iptables config specifying which rules are for which host without providing any IPs.

Comment 14 Piotr Kliczewski 2016-04-13 13:04:11 UTC
Nir,

I wonder whether storage code could make vdsm process to stuck. I talked to Eyal offline and it seems that we do not block engine <-> vdsm communication but still vdsm do not sends any data and heartbeat causes engine to drop the connection.

Comment 15 Oved Ourfali 2016-04-18 11:32:29 UTC
Nir - can you answer?

Comment 16 Nir Soffer 2016-04-18 19:22:51 UTC
(In reply to Piotr Kliczewski from comment #14)
> Nir,
> 
> I wonder whether storage code could make vdsm process to stuck. I talked to
> Eyal offline and it seems that we do not block engine <-> vdsm communication
> but still vdsm do not sends any data and heartbeat causes engine to drop the
> connection.

No, I don't think any storage operation can cause vdsm to stop sending heartbeats.

I would add detailed logs to understand what the reactor thread is doing, when
it wakes up, how much time each operation it does  take (e.g. writing message).

Also, can we reproduce it with standard ssl or only with m2crypto?

Comment 17 Piotr Kliczewski 2016-04-19 06:42:17 UTC
It looks like having no heartbeats and storage domain being blocked is related. 

Eyal, please reproduce the issue and let us access it.

Comment 18 Martin Perina 2016-05-03 08:13:43 UTC
Any progress reproducing the issue? Thanks

Comment 19 Eyal Shenitzky 2016-05-05 07:27:47 UTC
Yes, I reproduce the bug again and attached the relevant logs (vdsm and engine)

Comment 20 Eyal Shenitzky 2016-05-05 07:28:48 UTC
Created attachment 1154112 [details]
05/05/16-engine vdsm logs

Comment 21 Oved Ourfali 2016-05-05 07:29:21 UTC
Can we access, as Piotr requested?

Comment 22 Eyal Shenitzky 2016-05-05 08:25:34 UTC
(In reply to Oved Ourfali from comment #21)
> Can we access, as Piotr requested?

What do you mean by access?

Comment 23 Oved Ourfali 2016-05-05 08:27:48 UTC
(In reply to Eyal Shenitzky from comment #22)
> (In reply to Oved Ourfali from comment #21)
> > Can we access, as Piotr requested?
> 
> What do you mean by access?

Access the environment.

Comment 24 Piotr Kliczewski 2016-05-05 09:32:59 UTC
Checking the logs I see that vdsm was stopped:

MainThread::DEBUG::2016-05-05 10:14:47,395::vdsm::72::vds::(sigtermHandler) Received signal 15

and as a result engine lost connection:

2016-05-05 10:15:13,423 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler_Worker-52) [] Command 'GetAllVmStatsVDSCommand(HostName = host_mixed_3, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='da92e664-deeb-4111-9d42-98a56c7a594b', vds='Host[host_mixed_3,da92e664-deeb-4111-9d42-98a56c7a594b]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded

In the logs I can see that vdsm was stopped 3 times. Communication was restored when vdsm was Up again. Why we restarted vdsm?

Comment 25 Eyal Shenitzky 2016-05-05 11:57:19 UTC
(In reply to Piotr Kliczewski from comment #24)
> Checking the logs I see that vdsm was stopped:
> 
> MainThread::DEBUG::2016-05-05 10:14:47,395::vdsm::72::vds::(sigtermHandler)
> Received signal 15
> 
> and as a result engine lost connection:
> 
> 2016-05-05 10:15:13,423 ERROR
> [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand]
> (DefaultQuartzScheduler_Worker-52) [] Command
> 'GetAllVmStatsVDSCommand(HostName = host_mixed_3,
> VdsIdAndVdsVDSCommandParametersBase:{runAsync='true',
> hostId='da92e664-deeb-4111-9d42-98a56c7a594b',
> vds='Host[host_mixed_3,da92e664-deeb-4111-9d42-98a56c7a594b]'})' execution
> failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded
> 
> In the logs I can see that vdsm was stopped 3 times. Communication was
> restored when vdsm was Up again. Why we restarted vdsm

The vdsm reboots 3 times before the engine set the host set as Non-responsive

Comment 26 Piotr Kliczewski 2016-05-05 12:51:02 UTC
Please see the time when it occurred:

Vdsm Restart:             2016-05-05 10:14:47,395
Engine lost connectivity: 2016-05-05 10:15:13,423

The restarts were direct cause of the issue, at least according to the logs.

Comment 27 Piotr Kliczewski 2016-05-05 13:30:32 UTC
Please retest without restarting vdsm

Comment 28 Eyal Shenitzky 2016-05-05 13:33:58 UTC
 (In reply to Piotr Kliczewski from comment #27)
> Please retest without restarting vdsm

I'm not restarting the vdsm, I only block the connection between the SPM host and 
the Master Storage Domain.
the reboot of the VDSM cause by the engine and not manually by me.

Comment 29 Martin Perina 2016-05-05 16:08:11 UTC
Before deep diving into the logs, are you sure that connection between engine and VDSM on SPM host was not affected by blocking connection between VDSM and master storage domain? Because from the description it seems to me like most probable cause for the SPM host to became NonResponsive ...

Comment 30 Eyal Shenitzky 2016-05-06 08:45:45 UTC
(In reply to Martin Perina from comment #29)
> Before deep diving into the logs, are you sure that connection between
> engine and VDSM on SPM host was not affected by blocking connection between
> VDSM and master storage domain? Because from the description it seems to me
> like most probable cause for the SPM host to became NonResponsive ...

The opposite way, this is the bug.
What I said in my response is that I didn't block or interfere to the connection
between the host and the engine, I block only the connection between the SPM and the master storage domain

Comment 31 Eyal Shenitzky 2016-05-06 08:45:58 UTC
(In reply to Martin Perina from comment #29)
> Before deep diving into the logs, are you sure that connection between
> engine and VDSM on SPM host was not affected by blocking connection between
> VDSM and master storage domain? Because from the description it seems to me
> like most probable cause for the SPM host to became NonResponsive ...

The opposite way, this is the bug.
What I said in my response is that I didn't block or interfere to the connection
between the host and the engine, I block only the connection between the SPM and the master storage domain

Comment 32 Martin Perina 2016-05-06 12:38:58 UTC
Created attachment 1154614 [details]
Reproducing logs

Comment 33 Piotr Kliczewski 2016-05-06 12:50:54 UTC
This is strange because I see:

Vdsm restart:
MainThread::DEBUG::2016-05-06 12:31:14,597::vdsm::72::vds::(sigtermHandler) Received signal 15

Engine lost connection:
2016-05-06 12:31:39,857 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-79) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM command failed: Heartbeat exeeded

and the connection was recovered at:

2016-05-06 12:32:30,903 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to /10.34.63.152

and later I can see:

2016-05-06 12:34:57,774 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-10) [7fda609e] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Status of host dev-21 was set to Up.

The question is why vdsm was restarted.

Comment 34 Martin Perina 2016-05-06 12:58:10 UTC
So I was able to reproduce (logs attached) it on latest 3.6.6 (including changes using vdsm-jsonrpc-java 1.1.12) using 2 hosts (dev-18, dev-21) in a cluster with shared NFS storage on 3rd server (str-02).

Flow:

1. Make dev-21 SPM and block access to str-02 only (engine can connect to the dev-21 host without any issues)

2. After a few following message appeared in the log (IMO there's no issue so far):

    2016-05-06 12:30:38,548 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-34) [] domain '4b204a94-88ae-4bae-80ba-32e5213023ff:nfs01' in problem. vds: 'dev-21'

3. Unfortunately cca 1 minute later heartbeat exceeded error message was raised during GetStatsVDSCommand:

    2016-05-06 12:31:39,891 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-45) [] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exeeded


4. Around the same time another heartbeat exceeded error message was raised during GetStoragePoolInfoVDSCommand:

5. After these 2 error we changed dev-21 host status to Connecting

6. Without any intervention (access to storafe domains is still blocked) host dev-21 became Up again 3 minutes later

    2016-05-06 12:34:57,774 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-10) [7fda609e] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Status of host dev-21 was set to Up.

So it seems that blocking access to master storage domain is somehow connected to VDSM stopping to respond to communication from engine.
Piotr, could you please take a look?

Comment 35 Piotr Kliczewski 2016-05-06 13:07:03 UTC
We need to understand the cause of vdsm service restart at 2016-05-06 12:31:14,597.

Comment 36 Martin Perina 2016-05-06 13:20:50 UTC
Created attachment 1154647 [details]
journalctl from dev-21

So from journalctl content it seems that VDSM was killed by systemd:

May 06 12:31:54 dev-21.rhev.lab.eng.brq.redhat.com systemd[1]: vdsmd.service: main process exited, code=killed, status=9/KILL

Comment 37 Piotr Kliczewski 2016-05-06 13:28:31 UTC
Signal was received at 12:31:14 and systemd could just log process being down.

Here is interesting entry that i found in journalctl:

May 06 12:31:14 dev-21.rhev.lab.eng.brq.redhat.com sanlock[19820]: 2016-05-06 12:31:14+0200 5630 [19820]: s2 check_our_lease failed 80
May 06 12:31:14 dev-21.rhev.lab.eng.brq.redhat.com sanlock[19820]: 2016-05-06 12:31:14+0200 5630 [19820]: s2 kill 20012 sig 15 count 1
May 06 12:31:14 dev-21.rhev.lab.eng.brq.redhat.com sanlock[19820]: 2016-05-06 12:31:14+0200 5631 [19820]: s2 kill 20012 sig 15 count 2

Nir, is there any functionality in sanlock which could send the signal?

Comment 38 Nir Soffer 2016-05-06 14:01:29 UTC
(In reply to Piotr Kliczewski from comment #37)
> Signal was received at 12:31:14 and systemd could just log process being
> down.
> 
> Here is interesting entry that i found in journalctl:
> 
> May 06 12:31:14 dev-21.rhev.lab.eng.brq.redhat.com sanlock[19820]:
> 2016-05-06 12:31:14+0200 5630 [19820]: s2 check_our_lease failed 80
> May 06 12:31:14 dev-21.rhev.lab.eng.brq.redhat.com sanlock[19820]:
> 2016-05-06 12:31:14+0200 5630 [19820]: s2 kill 20012 sig 15 count 1
> May 06 12:31:14 dev-21.rhev.lab.eng.brq.redhat.com sanlock[19820]:
> 2016-05-06 12:31:14+0200 5631 [19820]: s2 kill 20012 sig 15 count 2
> 
> Nir, is there any functionality in sanlock which could send the signal?

Sure, if you block the master storage domain on the spm host, sanlock cannot
renew the host lease. If the lease cannot be renewed for 80 seconds, sanlock
will kill vdsm, since vdsm is the owner of the spm resource.

Becoming unresponsive is expected when blocking the master domain on the spm
host.

When vdsm is restarted, the host should become up again, but it will not be
able to become the spm again since the master domain is blocked.

I think the correct test would be to block access to some storage domain on the
other host (non-spm), which should make this host non-operational.

Comment 39 Martin Perina 2016-05-06 14:27:00 UTC
I just confirmed behaviour Nir described in Comment 38: 

1. If you block access to storage domain on non-SPM host, then the host is moved to NonOperational after some time (without any communication issues raised).

2. If you block access to storage domain for SPM host, then VDSM on the host is killed by sanlock (which causes communication issues on engine side and host may change status to Connecting/NonResponsive), but afterwards host is marked as NonOperational and different host is selected as SPM

I'm going to close it as NOTABUG.

Comment 40 Liron Aravot 2016-05-08 13:50:23 UTC
Martin,
the problem here iiuc is that the host stays unresponsive although vdsm is restarted (and therefore the engine-host communication should be restored), therefore the host should become non-operational and not remain non responsive.

I might have a lead on this issue, on my setup (engine - master) I've restarted vdsm during the syncornous part of operation execution. On that scenario the communication wasn't restored as well till i've restarted the engine - if sanlock fences vdsm during the synchronous part of an operation in this bug it might be the same.

Comment 41 Liron Aravot 2016-05-08 14:03:07 UTC
ignore the second part of my comment, there was other problem leading to that in that case.

Comment 42 Piotr Kliczewski 2016-05-08 19:16:18 UTC
Liron,
As far as I understand Martin's reproducer of the issue proved exactly what you stated in the first part of your comment.

Comment 43 Martin Perina 2016-05-09 08:48:04 UTC
> Martin,
> the problem here iiuc is that the host stays unresponsive although vdsm is 
> restarted (and therefore the engine-host communication should be restored), 
> therefore the host should become non-operational and not remain non responsive.

Exactly as Piotr mentioned above, when VDSM was killed by sanlock, we lost a connection to the host, so host status was changed to Connecting. When VDSM was restarted then host status was changed to Up, InitVdsOnUp was called and during it we found an issue connecting to storage domain, so we changed that status to NonOperational.

If VDSM restart takes more time, engine can change host status from Connecting to NotResponding and even start fencing on the host, but that's how it is when we lost connection to VDSM.

Comment 44 Liron Aravot 2016-05-09 08:59:16 UTC
Thanks Piotr/Martin-
this bug is about the host remaining permanently in Non Responsive status.
If you and Eyal failed to reproduce and the log doesn't provide any more info - i guess it can be closed as WORKSFORME.

Comment 45 Eyal Shenitzky 2016-05-09 11:24:41 UTC
   (In reply to Martin Perina from comment #39)
> I just confirmed behaviour Nir described in Comment 38: 
> 
> 1. If you block access to storage domain on non-SPM host, then the host is
> moved to NonOperational after some time (without any communication issues
> raised).
> 
> 2. If you block access to storage domain for SPM host, then VDSM on the host
> is killed by sanlock (which causes communication issues on engine side and
> host may change status to Connecting/NonResponsive), but afterwards host is
> marked as NonOperational and different host is selected as SPM
> 
> I'm going to close it as NOTABUG.

 (In reply to Liron Aravot from comment #44)
> Thanks Piotr/Martin-
> this bug is about the host remaining permanently in Non Responsive status.
> If you and Eyal failed to reproduce and the log doesn't provide any more
> info - i guess it can be closed as WORKSFORME.


The bug reproduces each time, the question here is if the SPM host should become 
Non-responsive when the connection to the Master Storage domain is blocked or to move to Non-operational - which I think is the expected result.

Comment 46 Liron Aravot 2016-05-09 11:56:36 UTC
The host should end in Non Operational status, if in some phase its non responsive its fine - but if its *remains* non responsive there is an issue.

Comment 47 Eyal Shenitzky 2016-05-09 14:51:24 UTC
According to all the above this is definitely a bug
This bug does reproduced each time.
I'm not sure why this issue was closed as NOTABUG.

Comment 48 Eyal Shenitzky 2016-05-09 14:51:38 UTC
According to all the above this is definitely a bug
This bug does reproduced each time.
I'm not sure why this issue was closed as NOTABUG.

Comment 49 Martin Perina 2016-06-01 08:47:06 UTC
I'm unable to reproduce your error, in my environment with 2 host and NFS storage, if you block access for SPM (host1) to NFS storage, then after a while host1 always ends up in NonOperational state and host2 is selected as SPM.

Eyal, could you please provide access to your environment where you can reproduce it? If so, Eli could you please investigate?

Thanks

Comment 50 Eli Mesika 2016-06-01 09:01:05 UTC
Hi 

Talking with Eyal the host is moved to non-responding for 20 min , then it moved to non-operational and them the SPM election occurs and change the SPM to the other host 

So, what we have to investigate here is the long (20 min) time in non-responding state

Comment 51 Eyal Shenitzky 2016-06-01 10:45:33 UTC
Created attachment 1163576 [details]
reattach engine and vdsm logs

Comment 52 Eli Mesika 2016-06-02 13:28:29 UTC
As far as I see from the log, the reason that this took 20 minutes is that the Host is switching from non-reponding to connecting states back and forth while the engine tried to stop the host from being SPM, but when the state is 'Connecting' this operation is ignored.
There are no fencing scenarios here since the host power management option is disabled according to the logs

Comment 53 Eli Mesika 2016-06-02 13:58:46 UTC
Following comment 50 and comment 52, it seems as NOTABUG
Please let us know if you want to reproduce scenario with power management that is configured correctly on the host 
Otherwise, this bug will be closed as NOTABUG

Comment 54 Allon Mureinik 2016-06-06 11:48:38 UTC
*** Bug 1343011 has been marked as a duplicate of this bug. ***

Comment 55 Eyal Shenitzky 2016-06-13 08:38:31 UTC
Fine by me, close as NOTABUG

Comment 56 Martin Perina 2016-06-13 08:59:58 UTC
If power management is correctly configured on hosts, then host (which is the SPM and connection from the host to master storage domains is blocked) is properly fenced and after that new SPM is selected as expected.