1136368 – [hosted-engine-setup] [iSCSI support] Host fenced during deployment using iSCSI

Bug 1136368 - [hosted-engine-setup] [iSCSI support] Host fenced during deployment using iSCSI

Summary: [hosted-engine-setup] [iSCSI support] Host fenced during deployment using iSCSI

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	oVirt
Classification:	Retired
Component:	vdsm
Sub Component:
Version:	3.5
Hardware:	x86_64
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	3.5.1
Assignee:	Federico Simoncelli
QA Contact:	Gil Klein
Docs Contact:
URL:
Whiteboard:	storage
Depends On:
Blocks:	1193195
TreeView+	depends on / blocked

Reported:	2014-09-02 12:47 UTC by Elad
Modified:	2016-02-10 16:37 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2014-11-11 09:42:30 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
logs from the host (2.90 MB, application/octet-stream) 2014-09-02 12:47 UTC, Elad	no flags	Details
View All

Description Elad 2014-09-02 12:47:06 UTC

Created attachment 933757 [details]
logs from the host

Description of problem:
I tried to deploy hosted-engine using iSCSI, set-up the engine VM with OS and ovirt-engine installation. After that, I continued the HE deployment and after ~30 minutes the host rebooted.
When the host started, the VM wasn't brought up.



(1) Continue setup - engine installation is complete                                                                  
          (2) Power off and restart the VM                                                                              
          (3) Abort setup                                                                
          (1, 2, 3)[1]: 1                                                                                          
[ INFO  ] Engine replied: DB Up!Welcome to Health Status!                                                                         
          Enter the name of the cluster to which you want to add the host (Default) [Default]:                                    
[ INFO  ] Waiting for the host to become operational in the engine. This may take several minutes...       
[ INFO  ] Still waiting for VDSM host to become operational...                                                                  
[ INFO  ] Still waiting for VDSM host to become operational...


Version-Release number of selected component (if applicable):
ovirt-3.5 RC1.1
ovirt-hosted-engine-setup-1.2.0-0.1.master.20140820130713.gitd832f86.el7.noarch

How reproducible:
Always

Steps to Reproduce:
1. Execute hosted-engine --deploy. Choose iSCSI. 
2. Install OS and ovirt-engine on the VM
3. Continue the HE installation

Actual results:



Continued the deployment:

[ INFO  ] Engine replied: DB Up!Welcome to Health Status!                                                                         
          Enter the name of the cluster to which you want to add the host (Default) [Default]:                                    
[ INFO  ] Waiting for the host to become operational in the engine. This may take several minutes...       
[ INFO  ] Still waiting for VDSM host to become operational...                                                                  
[ INFO  ] Still waiting for VDSM host to become operational...


Sanlock exception in vdsm.log:

Thread-88::DEBUG::2014-09-02 13:52:30,904::domainMonitor::201::Storage.DomainMonitorThread::(_monitorLoop) Unable to release the host id 1 for domain fa98e800-83e4-480c-b769-c48504aa61fd
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 198, in _monitorLoop
    self.domain.releaseHostId(self.hostId, unused=True)
  File "/usr/share/vdsm/storage/sd.py", line 480, in releaseHostId
    self._clusterLock.releaseHostId(hostId, async, unused)
  File "/usr/share/vdsm/storage/clusterlock.py", line 252, in releaseHostId
    raise se.ReleaseHostIdFailure(self._sdUUID, e)
ReleaseHostIdFailure: Cannot release host id: ('fa98e800-83e4-480c-b769-c48504aa61fd', SanlockException(16, 'Sanlock lockspace remove failure', 'Device or resource busy'))



After ~30 host is rebooted.


The VM doesn't start when the host brought up. Checked HA agent and broker services, they are inactive.


Expected results:
Host shouldn't be fenced during deployment

Additional info:
logs from the host

Comment 1 Sandro Bonazzola 2014-09-03 12:20:50 UTC

From the attached logs I see that the first execution of the hosted-engine --deploy command was aborted by receiving signal 2 (keyboard interrupt / ctrl+c).

at 2014-09-02 11:43:52

in vdsm log at the same time I see:

 Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 198, in _monitorLoop
    self.domain.releaseHostId(self.hostId, unused=True)
  File "/usr/share/vdsm/storage/sd.py", line 480, in releaseHostId
    self._clusterLock.releaseHostId(hostId, async, unused)
  File "/usr/share/vdsm/storage/clusterlock.py", line 252, in releaseHostId
    raise se.ReleaseHostIdFailure(self._sdUUID, e)
 ReleaseHostIdFailure: Cannot release host id: ('1e1212ce-33c4-43b0-ab48-8843794176ad', SanlockException(16, 'Sanlock lockspace remove failure', 'Device or resource busy'))

and that make sense since the storage is busy because the VM was running when the setup was killed.

About second execution, I see garbage in the log file right after 2014-09-02 14:22:34.

corresponding to the event I see in message.log:

libvirtError: Requested operation is not valid: cgroup CPU controller is not mounted
 Sep  2 14:22:17 localhost iscsid: cannot make a connection to 10.35.146.129:3260 (-1,101)
 Sep  2 14:22:17 localhost multipathd: 3514f0c5462601612: sdb - path offline
 Sep  2 14:22:17 localhost multipathd: 3514f0c5462601614: sdc - path offline
 Sep  2 14:22:20 localhost iscsid: cannot make a connection to 10.35.146.129:3260 (-1,101)
 Sep  2 14:22:22 localhost multipathd: 3514f0c5462601612: sdb - path offline
 Sep  2 14:22:22 localhost multipathd: 3514f0c5462601614: sdc - path offline
 Sep  2 14:22:23 localhost iscsid: cannot make a connection to 10.35.146.129:3260 (-1,101)
 Sep  2 14:22:26 localhost iscsid: cannot make a connection to 10.35.146.129:3260 (-1,101)
 Sep  2 14:22:27 localhost multipathd: 3514f0c5462601612: sdb - path offline
 Sep  2 14:22:27 localhost multipathd: 3514f0c5462601614: sdc - path offline
 Sep  2 14:22:29 localhost iscsid: cannot make a connection to 10.35.146.129:3260 (-1,101)
 Sep  2 14:22:30 localhost journal: vdsm vm.Vm ERROR vmId=`cec64f28-7e21-42f4-a79d-499fdd0a99ca`::Stats function failed: <AdvancedStatsFunction _sampleCpu at 0x1414510>
 Traceback (most recent call last):
  File "/usr/share/vdsm/virt/sampling.py", line 471, in collect
    statsFunction()
  File "/usr/share/vdsm/virt/sampling.py", line 346, in __call__
    retValue = self._function(*args, **kwargs)
  File "/usr/share/vdsm/virt/vm.py", line 308, in _sampleCpu
    cpuStats = self._vm._dom.getCPUStats(True, 0)
  File "/usr/share/vdsm/virt/vm.py", line 662, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/vdsm/libvirtconnection.py", line 111, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 2016, in getCPUStats
    if ret is None: raise libvirtError ('virDomainGetCPUStats() failed', dom=self)
 libvirtError: Requested operation is not valid: cgroup CPUACCT controller is not mounted
 Sep  2 14:22:30 localhost journal: vdsm vm.Vm ERROR vmId=`cec64f28-7e21-42f4-a79d-499fdd0a99ca`::Stats function failed: <AdvancedStatsFunction _sampleCpuTune at 0x1410650>
 Traceback (most recent call last):
  File "/usr/share/vdsm/virt/sampling.py", line 471, in collect
    statsFunction()
  File "/usr/share/vdsm/virt/sampling.py", line 346, in __call__
    retValue = self._function(*args, **kwargs)
  File "/usr/share/vdsm/virt/vm.py", line 354, in _sampleCpuTune
    infos = self._vm._dom.schedulerParameters()
  File "/usr/share/vdsm/virt/vm.py", line 662, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/vdsm/libvirtconnection.py", line 111, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 2134, in schedulerParameters
    if ret is None: raise libvirtError ('virDomainGetSchedulerParameters() failed', dom=self)
 libvirtError: Requested operation is not valid: cgroup CPU controller is not mounted
 Sep  2 14:22:32 localhost iscsid: cannot make a connection to 10.35.146.129:3260 (-1,101)
 Sep  2 14:22:32 localhost multipathd: 3514f0c5462601612: sdb - path offline
 Sep  2 14:22:32 localhost multipathd: 3514f0c5462601614: sdc - path offline
 Sep  2 14:22:35 localhost iscsid: cannot make a connection to 10.35.146.129:3260 (-1,101)
 Sep  2 14:22:37 localhost multipathd: 3514f0c5462601612: sdb - path offline
 Sep  2 14:22:37 localhost multipathd: 3514f0c5462601614: sdc - path offline
 Sep  2 14:22:38 localhost systemd: Starting Cleanup of Temporary Directories...
 Sep  2 14:22:38 localhost systemd: Started Cleanup of Temporary Directories.
 Sep  2 14:22:38 localhost iscsid: cannot make a connection to 10.35.146.129:3260 (-1,101)

So the fencing doesn't seems related to the hosted-engine setup but to something happened at lower levels. Moving this to storage team.

Comment 2 Federico Simoncelli 2014-09-12 15:20:28 UTC

The logs are full of:

 iscsid: cannot make a connection to 10.35.146.129:3260 (-1,101)
 multipathd: 3514f0c5462601612: sdb - path offline
 multipathd: 3514f0c5462601614: sdc - path offline
 iscsid: cannot make a connection to 10.35.146.129:3260

Sandro, is the setup changing the iptables rules?

Comment 3 Sandro Bonazzola 2014-09-18 08:17:44 UTC

Yes, it changes iptables rules: when adding the host to the engine using SDK it uses ovirt-host-deploy and it changes the iptables rules on the host accordingly with iptables rules stored in the DB as it was any other host.

Comment 4 Sandro Bonazzola 2014-10-01 09:09:24 UTC

I can't reproduce it with the following configuration:

1) EL7 VMs running tgtd, exposing 1 lun, 192.168.1.108
2) ADSL modem/router, default GW, 192.168.1.1
3) Bare metal host, running EL 6.5

same config but Bare metal host, running EL 7 with SELinux disabled due to 1142709

All host / VMs freshly installed. I'm not using multipath.
So may this be something related to that?

Comment 6 Federico Simoncelli 2014-11-11 09:27:53 UTC

(In reply to Sandro Bonazzola from comment #4)
> All host / VMs freshly installed. I'm not using multipath.
> So may this be something related to that?

I don't think it is (since we always use multipath even when only 1 path is available) but you can easily find out adding another path.

Comment 7 Sandro Bonazzola 2014-11-11 09:42:30 UTC

Closing since I don't have enough info for reproducing the issue.
Please reopen if you're able to reproduce this and provide detailed steps for reproducing.

Note You need to log in before you can comment on or make changes to this bug.