Created attachment 933757 [details] logs from the host Description of problem: I tried to deploy hosted-engine using iSCSI, set-up the engine VM with OS and ovirt-engine installation. After that, I continued the HE deployment and after ~30 minutes the host rebooted. When the host started, the VM wasn't brought up. (1) Continue setup - engine installation is complete (2) Power off and restart the VM (3) Abort setup (1, 2, 3)[1]: 1 [ INFO ] Engine replied: DB Up!Welcome to Health Status! Enter the name of the cluster to which you want to add the host (Default) [Default]: [ INFO ] Waiting for the host to become operational in the engine. This may take several minutes... [ INFO ] Still waiting for VDSM host to become operational... [ INFO ] Still waiting for VDSM host to become operational... Version-Release number of selected component (if applicable): ovirt-3.5 RC1.1 ovirt-hosted-engine-setup-1.2.0-0.1.master.20140820130713.gitd832f86.el7.noarch How reproducible: Always Steps to Reproduce: 1. Execute hosted-engine --deploy. Choose iSCSI. 2. Install OS and ovirt-engine on the VM 3. Continue the HE installation Actual results: Continued the deployment: [ INFO ] Engine replied: DB Up!Welcome to Health Status! Enter the name of the cluster to which you want to add the host (Default) [Default]: [ INFO ] Waiting for the host to become operational in the engine. This may take several minutes... [ INFO ] Still waiting for VDSM host to become operational... [ INFO ] Still waiting for VDSM host to become operational... Sanlock exception in vdsm.log: Thread-88::DEBUG::2014-09-02 13:52:30,904::domainMonitor::201::Storage.DomainMonitorThread::(_monitorLoop) Unable to release the host id 1 for domain fa98e800-83e4-480c-b769-c48504aa61fd Traceback (most recent call last): File "/usr/share/vdsm/storage/domainMonitor.py", line 198, in _monitorLoop self.domain.releaseHostId(self.hostId, unused=True) File "/usr/share/vdsm/storage/sd.py", line 480, in releaseHostId self._clusterLock.releaseHostId(hostId, async, unused) File "/usr/share/vdsm/storage/clusterlock.py", line 252, in releaseHostId raise se.ReleaseHostIdFailure(self._sdUUID, e) ReleaseHostIdFailure: Cannot release host id: ('fa98e800-83e4-480c-b769-c48504aa61fd', SanlockException(16, 'Sanlock lockspace remove failure', 'Device or resource busy')) After ~30 host is rebooted. The VM doesn't start when the host brought up. Checked HA agent and broker services, they are inactive. Expected results: Host shouldn't be fenced during deployment Additional info: logs from the host
From the attached logs I see that the first execution of the hosted-engine --deploy command was aborted by receiving signal 2 (keyboard interrupt / ctrl+c). at 2014-09-02 11:43:52 in vdsm log at the same time I see: Traceback (most recent call last): File "/usr/share/vdsm/storage/domainMonitor.py", line 198, in _monitorLoop self.domain.releaseHostId(self.hostId, unused=True) File "/usr/share/vdsm/storage/sd.py", line 480, in releaseHostId self._clusterLock.releaseHostId(hostId, async, unused) File "/usr/share/vdsm/storage/clusterlock.py", line 252, in releaseHostId raise se.ReleaseHostIdFailure(self._sdUUID, e) ReleaseHostIdFailure: Cannot release host id: ('1e1212ce-33c4-43b0-ab48-8843794176ad', SanlockException(16, 'Sanlock lockspace remove failure', 'Device or resource busy')) and that make sense since the storage is busy because the VM was running when the setup was killed. About second execution, I see garbage in the log file right after 2014-09-02 14:22:34. corresponding to the event I see in message.log: libvirtError: Requested operation is not valid: cgroup CPU controller is not mounted Sep 2 14:22:17 localhost iscsid: cannot make a connection to 10.35.146.129:3260 (-1,101) Sep 2 14:22:17 localhost multipathd: 3514f0c5462601612: sdb - path offline Sep 2 14:22:17 localhost multipathd: 3514f0c5462601614: sdc - path offline Sep 2 14:22:20 localhost iscsid: cannot make a connection to 10.35.146.129:3260 (-1,101) Sep 2 14:22:22 localhost multipathd: 3514f0c5462601612: sdb - path offline Sep 2 14:22:22 localhost multipathd: 3514f0c5462601614: sdc - path offline Sep 2 14:22:23 localhost iscsid: cannot make a connection to 10.35.146.129:3260 (-1,101) Sep 2 14:22:26 localhost iscsid: cannot make a connection to 10.35.146.129:3260 (-1,101) Sep 2 14:22:27 localhost multipathd: 3514f0c5462601612: sdb - path offline Sep 2 14:22:27 localhost multipathd: 3514f0c5462601614: sdc - path offline Sep 2 14:22:29 localhost iscsid: cannot make a connection to 10.35.146.129:3260 (-1,101) Sep 2 14:22:30 localhost journal: vdsm vm.Vm ERROR vmId=`cec64f28-7e21-42f4-a79d-499fdd0a99ca`::Stats function failed: <AdvancedStatsFunction _sampleCpu at 0x1414510> Traceback (most recent call last): File "/usr/share/vdsm/virt/sampling.py", line 471, in collect statsFunction() File "/usr/share/vdsm/virt/sampling.py", line 346, in __call__ retValue = self._function(*args, **kwargs) File "/usr/share/vdsm/virt/vm.py", line 308, in _sampleCpu cpuStats = self._vm._dom.getCPUStats(True, 0) File "/usr/share/vdsm/virt/vm.py", line 662, in f ret = attr(*args, **kwargs) File "/usr/lib64/python2.7/site-packages/vdsm/libvirtconnection.py", line 111, in wrapper ret = f(*args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 2016, in getCPUStats if ret is None: raise libvirtError ('virDomainGetCPUStats() failed', dom=self) libvirtError: Requested operation is not valid: cgroup CPUACCT controller is not mounted Sep 2 14:22:30 localhost journal: vdsm vm.Vm ERROR vmId=`cec64f28-7e21-42f4-a79d-499fdd0a99ca`::Stats function failed: <AdvancedStatsFunction _sampleCpuTune at 0x1410650> Traceback (most recent call last): File "/usr/share/vdsm/virt/sampling.py", line 471, in collect statsFunction() File "/usr/share/vdsm/virt/sampling.py", line 346, in __call__ retValue = self._function(*args, **kwargs) File "/usr/share/vdsm/virt/vm.py", line 354, in _sampleCpuTune infos = self._vm._dom.schedulerParameters() File "/usr/share/vdsm/virt/vm.py", line 662, in f ret = attr(*args, **kwargs) File "/usr/lib64/python2.7/site-packages/vdsm/libvirtconnection.py", line 111, in wrapper ret = f(*args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 2134, in schedulerParameters if ret is None: raise libvirtError ('virDomainGetSchedulerParameters() failed', dom=self) libvirtError: Requested operation is not valid: cgroup CPU controller is not mounted Sep 2 14:22:32 localhost iscsid: cannot make a connection to 10.35.146.129:3260 (-1,101) Sep 2 14:22:32 localhost multipathd: 3514f0c5462601612: sdb - path offline Sep 2 14:22:32 localhost multipathd: 3514f0c5462601614: sdc - path offline Sep 2 14:22:35 localhost iscsid: cannot make a connection to 10.35.146.129:3260 (-1,101) Sep 2 14:22:37 localhost multipathd: 3514f0c5462601612: sdb - path offline Sep 2 14:22:37 localhost multipathd: 3514f0c5462601614: sdc - path offline Sep 2 14:22:38 localhost systemd: Starting Cleanup of Temporary Directories... Sep 2 14:22:38 localhost systemd: Started Cleanup of Temporary Directories. Sep 2 14:22:38 localhost iscsid: cannot make a connection to 10.35.146.129:3260 (-1,101) So the fencing doesn't seems related to the hosted-engine setup but to something happened at lower levels. Moving this to storage team.
The logs are full of: iscsid: cannot make a connection to 10.35.146.129:3260 (-1,101) multipathd: 3514f0c5462601612: sdb - path offline multipathd: 3514f0c5462601614: sdc - path offline iscsid: cannot make a connection to 10.35.146.129:3260 Sandro, is the setup changing the iptables rules?
Yes, it changes iptables rules: when adding the host to the engine using SDK it uses ovirt-host-deploy and it changes the iptables rules on the host accordingly with iptables rules stored in the DB as it was any other host.
I can't reproduce it with the following configuration: 1) EL7 VMs running tgtd, exposing 1 lun, 192.168.1.108 2) ADSL modem/router, default GW, 192.168.1.1 3) Bare metal host, running EL 6.5 same config but Bare metal host, running EL 7 with SELinux disabled due to 1142709 All host / VMs freshly installed. I'm not using multipath. So may this be something related to that?
(In reply to Sandro Bonazzola from comment #4) > All host / VMs freshly installed. I'm not using multipath. > So may this be something related to that? I don't think it is (since we always use multipath even when only 1 path is available) but you can easily find out adding another path.
Closing since I don't have enough info for reproducing the issue. Please reopen if you're able to reproduce this and provide detailed steps for reproducing.