Created attachment 938391 [details] Logs Description of problem: Trying to deploy the hosted-engine via iSCSI doesn't work. It results in a working engine, that pauses whenever the first node (aka the first machine the deploy process is run on) is added to the engine. Version-Release number of selected component (if applicable): How reproducible: Very (tried four times, failing at the same exact moment) Steps to Reproduce: 1.) Install a fresh copy of CentOS 7 2.) Install the nightly snapshot repos for ovirt 3.) Install ovirt-hosted-engine-setup 4.) Running vdsm-tool configure --force (because of the known cert issue) 5.) Running hosted-engine --deploy 6.) Inatll the VM with ovirt-engine + deploying it (successful) 7.) Let the hosted-engine --deploy process add the node to the engine Actual results: The host starts to be added (becomes visible) into the engine, but then the engine-vm pauses resulting in a broken install on the node side. Expected results: The host should be added (become visible) into the engine, the VM shouldn't stop and the host should become up and running. Additional info: Logs from both the engine and the node are provided as an attachment.
On vmnode4 the hosted engine setup logs ovirt-hosted-engine-setup-20140917090611-vs45gs.log shows 2014-09-17 09:57:01 DEBUG otopi.plugins.otopi.dialog.human dialog.__logString:215 DIALOG:SEND Hosted Engine successfully set up Before telling that the setup was successfully it waited for the VM to be down and then destroyed it. 2014-09-17 09:57:00 DEBUG otopi.ovirt_hosted_engine_setup.tasks tasks.wait:97 Done 2014-09-17 09:57:00 DEBUG otopi.plugins.ovirt_hosted_engine_setup.ha.ha_services plugin.executeRaw:785 execute: ['/bin/vdsClient', '-s', 'localhost', 'destroy', '8411bfbb-f35c-45bf-8266-b0a6c69b3101'], executable='None', cwd='None', env=None Then it started the HA daemons: 2014-09-17 09:57:00 DEBUG otopi.plugins.otopi.services.systemd plugin.executeRaw:785 execute: ('/bin/systemctl', 'enable', u'ovirt-ha-agent.service'), executable='None', cwd='None', env=None 2014-09-17 09:57:00 DEBUG otopi.plugins.otopi.services.systemd plugin.executeRaw:785 execute: ('/bin/systemctl', 'start', 'ovirt-ha-agent.service'), executable='None', cwd='None', env=None 2014-09-17 09:57:01 DEBUG otopi.plugins.otopi.services.systemd plugin.executeRaw:785 execute: ('/bin/systemctl', 'enable', u'ovirt-ha-broker.service'), executable='None', cwd='None', env=None 2014-09-17 09:57:01 DEBUG otopi.plugins.otopi.services.systemd plugin.executeRaw:785 execute: ('/bin/systemctl', 'start', 'ovirt-ha-broker.service'), executable='None', cwd='None', env=None So from a setup point of view, everything has gone well. hosted-engine HA daemon successfully started the VM: MainThread::INFO::2014-09-17 09:58:30,694::hosted_engine::877::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) Starting vm using `/usr/sbin/hosted-engine --vm-start` MainThread::INFO::2014-09-17 09:58:30,968::hosted_engine::883::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) stdout: 8411bfbb-f35c-45bf-8266-b0a6c69b3101 Status = WaitForLaunch nicModel = rtl8139,pv emulatedMachine = pc pid = 0 devices = [{'index': '2', 'iface': 'ide', 'specParams': {}, 'readonly': 'true', 'deviceId': '8f72938c-4831-4af6-9d6d-2ec507ce5a68', 'address': {'bus': '1', 'controller': '0', 'type': 'drive', 'target': '0', 'unit': '0'}, 'device': 'cdrom', 'shared': 'false', 'path': '/Fedora-20-x86_64-netinst.iso', 'type': 'disk'}, {'index': '0', 'iface': 'virtio', 'reqsize': '0', 'format': 'raw', 'bootOrder': '1', 'address': {'slot': '0x06', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'volumeID': 'd2dbd275-1ba8-4018-a242-75ce5f2e5314', 'imageID': '64acee54-cffd-461e-b27e-4efa9afb350b', 'specParams': {}, 'readonly': 'false', 'domainID': 'c316df3b-2309-4196-8d43-6b54e446daed', 'deviceId': '64acee54-cffd-461e-b27e-4efa9afb350b', 'type': 'disk', 'poolID': '00000000-0000-0000-0000-000000000000', 'device': 'disk', 'shared': 'exclusive', 'propagateErrors': 'off', 'optional': 'false'}, {'device': 'scsi', 'model': 'virtio-scsi', 'type': 'controller'}, {'nicModel': 'pv', 'macAddr': '22:33:aa:bb:33:22', 'linkActive': 'true', 'network': 'ovirtmgmt', 'filter': 'vdsm-no-mac-spoofing', 'specParams': {}, 'deviceId': 'a7627cfc-60ed-4582-8cb7-dc250ec6dea9', 'address': {'slot': '0x03', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'device': 'bridge', 'type': 'interface'}, {'device': 'console', 'specParams': {}, 'type': 'console', 'deviceId': 'd4015181-8d3d-4ff9-8425-44b058c7ea72', 'alias': 'console0'}] smp = 4 vmType = kvm displayIp = 0 displaySecurePort = -1 memSize = 8192 displayPort = -1 cpuType = SandyBridge spiceSecureChannels = smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir clientIp = vmName = HostedEngine display = vnc And reported a good state: MainThread::INFO::2014-09-17 10:17:37,868::state_machine::168::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Local (id 1): {'engine-health': {'health': 'good', 'vm': 'up', 'detail': 'up'}, 'bridge': True, 'mem-free': 61853.0, 'maintenance': False, 'cpu-load': 0.0167, 'gateway': True} I'm not a VDSM expert but seems to me that the following line says that the VM is up and running: Thread-1679::DEBUG::2014-09-17 10:18:57,669::BindingXMLRPC::1139::vds::(wrapper) return vmGetStats with {'status': {'message': 'Done', 'code': 0}, 'statsList': [{'displayInfo': [{'tlsPort': '-1', 'ipAddress': '0', 'type': 'vnc', 'port': u'5900'}], 'memUsage': '0', 'acpiEnable': 'true', 'guestFQDN': '', 'pid': '22214', 'session': 'Unknown', 'vmType': 'kvm', 'timeOffset': '0', 'balloonInfo': {'balloon_max': '8388608', 'balloon_target': '0', 'balloon_cur': '8388608', 'balloon_min': '0'}, 'pauseCode': 'NOERR', 'network': {u'vnet0': {'macAddr': '22:33:aa:bb:33:22', 'rxDropped': '0', 'rxErrors': '0', 'txDropped': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'unknown', 'speed': '1000', 'name': u'vnet0'}}, 'vmJobs': {}, 'cpuUser': '1.67', 'elapsedTime': '1226', 'displayType': 'vnc', 'cpuSys': '0.33', 'appsList': [], 'vcpuCount': '4', 'clientIp': '', 'hash': '-6500834245358214803', 'vmId': '8411bfbb-f35c-45bf-8266-b0a6c69b3101', 'displayIp': '0', 'vcpuPeriod': 100000L, 'displayPort': u'5900', 'vcpuQuota': '-1', 'kvmEnable': 'true', 'disks': {u'vda': {'readLatency': '2161970', 'apparentsize': '26843545600', 'writeLatency': '3532596', 'imageID': '64acee54-cffd-461e-b27e-4efa9afb350b', 'flushLatency': '55917', 'readRate': '300793.44', 'truesize': '26843545600', 'writeRate': '154931.46'}, u'hdc': {'readLatency': '0', 'apparentsize': '0', 'writeLatency': '0', 'flushLatency': '0', 'readRate': '0.00', 'truesize': '0', 'writeRate': '0.00'}}, 'monitorResponse': '0', 'statsAge': '0.35', 'username': 'Unknown', 'status': 'Up', 'guestCPUCount': -1, 'guestIPs': '', 'displaySecurePort': '-1'}]} So on Hosted Engine and VDSM side everything looks well. On the other end, the deployment seems to fail: 2014-09-17 09:42:59,360 INFO [org.ovirt.engine.core.bll.InstallerMessages] (VdsDeploy) Installation 192.168.1.5: Stage: Initializing 2014-09-17 09:42:59,369 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy) Correlation ID: 1bbe00d9, Call Stack: null, Custom Event ID: -1, Message: Installing Host xxx-vmnode4. Stage: Initializing. 2014-09-17 09:42:59,385 INFO [org.ovirt.engine.core.bll.InstallerMessages] (VdsDeploy) Installation 192.168.1.5: Stage: Environment setup 2014-09-17 09:42:59,390 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy) Correlation ID: 1bbe00d9, Call Stack: null, Custom Event ID: -1, Message: Installing Host xxx-vmnode4. Stage: Environment setup. 2014-09-17 09:42:59,400 INFO [org.ovirt.engine.core.bll.InstallerMessages] (VdsDeploy) Installation 192.168.1.5: Stage: Environment packages setup 2014-09-17 09:42:59,404 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy) Correlation ID: 1bbe00d9, Call Stack: null, Custom Event ID: -1, Message: Installing Host xxx-vmnode4. Stage: Environment packages setup. 2014-09-17 09:43:01,298 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-24) Initializing Host: xxx-vmnode4 2014-09-17 09:43:13,070 INFO [org.ovirt.engine.core.bll.InstallerMessages] (VdsDeploy) Installation 192.168.1.5: Stage: Programs detection 2014-09-17 09:43:13,083 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy) Correlation ID: 1bbe00d9, Call Stack: null, Custom Event ID: -1, Message: Installing Host xxx-vmnode4. Stage: Programs detection. 2014-09-17 09:43:13,097 INFO [org.ovirt.engine.core.bll.InstallerMessages] (VdsDeploy) Installation 192.168.1.5: Stage: Environment customization 2014-09-17 09:43:13,106 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy) Correlation ID: 1bbe00d9, Call Stack: null, Custom Event ID: -1, Message: Installing Host xxx-vmnode4. Stage: Environment customization. 2014-09-17 09:43:17,496 INFO [org.ovirt.engine.core.bll.InstallerMessages] (VdsDeploy) Installation 192.168.1.5: Kdump supported 2014-09-17 09:43:17,503 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy) Correlation ID: 1bbe00d9, Call Stack: null, Custom Event ID: -1, Message: Installing Host xxx-vmnode4. Kdump supported. 2014-09-17 09:43:17,512 INFO [org.ovirt.engine.core.bll.InstallerMessages] (VdsDeploy) Installation 192.168.1.5: Logs at host located at: '/tmp/ovirt-host-deploy-20140917094259-b7qbdo.log' 2014-09-17 09:43:17,518 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy) Correlation ID: 1bbe00d9, Call Stack: null, Custom Event ID: -1, Message: Installing Host xxx-vmnode4. Logs at host located at: '/tmp/ovirt-host-deploy-20140917094259-b7qbdo.log'. 2014-09-17 09:43:17,584 INFO [org.ovirt.engine.core.bll.VdsDeploy] (VdsDeploy) Host 192.168.1.5 reports unique id 03000200-0400-0500-0006-000700080009 2014-09-17 09:43:17,605 INFO [org.ovirt.engine.core.bll.VdsDeploy] (VdsDeploy) Assigning unique id 03000200-0400-0500-0006-000700080009 to Host 192.168.1.5 2014-09-17 09:43:17,668 INFO [org.ovirt.engine.core.bll.InstallerMessages] (VdsDeploy) Installation 192.168.1.5: Disabling Kdump integration 2014-09-17 09:43:17,674 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy) Correlation ID: 1bbe00d9, Call Stack: null, Custom Event ID: -1, Message: Installing Host xxx-vmnode4. Disabling Kdump integration. 2014-09-17 09:43:17,686 INFO [org.ovirt.engine.core.bll.InstallerMessages] (VdsDeploy) Installation 192.168.1.5: Stage: Setup validation 2014-09-17 09:43:17,690 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy) Correlation ID: 1bbe00d9, Call Stack: null, Custom Event ID: -1, Message: Installing Host xxx-vmnode4. Stage: Setup validation. 2014-09-17 09:43:17,743 INFO [org.ovirt.engine.core.bll.InstallerMessages] (VdsDeploy) Installation 192.168.1.5: Hardware supports virtualization 2014-09-17 09:43:17,747 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy) Correlation ID: 1bbe00d9, Call Stack: null, Custom Event ID: -1, Message: Installing Host xxx-vmnode4. Hardware supports virtualization. 2014-09-17 09:43:17,894 INFO [org.ovirt.engine.core.bll.InstallerMessages] (VdsDeploy) Installation 192.168.1.5: Stage: Transaction setup 2014-09-17 09:43:17,900 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy) Correlation ID: 1bbe00d9, Call Stack: null, Custom Event ID: -1, Message: Installing Host xxx-vmnode4. Stage: Transaction setup. 2014-09-17 09:43:17,942 INFO [org.ovirt.engine.core.bll.InstallerMessages] (VdsDeploy) Installation 192.168.1.5: Stage: Misc configuration 2014-09-17 09:43:17,946 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy) Correlation ID: 1bbe00d9, Call Stack: null, Custom Event ID: -1, Message: Installing Host xxx-vmnode4. Stage: Misc configuration. 2014-09-17 09:43:17,947 INFO [org.ovirt.engine.core.bll.InstallerMessages] (VdsDeploy) Installation 192.168.1.5: Stage: Package installation 2014-09-17 09:43:17,951 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy) Correlation ID: 1bbe00d9, Call Stack: null, Custom Event ID: -1, Message: Installing Host xxx-vmnode4. Stage: Package installation. and then garbage in the log file. Looks like something really bad happened but I can't see it as hosted-engine issue. Whatever happened, it happened around 2014-09-17 09:43:17. looking at the logs around that timestamp
Around that time I can see in vdsm log file: Thread-65::DEBUG::2014-09-17 09:43:14,514::blockSD::596::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '1+0 records in\n1+0 records out\n4096 bytes (4.1 kB) copied, 0.00544066 s, 753 kB/s\n'; <rc> = 0 GuestMonitor-HostedEngine::DEBUG::2014-09-17 09:43:18,069::vm::486::vm.Vm::(_getUserCpuTuneInfo) vmId=`8411bfbb-f35c-45bf-8266-b0a6c69b3101`::Domain Metadata is not set GuestMonitor-HostedEngine::DEBUG::2014-09-17 09:43:18,070::vm::486::vm.Vm::(_getUserCpuTuneInfo) vmId=`8411bfbb-f35c-45bf-8266-b0a6c69b3101`::Domain Metadata is not set GuestMonitor-HostedEngine::DEBUG::2014-09-17 09:43:18,071::vm::486::vm.Vm::(_getUserCpuTuneInfo) vmId=`8411bfbb-f35c-45bf-8266-b0a6c69b3101`::Domain Metadata is not set MainThread::DEBUG::2014-09-17 09:43:22,119::vdsm::58::vds::(sigtermHandler) Received signal 15 MainThread::DEBUG::2014-09-17 09:43:22,120::protocoldetector::133::vds.MultiProtocolAcceptor::(stop) Stopping Acceptor Thread-30::ERROR::2014-09-17 09:43:22,120::__init__::317::IOProcessClient::(_communicate) IOProcess failure Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 272, in _communicate raise Exception("FD closed") Exception: FD closed MainThread::INFO::2014-09-17 09:43:22,121::__init__::563::jsonrpc.JsonRpcServer::(stop) Stopping JsonRPC Server Detector thread::DEBUG::2014-09-17 09:43:22,121::protocoldetector::104::vds.MultiProtocolAcceptor::(_cleanup) Cleaning Acceptor Thread-30::DEBUG::2014-09-17 09:43:22,122::__init__::159::IOProcessClient::(_run) Starting IOProcess... MainThread::INFO::2014-09-17 09:43:22,123::vmchannels::188::vds::(stop) VM channels listener was stopped. MainThread::INFO::2014-09-17 09:43:22,129::momIF::91::MOM::(stop) Shutting down MOM Moving to ioprocess since I can't see anything else failing around that time.
I don't think it's related to ioprocess. Its FD was closed right after MainThread::DEBUG::2014-09-17 09:43:22,119::vdsm::58::vds::(sigtermHandler) Received signal 15 We need to understand what stopped/restarted vdsm. Could it be hosted-engine-setup itself? Something else which is worrying are repeated calls to hbaScan() every 10 seconds.
Do you have hints of vdsmd restart on /var/log/vdsm/spm-lock.log or /var/log/messages?
Do you need those from the node or the engine VM? I am currently not able to see those files, but I will review and post them ASAP. Thanks for your help and interest. (In reply to Dan Kenigsberg from comment #4) > Do you have hints of vdsmd restart on /var/log/vdsm/spm-lock.log or > /var/log/messages?
(In reply to Thomas Keppler from comment #5) > Do you need those from the node or the engine VM? From the node which had vdsmd restarted - we need to guess what caused that.
Created attachment 938756 [details] /var/log/messages of xxx-vmnode4
As I don't have the /var/log/vdsm/spm-lock.log as requested, else it would have been in my other file I've already uploaded, but I'll provide you with the /var/log/messages file and the output of journalctl -o verbose (still uploading as of now). I hope you can find your desired information in there and that it helps to hunt the problem down!! Best regards Thomas (In reply to Dan Kenigsberg from comment #6) > (In reply to Thomas Keppler from comment #5) > > Do you need those from the node or the engine VM? > > From the node which had vdsmd restarted - we need to guess what caused that.
Created attachment 938757 [details] journalctl -o verbose of xxx-vmnode4
I've reproduced the issue following http://pastebin.com/raw.php?i=sH9cfvDx so it seems reproducible. I think it may be somehow related to bug #1136368. The engine VM got paused: # vdsClient -s 0 getVmStats b692df43-4c12-498e-b6d4-aff7e928bd85 b692df43-4c12-498e-b6d4-aff7e928bd85 Status = Paused displayInfo = [{'tlsPort': '-1', 'ipAddress': '0', 'port': '5900', 'type': 'vnc'}] memUsage = 0 acpiEnable = true guestFQDN = session = Unknown displaySecurePort = -1 timeOffset = 0 balloonInfo = {'balloon_max': '8388608', 'balloon_min': '0', 'balloon_target': '0', 'balloon_cur': '8388608'} pauseCode = NOERR network = {'vnet0': {'macAddr': '22:33:aa:bb:33:22', 'rxDropped': '0', 'txDropped': '0', 'rxErrors': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'unknown', 'speed': '1000', 'name': 'vnet0'}} vmType = kvm cpuUser = 0.00 elapsedTime = 1918 vmJobs = {} cpuSys = 0.00 appsList = [] displayType = vnc hash = -7643996875850563108 pid = 4676 displayIp = 0 displayPort = 5900 guestIPs = kvmEnable = true disks = {'vda': {'readLatency': '0', 'apparentsize': '21474836480', 'writeLatency': '0', 'imageID': '94f46485-ab0f-4ac8-995f-aaea87724284', 'flushLatency': '0', 'truesize': '21474836480'}, 'hdc': {'flushLatency': '0', 'readLatency': '0', 'truesize': '0', 'apparentsize': '0', 'writeLatency': '0'}} monitorResponse = 0 statsAge = 0.16 username = Unknown guestCPUCount = -1 clientIp = # vdsClient -s 0 getDeviceList [{'GUID': '360000000000000000e00000000010001', 'capacity': '26843545600', 'devtype': 'iSCSI', 'fwrev': '0001', 'logicalblocksize': '512', 'pathlist': [{'connection': '192.168.1.108', 'initiatorname': 'default', 'iqn': 'iqn.2009-02.com.example:for.all', 'port': '3260', 'portal': '1'}], 'pathstatus': [{'lun': '1', 'physdev': 'sdb', 'state': 'active', 'type': 'iSCSI'}], 'physicalblocksize': '4096', 'productID': 'VIRTUAL-DISK', 'pvUUID': '3rQ820-EGtx-gdXL-SiYM-SgGY-ghv2-TYa18Z', 'serial': 'SIET_VIRTUAL-DISK', 'status': 'used', 'vendorID': 'IET', 'vgUUID': 'OgZx4n-phrw-3zoV-p6wv-ZsqT-2Imh-Fe5crv'}] # vdsClient -s 0 getDevicesVisibility 360000000000000000e00000000010001 360000000000000000e00000000010001 = True # iptables-save # Generated by iptables-save v1.4.21 on Thu Sep 18 10:23:29 2014 *filter :INPUT ACCEPT [0:0] :FORWARD ACCEPT [0:0] :OUTPUT ACCEPT [19483:1557514] -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT -A INPUT -i lo -j ACCEPT -A INPUT -p tcp -m tcp --dport 54321 -j ACCEPT -A INPUT -p tcp -m tcp --dport 22 -j ACCEPT -A INPUT -p udp -m udp --dport 161 -j ACCEPT -A INPUT -p tcp -m tcp --dport 16514 -j ACCEPT -A INPUT -p tcp -m multiport --dports 5900:6923 -j ACCEPT -A INPUT -p tcp -m multiport --dports 49152:49216 -j ACCEPT -A INPUT -j REJECT --reject-with icmp-host-prohibited -A FORWARD -m physdev ! --physdev-is-bridged -j REJECT --reject-with icmp-host-prohibited COMMIT # Completed on Thu Sep 18 10:23:29 2014 # vdsClient -s 0 getStorageDomainsList be03d311-751b-4623-82a0-f3113f542043 # vdsClient -s 0 getStorageDomainStats be03d311-751b-4623-82a0-f3113f542043 disktotal = 26440892416 (24.0GB) diskfree = 536870912 (512.0MB) # vdsClient -s 0 getStorageDomainInfo be03d311-751b-4623-82a0-f3113f542043 uuid = be03d311-751b-4623-82a0-f3113f542043 vguuid = OgZx4n-phrw-3zoV-p6wv-ZsqT-2Imh-Fe5crv state = OK version = 3 role = Master type = ISCSI class = Data pool = ['ffcdbc2f-0fa2-4991-8b6d-94050a1954e7'] name = hosted_storage # vdsClient -s 0 getVolumesList be03d311-751b-4623-82a0-f3113f542043 0 31ccf8c8-0547-4b98-8e8c-4109f3c345f8 : Hosted Engine Image. d602fc34-e4b4-498a-b2df-84de80e68ec5 : hosted-engine.lockspace. a1b1c513-9367-4092-a5c3-3744395a279b : hosted-engine.metadata. # vdsClient -s 0 getVolumeInfo be03d311-751b-4623-82a0-f3113f542043 0 0 31ccf8c8-0547-4b98-8e8c-4109f3c345f8 status = OK domain = be03d311-751b-4623-82a0-f3113f542043 capacity = 21474836480 voltype = LEAF description = Hosted Engine Image parent = 00000000-0000-0000-0000-000000000000 format = RAW image = 94f46485-ab0f-4ac8-995f-aaea87724284 uuid = 31ccf8c8-0547-4b98-8e8c-4109f3c345f8 disktype = 2 legality = LEGAL mtime = 0 apparentsize = 21474836480 truesize = 21474836480 type = PREALLOCATED children = [] pool = ctime = 1411024926
vdsm logs shows: Thread-15::ERROR::2014-09-18 10:29:17,084::sampling::475::vm.Vm::(collect) vmId=`b692df43-4c12-498e-b6d4-aff7e928bd85`::Stats function failed: <AdvancedStatsFunction _sampleCpu at 0x3134750> Traceback (most recent call last): File "/usr/share/vdsm/virt/sampling.py", line 471, in collect statsFunction() File "/usr/share/vdsm/virt/sampling.py", line 346, in __call__ retValue = self._function(*args, **kwargs) File "/usr/share/vdsm/virt/vm.py", line 313, in _sampleCpu cpuStats = self._vm._dom.getCPUStats(True, 0) File "/usr/share/vdsm/virt/vm.py", line 670, in f ret = attr(*args, **kwargs) File "/usr/lib64/python2.7/site-packages/vdsm/libvirtconnection.py", line 111, in wrapper ret = f(*args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 2016, in getCPUStats if ret is None: raise libvirtError ('virDomainGetCPUStats() failed', dom=self) libvirtError: Requested operation is not valid: cgroup CPUACCT controller is not mounted Thread-15::DEBUG::2014-09-18 10:29:17,088::libvirtconnection::143::root::(wrapper) Unknown libvirterror: ecode: 55 edom: 10 level: 2 message: Requested operation is not valid: cgroup CPU controller is not mounted Thread-15::ERROR::2014-09-18 10:29:17,088::sampling::475::vm.Vm::(collect) vmId=`b692df43-4c12-498e-b6d4-aff7e928bd85`::Stats function failed: <AdvancedStatsFunction _sampleCpuTune at 0x312bd18> Traceback (most recent call last): File "/usr/share/vdsm/virt/sampling.py", line 471, in collect statsFunction() File "/usr/share/vdsm/virt/sampling.py", line 346, in __call__ retValue = self._function(*args, **kwargs) File "/usr/share/vdsm/virt/vm.py", line 359, in _sampleCpuTune infos = self._vm._dom.schedulerParameters() File "/usr/share/vdsm/virt/vm.py", line 670, in f ret = attr(*args, **kwargs) File "/usr/lib64/python2.7/site-packages/vdsm/libvirtconnection.py", line 111, in wrapper ret = f(*args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 2134, in schedulerParameters if ret is None: raise libvirtError ('virDomainGetSchedulerParameters() failed', dom=self) libvirtError: Requested operation is not valid: cgroup CPU controller is not mounted
Need to clear up the cgroups issue first anyway, then re-determine if there's anything else to be done. Moving to SLA to handle the above.
Created attachment 938809 [details] English description of my steps until the failure happened Added my steps to failure for future reference.
Not sure it may help, but: # mount|grep cgroup tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,seclabel,mode=755) cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd) cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset) cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpuacct,cpu) cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory) cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices) cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer) cgroup on /sys/fs/cgroup/net_cls type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls) cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio) cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event) cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
Tried to reproduce using CentOS 6.5 host and CentOS 6.5 VM, the issue is not reproducible there. Now going to try Fedora 20 host and Fedora 20 VM, I fear it's a CentOS 7 only issue.
I reproduced with Fedora 20 host and Fedora 20 VM. For a reason I don't understand yet, on Fedora 20 I can see in the manager UI that libvirt has been stopped by host-deploy: 2014-Sep-18, 17:05 Installing Host hosted_engine_1. Stopping libvirtd. 2014-Sep-18, 17:05 Installing Host hosted_engine_1. Stage: Closing up. 2014-Sep-18, 17:05 Installing Host hosted_engine_1. Stage: Transaction commit. 2014-Sep-18, 17:05 Installing Host hosted_engine_1. Enrolling certificate. 2014-Sep-18, 17:05 Installing Host hosted_engine_1. Setting time. 2014-Sep-18, 17:05 Installing Host hosted_engine_1. Stage: Misc configuration. 2014-Sep-18, 17:05 Installing Host hosted_engine_1. Yum Verify: 10/10: python-setuptools.noarch 0:1.4.2-1.fc20 - u. ... ... 2014-Sep-18, 17:05 Installing Host hosted_engine_1. Hardware supports virtualization. 2014-Sep-18, 17:05 Host hosted_engine_1 installation in progress . Cannot validate host name settings, reason: cannot resolve own name 'mididell.home'. 2014-Sep-18, 17:05 Installing Host hosted_engine_1. Stage: Setup validation. 2014-Sep-18, 17:05 Installing Host hosted_engine_1. Disabling Kdump integration. 2014-Sep-18, 17:05 Installing Host hosted_engine_1. Logs at host located at: '/tmp/ovirt-host-deploy-20140918170506-bmvcub.log'. 2014-Sep-18, 17:05 Installing Host hosted_engine_1. Kdump unsupported. ... 2014-Sep-18, 17:05 Installing Host hosted_engine_1. Stage: Initializing. 2014-Sep-18, 17:05 Installing Host hosted_engine_1. Connected to host 192.168.1.106 with SSH key fingerprint: 97:dc:00:97:de:42:d3:d4:e8:e3:b8:01:f5:04:d8:9b. This may explain why everything stop working later on. I'm sure this didn't happened in 3.4 and in 3.5 beta. For Martin, here is the xml dump: <domain type='kvm' id='3'> <name>HostedEngine</name> <uuid>7f9cd40e-252e-4bed-8247-5e2221fe9807</uuid> <memory unit='KiB'>8388608</memory> <currentMemory unit='KiB'>8388608</currentMemory> <vcpu placement='static'>4</vcpu> <cputune> <shares>1020</shares> </cputune> <resource> <partition>/machine</partition> </resource> <sysinfo type='smbios'> <system> <entry name='manufacturer'>oVirt</entry> <entry name='product'>oVirt Node</entry> <entry name='version'>20-3</entry> <entry name='serial'>4C4C4544-0039-4610-8039-B4C04F563032_f8:b1:56:c8:96:1f</entry> <entry name='uuid'>7f9cd40e-252e-4bed-8247-5e2221fe9807</entry> </system> </sysinfo> <os> <type arch='x86_64' machine='pc-i440fx-1.6'>hvm</type> <smbios mode='sysinfo'/> </os> <features> <acpi/> </features> <cpu mode='custom' match='exact'> <model fallback='allow'>Haswell</model> </cpu> <clock offset='variable' adjustment='0' basis='utc'> <timer name='rtc' tickpolicy='catchup'/> <timer name='pit' tickpolicy='delay'/> <timer name='hpet' present='no'/> </clock> <on_poweroff>destroy</on_poweroff> <on_reboot>destroy</on_reboot> <on_crash>destroy</on_crash> <devices> <emulator>/usr/bin/qemu-kvm</emulator> <disk type='file' device='cdrom'> <driver name='qemu' type='raw'/> <source file='/var/tmp/Fedora-20-x86_64-netinst.iso' startupPolicy='optional'/> <target dev='hdc' bus='ide'/> <readonly/> <serial></serial> <alias name='ide0-1-0'/> <address type='drive' controller='0' bus='1' target='0' unit='0'/> </disk> <disk type='block' device='disk' snapshot='no'> <driver name='qemu' type='raw' cache='none' error_policy='stop' io='native'/> <source dev='/var/run/vdsm/storage/442b0aca-7a4c-45bd-9d54-f41c05571bb1/a65bb620-5324-47b1-8832-44866f8fa22d/5e3f5725-156d-4da1-a0f7-f00846c93d58'/> <target dev='vda' bus='virtio'/> <serial>a65bb620-5324-47b1-8832-44866f8fa22d</serial> <boot order='1'/> <alias name='virtio-disk0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/> </disk> <controller type='scsi' index='0' model='virtio-scsi'> <alias name='scsi0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> </controller> <controller type='usb' index='0'> <alias name='usb0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/> </controller> <controller type='pci' index='0' model='pci-root'> <alias name='pci.0'/> </controller> <controller type='ide' index='0'> <alias name='ide0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/> </controller> <controller type='virtio-serial' index='0'> <alias name='virtio-serial0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/> </controller> <lease> <lockspace>442b0aca-7a4c-45bd-9d54-f41c05571bb1</lockspace> <key>5e3f5725-156d-4da1-a0f7-f00846c93d58</key> <target path='/dev/442b0aca-7a4c-45bd-9d54-f41c05571bb1/leases' offset='111149056'/> </lease> <interface type='bridge'> <mac address='00:16:3e:72:62:3f'/> <source bridge='ovirtmgmt'/> <target dev='vnet0'/> <model type='virtio'/> <filterref filter='vdsm-no-mac-spoofing'/> <link state='up'/> <alias name='net0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> </interface> <console type='pty' tty='/dev/pts/2'> <source path='/dev/pts/2'/> <target type='virtio' port='0'/> <alias name='console0'/> </console> <channel type='unix'> <source mode='bind' path='/var/lib/libvirt/qemu/channels/7f9cd40e-252e-4bed-8247-5e2221fe9807.com.redhat.rhevm.vdsm'/> <target type='virtio' name='com.redhat.rhevm.vdsm'/> <alias name='channel0'/> <address type='virtio-serial' controller='0' bus='0' port='1'/> </channel> <channel type='unix'> <source mode='bind' path='/var/lib/libvirt/qemu/channels/7f9cd40e-252e-4bed-8247-5e2221fe9807.org.qemu.guest_agent.0'/> <target type='virtio' name='org.qemu.guest_agent.0'/> <alias name='channel1'/> <address type='virtio-serial' controller='0' bus='0' port='2'/> </channel> <input type='mouse' bus='ps2'/> <graphics type='vnc' port='5900' autoport='yes' listen='0' passwdValidTo='2014-09-18T17:34:15'> <listen type='address' address='0'/> </graphics> <video> <model type='cirrus' vram='9216' heads='1'/> <alias name='video0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </video> <memballoon model='none'> <alias name='balloon0'/> </memballoon> </devices> <seclabel type='dynamic' model='selinux' relabel='yes'> <label>system_u:system_r:svirt_t:s0:c379,c386</label> <imagelabel>system_u:object_r:svirt_image_t:s0:c379,c386</imagelabel> </seclabel> </domain>
How: """ This may explain why everything stop working later on. I'm sure this didn't happened in 3.4 and in 3.5 beta. """ What explains? what is the explanation?
(In reply to Alon Bar-Lev from comment #17) > How: > """ > This may explain why everything stop working later on. > I'm sure this didn't happened in 3.4 and in 3.5 beta. > """ > > What explains? what is the explanation? Restarting libvirt causes iSCSI device to become unreachable, vdsm to receive sigterm, the hosted engine to be paused, the host deploy to be incomplete since the VM running the engine is paused due to above events cascade.
host-deploy stops vdsmd, stops libvirtd then start libvirtd (if no systemd) and start vdsm. vdsm is down during libvirtd restart. please explain again how it is related to host-deploy. if libvirt does not survive restart due to invalid configuration it is something else.
I tried it again with firewalld removed both on the host machine and in the engine-vm. Will provide some logs later, but for now it seems to be OK. I will also try to add another host to this configuration later and see what I can do... A (really bad) workaround for this issue would be, to be really quick and forcefully power the VM down (hosted-engine --vm-poweroff) and start it back up (hosted-engine --vm-start). After that (if you are lucky) the host is "successfully" added, the hosted-engine --deploy script doesn't notice that, though, so you have to wait for the timeout in order to go on with the deploy process. All in all, it sort of works but this is just far from stable or good.
(In reply to Alon Bar-Lev from comment #19) > host-deploy stops vdsmd, stops libvirtd then start libvirtd (if no systemd) > and start vdsm. > > vdsm is down during libvirtd restart. > > please explain again how it is related to host-deploy. > > if libvirt does not survive restart due to invalid configuration it is > something else. Alon, this is hosted engine. libvirt and vdsmd are configured and running because they're needed for creating the VM where the engine is running. host-deploy is executed when the host running the hosted engine is added to the engine itself. is host-deploy restarting libvirt because of both firewalld and iptables installed on the host? looking at comment #20 it seems that removing firewalld before deploying the host solve the issue. If so, I can add a Conflict in the hosted engine spec file. BTW, I don't think that stopping vdsm / libvirt on a host already running any VM managed by VDSM (not only the hosted engine one) is the right thing to do. Also, looking at the sequence in comment #16 vdsm is not stopped before stopping libvirt. Stopping libvirt causes SIGTERM in vdsmd and this is also something I don't think it's what host deploy should do. Reducing severity from urgent to high since a workaround (removing firewalld on the host) seems to be enough for having the system running.
I am sorry if I hadn't made that clear before but no, it *doesn't help*. Same result.
it is not new that vdsm and libvirt are restarted, please reassign to your-self until you figure out what's wrong.
Alon, I'm reassigning this to myself, but nothing changed on hosted-engine setup between 3.4 and 3.5 that may justify this issue and I still think it's wrong restarting vdsm and libvirt on a system where a vm relying on those 2 daemons is already running. Anyway, any help finding out the root issue from the 20 people in CC on this bug is welcome.
it may be wrong but this what ensures that after update of these components the latest is used. also, if you look at the spec files of these you see that they restart self post upgrade anyway. nothing in host-deploy was modified in this regard since 3.2, you can see the git log. the entire solution of hosted-engine in this regard is faulty anyway... not that we did not know. restarting libvirt should not effect running vms as far as I know, so I still think the root cause was not found. ovirt-host-deploy is natural per above default component behavior that we do not control.
(In reply to Alon Bar-Lev from comment #25) > the entire solution of hosted-engine in this regard is faulty anyway... not > that we did not know. Please open a separate bug describing why it is faulty and how it should be. We can consider to use a different implementation in a future release.
Update: Using NFS seems to work using CentOS 7/FC20! ...but this doesn't fix the problem.
Moving to integration for further investigation. Please needinfo me in case it turns out as an infra issue.
Could confirm sbonazzo's result that with a CentOS 6.5 Host + CentOS 6.5 Engine, (nothing changes beyond that) this setup works perfectly.
Please check the audit.log if there any selinux warning. Can you try to deploy it with selinux in permissive mode (running setenforce 0 before you start the deploy)
I have already reinstalled CentOS 6.5 on those servers and currently don't have the time to debug this fault, if you need any file, talk to Sandro Bonazzola as he will have the same issues on his testing setup. I will, of course, provide those files if I can in the future, but I'm redirecting this to Mr. Bonazzola because of the issue stated above. (In reply to Jiri Moskovcak from comment #30) > Please check the audit.log if there any selinux warning. Can you try to > deploy it with selinux in permissive mode (running setenforce 0 before you > start the deploy)
Dropping Regression keyword, since NFS setup available in previous releases is not affected. Jiri, I'll try to reproduce again and will check selinux. But if you're thinking about the issue you had with tgtd, this is not the same issue.
Created attachment 941013 [details] audit log
Created attachment 941018 [details] ovirt-host-deploy logs
(In reply to Sandro Bonazzola from comment #34) > Created attachment 941018 [details] > ovirt-host-deploy logs Job for vdsmd.service failed. See 'systemctl status vdsmd.service' and 'journalctl -xn' for details. ?
I've been able to perform the whole setup on EL7 with selinux in permissive mode. As Jiri suspected is a selinux issue. Now trying to get the needed rules. Workaround for now is set selinux for running in permissive mode.
Dan, since this seems to be something that may impact vdsm in general and since you already have selinux dependencies in vdsm, maybe better to bump selinux requirements in vdsm once new release will be available fixing the issue. Moving this to vdsm component. I can take care of bumping the requirements if you don't want to do it yourself.
Nir, does this look related to bug 1127460 ?
VDSM provided a workaround for lvm selinux rules, currently under verification.
(In reply to Sandro Bonazzola from comment #39) > VDSM provided a workaround for lvm selinux rules, currently under > verification. Sandro, do you have any reason to believe that this issue is related to selinux label on lvm devices? That issue is relevant when you have thin provisioning disk and disk becomes full. vdsm detect this, extend the disk and refresh the lv. Refreshing the lvm cause it to loose the selinux label (systemd bug). Here I think you are working with preallocated disk, so this does not seem to be the issue. On xxx-vmnode4, we can see 3 refreshes during bootstrap - these refresh only the special lvs which are not accessed by libvirt/qemu and do not have the libvirt image selinux label. storageRefresh::INFO::2014-09-17 09:01:36,503::lvm::668::Storage.LVM::(bootstrap) Refreshing lvs: vg=c316df3b-2309-4196-8d43-6b54e446daed lvs=['ids', 'master', 'metadata', 'out box', 'inbox', 'leases'] storageRefresh::DEBUG::2014-09-17 09:01:36,505::lvm::288::Storage.Misc.excCmd::(cmd) /usr/bin/sudo -n /usr/sbin/lvm lvchange --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 obtain_device_list_from_udev=0 filter = [ '\''a|/dev/mapper/36090a058f04424900541b5000000904b|'\'', '\''r|.*|'\'' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } ' --refresh c316df3b-2309-4196-8d43-6b54e446daed/ids c316df3b-2309-4196-8d43-6b54e446daed/master c316df3b-2309-4196-8d43-6b54e446daed/metadata c316df3b-2309-4196-8d43-6b54e446daed/outbox c316df3b-2309-4196-8d43-6b54e446daed/inbox c316df3b-2309-4196-8d43-6b54e446daed/leases (cwd None) storageRefresh::INFO::2014-09-17 09:07:00,218::lvm::668::Storage.LVM::(bootstrap) Refreshing lvs: vg=c316df3b-2309-4196-8d43-6b54e446daed lvs=['ids', 'master', 'metadata', 'out box', 'inbox', 'leases'] storageRefresh::DEBUG::2014-09-17 09:07:00,219::lvm::288::Storage.Misc.excCmd::(cmd) /usr/bin/sudo -n /usr/sbin/lvm lvchange --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 obtain_device_list_from_udev=0 filter = [ '\''a|/dev/mapper/36090a058f04424900541b5000000904b|'\'', '\''r|.*|'\'' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } ' --refresh c316df3b-2309-4196-8d43-6b54e446daed/ids c316df3b-2309-4196-8d43-6b54e446daed/master c316df3b-2309-4196-8d43-6b54e446daed/metadata c316df3b-2309-4196-8d43-6b54e446daed/outbox c316df3b-2309-4196-8d43-6b54e446daed/inbox c316df3b-2309-4196-8d43-6b54e446daed/leases (cwd None) storageRefresh::INFO::2014-09-17 09:43:41,176::lvm::668::Storage.LVM::(bootstrap) Refreshing lvs: vg=c316df3b-2309-4196-8d43-6b54e446daed lvs=['outbox', 'master', 'inbox', 'met adata', 'ids', 'leases'] storageRefresh::DEBUG::2014-09-17 09:43:41,178::lvm::288::Storage.Misc.excCmd::(cmd) /usr/bin/sudo -n /usr/sbin/lvm lvchange --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 obtain_device_list_from_udev=0 filter = [ '\''a|/dev/mapper/36090a058f04424900541b5000000904b|'\'', '\''r|.*|'\'' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } ' --refresh c316df3b-2309-4196-8d43-6b54e446daed/outbox c316df3b-2309-4196-8d43-6b54e446daed/master c316df3b-2309-4196-8d43-6b54e446daed/inbox c316df3b-2309-4196-8d43-6b54e446daed/metadata c316df3b-2309-4196-8d43-6b54e446daed/ids c316df3b-2309-4196-8d43-6b54e446daed/leases (cwd None) I don't see any other vdsm log that show a related error, so I don't see any relation to bug 1127460.
Nir are you sure those are unrelated? Just verified on EL7 that using http://gerrit.ovirt.org/#/c/33492/4/vdsm/storage/vdsm-lvm.rules.in as suggested yesterday by Dan allow to complete the deploy without reproducing this bug.
(In reply to Sandro Bonazzola from comment #41) > Nir are you sure those are unrelated? I'm sure there is no evidence in this bug that they are. Can you verify that when you install the same version but without the patch, the deployment fails? To verify that this was indeed the issue, when you run without the patch, and the vm pause, please find the block device backing the hosted engine lv, and check its selinux label: You can tell the device using lvs, then find the block device using: ls -l /dev/vgname/lvname /dev/vgname/lvname -> /dev/dm-xxx And check the selinux label with: ls -Z /dev/dm-xxx It should be svirt_image_t:s0,cxxx,cyyy. It you see fixed_disk_device_t:s0 then you indeed have the same issue.
I tried to reproduce getting latest tag and removing your patches from the code but I can't reproduce. So maybe you're right about not being the same issue. On the other hand it seems to be fixed anyway in latest vdsm. Pushing a new hosted-engine setup package requiring latest vdsm.
oVirt 3.5 has been released and should include the fix for this issue.