Created attachment 1505179 [details] Error logs from journalctl and /var/log/messages Description of problem: I'm using oVirt Node 4.3.0 master. I can't start vdsmd.service (and so deploying a self-hosted engine) and it stays in 'activating (start-pre)' state. Main error messages found are: 'sysctl: cannot stat /proc/sys/ssl: No such file or directory' 'vdsmd.service: control process exited, code=exited status=1' 'Failed to start Virtual Desktop Server Manager.' 'vdsm: stopped during execute tune_system task (task returned with error code 255).' Version-Release number of selected component (if applicable): NAME="CentOS Linux" VERSION="7 (Core)" ID="centos" ID_LIKE="rhel fedora" VERSION_ID="7" VARIANT="oVirt Node 4.3.0_master" VARIANT_ID="ovirt-node" PRETTY_NAME="oVirt Node 4.3.0_master" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:7" HOME_URL="https://www.ovirt.org/" BUG_REPORT_URL="https://bugzilla.redhat.com/" CentOS Linux release 7.5.1804 (Core) Installed Packages Name : vdsm Arch : x86_64 Version : 4.30.1 Release : 11.gitfcaaba8.el7 How reproducible: It happens everytime, every day for 1 month. Steps to Reproduce: 1. Configure vdsm with 'vdsm-tool configure --force' 2. Confirm it's correctly configured with 'vdsm-tool is-configured' 3. Start vdsmd with 'systemctl start vdsmd' Actual results: vdsmd.service fail to start and exit with status 1. vdsm stops during tune_system execution with error code 255. vdsmd.service enters and endless loop of retries. Expected results: vdsmd should start without errors. Additional info: I've 3 servers with oVirt Node 4.3.0 master configured in the same way as recommended by docs; node is healthy as confirmed by 'nodectl check'; I've configured firewalld accordingly to have all the servers trusting each other (and they do, as they can connect between them with passwordless key-based ssh); I've created a gluster pool and it works (I've manually tested it). For what I know I've never succeded in starting vdsmd. Minimal hardware requirements are met. I've attached an xz archive containing 2 files: 1) vdsmd_journalctl.2.txt 2) vdsm_messages_err.3.txt 1) Contains the result of 'systemctl status vdsmd.service' and 'journalctl -xe -u vdsmd' 2) Contains the result of 'grep -i vdsm /var/log/messages | tail -1000'
Hi, can you please share an sosreport ? tune_system just sets the sysctl parameters from /etc/sysctl.d/vdsm.conf, and there's nothing with ssl there.
(In reply to Yuval Turgeman from comment #1) > Hi, can you please share an sosreport ? tune_system just sets the sysctl > parameters from /etc/sysctl.d/vdsm.conf, and there's nothing with ssl there. Thank you for your input, I've checked /etc/sysctl.d/vdsm.conf and found a line about ssl which shouldn't have been there (it has been added manually). I've removed it and now vdsmd starts. I still can't deploy hosted-engine, but I'd say it's a different issue, so I'd close this report and ask in the mailing list for help. What do you say?
PS: Hi Yuval, I'm sorry for the previous message, it was meant to be flagged as "Need additional information from assignee", but I forgot to switch flag and left the default one about new details on the bug. Is there anything I can do to remove the previous message in order to keep the thread clean?
No good way to remove it, unfortunately, but many of us end up seeing replies even if NEEDINFO isn't set. What's happening with hosted engine now?
Hi Giovanni, thanks for the input, glad to see you passed that problem and no need to remove the previous message. If you could attach the hosted engine logs, we'll try to figure out what's going on next :)
Created attachment 1513682 [details] hosted-engine deployment file log The attachment is a xz compressed .log text file, originally located in: /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20181210111254-zuat48.log
As suggested by ovirt docs, I've tried various times deploying via Cockpit without success, so I went with 'hosted-engine --deploy"; I've followed these instructions: https://www.ovirt.org/blog/2018/02/up-and-running-with-ovirt-4-2-and-gluster-storage/ https://www.ovirt.org/documentation/how-to/hosted-engine/#restarting-from-a-partially-deployed-system https://www.ovirt.org/documentation/self-hosted/chap-Deploying_Self-Hosted_Engine/ https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.0/html-single/self-hosted_engine_guide/index I'm using ovirt 4.3 and couldn't find more up to date doc for it, so I've left blank undocumented questions, like these two: "Please enter the name of the datacenter where you want to deploy this hosted-engine host. [Default]:" "Please enter the name of the cluster where you want to deploy this hosted-engine host. [Default]:" I've attached the log /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20181210111254-zuat48.log Here's what I think are the relevant lines: 2018-12-10 11:28:17,906+0100 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:94 {u'_ansible_parsed': True, u'stderr_lines': [u'error: failed to connect to the hypervisor', u'error: authentication failed: Failed to start SASL negotiation: -1 (SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure. Minor code may provide more information (SPNEGO cannot find mechanisms to negotiate))'], u'cmd': u"virsh -r net-dhcp-leases default | grep -i 00:16:3e:0d:79:b5 | awk '{ print $5 }' | cut -f1 -d'/'", u'end': u'2018-12-10 11:28:17.736325', u'_ansible_no_log': False, u'stdout': u'', u'changed': True, u'invocation': {u'module_args': {u'warn': True, u'executable': None, u'_uses_shell': True, u'_raw_params': u"virsh -r net-dhcp-leases default | grep -i 00:16:3e:0d:79:b5 | awk '{ print $5 }' | cut -f1 -d'/'", u'removes': None, u'argv': None, u'creates': None, u'chdir': None, u'stdin': None}}, u'start': u'2018-12-10 11:28:17.582534', u'attempts': 50, u'stderr': u'error: failed to connect to the hypervisor\nerror: authentication failed: Failed to start SASL negotiation: -1 (SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure. Minor code may provide more information (SPNEGO cannot find mechanisms to negotiate))', u'rc': 0, u'delta': u'0:00:00.153791', u'stdout_lines': []} 2018-12-10 11:28:18,007+0100 ERROR otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:98 fatal: [localhost]: FAILED! => {"attempts": 50, "changed": true, "cmd": "virsh -r net-dhcp-leases default | grep -i 00:16:3e:0d:79:b5 | awk '{ print $5 }' | cut -f1 -d'/'", "delta": "0:00:00.153791", "end": "2018-12-10 11:28:17.736325", "rc": 0, "start": "2018-12-10 11:28:17.582534", "stderr": "error: failed to connect to the hypervisor\nerror: authentication failed: Failed to start SASL negotiation: -1 (SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure. Minor code may provide more information (SPNEGO cannot find mechanisms to negotiate))", "stderr_lines": ["error: failed to connect to the hypervisor", "error: authentication failed: Failed to start SASL negotiation: -1 (SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure. Minor code may provide more information (SPNEGO cannot find mechanisms to negotiate))"], "stdout": "", "stdout_lines": []} 2018-12-10 11:28:55,305+0100 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:94 {u'msg': u"The task includes an option with an undefined variable. The error was: 'local_vm_disk_path' is undefined\n\nThe error appears to have been in '/usr/share/ovirt-hosted-engine-setup/ansible/fetch_engine_logs.yml': line 16, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n seconds: 10\n- name: Copy engine logs\n ^ here\n", u'_ansible_no_log': False} 2018-12-10 11:28:55,406+0100 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:94 ignored: [localhost]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'local_vm_disk_path' is undefined\n\nThe error appears to have been in '/usr/share/ovirt-hosted-engine-setup/ansible/fetch_engine_logs.yml': line 16, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n seconds: 10\n- name: Copy engine logs\n ^ here\n"}
Please try ovirt-hosted-engine-cleanup and redploy
Created attachment 1514313 [details] hosted-engine cleanup and deployment logs Contains: - cleanup log "hosted-engine-cleanup_20181214.txt" - deployment log "ovirt-hosted-engine-setup-20181214091554-o7t4h1.log"
I've launched ovirt-hosted-engine-cleanup and deployed again without success. I've attached a new .xz containing 'hosted-engine-cleanup_20181214.txt' which is the output of 'ovirt-hosted-engine-cleanup' and 'ovirt-hosted-engine-setup-20181214091554-o7t4h1.log' which is the log of the deployment done after cleanup. Here are the last lines of the failed deployment output: # hosted-engine --deploy -- skipped deploy config lines -- -- skipped various '[ INFO ] TASK/ok' lines -- [ INFO ] TASK [Remove temporary entry in /etc/hosts for the local VM] [ INFO ] ok: [localhost] [ INFO ] TASK [Notify the user about a failure] [ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "The system may not be provisioned according to the playbook results: please check the logs for the issue, fix accordingly or re-deploy from scratch.\n"} [ ERROR ] Failed to execute stage 'Closing up': Failed executing ansible-playbook [ INFO ] Stage: Clean up [ INFO ] Cleaning temporary resources [ INFO ] TASK [Gathering Facts] [ INFO ] ok: [localhost] [ INFO ] TASK [Fetch logs from the engine VM] [ INFO ] ok: [localhost] [ INFO ] TASK [Set destination directory path] [ INFO ] ok: [localhost] [ INFO ] TASK [Create destination directory] [ INFO ] changed: [localhost] [ INFO ] TASK [include_tasks] [ INFO ] ok: [localhost] [ INFO ] TASK [Find the local appliance image] [ INFO ] ok: [localhost] [ INFO ] TASK [Set local_vm_disk_path] [ INFO ] skipping: [localhost] [ INFO ] TASK [Give the vm time to flush dirty buffers] [ INFO ] ok: [localhost] [ INFO ] TASK [Copy engine logs] [ INFO ] TASK [include_tasks] [ INFO ] ok: [localhost] [ INFO ] TASK [Remove local vm dir] [ INFO ] ok: [localhost] [ INFO ] TASK [Remove temporary entry in /etc/hosts for the local VM] [ INFO ] ok: [localhost] [ INFO ] Generating answer file '/var/lib/ovirt-hosted-engine-setup/answers/answers-20181214093205.conf' [ INFO ] Stage: Pre-termination [ INFO ] Stage: Termination [ ERROR ] Hosted Engine deployment failed: please check the logs for the issue, fix accordingly or re-deploy from scratch. Log file is located at /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20181214091554-o7t4h1.log
Simone, can you take a look, please ?
Please notice that manually configuring and starting vdsm before executing hosted-engine --deploy from CLI or from cockpit is not required at all.
This on my opinion is due to same leftover in libvirt configuration manually introduced by the user by mistake: 2018-12-14 09:22:34,068+0100 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:100 TASK [Get local VM IP] 2018-12-14 09:31:20,789+0100 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:94 {u'_ansible_parsed': True, u'stderr_lines': [u'error: failed to connect to the hypervisor', u'error: authentication failed: Failed to start SASL negotiation: -1 (SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure. Minor code may provide more information (SPNEGO cannot find mechanisms to negotiate))'], u'cmd': u"virsh -r net-dhcp-leases default | grep -i 00:16:3e:67:b9:3b | awk '{ print $5 }' | cut -f1 -d'/'", u'end': u'2018-12-14 09:31:20.594366', u'_ansible_no_log': False, u'stdout': u'', u'changed': True, u'invocation': {u'module_args': {u'warn': True, u'executable': None, u'_uses_shell': True, u'_raw_params': u"virsh -r net-dhcp-leases default | grep -i 00:16:3e:67:b9:3b | awk '{ print $5 }' | cut -f1 -d'/'", u'removes': None, u'argv': None, u'creates': None, u'chdir': None, u'stdin': None}}, u'start': u'2018-12-14 09:31:20.439674', u'attempts': 50, u'stderr': u'error: failed to connect to the hypervisor\nerror: authentication failed: Failed to start SASL negotiation: -1 (SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure. Minor code may provide more information (SPNEGO cannot find mechanisms to negotiate))', u'rc': 0, u'delta': u'0:00:00.154692', u'stdout_lines': []} Giovanni, can you please try on a clean host?
re-targeting to 4.3.1 since this BZ has not been proposed as blocker for 4.3.0. If you think this bug should block 4.3.0 please re-target and set blocker flag.
Came back to work only today, I'll install and configure a new machine from scratch as soon as possible. Do you want me to try with 4.3.0 or with 4.3.1?
Created attachment 1527096 [details] 20190205 ovirt config + deployment fail log This .xz contains: - "sysconfig_20190205.txt" Configurations.customisations made to the new ovirt node, installed on a new bare metal machine. - "ovirt-hosted-engine-setup-20190205094353-ri8go0.log" Log of the failed deployment
Configured a new bare metal machine, formatted drive and installed oVirt node via "ovirt-node-ng-installer-master-el7-2018082007.iso". I know it's outdated, but I wanted to install the same version as the previous machines, should I upgrade and retry? All I did on the freshly installed system is listed in the attached "sysconfig_20190205.txt". I did not set any other filesystem apart the ones for the ovirt installation, as I was hoping to just use a glusterfs setup on 3 other bare metal machines (replica 3, arbiter 1). hosted-engine --deploy fails and the log is "ovirt-hosted-engine-setup-20190205094353-ri8go0.log", attached as well. These two files (.txt and .log) are contained in "20190205 ovirt config + deployment fail log" submission, attachment "new-ovirt-system_fail-log_20190205.xz".
From the log, looks like you need to adjust the way your host resolves, Simone, is that right ? 2019-02-05 09:52:11,157+0000 ERROR otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:98 fatal: [localhost]: FAILED! => {"changed": false, "msg": "hostname 'ovirtfour' doesn't uniquely match the interface 'enp4s0' selected for the management bridge; it matches also interface with IP [u'192.168.124.1']. Please make sure that the hostname got from the interface for the managem ent network resolves only there.\n"}
(In reply to Yuval Turgeman from comment #18) > From the log, looks like you need to adjust the way your host resolves, > Simone, is that right ? Yes, right. Please ensure, properly configuring your DNS or eventually setting /etc/hosts, that 'ovirtfour' will be resolved only as the address assigned to the interface choose for the management bridge.
Moving to 4.3.2 not being identified as blocker for 4.3.1
Giovanni can you please check comment #19?
(In reply to Sandro Bonazzola from comment #21) > Giovanni can you please check comment #19? /etc/hosts was correctly configured, the conflicting interface match probably came from some misconfiguration in hosted-engine setup for deployment. Anyway libvirtd stopped running on test machine #1, I've tried for days to restart it, failing, despite copying config file from machines #2 and #3 (working). So I'm reinstalling anew all 3 machines (I'm leaving the #4 as it was, without gluster configured to test hosted-engine with gluster deployment). I'm using "ovirt-node-ng-installer-4.3.0-2019022810.el7.iso" from https://resources.ovirt.org/pub/ovirt-4.3/iso/ovirt-node-ng-installer/4.3.0-2019022810/el7/ I'll try again on Monday 2019/03/03.
Giovanni, any update?
Thank you for your quick reply Sandro. I've failed again to deploy, using ovirt node 4.3.1 as repoted before. This time I'm using a nearly-definitive setup/config, to reduce messing and temporarily configurations. I've used cockpit interface (as it seems that command line deployment is deprecated) with provisioned storage (as I already have a working gluster replica 3 arbiter 1). This time, differently from previous ovirt-node versions, there was no specified log file, just a generic "look at the log". i.e. from previous version: "[ ERROR ] Hosted Engine deployment failed: please check the logs for the issue, fix accordingly or re-deploy from scratch. Log file is located at /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20181214091554-o7t4h1.log" Error message from 4.3.1: "[ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "The system may not be provisioned according to the playbook results: please check the logs for the issue, fix accordingly or re-deploy from scratch.\n"}" I think the relevant logs are still located in the same place: # ls /var/log/ovirt-hosted-engine-setup/ ovirt-hosted-engine-setup-ansible-bootstrap_local_vm-201924145239-btepaa.log ovirt-hosted-engine-setup-ansible-get_network_interfaces-201924144228-7hyxr5.log ovirt-hosted-engine-setup-ansible-initial_clean-201924144731-cg540o.log ovirt-hosted-engine-setup-ansible-validate_hostnames-201924144228-tuf9o0.log I've searched for ERROR lines but understood nothing: # grep -R ERROR /var/log/ovirt-hosted-engine-setup/ ovirt-hosted-engine-setup-ansible-bootstrap_local_vm-201924145239-btepaa.log:2019-03-04 15:07:54,206+0100 ERROR ansible failed {'status': 'FAILED', 'ansible_type': 'task', 'ansible_task': u'Get local VM IP', 'ansible_result': u'type: <type \'dict\'>\nstr: {\'_ansible_parsed\': True, \'stderr_lines\': [u\'Usage: grep [OPTION]... PATTERN [FILE]...\', u"Try \'grep --help\' for more information."], u\'changed\': True, u\'end\': u\'2019-03-04 15:07:53.432081\', \'_ansible_no_log\': False, u\'stdout\': u\'\', u\'cmd\': u"virsh -r net-dhcp-leases default | grep -i | awk \'{ prin', 'ansible_host': u'localhost', 'ansible_playbook': u'/usr/share/ovirt-hosted-engine-setup/ansible/trigger_role.yml'} ovirt-hosted-engine-setup-ansible-bootstrap_local_vm-201924145239-btepaa.log:2019-03-04 15:08:03,107+0100 ERROR ansible failed {'status': 'FAILED', 'ansible_type': 'task', 'ansible_task': u'Notify the user about a failure', 'ansible_result': u"type: <type 'dict'>\nstr: {'msg': u'The system may not be provisioned according to the playbook results: please check the logs for the issue, fix accordingly or re-deploy from scratch.\\n', 'changed': False, '_ansible_no_log': False}", 'ansible_host': u'localhost', 'ansible_playbook': u'/usr/share/ovirt-hosted-engine-setup/ansible/trigger_role.yml'} I've attached failed deployment output and all those 4 log files under '/var/log/ovirt-hosted-engine-setup/', let me know if I should add something else. NOTE: when I write a lengthy post like this and provide attachments as well, should I make them a single post (using Comment section in attachment interface) or should I continue like this and make 2 separate posts?
Perhaps Ido
Created attachment 1540879 [details] ovirt node 4.3.1: failed deployment message/logs 'ovirt-4.3.1_failed-deploy_20190305.xz' contains: - 'ovirt-he_failed-deploy_20190305.txt' which is a copy-paste from cockpit failed deployment output - 4 'ovirt-hosted-engine-setup-*.log' which are the log files under '/var/log/ovirt-hosted-engine-setup/'
At the end of the log file: ovirt-4.3.1_failed-deploy_20190305 there is the deployment summary. In the future you can use it to see all the tasks that were executed and the ones who failed (like in this case). The error I see is: [ INFO ] TASK [ovirt.hosted_engine_setup : Get local VM IP] [ ERROR ] fatal: [localhost]: FAILED! => {"attempts": 50, "changed": true, "cmd": "virsh -r net-dhcp-leases default | grep -i | awk '{ print $5 }' | cut -f1 -d'/'", "delta": "0:00:00.145013", "end": "2019-03-04 15:07:53.432081", "rc": 0, "start": "2019-03-04 15:07:53.287068", "stderr": "Usage: grep [OPTION]... PATTERN [FILE]...\nTry 'grep --help' for more information.", "stderr_lines": ["Usage: grep [OPTION]... PATTERN [FILE]...", "Try 'grep --help' for more information."], "stdout": "", "stdout_lines": []} The error is on this line: - name: Get local VM IP shell: virsh -r net-dhcp-leases default | grep -i {{ he_vm_mac_addr }} | awk '{ print $5 }' | cut -f1 -d'/' In the code, if the user doesn't define something specific for 'he_vm_mac_addr' variable, we generate something for him. Giovanni, did you set 'he_vm_mac_addr' variable in the variables file maybe ?
(In reply to Ido Rosenzwig from comment #27) > [...] > > The error is on this line: > - name: Get local VM IP > shell: virsh -r net-dhcp-leases default | grep -i {{ he_vm_mac_addr > }} | awk '{ print $5 }' | cut -f1 -d'/' > > > In the code, if the user doesn't define something specific for > 'he_vm_mac_addr' variable, we generate something for him. > > Giovanni, did you set 'he_vm_mac_addr' variable in the variables file maybe ? I think I haven't set any 'he_vm_mac_addr', as I haven't set any file since I'm using cockpit web interface to deploy and I left the MAC Address field as it was, filled with what I think is a random generated one, as it changes everytime. I've also tried to deploy after deleting it, hoping it would force a random MAC for sure, but deployment still failed. So I've tried deployment again without removing MAC and noticed it was different from previous time, so it should be a successfully random generated one. The only things I set are: Engine VM FQDN - Which is validated successfully and registered on my DNS Network Configuration - Static VM IP Address - Registered on my DNS DNS Servers - They exist and I'm regularly using them from other machines Root password Host FQDN - Which is validated successfully and registered on my DNS This is what I'm getting: [...] [ ERROR ] fatal: [localhost]: FAILED! => {"ansible_facts": {"ovirt_hosts": []}, "attempts": 120, "changed": false} [...] [ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "The system may not be provisioned according to the playbook results: please check the logs for the issue, fix accordingly or re-deploy from scratch.\n"} I'll attach 2 full logs.
Please attach the logs from your latest tries.
Created attachment 1548579 [details] ovirt node 4.3.1: failed deployment message/logs (20190327) 'ovirt-4.3.1_failed-deploy_20190327.xz' content: This is the copy-paste of cockpit failed deployment output: "ovirt-he_failed-deploy_20190327.txt" This is the folder for present logs automatically created under '/var/log/ovirt-hosted-engine-setup': "engine-logs-2019-03-27T09:26:57Z" These are the logs in '/var/log/ovirt-hosted-engine-setup' "ovirt-hosted-engine-setup-ansible-bootstrap_local_vm-2019227102422-gwc3uh.log" "ovirt-hosted-engine-setup-ansible-get_network_interfaces-2019227101655-z211oc.log" "ovirt-hosted-engine-setup-ansible-initial_clean-2019227102055-rxuzqo.log" "ovirt-hosted-engine-setup-ansible-validate_hostnames-2019227101655-bsdk4t.log" "ovirt-hosted-engine-setup-ansible-validate_hostnames-2019227101950-onq2vd.log"
Giovanni, Are you using Static IP configuration OR DHCP ?
I'm using Static configuration and I've a text file with all the deployment settings ready to be copy-pasted. The static address is already registered on my DNS and dig give back the right answer, both forward and reverse. I've tried also stopping temporarily the firewall and cockpit crashes immediately after starting deployment, due to a JavaScript error.
The issue comes from here: 2019-03-27 10:46:54,339+01 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-1) [49de530f] EVENT_ID: USER_VDC_LOGIN(30), User admin@internal-authz connecting from '192.168.124.1' using session 'hZ74u5X9bqqmwEx/DaUhQGUhP7xvLRvUEcUk0CAqiINh9unihAfVS/DQ+tl91S+83flilIVAXGJMrqE5/GV7/g==' logged in. 2019-03-27 10:47:14,593+01 ERROR [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-2) [306c6975-72be-4b09-886d-e3cf3dcc7bdf] Failed to establish session with host 'ovn1.ifac.cnr.it': Failed to get the session. 2019-03-27 10:47:14,595+01 WARN [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-2) [306c6975-72be-4b09-886d-e3cf3dcc7bdf] Validation of action 'AddVds' failed for user admin@internal-authz. Reasons: VAR__ACTION__ADD,VAR__TYPE__HOST,$server ovn1.ifac.cnr.it,VDS_CANNOT_CONNECT_TO_SERVER 2019-03-27 10:47:14,642+01 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-2) [] Operation Failed: [Cannot add Host. Connecting to host via SSH has failed, verify that the host is reachable (IP address, routable address etc.) You may refer to the engine.log file for further details.] Giovanni, do you have something custom in /etc/hosts.allow and /etc/hosts.deny?
(In reply to Simone Tiraboschi from comment #33) > The issue comes from here: > >[...] > > Giovanni, do you have something custom in /etc/hosts.allow and > /etc/hosts.deny? I've checked and they're empty 'cat /etc/hosts.allow' # # hosts.allow This file contains access rules which are used to # allow or deny connections to network services that # either use the tcp_wrappers library or that have been # started through a tcp_wrappers-enabled xinetd. # # See 'man 5 hosts_options' and 'man 5 hosts_access' # for information on rule syntax. # See 'man tcpd' for information on tcp_wrappers # 'cat /etc/hosts.deny' # # hosts.deny This file contains access rules which are used to # deny connections to network services that either use # the tcp_wrappers library or that have been # started through a tcp_wrappers-enabled xinetd. # # The rules in this file can also be set up in # /etc/hosts.allow with a 'deny' option instead. # # See 'man 5 hosts_options' and 'man 5 hosts_access' # for information on rule syntax. # See 'man tcpd' for information on tcp_wrappers # I've reset firewalld to default deleting everything but "firewalld.conf" in "/etc/firewalld/". I've created 2 custom xml-based services to add all the ports reported here: https://www.ovirt.org/documentation/install-guide/chap-System_Requirements.html You'll find these services in "custom_firewall-1649268.txt", which is inside "ovirt-4.3.1_failed-deploy_20190402.xz". I've added those services via (of course I've used correct path and actual names): firewall-offline-cmd --new-service-from-file=/path/service.xml --name=custom-service firewall-cmd --reload firewall-cmd --zone=public --add-service custom-service firewall-cmd --runtime-to-permanent Both services appears under public zone. 'ss -tulpn' shows various open ports but not all of them, e.i. 161/udp is missing, I'd say because there's no active process using it. Port 22 is open.
Created attachment 1550925 [details] ovirt node 4.3.1: failed deployment message/logs (20190402) "ovirt-4.3.1_failed-deploy_20190402.xz" content: This is the copy-paste of cockpit failed deployment output: "ovirt-he_failed-deploy_20190402.txt" This is the folder for present logs automatically created under "/var/log/ovirt-hosted-engine-setup": "engine-logs-2019-04-02T07.41.59Z" These are the logs in '/var/log/ovirt-hosted-engine-setup' "ovirt-hosted-engine-setup-ansible-bootstrap_local_vm-20193293828-heg3xy.log" "ovirt-hosted-engine-setup-ansible-get_network_interfaces-2019329338-mhv5o0.log" "ovirt-hosted-engine-setup-ansible-initial_clean-2019329357-qvowsu.log" "ovirt-hosted-engine-setup-ansible-validate_hostnames-2019329338-20j2ew.log" This contains the xml I've created: "custom_firewal-1649268l.txt"
Giovanni, can you please attach also /var/log/vdsm/vdsm.log and /var/log/vdsm/supervdsm.log from your host?
Created attachment 1550950 [details] ovirt node 4.3.1: failed deployment message/logs (20190402) "ovirt-4.3.1_failed-deploy_20190402.xz" content: This is the copy-paste of cockpit failed deployment output: "ovirt-he_failed-deploy_20190402.txt" This is the folder for present logs automatically created under "/var/log/ovirt-hosted-engine-setup": "engine-logs-2019-04-02T07.41.59Z" These are the logs in '/var/log/ovirt-hosted-engine-setup' related to last deployment attempt: "ovirt-hosted-engine-setup-ansible-initial_clean-201932122532-0tsrch.log" This contains the 2 xmls I've created: "custom_firewal-1649268l.txt"
Giovanni, can you please attach also /var/log/vdsm/vdsm.log and /var/log/vdsm/supervdsm.log from your host? engine logs now stops on: 2019-04-02 10:06:57,765+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HostSetupNetworksVDSCommand] (EE-ManagedThreadFactory-engine-Thread-1) [5d5ed4ab] START, HostSetupNetworksVDSCommand(HostName = ovn1.ifac.cnr.it, HostSetupNetworksVdsCommandParameters:{hostId='1a5c640c-e932-4570-9d06-25f54c6faa6c', vds='Host[ovn1.ifac.cnr.it,1a5c640c-e932-4570-9d06-25f54c6faa6c]', rollbackOnFailure='true', commitOnSuccess='false', connectivityTimeout='120', networks='[HostNetwork:{defaultRoute='true', bonding='false', networkName='ovirtmgmt', vdsmName='ovirtmgmt', nicName='eno1', vlan='null', vmNetwork='true', stp='false', properties='null', ipv4BootProtocol='STATIC_IP', ipv4Address='149.139.32.240', ipv4Netmask='255.255.252.0', ipv4Gateway='149.139.32.1', ipv6BootProtocol='AUTOCONF', ipv6Address='null', ipv6Prefix='null', ipv6Gateway='null', nameServers='null'}]', removedNetworks='[]', bonds='[]', removedBonds='[]', clusterSwitchType='LEGACY', managementNetworkChanged='true'}), log id: 671869c9 2019-04-02 10:06:57,772+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HostSetupNetworksVDSCommand] (EE-ManagedThreadFactory-engine-Thread-1) [5d5ed4ab] FINISH, HostSetupNetworksVDSCommand, return: , log id: 671869c9 so the engine successfully managed to communicate with the host over ssh, but I fear that something bad happened when it tried to create the management bridge.
(In reply to Simone Tiraboschi from comment #38) > Giovanni, > can you please attach also /var/log/vdsm/vdsm.log and > /var/log/vdsm/supervdsm.log from your host? > > engine logs now stops on: > 2019-04-02 10:06:57,765+02 INFO > [org.ovirt.engine.core.vdsbroker.vdsbroker.HostSetupNetworksVDSCommand] > (EE-ManagedThreadFactory-engine-Thread-1) [5d5ed4ab] START, > HostSetupNetworksVDSCommand(HostName = ovn1.ifac.cnr.it, > HostSetupNetworksVdsCommandParameters:{hostId='1a5c640c-e932-4570-9d06- > 25f54c6faa6c', > vds='Host[ovn1.ifac.cnr.it,1a5c640c-e932-4570-9d06-25f54c6faa6c]', > rollbackOnFailure='true', commitOnSuccess='false', > connectivityTimeout='120', networks='[HostNetwork:{defaultRoute='true', > bonding='false', networkName='ovirtmgmt', vdsmName='ovirtmgmt', > nicName='eno1', vlan='null', vmNetwork='true', stp='false', > properties='null', ipv4BootProtocol='STATIC_IP', > ipv4Address='149.139.32.240', ipv4Netmask='255.255.252.0', > ipv4Gateway='149.139.32.1', ipv6BootProtocol='AUTOCONF', ipv6Address='null', > ipv6Prefix='null', ipv6Gateway='null', nameServers='null'}]', > removedNetworks='[]', bonds='[]', removedBonds='[]', > clusterSwitchType='LEGACY', managementNetworkChanged='true'}), log id: > 671869c9 > 2019-04-02 10:06:57,772+02 INFO > [org.ovirt.engine.core.vdsbroker.vdsbroker.HostSetupNetworksVDSCommand] > (EE-ManagedThreadFactory-engine-Thread-1) [5d5ed4ab] FINISH, > HostSetupNetworksVDSCommand, return: , log id: 671869c9 > > so the engine successfully managed to communicate with the host over ssh, > but I fear that something bad happened when it tried to create the > management bridge. I've attached vdsm-logs.xz as requested. After having failed deployment for ssh connection issues, I'm now getting again address resolving issues "[ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "The resolved address doesn't resolve on the selected interface\n"}" I don't understand what does "on the selected interface" mean. Querying via 'dig ovn1.ifac.cnr.it','dig -x 149.139.32.240', 'dig ovirt-engine.ifac.cnr.it' and 'dig -x 149.139.32.70' I always receive answers and they seem to be correct. I've tried also using other third party DNS services. 'ip link show' shows that ovirtmgmt is associated to eno1 interface: 2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovirtmgmt state UP mode DEFAULT group default qlen 1000
Created attachment 1550993 [details] vdsm.log + supervdsm.log This .xz archive contains: "/var/log/vdsm/vdsm.log" "/var/log/vdsm/supervdsm.log"
(In reply to Giovanni from comment #39) > I don't understand what does "on the selected interface" mean. From your logs I see that on your env ovn1 got resolved as fe80::fc16:3eff:fe12:9007 STREAM ovn1 (try 'getent ahosts ovn1 | grep ovn1') while on ovirtmgmt you have fe80::222:19ff:fe50:55e1 and so that error because fe80::fc16:3eff:fe12:9007 != fe80::222:19ff:fe50:55e1 I'd suggest to fix IPv6 name resolution or passing --4 option to force the setup to ignore IPv6.
(In reply to Simone Tiraboschi from comment #41) > (In reply to Giovanni from comment #39) > > I don't understand what does "on the selected interface" mean. > > From your logs I see that on your env ovn1 got resolved as > fe80::fc16:3eff:fe12:9007 STREAM ovn1 > > (try 'getent ahosts ovn1 | grep ovn1') > > while on ovirtmgmt you have fe80::222:19ff:fe50:55e1 > > and so that error because fe80::fc16:3eff:fe12:9007 != > fe80::222:19ff:fe50:55e1 > > I'd suggest to fix IPv6 name resolution or passing --4 option to force the > setup to ignore IPv6. I've run 'hosted-engine --deploy --4' and it failed again, I've attached bot fail logs. First time with "2019-04-03 10:03:00,859+0200 ERROR otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:107 AuthError: Error during SSO authentication access_denied : Cannot authenticate user 'None@N/A': No valid profile found in credentials.. 2019-04-03 10:03:00,960+0200 ERROR otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:107 fatal: [localhost]: FAILED! => {"changed": false, "msg": "Error during SSO authentication access_denied : Cannot authenticate user 'None@N/A': No valid profile found in credentials.."}" So I've tried again specifying ssh RSA public key and got this error: "2019-04-03 10:29:04,453+0200 ERROR otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:107 fatal: [localhost]: FAILED! => {"changed": false, "msg": "hostname 'ovn1' doesn't uniquely match the interface 'eno1' selected for the management bridge; it matches also interface with IP [u'192.168.124.1']. Please make sure that the hostname got from the interface for the management network resolves only there.\n"} 2019-04-03 10:29:05,356+0200 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:103 PLAY RECAP [localhost] : ok: 82 changed: 17 unreachable: 0 skipped: 25 failed: 1" I really don't understand why does that error about commonly matched interface come out only after specifying ssh RSA key and why it comes out at all. I've checked via 'ip addr' and 192.168.124.1 is associated to virbr0, is it just the default virtual bridge provided by libvirt? Should I just disable/nuke it? 24: virbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 52:54:00:d4:f9:e3 brd ff:ff:ff:ff:ff:ff inet 192.168.124.1/24 brd 192.168.124.255 scope global virbr0 valid_lft forever preferred_lft forever Also to note: using command line, I've found no way to specify host FQDN and I get this warning: [ INFO ] Stage: Setup validation [WARNING] Host name ovn1 has no domain suffix [WARNING] Failed to resolve ovn1 using DNS, it can be resolved only locally [ INFO ] Stage: Transaction setup [ INFO ] Stage: Misc configuration (early) Is there any way to either: 1) Specify host FQDN in command line. 2) Force IPv4 (--4) in cockpit interface. ?
Created attachment 1551292 [details] ovirt node 4.3.1: failed deployment logs (20190403) "ovirt-4.3.1_he_failed-deploy_20190403.xz" content: This is the first command line failed deployment log, about SSO authentication: "ovirt-hosted-engine-setup-20190403093234-jcwo0h.log" This is the second command line failed deployment log, about commonly matched interface: "ovirt-hosted-engine-setup-20190403101921-bleucc.log"
I forgot: what if I disable IPv6 entirely in /etc/sysctl.conf ? Would that stop ovirt deployment complaining about different addresses for host NIC and bridge?
(In reply to Giovanni from comment #42) > Is there any way to either: > 1) Specify host FQDN in command line. We fixed it as for https://bugzilla.redhat.com/1692460 adding an interactive question. (In reply to Giovanni from comment #44) > I forgot: what if I disable IPv6 entirely in /etc/sysctl.conf ? Would that > stop ovirt deployment complaining about different addresses for host NIC and > bridge? 2019-04-03 10:28:59,842+0200 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:103 TASK [ovirt.hosted_engine_setup : debug] 2019-04-03 10:29:00,744+0200 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:103 hostname_res_count_output: {'stderr_lines': [], u'changed': True, u'end': u'2019-04-03 10:28:58.745997', u'stdout': u'149.139.32.240\n192.168.124.1', u'cmd': u"getent ahostsv4 ovn1 | cut -d' ' -f1 | uniq", 'failed': False, u'delta': u'0:00:00.009427', u'stderr': u'', u'rc': 0, 'stdout_lines': [u'149.139.32.240', u'192.168.124.1'], u'start': u'2019-04-03 10:28:58.736570'} on your environment 'ovn1' resolves on 149.139.32.240 and on 192.168.124.1 while we requires it to resolves only on one interface. The quickest workaround is about adding a line on /etc/hosts on your host with 149.139.32.240 ovn1
(In reply to Simone Tiraboschi from comment #45) > [...] > The quickest workaround is about adding a line on /etc/hosts on your host > with > 149.139.32.240 ovn1 This indeed helped, this it lasted 1 hour before failing and for another issue: "2019-04-03 13:19:44,425+0200 ERROR otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:107 fatal: [localhost]: FAILED! => {"changed": false, "msg": "The host has been set in non_operational status, please check engine logs, fix accordingly and re-deploy.\n"}" I've attached the relevant log and messages again.
Created attachment 1551338 [details] ovirt node 4.3.1: failed deployment logs (20190403T10) Created attachment 1551292 [details] ovirt node 4.3.1: failed deployment logs (20190403T10) "ovirt-4.3.1_he_failed-deploy_20190403.xz" content: This is the command line failed deployment log: "ovirt-hosted-engine-setup-20190403093234-jcwo0h.log" This contains the folder under "/var/log/ovirt-hosted-engine-setup/engine-logs-2019-04-03T10:13:02Z/" "engine-logs-2019-04-03T10.13.02Z"
The issue now is here: 2019-04-03 13:19:31,985+02 INFO [org.ovirt.engine.core.bll.HandleVdsCpuFlagsOrClusterChangedCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-54) [e983d3e] Running command: HandleVdsCpuFlagsOrClusterChangedCommand internal: true. Entities affected : ID: 9667fe3a-763f-4dda-9b55-7c43d0ec9396 Type: VDS 2019-04-03 13:19:31,991+02 ERROR [org.ovirt.engine.core.bll.HandleVdsCpuFlagsOrClusterChangedCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-54) [e983d3e] Could not find server cpu for server 'ovn1' (9667fe3a-763f-4dda-9b55-7c43d0ec9396), flags: 'fpu,vme,de,pse,tsc,msr,pae,mce,cx8,apic,sep,mtrr,pge,mca,cmov,pat,pse36,clflush,dts,acpi,mmx,fxsr,sse,sse2,ss,ht,tm,pbe,syscall,nx,lm,constant_tsc,arch_perfmon,pebs,bts,rep_good,nopl,aperfmperf,eagerfpu,pni,dtes64,monitor,ds_cpl,vmx,est,tm2,ssse3,cx16,xtpr,pdcm,dca,sse4_1,xsave,lahf_lm,tpr_shadow,vnmi,flexpriority,dtherm,model_Opteron_G2,model_kvm32,model_coreduo,model_Conroe,model_Opteron_G1,model_core2duo,model_qemu32,model_Penryn,model_pentium2,model_pentium3,model_qemu64,model_kvm64,model_pentium,model_486' 2019-04-03 13:19:32,068+02 INFO [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-54) [7d446d75] Running command: SetNonOperationalVdsCommand internal: true. Entities affected : ID: 9667fe3a-763f-4dda-9b55-7c43d0ec9396 Type: VDS 2019-04-03 13:19:32,075+02 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-54) [7d446d75] START, SetVdsStatusVDSCommand(HostName = ovn1, SetVdsStatusVDSCommandParameters:{hostId='9667fe3a-763f-4dda-9b55-7c43d0ec9396', status='NonOperational', nonOperationalReason='CPU_TYPE_INCOMPATIBLE_WITH_CLUSTER', stopSpmFailureLogged='false', maintenanceReason='null'}), log id: 1545a674 Giovanni, which kind of CPU are you using on that host?
OK, it's an Intel(R) Xeon(R) CPU E5410 which is based on Penryn micro-architecture. Please notice that oVirt 4.3 removed the support for Conroe and Penryn CPUs as for: https://bugzilla.redhat.com/1540921 No way to deploy oVirt 4.3 on that hardware.
Ok, thought it was still supported: https://www.ovirt.org/documentation/install-guide/chap-System_Requirements.html Hypervisor Requirements CPU Requirements All CPUs must have support for the Intel® 64 or AMD64 CPU extensions, and the AMD-V™ or Intel VT® hardware virtualization extensions enabled. Support for the No eXecute flag (NX) is also required. The following CPU models are supported: AMD [...] Intel [...] Penryn [...] Where can I find CPU support list for 4.2.8 (or whichever is the latest 4.2 version)?
(In reply to Giovanni from comment #50) > Where can I find CPU support list for 4.2.8 (or whichever is the latest 4.2 > version)? Yes, that document is a bit outdated and the list still refers to 4.2.z. So it will work with 4.2.z but please notice that you will never be able to upgrade to 4.3 on that HW.
Ok, no problem, these machines are just old stuff laying around I use to test stuff on before planning the real thing. Thank you very much for the support.
I have newer CPU and same errors in logs "hosted-engine --deploy --restore-from-file=/..." on fresh installed hosts cannot complete. v.4.2.8 ERROR ansible failed {'status': 'FAILED', 'ansible_type': 'task', 'ansible_task': u'Copy engine logs', 'ansible_result': u'type: <type \'dict\'>\nstr: {\'msg\': u"The task includes an option with an undefined variable. The error was: \'local_vm_disk_path\' is undefined\\n\\nThe error appears to have been in \'/usr/share/ovirt-hosted-engine-setup/ansible/fetch_engine_logs.yml\': line 16, column 3, but may\\nbe elsewhere in the file depending on the exact sy', 'ansible_host': u'localhost', 'ansible_playbook': u'/usr/share/ovirt-hosted-engine-setup/ansible/final_clean.yml'} trying to use "noansible", but this option is only for new install, not restore. #lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 12 On-line CPU(s) list: 0-11 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 12 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 94 Model name: Intel Core Processor (Skylake) Stepping: 3 CPU MHz: 2294.608 BogoMIPS: 4589.21 Virtualization: VT-x Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 4096K L3 cache: 16384K NUMA node0 CPU(s): 0-11 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch tpr_shadow vnmi flexpriority ept vpid fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap xsaveopt xsavec xgetbv1 arat
(In reply to dearfriend from comment #53) > I have newer CPU and same errors in logs > > "hosted-engine --deploy --restore-from-file=/..." on fresh installed hosts > cannot complete. v.4.2.8 > > ERROR ansible failed {'status': 'FAILED', 'ansible_type': 'task', > 'ansible_task': u'Copy engine logs', 'ansible_result': u'type: <type > \'dict\'>\nstr: {\'msg\': u"The task includes an option with an undefined > variable. The error was: \'local_vm_disk_path\' is undefined\\n\\nThe error > appears to have been in > \'/usr/share/ovirt-hosted-engine-setup/ansible/fetch_engine_logs.yml\': line > 16, column 3, but may\\nbe elsewhere in the file depending on the exact sy', > 'ansible_host': u'localhost', 'ansible_playbook': > u'/usr/share/ovirt-hosted-engine-setup/ansible/final_clean.yml'} This is just something on the cleanup procedure; the real error is a few lines above that.
ERROR otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:98 fatal: [localhost]: FAILED! => {"ansible_facts": {"ovirt_hosts": []}, "attempts": 120, "changed": false} I can see engine is running on 192.168.122.181. But not external IP on ovirtmgmt network with associated dns name.
(In reply to dearfriend from comment #55) > ERROR otopi.ovirt_hosted_engine_setup.ansible_utils > ansible_utils._process_output:98 fatal: [localhost]: FAILED! => > {"ansible_facts": {"ovirt_hosts": []}, "attempts": 120, "changed": false} This means that the engine failed to deploy the host: I'd suggest to connect to the bootstrap engine VM and check engine.log and host-deploy logs there. > I can see engine is running on 192.168.122.181. But not external IP on > ovirtmgmt network with associated dns name. This is absolutely fine on that stage.