Bug 1649268 - vdsmd.service stuck in state: activating
Summary: vdsmd.service stuck in state: activating
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: vdsm
Classification: oVirt
Component: Services
Version: 4.30.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: Ido Rosenzwig
QA Contact: meital avital
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-11-13 09:46 UTC by Giovanni
Modified: 2019-06-11 15:52 UTC (History)
8 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2019-04-03 14:54:46 UTC
oVirt Team: Integration
Embargoed:


Attachments (Terms of Use)
Error logs from journalctl and /var/log/messages (2.69 KB, application/x-xz)
2018-11-13 09:46 UTC, Giovanni
no flags Details
hosted-engine deployment file log (22.62 KB, application/x-xz)
2018-12-12 14:02 UTC, Giovanni
no flags Details
hosted-engine cleanup and deployment logs (23.38 KB, application/x-xz)
2018-12-14 09:03 UTC, Giovanni
no flags Details
20190205 ovirt config + deployment fail log (21.10 KB, application/x-xz)
2019-02-05 10:30 UTC, Giovanni
no flags Details
ovirt node 4.3.1: failed deployment message/logs (68.76 KB, application/x-xz)
2019-03-05 09:50 UTC, Giovanni
no flags Details
ovirt node 4.3.1: failed deployment message/logs (20190327) (393.24 KB, application/x-xz)
2019-03-27 13:27 UTC, Giovanni
no flags Details
ovirt node 4.3.1: failed deployment message/logs (20190402) (422.23 KB, application/x-xz)
2019-04-02 09:19 UTC, Giovanni
no flags Details
ovirt node 4.3.1: failed deployment message/logs (20190402) (398.13 KB, application/x-xz)
2019-04-02 10:46 UTC, Giovanni
no flags Details
vdsm.log + supervdsm.log (82.21 KB, application/x-xz)
2019-04-02 11:51 UTC, Giovanni
no flags Details
ovirt node 4.3.1: failed deployment logs (20190403) (40.34 KB, application/x-xz)
2019-04-03 09:20 UTC, Giovanni
no flags Details
ovirt node 4.3.1: failed deployment logs (20190403T10) (407.77 KB, application/x-xz)
2019-04-03 11:41 UTC, Giovanni
no flags Details

Description Giovanni 2018-11-13 09:46:39 UTC
Created attachment 1505179 [details]
Error logs from journalctl and /var/log/messages

Description of problem:
I'm using oVirt Node 4.3.0 master. I can't start vdsmd.service (and so deploying a self-hosted engine) and it stays in 'activating (start-pre)' state.
Main error messages found are:
'sysctl: cannot stat /proc/sys/ssl: No such file or directory'
'vdsmd.service: control process exited, code=exited status=1'
'Failed to start Virtual Desktop Server Manager.'
'vdsm: stopped during execute tune_system task (task returned with error code 255).'

Version-Release number of selected component (if applicable):
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
VARIANT="oVirt Node 4.3.0_master"
VARIANT_ID="ovirt-node"
PRETTY_NAME="oVirt Node 4.3.0_master"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.ovirt.org/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"


CentOS Linux release 7.5.1804 (Core)

Installed Packages
Name        : vdsm
Arch        : x86_64
Version     : 4.30.1
Release     : 11.gitfcaaba8.el7

How reproducible:
It happens everytime, every day for 1 month.

Steps to Reproduce:
1. Configure vdsm with 'vdsm-tool configure --force'
2. Confirm it's correctly configured with 'vdsm-tool is-configured'
3. Start vdsmd with 'systemctl start vdsmd'

Actual results:
vdsmd.service fail to start and exit with status 1.
vdsm stops during tune_system execution with error code 255.
vdsmd.service enters and endless loop of retries.

Expected results:
vdsmd should start without errors.

Additional info:
I've 3 servers with oVirt Node 4.3.0 master configured in the same way as recommended by docs; node is healthy as confirmed by 'nodectl check'; I've configured firewalld accordingly to have all the servers trusting each other (and they do, as they can connect between them with passwordless key-based ssh); I've created a gluster pool and it works (I've manually tested it). For what I know I've never succeded in starting vdsmd.

Minimal hardware requirements are met.

I've attached an xz archive containing 2 files:
1) vdsmd_journalctl.2.txt
2) vdsm_messages_err.3.txt

1) Contains the result of 'systemctl status vdsmd.service' and 'journalctl -xe -u vdsmd'
2) Contains the result of 'grep -i vdsm /var/log/messages | tail -1000'

Comment 1 Yuval Turgeman 2018-11-28 10:59:10 UTC
Hi, can you please share an sosreport ?  tune_system just sets the sysctl parameters from /etc/sysctl.d/vdsm.conf, and there's nothing with ssl there.

Comment 2 Giovanni 2018-12-11 14:43:46 UTC
(In reply to Yuval Turgeman from comment #1)
> Hi, can you please share an sosreport ?  tune_system just sets the sysctl
> parameters from /etc/sysctl.d/vdsm.conf, and there's nothing with ssl there.


Thank you for your input, I've checked /etc/sysctl.d/vdsm.conf and found a line about ssl which shouldn't have been there (it has been added manually). I've removed it and now vdsmd starts.

I still can't deploy hosted-engine, but I'd say it's a different issue, so I'd close this report and ask in the mailing list for help.

What do you say?

Comment 3 Giovanni 2018-12-11 14:46:11 UTC
PS: Hi Yuval, I'm sorry for the previous message, it was meant to be flagged as "Need additional information from assignee", but I forgot to switch flag and left the default one about new details on the bug. Is there anything I can do to remove the previous message in order to keep the thread clean?

Comment 4 Ryan Barry 2018-12-11 15:22:45 UTC
No good way to remove it, unfortunately, but many of us end up seeing replies even if NEEDINFO isn't set.

What's happening with hosted engine now?

Comment 5 Yuval Turgeman 2018-12-11 15:26:09 UTC
Hi Giovanni, thanks for the input, glad to see you passed that problem and no need to remove the previous message.
If you could attach the hosted engine logs, we'll try to figure out what's going on next :)

Comment 6 Giovanni 2018-12-12 14:02:36 UTC
Created attachment 1513682 [details]
hosted-engine deployment file log

The attachment is a xz compressed .log text file, originally located in:
/var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20181210111254-zuat48.log

Comment 7 Giovanni 2018-12-12 14:03:34 UTC
As suggested by ovirt docs, I've tried various times deploying via Cockpit without success, so I went with 'hosted-engine --deploy"; I've followed these instructions:
https://www.ovirt.org/blog/2018/02/up-and-running-with-ovirt-4-2-and-gluster-storage/
https://www.ovirt.org/documentation/how-to/hosted-engine/#restarting-from-a-partially-deployed-system
https://www.ovirt.org/documentation/self-hosted/chap-Deploying_Self-Hosted_Engine/
https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.0/html-single/self-hosted_engine_guide/index

I'm using ovirt 4.3 and couldn't find more up to date doc for it, so I've left blank undocumented questions, like these two:
"Please enter the name of the datacenter where you want to deploy this hosted-engine host. [Default]:"
"Please enter the name of the cluster where you want to deploy this hosted-engine host. [Default]:"

I've attached the log /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20181210111254-zuat48.log

Here's what I think are the relevant lines:
2018-12-10 11:28:17,906+0100 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:94 {u'_ansible_parsed': True, u'stderr_lines': [u'error: failed to connect to the hypervisor', u'error: authentication failed: Failed to start SASL negotiation: -1 (SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (SPNEGO cannot find mechanisms to negotiate))'], u'cmd': u"virsh -r net-dhcp-leases default | grep -i 00:16:3e:0d:79:b5 | awk '{ print $5 }' | cut -f1 -d'/'", u'end': u'2018-12-10 11:28:17.736325', u'_ansible_no_log': False, u'stdout': u'', u'changed': True, u'invocation': {u'module_args': {u'warn': True, u'executable': None, u'_uses_shell': True, u'_raw_params': u"virsh -r net-dhcp-leases default | grep -i 00:16:3e:0d:79:b5 | awk '{ print $5 }' | cut -f1 -d'/'", u'removes': None, u'argv': None, u'creates': None, u'chdir': None, u'stdin': None}}, u'start': u'2018-12-10 11:28:17.582534', u'attempts': 50, u'stderr': u'error: failed to connect to the hypervisor\nerror: authentication failed: Failed to start SASL negotiation: -1 (SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (SPNEGO cannot find mechanisms to negotiate))', u'rc': 0, u'delta': u'0:00:00.153791', u'stdout_lines': []}

2018-12-10 11:28:18,007+0100 ERROR otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:98 fatal: [localhost]: FAILED! => {"attempts": 50, "changed": true, "cmd": "virsh -r net-dhcp-leases default | grep -i 00:16:3e:0d:79:b5 | awk '{ print $5 }' | cut -f1 -d'/'", "delta": "0:00:00.153791", "end": "2018-12-10 11:28:17.736325", "rc": 0, "start": "2018-12-10 11:28:17.582534", "stderr": "error: failed to connect to the hypervisor\nerror: authentication failed: Failed to start SASL negotiation: -1 (SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (SPNEGO cannot find mechanisms to negotiate))", "stderr_lines": ["error: failed to connect to the hypervisor", "error: authentication failed: Failed to start SASL negotiation: -1 (SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (SPNEGO cannot find mechanisms to negotiate))"], "stdout": "", "stdout_lines": []}

2018-12-10 11:28:55,305+0100 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:94 {u'msg': u"The task includes an option with an undefined variable. The error was: 'local_vm_disk_path' is undefined\n\nThe error appears to have been in '/usr/share/ovirt-hosted-engine-setup/ansible/fetch_engine_logs.yml': line 16, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n    seconds: 10\n- name: Copy engine logs\n  ^ here\n", u'_ansible_no_log': False}
2018-12-10 11:28:55,406+0100 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:94 ignored: [localhost]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'local_vm_disk_path' is undefined\n\nThe error appears to have been in '/usr/share/ovirt-hosted-engine-setup/ansible/fetch_engine_logs.yml': line 16, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n    seconds: 10\n- name: Copy engine logs\n  ^ here\n"}

Comment 8 Ryan Barry 2018-12-12 14:12:00 UTC
Please try ovirt-hosted-engine-cleanup and redploy

Comment 9 Giovanni 2018-12-14 09:03:44 UTC
Created attachment 1514313 [details]
hosted-engine cleanup and deployment logs

Contains:
- cleanup log "hosted-engine-cleanup_20181214.txt"
- deployment log "ovirt-hosted-engine-setup-20181214091554-o7t4h1.log"

Comment 10 Giovanni 2018-12-14 09:04:19 UTC
I've launched ovirt-hosted-engine-cleanup and deployed again without success.

I've attached a new .xz containing 'hosted-engine-cleanup_20181214.txt' which is the output of 'ovirt-hosted-engine-cleanup' and 'ovirt-hosted-engine-setup-20181214091554-o7t4h1.log' which is the log of the deployment done after cleanup.

Here are the last lines of the failed deployment output:
# hosted-engine --deploy
-- skipped deploy config lines --
-- skipped various '[ INFO  ] TASK/ok' lines --
[ INFO  ] TASK [Remove temporary entry in /etc/hosts for the local VM]
[ INFO  ] ok: [localhost]
[ INFO  ] TASK [Notify the user about a failure]
[ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "The system may not be provisioned according to the playbook results: please check the logs for the issue, fix accordingly or re-deploy from scratch.\n"}
[ ERROR ] Failed to execute stage 'Closing up': Failed executing ansible-playbook
[ INFO  ] Stage: Clean up
[ INFO  ] Cleaning temporary resources
[ INFO  ] TASK [Gathering Facts]
[ INFO  ] ok: [localhost]
[ INFO  ] TASK [Fetch logs from the engine VM]
[ INFO  ] ok: [localhost]
[ INFO  ] TASK [Set destination directory path]
[ INFO  ] ok: [localhost]
[ INFO  ] TASK [Create destination directory]
[ INFO  ] changed: [localhost]
[ INFO  ] TASK [include_tasks]
[ INFO  ] ok: [localhost]
[ INFO  ] TASK [Find the local appliance image]
[ INFO  ] ok: [localhost]
[ INFO  ] TASK [Set local_vm_disk_path]
[ INFO  ] skipping: [localhost]
[ INFO  ] TASK [Give the vm time to flush dirty buffers]
[ INFO  ] ok: [localhost]
[ INFO  ] TASK [Copy engine logs]
[ INFO  ] TASK [include_tasks]
[ INFO  ] ok: [localhost]
[ INFO  ] TASK [Remove local vm dir]
[ INFO  ] ok: [localhost]
[ INFO  ] TASK [Remove temporary entry in /etc/hosts for the local VM]
[ INFO  ] ok: [localhost]
[ INFO  ] Generating answer file '/var/lib/ovirt-hosted-engine-setup/answers/answers-20181214093205.conf'
[ INFO  ] Stage: Pre-termination
[ INFO  ] Stage: Termination
[ ERROR ] Hosted Engine deployment failed: please check the logs for the issue, fix accordingly or re-deploy from scratch.
          Log file is located at /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20181214091554-o7t4h1.log

Comment 11 Yuval Turgeman 2018-12-16 16:07:25 UTC
Simone, can you take a look, please ?

Comment 12 Simone Tiraboschi 2019-01-07 08:49:00 UTC
Please notice that manually configuring and starting vdsm before executing hosted-engine --deploy from CLI or from cockpit is not required at all.

Comment 13 Simone Tiraboschi 2019-01-07 08:52:27 UTC
This on my opinion is due to same leftover in libvirt configuration manually introduced by the user by mistake:

2018-12-14 09:22:34,068+0100 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:100 TASK [Get local VM IP]
2018-12-14 09:31:20,789+0100 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:94 {u'_ansible_parsed': True, u'stderr_lines': [u'error: failed to connect to the hypervisor', u'error: authentication failed: Failed to start SASL negotiation: -1 (SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (SPNEGO cannot find mechanisms to negotiate))'], u'cmd': u"virsh -r net-dhcp-leases default | grep -i 00:16:3e:67:b9:3b | awk '{ print $5 }' | cut -f1 -d'/'", u'end': u'2018-12-14 09:31:20.594366', u'_ansible_no_log': False, u'stdout': u'', u'changed': True, u'invocation': {u'module_args': {u'warn': True, u'executable': None, u'_uses_shell': True, u'_raw_params': u"virsh -r net-dhcp-leases default | grep -i 00:16:3e:67:b9:3b | awk '{ print $5 }' | cut -f1 -d'/'", u'removes': None, u'argv': None, u'creates': None, u'chdir': None, u'stdin': None}}, u'start': u'2018-12-14 09:31:20.439674', u'attempts': 50, u'stderr': u'error: failed to connect to the hypervisor\nerror: authentication failed: Failed to start SASL negotiation: -1 (SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (SPNEGO cannot find mechanisms to negotiate))', u'rc': 0, u'delta': u'0:00:00.154692', u'stdout_lines': []}

Giovanni, can you please try on a clean host?

Comment 14 Sandro Bonazzola 2019-01-21 08:31:16 UTC
re-targeting to 4.3.1 since this BZ has not been proposed as blocker for 4.3.0.
If you think this bug should block 4.3.0 please re-target and set blocker flag.

Comment 15 Giovanni 2019-01-21 11:27:56 UTC
Came back to work only today, I'll install and configure a new machine from scratch as soon as possible. Do you want me to try with 4.3.0 or with 4.3.1?

Comment 16 Giovanni 2019-02-05 10:30:34 UTC
Created attachment 1527096 [details]
20190205 ovirt config + deployment fail log

This .xz contains:
- "sysconfig_20190205.txt"
Configurations.customisations made to the new ovirt node, installed on a new bare metal machine.
- "ovirt-hosted-engine-setup-20190205094353-ri8go0.log"
Log of the failed deployment

Comment 17 Giovanni 2019-02-05 10:34:40 UTC
Configured a new bare metal machine, formatted drive and installed oVirt node via "ovirt-node-ng-installer-master-el7-2018082007.iso". I know it's outdated, but I wanted to install the same version as the previous machines, should I upgrade and retry?

All I did on the freshly installed system is listed in the attached "sysconfig_20190205.txt".
I did not set any other filesystem apart the ones for the ovirt installation, as I was hoping to just use a glusterfs setup on 3 other bare metal machines (replica 3, arbiter 1).

hosted-engine --deploy fails and the log is "ovirt-hosted-engine-setup-20190205094353-ri8go0.log", attached as well.

These two files (.txt and .log) are contained in "20190205 ovirt config + deployment fail log" submission, attachment "new-ovirt-system_fail-log_20190205.xz".

Comment 18 Yuval Turgeman 2019-02-07 13:10:39 UTC
From the log, looks like you need to adjust the way your host resolves, Simone, is that right ?

2019-02-05 09:52:11,157+0000 ERROR otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:98 fatal: [localhost]: FAILED! => {"changed": false, "msg": "hostname 'ovirtfour' doesn't 
uniquely match the interface 'enp4s0' selected for the management bridge; it matches also interface with IP [u'192.168.124.1']. Please make sure that the hostname got from the interface for the managem
ent network resolves only there.\n"}

Comment 19 Simone Tiraboschi 2019-02-07 13:24:42 UTC
(In reply to Yuval Turgeman from comment #18)
> From the log, looks like you need to adjust the way your host resolves,
> Simone, is that right ?

Yes, right.
Please ensure, properly configuring your DNS or eventually setting /etc/hosts, that 'ovirtfour' will be resolved only as the address assigned to the interface choose for the management bridge.

Comment 20 Sandro Bonazzola 2019-02-18 07:57:58 UTC
Moving to 4.3.2 not being identified as blocker for 4.3.1

Comment 21 Sandro Bonazzola 2019-02-26 09:27:17 UTC
Giovanni can you please check comment #19?

Comment 22 Giovanni 2019-03-01 15:18:26 UTC
(In reply to Sandro Bonazzola from comment #21)
> Giovanni can you please check comment #19?

/etc/hosts was correctly configured, the conflicting interface match probably came from some misconfiguration in hosted-engine setup for deployment.

Anyway libvirtd stopped running on test machine #1, I've tried for days to restart it, failing, despite copying config file from machines #2 and #3 (working). So I'm reinstalling anew all 3 machines (I'm leaving the #4 as it was, without gluster configured to test hosted-engine with gluster deployment). I'm using "ovirt-node-ng-installer-4.3.0-2019022810.el7.iso" from https://resources.ovirt.org/pub/ovirt-4.3/iso/ovirt-node-ng-installer/4.3.0-2019022810/el7/

I'll try again on Monday 2019/03/03.

Comment 23 Sandro Bonazzola 2019-03-05 09:16:15 UTC
Giovanni, any update?

Comment 24 Giovanni 2019-03-05 09:45:34 UTC
Thank you for your quick reply Sandro.

I've failed again to deploy, using ovirt node 4.3.1 as repoted before. This time I'm using a nearly-definitive setup/config, to reduce messing and temporarily configurations.

I've used cockpit interface (as it seems that command line deployment is deprecated) with provisioned storage (as I already have a working gluster replica 3 arbiter 1). This time, differently from previous ovirt-node versions, there was no specified log file, just a generic "look at the log".
i.e. from previous version:
"[ ERROR ] Hosted Engine deployment failed: please check the logs for the issue, fix accordingly or re-deploy from scratch.
          Log file is located at /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20181214091554-o7t4h1.log"

Error message from 4.3.1:
"[ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "The system may not be provisioned according to the playbook results: please check the logs for the issue, fix accordingly or re-deploy from scratch.\n"}"

I think the relevant logs are still located in the same place:
# ls /var/log/ovirt-hosted-engine-setup/
ovirt-hosted-engine-setup-ansible-bootstrap_local_vm-201924145239-btepaa.log
ovirt-hosted-engine-setup-ansible-get_network_interfaces-201924144228-7hyxr5.log
ovirt-hosted-engine-setup-ansible-initial_clean-201924144731-cg540o.log
ovirt-hosted-engine-setup-ansible-validate_hostnames-201924144228-tuf9o0.log

I've searched for ERROR lines but understood nothing:
# grep -R ERROR /var/log/ovirt-hosted-engine-setup/
ovirt-hosted-engine-setup-ansible-bootstrap_local_vm-201924145239-btepaa.log:2019-03-04 15:07:54,206+0100 ERROR ansible failed {'status': 'FAILED', 'ansible_type': 'task', 'ansible_task': u'Get local VM IP', 'ansible_result': u'type: <type \'dict\'>\nstr: {\'_ansible_parsed\': True, \'stderr_lines\': [u\'Usage: grep [OPTION]... PATTERN [FILE]...\', u"Try \'grep --help\' for more information."], u\'changed\': True, u\'end\': u\'2019-03-04 15:07:53.432081\', \'_ansible_no_log\': False, u\'stdout\': u\'\', u\'cmd\': u"virsh -r net-dhcp-leases default | grep -i  | awk \'{ prin', 'ansible_host': u'localhost', 'ansible_playbook': u'/usr/share/ovirt-hosted-engine-setup/ansible/trigger_role.yml'}
ovirt-hosted-engine-setup-ansible-bootstrap_local_vm-201924145239-btepaa.log:2019-03-04 15:08:03,107+0100 ERROR ansible failed {'status': 'FAILED', 'ansible_type': 'task', 'ansible_task': u'Notify the user about a failure', 'ansible_result': u"type: <type 'dict'>\nstr: {'msg': u'The system may not be provisioned according to the playbook results: please check the logs for the issue, fix accordingly or re-deploy from scratch.\\n', 'changed': False, '_ansible_no_log': False}", 'ansible_host': u'localhost', 'ansible_playbook': u'/usr/share/ovirt-hosted-engine-setup/ansible/trigger_role.yml'}

I've attached failed deployment output and all those 4 log files under '/var/log/ovirt-hosted-engine-setup/', let me know if I should add something else.

NOTE: when I write a lengthy post like this and provide attachments as well, should I make them a single post (using Comment section in attachment interface) or should I continue like this and make 2 separate posts?

Comment 25 Yuval Turgeman 2019-03-05 09:50:15 UTC
Perhaps Ido

Comment 26 Giovanni 2019-03-05 09:50:53 UTC
Created attachment 1540879 [details]
ovirt node 4.3.1: failed deployment message/logs

'ovirt-4.3.1_failed-deploy_20190305.xz' contains:
- 'ovirt-he_failed-deploy_20190305.txt' which is a copy-paste from cockpit failed deployment output
- 4 'ovirt-hosted-engine-setup-*.log' which are the log files under '/var/log/ovirt-hosted-engine-setup/'

Comment 27 Ido Rosenzwig 2019-03-24 10:34:06 UTC
At the end of the log file: ovirt-4.3.1_failed-deploy_20190305 there is the deployment summary.
In the future you can use it to see all the tasks that were executed and the ones who failed (like in this case).

The error I see is:
[ INFO ] TASK [ovirt.hosted_engine_setup : Get local VM IP]
[ ERROR ] fatal: [localhost]: FAILED! => {"attempts": 50, "changed": true, "cmd": "virsh -r net-dhcp-leases default | grep -i | awk '{ print $5 }' | cut -f1 -d'/'", "delta": "0:00:00.145013", "end": "2019-03-04 15:07:53.432081", "rc": 0, "start": "2019-03-04 15:07:53.287068", "stderr": "Usage: grep [OPTION]... PATTERN [FILE]...\nTry 'grep --help' for more information.", "stderr_lines": ["Usage: grep [OPTION]... PATTERN [FILE]...", "Try 'grep --help' for more information."], "stdout": "", "stdout_lines": []}


The error is on this line:
- name: Get local VM IP
        shell: virsh -r net-dhcp-leases default | grep -i {{ he_vm_mac_addr }} | awk '{ print $5 }' | cut -f1 -d'/'


In the code, if the user doesn't define something specific for 'he_vm_mac_addr' variable, we generate something for him.

Giovanni, did you set 'he_vm_mac_addr' variable in the variables file maybe ?

Comment 28 Giovanni 2019-03-27 12:24:16 UTC
(In reply to Ido Rosenzwig from comment #27)
> [...]
> 
> The error is on this line:
> - name: Get local VM IP
>         shell: virsh -r net-dhcp-leases default | grep -i {{ he_vm_mac_addr
> }} | awk '{ print $5 }' | cut -f1 -d'/'
> 
> 
> In the code, if the user doesn't define something specific for
> 'he_vm_mac_addr' variable, we generate something for him.
> 
> Giovanni, did you set 'he_vm_mac_addr' variable in the variables file maybe ?

I think I haven't set any 'he_vm_mac_addr', as I haven't set any file since I'm using cockpit web interface to deploy and I left the MAC Address field as it was, filled with what I think is a random generated one, as it changes everytime.
I've also tried to deploy after deleting it, hoping it would force a random MAC for sure, but deployment still failed. So I've tried deployment again without removing MAC and noticed it was different from previous time, so it should be a successfully random generated one.

The only things I set are:
Engine VM FQDN - Which is validated successfully and registered on my DNS
Network Configuration - Static
VM IP Address - Registered on my DNS
DNS Servers - They exist and I'm regularly using them from other machines
Root password
Host FQDN - Which is validated successfully and registered on my DNS

This is what I'm getting:
[...]
[ ERROR ] fatal: [localhost]: FAILED! => {"ansible_facts": {"ovirt_hosts": []}, "attempts": 120, "changed": false}
[...]
[ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "The system may not be provisioned according to the playbook results: please check the logs for the issue, fix accordingly or re-deploy from scratch.\n"}

I'll attach 2 full logs.

Comment 29 Ido Rosenzwig 2019-03-27 13:18:40 UTC
Please attach the logs from your latest tries.

Comment 30 Giovanni 2019-03-27 13:27:40 UTC
Created attachment 1548579 [details]
ovirt node 4.3.1: failed deployment message/logs (20190327)

'ovirt-4.3.1_failed-deploy_20190327.xz' content:

This is the copy-paste of cockpit failed deployment output:
"ovirt-he_failed-deploy_20190327.txt"

This is the folder for present logs automatically created under '/var/log/ovirt-hosted-engine-setup':
"engine-logs-2019-03-27T09:26:57Z"

These are the logs in '/var/log/ovirt-hosted-engine-setup'
"ovirt-hosted-engine-setup-ansible-bootstrap_local_vm-2019227102422-gwc3uh.log"
"ovirt-hosted-engine-setup-ansible-get_network_interfaces-2019227101655-z211oc.log"
"ovirt-hosted-engine-setup-ansible-initial_clean-2019227102055-rxuzqo.log"
"ovirt-hosted-engine-setup-ansible-validate_hostnames-2019227101655-bsdk4t.log"
"ovirt-hosted-engine-setup-ansible-validate_hostnames-2019227101950-onq2vd.log"

Comment 31 Ido Rosenzwig 2019-03-28 14:43:32 UTC
Giovanni, Are you using Static IP configuration OR DHCP ?

Comment 32 Giovanni 2019-03-29 11:09:05 UTC
I'm using Static configuration and I've a text file with all the deployment settings ready to be copy-pasted. The static address is already registered on my DNS and dig give back the right answer, both forward and reverse.

I've tried also stopping temporarily the firewall and cockpit crashes immediately after starting deployment, due to a JavaScript error.

Comment 33 Simone Tiraboschi 2019-04-01 12:08:15 UTC
The issue comes from here:

2019-03-27 10:46:54,339+01 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-1) [49de530f] EVENT_ID: USER_VDC_LOGIN(30), User admin@internal-authz connecting from '192.168.124.1' using session 'hZ74u5X9bqqmwEx/DaUhQGUhP7xvLRvUEcUk0CAqiINh9unihAfVS/DQ+tl91S+83flilIVAXGJMrqE5/GV7/g==' logged in.
2019-03-27 10:47:14,593+01 ERROR [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-2) [306c6975-72be-4b09-886d-e3cf3dcc7bdf] Failed to establish session with host 'ovn1.ifac.cnr.it': Failed to get the session.
2019-03-27 10:47:14,595+01 WARN  [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-2) [306c6975-72be-4b09-886d-e3cf3dcc7bdf] Validation of action 'AddVds' failed for user admin@internal-authz. Reasons: VAR__ACTION__ADD,VAR__TYPE__HOST,$server ovn1.ifac.cnr.it,VDS_CANNOT_CONNECT_TO_SERVER
2019-03-27 10:47:14,642+01 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-2) [] Operation Failed: [Cannot add Host. Connecting to host via SSH has failed, verify that the host is reachable (IP address, routable address etc.) You may refer to the engine.log file for further details.]

Giovanni, do you have something custom in /etc/hosts.allow and /etc/hosts.deny?

Comment 34 Giovanni 2019-04-02 09:12:26 UTC
(In reply to Simone Tiraboschi from comment #33)
> The issue comes from here:
> 
>[...]
> 
> Giovanni, do you have something custom in /etc/hosts.allow and
> /etc/hosts.deny?

I've checked and they're empty

'cat /etc/hosts.allow'
#
# hosts.allow	This file contains access rules which are used to
#		allow or deny connections to network services that
#		either use the tcp_wrappers library or that have been
#		started through a tcp_wrappers-enabled xinetd.
#
#		See 'man 5 hosts_options' and 'man 5 hosts_access'
#		for information on rule syntax.
#		See 'man tcpd' for information on tcp_wrappers
#

'cat /etc/hosts.deny'
#
# hosts.deny	This file contains access rules which are used to
#		deny connections to network services that either use
#		the tcp_wrappers library or that have been
#		started through a tcp_wrappers-enabled xinetd.
#
#		The rules in this file can also be set up in
#		/etc/hosts.allow with a 'deny' option instead.
#
#		See 'man 5 hosts_options' and 'man 5 hosts_access'
#		for information on rule syntax.
#		See 'man tcpd' for information on tcp_wrappers
#

I've reset firewalld to default deleting everything but "firewalld.conf" in "/etc/firewalld/". I've created 2 custom xml-based services to add all the ports reported here: https://www.ovirt.org/documentation/install-guide/chap-System_Requirements.html
You'll find these services in "custom_firewall-1649268.txt", which is inside "ovirt-4.3.1_failed-deploy_20190402.xz".

I've added those services via (of course I've used correct path and actual names):
firewall-offline-cmd --new-service-from-file=/path/service.xml --name=custom-service
firewall-cmd --reload
firewall-cmd --zone=public --add-service custom-service
firewall-cmd --runtime-to-permanent

Both services appears under public zone.
'ss -tulpn' shows various open ports but not all of them, e.i. 161/udp is missing, I'd say because there's no active process using it.

Port 22 is open.

Comment 35 Giovanni 2019-04-02 09:19:29 UTC
Created attachment 1550925 [details]
ovirt node 4.3.1: failed deployment message/logs (20190402)

"ovirt-4.3.1_failed-deploy_20190402.xz" content:

This is the copy-paste of cockpit failed deployment output:
"ovirt-he_failed-deploy_20190402.txt"


This is the folder for present logs automatically created under "/var/log/ovirt-hosted-engine-setup":
"engine-logs-2019-04-02T07.41.59Z"

These are the logs in '/var/log/ovirt-hosted-engine-setup'
"ovirt-hosted-engine-setup-ansible-bootstrap_local_vm-20193293828-heg3xy.log"
"ovirt-hosted-engine-setup-ansible-get_network_interfaces-2019329338-mhv5o0.log"
"ovirt-hosted-engine-setup-ansible-initial_clean-2019329357-qvowsu.log"
"ovirt-hosted-engine-setup-ansible-validate_hostnames-2019329338-20j2ew.log"

This contains the xml I've created:
"custom_firewal-1649268l.txt"

Comment 36 Simone Tiraboschi 2019-04-02 09:38:40 UTC
Giovanni,
can you please attach also /var/log/vdsm/vdsm.log and /var/log/vdsm/supervdsm.log from your host?

Comment 37 Giovanni 2019-04-02 10:46:49 UTC
Created attachment 1550950 [details]
ovirt node 4.3.1: failed deployment message/logs (20190402)

"ovirt-4.3.1_failed-deploy_20190402.xz" content:

This is the copy-paste of cockpit failed deployment output:
"ovirt-he_failed-deploy_20190402.txt"

This is the folder for present logs automatically created under "/var/log/ovirt-hosted-engine-setup":
"engine-logs-2019-04-02T07.41.59Z"

These are the logs in '/var/log/ovirt-hosted-engine-setup' related to last deployment attempt:
"ovirt-hosted-engine-setup-ansible-initial_clean-201932122532-0tsrch.log"

This contains the 2 xmls I've created:
"custom_firewal-1649268l.txt"

Comment 38 Simone Tiraboschi 2019-04-02 11:37:48 UTC
Giovanni,
can you please attach also /var/log/vdsm/vdsm.log and /var/log/vdsm/supervdsm.log from your host?

engine logs now stops on:
2019-04-02 10:06:57,765+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HostSetupNetworksVDSCommand] (EE-ManagedThreadFactory-engine-Thread-1) [5d5ed4ab] START, HostSetupNetworksVDSCommand(HostName = ovn1.ifac.cnr.it, HostSetupNetworksVdsCommandParameters:{hostId='1a5c640c-e932-4570-9d06-25f54c6faa6c', vds='Host[ovn1.ifac.cnr.it,1a5c640c-e932-4570-9d06-25f54c6faa6c]', rollbackOnFailure='true', commitOnSuccess='false', connectivityTimeout='120', networks='[HostNetwork:{defaultRoute='true', bonding='false', networkName='ovirtmgmt', vdsmName='ovirtmgmt', nicName='eno1', vlan='null', vmNetwork='true', stp='false', properties='null', ipv4BootProtocol='STATIC_IP', ipv4Address='149.139.32.240', ipv4Netmask='255.255.252.0', ipv4Gateway='149.139.32.1', ipv6BootProtocol='AUTOCONF', ipv6Address='null', ipv6Prefix='null', ipv6Gateway='null', nameServers='null'}]', removedNetworks='[]', bonds='[]', removedBonds='[]', clusterSwitchType='LEGACY', managementNetworkChanged='true'}), log id: 671869c9
2019-04-02 10:06:57,772+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HostSetupNetworksVDSCommand] (EE-ManagedThreadFactory-engine-Thread-1) [5d5ed4ab] FINISH, HostSetupNetworksVDSCommand, return: , log id: 671869c9

so the engine successfully managed to communicate with the host over ssh, but I fear that something bad happened when it tried to create the management bridge.

Comment 39 Giovanni 2019-04-02 11:49:59 UTC
(In reply to Simone Tiraboschi from comment #38)
> Giovanni,
> can you please attach also /var/log/vdsm/vdsm.log and
> /var/log/vdsm/supervdsm.log from your host?
> 
> engine logs now stops on:
> 2019-04-02 10:06:57,765+02 INFO 
> [org.ovirt.engine.core.vdsbroker.vdsbroker.HostSetupNetworksVDSCommand]
> (EE-ManagedThreadFactory-engine-Thread-1) [5d5ed4ab] START,
> HostSetupNetworksVDSCommand(HostName = ovn1.ifac.cnr.it,
> HostSetupNetworksVdsCommandParameters:{hostId='1a5c640c-e932-4570-9d06-
> 25f54c6faa6c',
> vds='Host[ovn1.ifac.cnr.it,1a5c640c-e932-4570-9d06-25f54c6faa6c]',
> rollbackOnFailure='true', commitOnSuccess='false',
> connectivityTimeout='120', networks='[HostNetwork:{defaultRoute='true',
> bonding='false', networkName='ovirtmgmt', vdsmName='ovirtmgmt',
> nicName='eno1', vlan='null', vmNetwork='true', stp='false',
> properties='null', ipv4BootProtocol='STATIC_IP',
> ipv4Address='149.139.32.240', ipv4Netmask='255.255.252.0',
> ipv4Gateway='149.139.32.1', ipv6BootProtocol='AUTOCONF', ipv6Address='null',
> ipv6Prefix='null', ipv6Gateway='null', nameServers='null'}]',
> removedNetworks='[]', bonds='[]', removedBonds='[]',
> clusterSwitchType='LEGACY', managementNetworkChanged='true'}), log id:
> 671869c9
> 2019-04-02 10:06:57,772+02 INFO 
> [org.ovirt.engine.core.vdsbroker.vdsbroker.HostSetupNetworksVDSCommand]
> (EE-ManagedThreadFactory-engine-Thread-1) [5d5ed4ab] FINISH,
> HostSetupNetworksVDSCommand, return: , log id: 671869c9
> 
> so the engine successfully managed to communicate with the host over ssh,
> but I fear that something bad happened when it tried to create the
> management bridge.

I've attached vdsm-logs.xz as requested.

After having failed deployment for ssh connection issues, I'm now getting again address resolving issues "[ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "The resolved address doesn't resolve on the selected interface\n"}"

I don't understand what does "on the selected interface" mean.

Querying via 'dig ovn1.ifac.cnr.it','dig -x 149.139.32.240', 'dig ovirt-engine.ifac.cnr.it' and 'dig -x 149.139.32.70' I always receive answers and they seem to be correct. I've tried also using other third party DNS services.

'ip link show' shows that ovirtmgmt is associated to eno1 interface:
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovirtmgmt state UP mode DEFAULT group default qlen 1000

Comment 40 Giovanni 2019-04-02 11:51:06 UTC
Created attachment 1550993 [details]
vdsm.log + supervdsm.log

This .xz archive contains:
"/var/log/vdsm/vdsm.log"
"/var/log/vdsm/supervdsm.log"

Comment 41 Simone Tiraboschi 2019-04-02 13:55:26 UTC
(In reply to Giovanni from comment #39)
> I don't understand what does "on the selected interface" mean.

From your logs I see that on your env ovn1 got resolved as 
fe80::fc16:3eff:fe12:9007 STREAM ovn1

(try 'getent ahosts ovn1 | grep ovn1')

while on ovirtmgmt you have fe80::222:19ff:fe50:55e1 

and so that error because  fe80::fc16:3eff:fe12:9007 != fe80::222:19ff:fe50:55e1 

I'd suggest to fix IPv6 name resolution or passing --4 option to force the setup to ignore IPv6.

Comment 42 Giovanni 2019-04-03 09:16:51 UTC
(In reply to Simone Tiraboschi from comment #41)
> (In reply to Giovanni from comment #39)
> > I don't understand what does "on the selected interface" mean.
> 
> From your logs I see that on your env ovn1 got resolved as 
> fe80::fc16:3eff:fe12:9007 STREAM ovn1
> 
> (try 'getent ahosts ovn1 | grep ovn1')
> 
> while on ovirtmgmt you have fe80::222:19ff:fe50:55e1 
> 
> and so that error because  fe80::fc16:3eff:fe12:9007 !=
> fe80::222:19ff:fe50:55e1 
> 
> I'd suggest to fix IPv6 name resolution or passing --4 option to force the
> setup to ignore IPv6.

I've run 'hosted-engine --deploy --4' and it failed again, I've attached bot fail logs.

First time with "2019-04-03 10:03:00,859+0200 ERROR otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:107 AuthError: Error during SSO authentication access_denied : Cannot authenticate user 'None@N/A': No valid profile found in credentials..
2019-04-03 10:03:00,960+0200 ERROR otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:107 fatal: [localhost]: FAILED! => {"changed": false, "msg": "Error during SSO authentication access_denied : Cannot authenticate user 'None@N/A': No valid profile found in credentials.."}"

So I've tried again specifying ssh RSA public key and got this error: "2019-04-03 10:29:04,453+0200 ERROR otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:107 fatal: [localhost]: FAILED! => {"changed": false, "msg": "hostname 'ovn1' doesn't uniquely match the interface 'eno1' selected for the management bridge; it matches also interface with IP [u'192.168.124.1']. Please make sure that the hostname got from the interface for the management network resolves only there.\n"}
2019-04-03 10:29:05,356+0200 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:103 PLAY RECAP [localhost] : ok: 82 changed: 17 unreachable: 0 skipped: 25 failed: 1"

I really don't understand why does that error about commonly matched interface come out only after specifying ssh RSA key and why it comes out at all. I've checked via 'ip addr' and 192.168.124.1 is associated to virbr0, is it just the default virtual bridge provided by libvirt? Should I just disable/nuke it?
24: virbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 52:54:00:d4:f9:e3 brd ff:ff:ff:ff:ff:ff
    inet 192.168.124.1/24 brd 192.168.124.255 scope global virbr0
       valid_lft forever preferred_lft forever

Also to note: using command line, I've found no way to specify host FQDN and I get this warning:
[ INFO  ] Stage: Setup validation
[WARNING] Host name ovn1 has no domain suffix
[WARNING] Failed to resolve ovn1 using DNS, it can be resolved only locally
[ INFO  ] Stage: Transaction setup
[ INFO  ] Stage: Misc configuration (early)

Is there any way to either:
1) Specify host FQDN in command line.
2) Force IPv4 (--4) in cockpit interface.
?

Comment 43 Giovanni 2019-04-03 09:20:56 UTC
Created attachment 1551292 [details]
ovirt node 4.3.1: failed deployment logs (20190403)

"ovirt-4.3.1_he_failed-deploy_20190403.xz" content:

This is the first command line failed deployment log, about SSO authentication:
"ovirt-hosted-engine-setup-20190403093234-jcwo0h.log"


This is the second command line failed deployment log, about commonly matched interface:
"ovirt-hosted-engine-setup-20190403101921-bleucc.log"

Comment 44 Giovanni 2019-04-03 09:32:03 UTC
I forgot: what if I disable IPv6 entirely in /etc/sysctl.conf ? Would that stop ovirt deployment complaining about different addresses for host NIC and bridge?

Comment 45 Simone Tiraboschi 2019-04-03 09:53:50 UTC
(In reply to Giovanni from comment #42)
> Is there any way to either:
> 1) Specify host FQDN in command line.

We fixed it as for https://bugzilla.redhat.com/1692460 adding an interactive question.

(In reply to Giovanni from comment #44)
> I forgot: what if I disable IPv6 entirely in /etc/sysctl.conf ? Would that
> stop ovirt deployment complaining about different addresses for host NIC and
> bridge?

2019-04-03 10:28:59,842+0200 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:103 TASK [ovirt.hosted_engine_setup : debug]
2019-04-03 10:29:00,744+0200 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:103 hostname_res_count_output: {'stderr_lines': [], u'changed': True, u'end': u'2019-04-03 10:28:58.745997', u'stdout': u'149.139.32.240\n192.168.124.1', u'cmd': u"getent ahostsv4 ovn1 | cut -d' ' -f1 | uniq", 'failed': False, u'delta': u'0:00:00.009427', u'stderr': u'', u'rc': 0, 'stdout_lines': [u'149.139.32.240', u'192.168.124.1'], u'start': u'2019-04-03 10:28:58.736570'}

on your environment 'ovn1' resolves on 149.139.32.240 and on 192.168.124.1 while we requires it to resolves only on one interface.
The quickest workaround is about adding a line on /etc/hosts on your host with
149.139.32.240 ovn1

Comment 46 Giovanni 2019-04-03 11:36:36 UTC
(In reply to Simone Tiraboschi from comment #45)
> [...]
> The quickest workaround is about adding a line on /etc/hosts on your host
> with
> 149.139.32.240 ovn1

This indeed helped, this it lasted 1 hour before failing and for another issue:
"2019-04-03 13:19:44,425+0200 ERROR otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:107 fatal: [localhost]: FAILED! => {"changed": false, "msg": "The host has been set in non_operational status, please check engine logs, fix accordingly and re-deploy.\n"}"

I've attached the relevant log and messages again.

Comment 47 Giovanni 2019-04-03 11:41:48 UTC
Created attachment 1551338 [details]
ovirt node 4.3.1: failed deployment logs (20190403T10)

Created attachment 1551292 [details]
ovirt node 4.3.1: failed deployment logs (20190403T10)

"ovirt-4.3.1_he_failed-deploy_20190403.xz" content:

This is the command line failed deployment log:
"ovirt-hosted-engine-setup-20190403093234-jcwo0h.log"


This contains the folder under "/var/log/ovirt-hosted-engine-setup/engine-logs-2019-04-03T10:13:02Z/"
"engine-logs-2019-04-03T10.13.02Z"

Comment 48 Simone Tiraboschi 2019-04-03 11:46:30 UTC
The issue now is here:
2019-04-03 13:19:31,985+02 INFO  [org.ovirt.engine.core.bll.HandleVdsCpuFlagsOrClusterChangedCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-54) [e983d3e] Running command: HandleVdsCpuFlagsOrClusterChangedCommand internal: true. Entities affected :  ID: 9667fe3a-763f-4dda-9b55-7c43d0ec9396 Type: VDS
2019-04-03 13:19:31,991+02 ERROR [org.ovirt.engine.core.bll.HandleVdsCpuFlagsOrClusterChangedCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-54) [e983d3e] Could not find server cpu for server 'ovn1' (9667fe3a-763f-4dda-9b55-7c43d0ec9396), flags: 'fpu,vme,de,pse,tsc,msr,pae,mce,cx8,apic,sep,mtrr,pge,mca,cmov,pat,pse36,clflush,dts,acpi,mmx,fxsr,sse,sse2,ss,ht,tm,pbe,syscall,nx,lm,constant_tsc,arch_perfmon,pebs,bts,rep_good,nopl,aperfmperf,eagerfpu,pni,dtes64,monitor,ds_cpl,vmx,est,tm2,ssse3,cx16,xtpr,pdcm,dca,sse4_1,xsave,lahf_lm,tpr_shadow,vnmi,flexpriority,dtherm,model_Opteron_G2,model_kvm32,model_coreduo,model_Conroe,model_Opteron_G1,model_core2duo,model_qemu32,model_Penryn,model_pentium2,model_pentium3,model_qemu64,model_kvm64,model_pentium,model_486'
2019-04-03 13:19:32,068+02 INFO  [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-54) [7d446d75] Running command: SetNonOperationalVdsCommand internal: true. Entities affected :  ID: 9667fe3a-763f-4dda-9b55-7c43d0ec9396 Type: VDS
2019-04-03 13:19:32,075+02 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-54) [7d446d75] START, SetVdsStatusVDSCommand(HostName = ovn1, SetVdsStatusVDSCommandParameters:{hostId='9667fe3a-763f-4dda-9b55-7c43d0ec9396', status='NonOperational', nonOperationalReason='CPU_TYPE_INCOMPATIBLE_WITH_CLUSTER', stopSpmFailureLogged='false', maintenanceReason='null'}), log id: 1545a674

Giovanni, which kind of CPU are you using on that host?

Comment 49 Simone Tiraboschi 2019-04-03 14:54:46 UTC
OK, it's an Intel(R) Xeon(R) CPU E5410 which is based on Penryn micro-architecture.
Please notice that oVirt 4.3 removed the support for Conroe and Penryn CPUs as for: https://bugzilla.redhat.com/1540921

No way to deploy oVirt 4.3 on that hardware.

Comment 50 Giovanni 2019-04-05 07:01:38 UTC
Ok, thought it was still supported: https://www.ovirt.org/documentation/install-guide/chap-System_Requirements.html

Hypervisor Requirements
CPU Requirements

All CPUs must have support for the Intel® 64 or AMD64 CPU extensions, and the AMD-V™ or Intel VT® hardware virtualization extensions enabled. Support for the No eXecute flag (NX) is also required.

The following CPU models are supported:

    AMD
        [...]
    Intel
        [...]
        Penryn
        [...]

Where can I find CPU support list for 4.2.8 (or whichever is the latest 4.2 version)?

Comment 51 Simone Tiraboschi 2019-04-05 07:32:29 UTC
(In reply to Giovanni from comment #50)
> Where can I find CPU support list for 4.2.8 (or whichever is the latest 4.2
> version)?

Yes, that document is a bit outdated and the list still refers to 4.2.z.
So it will work with 4.2.z but please notice that you will never be able to upgrade to 4.3 on that HW.

Comment 52 Giovanni 2019-04-05 07:40:48 UTC
Ok, no problem, these machines are just old stuff laying around I use to test stuff on before planning the real thing. Thank you very much for the support.

Comment 53 dearfriend 2019-04-05 08:02:25 UTC
I have newer CPU and same errors in logs

"hosted-engine --deploy --restore-from-file=/..." on fresh installed hosts cannot complete.  v.4.2.8

ERROR ansible failed {'status': 'FAILED', 'ansible_type': 'task', 'ansible_task': u'Copy engine logs', 'ansible_result': u'type: <type \'dict\'>\nstr: {\'msg\': u"The task includes an option with an undefined variable. The error was: \'local_vm_disk_path\' is undefined\\n\\nThe error appears to have been in \'/usr/share/ovirt-hosted-engine-setup/ansible/fetch_engine_logs.yml\': line 16, column 3, but may\\nbe elsewhere in the file depending on the exact sy', 'ansible_host': u'localhost', 'ansible_playbook': u'/usr/share/ovirt-hosted-engine-setup/ansible/final_clean.yml'}


trying to use "noansible", but this option is only for new install, not restore.

#lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                12
On-line CPU(s) list:   0-11
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             12
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 94
Model name:            Intel Core Processor (Skylake)
Stepping:              3
CPU MHz:               2294.608
BogoMIPS:              4589.21
Virtualization:        VT-x
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              4096K
L3 cache:              16384K
NUMA node0 CPU(s):     0-11
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch tpr_shadow vnmi flexpriority ept vpid fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap xsaveopt xsavec xgetbv1 arat

Comment 54 Simone Tiraboschi 2019-04-05 08:10:53 UTC
(In reply to dearfriend from comment #53)
> I have newer CPU and same errors in logs
> 
> "hosted-engine --deploy --restore-from-file=/..." on fresh installed hosts
> cannot complete.  v.4.2.8
> 
> ERROR ansible failed {'status': 'FAILED', 'ansible_type': 'task',
> 'ansible_task': u'Copy engine logs', 'ansible_result': u'type: <type
> \'dict\'>\nstr: {\'msg\': u"The task includes an option with an undefined
> variable. The error was: \'local_vm_disk_path\' is undefined\\n\\nThe error
> appears to have been in
> \'/usr/share/ovirt-hosted-engine-setup/ansible/fetch_engine_logs.yml\': line
> 16, column 3, but may\\nbe elsewhere in the file depending on the exact sy',
> 'ansible_host': u'localhost', 'ansible_playbook':
> u'/usr/share/ovirt-hosted-engine-setup/ansible/final_clean.yml'}

This is just something on the cleanup procedure; the real error is a few lines above that.

Comment 55 dearfriend 2019-04-05 08:18:07 UTC
ERROR otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:98 fatal: [localhost]: FAILED! => {"ansible_facts": {"ovirt_hosts": []}, "attempts": 120, "changed": false}

I can see engine is running on 192.168.122.181. But not external IP on ovirtmgmt network with associated dns name.

Comment 56 Simone Tiraboschi 2019-04-05 08:34:35 UTC
(In reply to dearfriend from comment #55)
> ERROR otopi.ovirt_hosted_engine_setup.ansible_utils
> ansible_utils._process_output:98 fatal: [localhost]: FAILED! =>
> {"ansible_facts": {"ovirt_hosts": []}, "attempts": 120, "changed": false}

This means that the engine failed to deploy the host: I'd suggest to connect to the bootstrap engine VM and check engine.log and host-deploy logs there. 

> I can see engine is running on 192.168.122.181. But not external IP on
> ovirtmgmt network with associated dns name.

This is absolutely fine on that stage.


Note You need to log in before you can comment on or make changes to this bug.