Bug 1779527 - During hosted engine deploy, vdsm log has: "Failed to connect to guest agent channel"
Summary: During hosted engine deploy, vdsm log has: "Failed to connect to guest agent ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: General
Version: 4.40.0
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ovirt-4.4.2
: 4.40.24
Assignee: Liran Rotenberg
QA Contact: Wei Wang
URL:
Whiteboard:
Depends On: 1785272
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-12-04 07:00 UTC by Wei Wang
Modified: 2020-09-18 07:13 UTC (History)
16 users (show)

Fixed In Version: vdsm-4.40.24
Clone Of:
Environment:
Last Closed: 2020-09-18 07:13:27 UTC
oVirt Team: Virt
Embargoed:
mtessun: ovirt-4.4+
mtessun: planning_ack+
ahadas: devel_ack+
weiwang: testing_ack+


Attachments (Terms of Use)
log files (2.26 MB, application/gzip)
2019-12-04 07:00 UTC, Wei Wang
no flags Details
/var/log files (1.98 MB, application/gzip)
2019-12-26 04:26 UTC, Wei Wang
no flags Details
journalctl log (633.63 KB, text/plain)
2019-12-26 04:27 UTC, Wei Wang
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 108069 0 master ABANDONED svdsm: channel prepare: missing oga is fine on el8 2020-12-17 12:33:18 UTC
oVirt gerrit 110184 0 master MERGED virt: do not error when ovirt-guest-agent channel is not configured 2020-12-17 12:33:18 UTC

Description Wei Wang 2019-12-04 07:00:27 UTC
Created attachment 1641912 [details]
log files

Description of problem:
Hosted engine deploys failed as "Failed to connect to guest agent channel" when creating target vm.

ovirt-hosted-engine-setup-ansible-create_target_vm.log
2019-12-04 14:22:27,176+0800 ERROR ansible failed {'status': 'FAILED', 'ansible_type': 'task', 'ansible_playbook': '/usr/share/ovirt-hosted-engine-setup/ansible/trigger_role.yml', 'ansible_host': 'localhost', 'ansible_task': 'Check OVF_STORE volume status', 'ansible_result': 'type: <class \'dict\'>\nstr: {\'results\': [{\'cmd\': [\'vdsm-client\', \'Volume\', \'getInfo\', \'storagepoolID=3d184f44-165b-11ea-8453-5254003404b0\', \'storagedomainID=f3489f19-12d7-4215-8025-22c714ced863\', \'imageID=ccf5fc3b-83b8-4b11-b8ef-4bdfb25d943e\', \'volumeID=ea4c4547-2368-4f4a-aca5-8db9c002b787\'], \'stdout\': \'{\\n    "apparentsize": ', 'task_duration': 261}
2019-12-04 14:22:27,176+0800 DEBUG ansible on_any args <ansible.executor.task_result.TaskResult object at 0x7f171838eb70> kwargs ignore_errors:None
2019-12-04 14:22:27,178+0800 INFO ansible stats {
    "ansible_playbook": "/usr/share/ovirt-hosted-engine-setup/ansible/trigger_role.yml",
    "ansible_playbook_duration": "07:40 Minutes",
    "ansible_result": "type: <class 'dict'>\nstr: {'localhost': {'ok': 76, 'failures': 1, 'unreachable': 0, 'changed': 21, 'skipped': 10, 'rescued': 0, 'ignored': 0}}",
    "ansible_type": "finish",
    "status": "FAILED"
}

vdsm.log
2019-12-04 14:04:38,489+0800 ERROR (vm/0c11a23b) [virt.vm] (vmId='0c11a23b-bc3a-4b97-9368-296c234e6611') Failed to connect to guest agent channel (vm:2252)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 2250, in _vmDependentInit
    self.guestAgent.start()
  File "/usr/lib/python3.6/site-packages/vdsm/virt/guestagent.py", line 257, in start
    self._prepare_socket()
  File "/usr/lib/python3.6/site-packages/vdsm/virt/guestagent.py", line 299, in _prepare_socket
    supervdsm.getProxy().prepareVmChannel(self._socketName)
  File "/usr/lib/python3.6/site-packages/vdsm/common/supervdsm.py", line 56, in __call__
    return callMethod()
  File "/usr/lib/python3.6/site-packages/vdsm/common/supervdsm.py", line 54, in <lambda>
    **kwargs)
  File "<string>", line 2, in prepareVmChannel
  File "/usr/lib64/python3.6/multiprocessing/managers.py", line 772, in _callmethod
    raise convert_to_error(kind, result)
FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/libvirt/qemu/channels/0c11a23b-bc3a-4b97-9368-296c234e6611.com.redhat.rhevm.vdsm'



Version-Release number of selected component (if applicable):
RHVH-4.4-20191201.7-RHVH-x86_64-dvd1.iso
cockpit-packagekit-197.3-1.el8.noarch
cockpit-196.3-1.el8.x86_64
cockpit-system-196.3-1.el8.noarch
cockpit-dashboard-197.3-1.el8.noarch
cockpit-storaged-197.3-1.el8.noarch
cockpit-bridge-196.3-1.el8.x86_64
subscription-manager-cockpit-1.25.17-1.el8.noarch
cockpit-ws-196.3-1.el8.x86_64
cockpit-ovirt-dashboard-0.14.0-1.el8ev.noarch
ovirt-hosted-engine-setup-2.4.0-1.el8ev.noarch
ovirt-hosted-engine-ha-2.4.0-1.el8ev.noarch
rhvm-appliance-4.4-20191202.1.el8ev.x86_64


How reproducible:
100%

Steps to Reproduce:
1. Deploy hosted engine via cockpit UI
2.
3.

Actual results:
Hosted engine deploys failed as "Failed to connect to guest agent channel" when creating target vm.

Expected results:
Hosted engine deploys successfully.

Additional info:

Comment 1 Sandro Bonazzola 2019-12-05 13:25:03 UTC
note that the appliance has no ovirt-guest-agent, only qemu-guest-agent.
the test here seems to come from vdsm, if that's the case this needs to move to vdsm.

Comment 2 Yedidyah Bar David 2019-12-18 08:34:39 UTC
I am pretty certain(In reply to Wei Wang from comment #0)
> Created attachment 1641912 [details]
> log files
> 
> Description of problem:
> Hosted engine deploys failed as "Failed to connect to guest agent channel"
> when creating target vm.
> 
> ovirt-hosted-engine-setup-ansible-create_target_vm.log
> 2019-12-04 14:22:27,176+0800 ERROR ansible failed {'status': 'FAILED',
> 'ansible_type': 'task', 'ansible_playbook':
> '/usr/share/ovirt-hosted-engine-setup/ansible/trigger_role.yml',
> 'ansible_host': 'localhost', 'ansible_task': 'Check OVF_STORE volume
> status', 'ansible_result': 'type: <class \'dict\'>\nstr: {\'results\':
> [{\'cmd\': [\'vdsm-client\', \'Volume\', \'getInfo\',
> \'storagepoolID=3d184f44-165b-11ea-8453-5254003404b0\',
> \'storagedomainID=f3489f19-12d7-4215-8025-22c714ced863\',
> \'imageID=ccf5fc3b-83b8-4b11-b8ef-4bdfb25d943e\',
> \'volumeID=ea4c4547-2368-4f4a-aca5-8db9c002b787\'], \'stdout\': \'{\\n   
> "apparentsize": ', 'task_duration': 261}
> 2019-12-04 14:22:27,176+0800 DEBUG ansible on_any args
> <ansible.executor.task_result.TaskResult object at 0x7f171838eb70> kwargs
> ignore_errors:None
> 2019-12-04 14:22:27,178+0800 INFO ansible stats {
>     "ansible_playbook":
> "/usr/share/ovirt-hosted-engine-setup/ansible/trigger_role.yml",
>     "ansible_playbook_duration": "07:40 Minutes",
>     "ansible_result": "type: <class 'dict'>\nstr: {'localhost': {'ok': 76,
> 'failures': 1, 'unreachable': 0, 'changed': 21, 'skipped': 10, 'rescued': 0,
> 'ignored': 0}}",
>     "ansible_type": "finish",
>     "status": "FAILED"
> }

I am pretty certain that this was caused by failure to update OVF store, bug 1779085.
Please check engine.log - I have there errors like:

2019-12-18 08:50:44,855+02 ERROR [org.ovirt.engine.core.bll.storage.ovfstore.UploadStreamCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-79) [46e81c97] Command 'org.ovirt.engine.core.bll.storage.ovfstore.UploadStreamCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: response was missing the following header: Task-Id (Failed with error GeneralException and code 100)

> 
> vdsm.log
> 2019-12-04 14:04:38,489+0800 ERROR (vm/0c11a23b) [virt.vm]
> (vmId='0c11a23b-bc3a-4b97-9368-296c234e6611') Failed to connect to guest
> agent channel (vm:2252)
> Traceback (most recent call last):
>   File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 2250, in
> _vmDependentInit
>     self.guestAgent.start()
>   File "/usr/lib/python3.6/site-packages/vdsm/virt/guestagent.py", line 257,
> in start
>     self._prepare_socket()
>   File "/usr/lib/python3.6/site-packages/vdsm/virt/guestagent.py", line 299,
> in _prepare_socket
>     supervdsm.getProxy().prepareVmChannel(self._socketName)
>   File "/usr/lib/python3.6/site-packages/vdsm/common/supervdsm.py", line 56,
> in __call__
>     return callMethod()
>   File "/usr/lib/python3.6/site-packages/vdsm/common/supervdsm.py", line 54,
> in <lambda>
>     **kwargs)
>   File "<string>", line 2, in prepareVmChannel
>   File "/usr/lib64/python3.6/multiprocessing/managers.py", line 772, in
> _callmethod
>     raise convert_to_error(kind, result)
> FileNotFoundError: [Errno 2] No such file or directory:
> '/var/lib/libvirt/qemu/channels/0c11a23b-bc3a-4b97-9368-296c234e6611.com.
> redhat.rhevm.vdsm'

I am pretty certain that this didn't cause the failure, although would be
useful to fix to minimize confusion. Changing the summary line accordingly.

Comment 3 Michal Skrivanek 2019-12-24 15:14:04 UTC
Guest agent presence is unrelated, this is about a missing socket file during recovery flow. That shouldn’t happen. But the logs only contain the recovery run, not the initial one woth domain xml so I can’t tell if the VM is created wrongly or something else happened.

Comment 4 Yedidyah Bar David 2019-12-25 07:00:15 UTC
Please try to reproduce with current builds. If it still fails, please attach all relevant logs. If unsure, a sosreport should be enough. Thanks.

Comment 5 Wei Wang 2019-12-25 07:21:22 UTC
(In reply to Yedidyah Bar David from comment #4)
> Please try to reproduce with current builds. If it still fails, please
> attach all relevant logs. If unsure, a sosreport should be enough. Thanks.

ok, I will try it after RHVH 4.3.8 tier2 test.

Comment 6 Wei Wang 2019-12-26 04:25:41 UTC
Reproduce this bug with
RHVH-4.4-20191205.t.1-RHVH-x86_64-dvd1
rhvm-appliance-4.4-20191204.3.el8ev.x86_64

Attach all the relevant log in attachment.

Comment 7 Wei Wang 2019-12-26 04:26:41 UTC
Created attachment 1647655 [details]
/var/log files

Comment 8 Wei Wang 2019-12-26 04:27:22 UTC
Created attachment 1647656 [details]
journalctl log

Comment 10 Yedidyah Bar David 2019-12-26 07:28:14 UTC
(In reply to Wei Wang from comment #6)
> Reproduce this bug with
> RHVH-4.4-20191205.t.1-RHVH-x86_64-dvd1
> rhvm-appliance-4.4-20191204.3.el8ev.x86_64
> 
> Attach all the relevant log in attachment.

That's still too old. The error there is as before, failed during 'Check OVF_STORE volume status'.

I didn't try a recent RHV build. I did try a few oVirt ones, and the one that worked for me was from 2012-12-22. So please try a later version.

Comment 13 Wei Wang 2019-12-26 07:52:39 UTC
(In reply to Yedidyah Bar David from comment #10)
> (In reply to Wei Wang from comment #6)
> > Reproduce this bug with
> > RHVH-4.4-20191205.t.1-RHVH-x86_64-dvd1
> > rhvm-appliance-4.4-20191204.3.el8ev.x86_64
> > 
> > Attach all the relevant log in attachment.
> 
> That's still too old. The error there is as before, failed during 'Check
> OVF_STORE volume status'.
> 
> I didn't try a recent RHV build. I did try a few oVirt ones, and the one
> that worked for me was from 2012-12-22. So please try a later version.

Let me test with upstream version now, will give the result later.

Comment 14 Wei Wang 2019-12-27 08:22:37 UTC
Test Version
ovirt-node-ng-installer-4.4.0-2019122607.el8.iso
ovirt-engine-appliance-4.4-20191226174442.1.el8.x86_64

Test Result:
Hosted engine deploy failed since 

[ INFO ] TASK [ovirt.hosted_engine_setup : Wait for the host to be up]
[ ERROR ] fatal: [localhost]: FAILED! => {"ansible_facts": {"ovirt_hosts": [{"address": "hp-dl388g9-04.lab.eng.pek2.redhat.com", "affinity_labels": [], "auto_numa_status": "unknown", "certificate": {"organization": "lab.eng.pek2.redhat.com", "subject": "O=lab.eng.pek2.redhat.com,CN=hp-dl388g9-04.lab.eng.pek2.redhat.com"}, "cluster": {"href": "/ovirt-engine/api/clusters/5bce182a-28c0-11ea-a794-5254003404b0", "id": "5bce182a-28c0-11ea-a794-5254003404b0"}, "comment": "", "cpu": {"speed": 0.0, "topology": {}}, "device_passthrough": {"enabled": false}, "devices": [], "external_network_provider_configurations": [], "external_status": "ok", "hardware_information": {"supported_rng_sources": []}, "hooks": [], "href": "/ovirt-engine/api/hosts/9013d38e-71ec-460e-b1af-f510d85a2592", "id": "9013d38e-71ec-460e-b1af-f510d85a2592", "katello_errata": [], "kdump_status": "unknown", "ksm": {"enabled": false}, "max_scheduling_memory": 0, "memory": 0, "name": "hp-dl388g9-04.lab.eng.pek2.redhat.com", "network_attachments": [], "nics": [], "numa_nodes": [], "numa_supported": false, "os": {"custom_kernel_cmdline": ""}, "permissions": [], "port": 54321, "power_management": {"automatic_pm_enabled": true, "enabled": false, "kdump_detection": true, "pm_proxies": []}, "protocol": "stomp", "se_linux": {}, "spm": {"priority": 5, "status": "none"}, "ssh": {"fingerprint": "SHA256:CNEZTFA8dISuv4k96apQsdOPWdOS9YvOltDeVKyEAtU", "port": 22}, "statistics": [], "status": "install_failed", "storage_connection_extensions": [], "summary": {"total": 0}, "tags": [], "transparent_huge_pages": {"enabled": false}, "type": "rhel", "unmanaged_networks": [], "update_available": false, "vgpu_placement": "consolidated"}]}, "attempts": 120, "changed": false, "deprecations": [{"msg": "The 'ovirt_host_facts' module has been renamed to 'ovirt_host_info', and the renamed one no longer returns ansible_facts", "version": "2.13"}]}

didi,
Please check the log in my machine environment(reserved until Jan, 2, 2020). I will send the info via email.

Comment 15 Yedidyah Bar David 2019-12-29 08:35:17 UTC
This is same machine and failure as bug 1770094 comment 11.

I guess we do want to continue tracking the 'Failed to connect to guest agent channel' msg, but it's now clear that this isn't what's failing the deploy. So changing subject accordingly and removing TestBlocker.

Comment 16 Sandro Bonazzola 2020-01-08 08:20:16 UTC
Moving this to MODIFIED since bug #1785272 is in modified state.
We'll move to QE both at the same time and we'll reopen this if this still reproduce while the issue on yum-utils is fixed.

Comment 17 Wei Wang 2020-01-14 06:36:43 UTC
QE will verify this bug until we get the new 4.4 build.

Comment 18 Wei Wang 2020-02-06 08:28:13 UTC
According to https://bugzilla.redhat.com/show_bug.cgi?id=1770094#c17, bug is still reproduced, move the status to "ASSIGNED"

Test Vesion:
RHVH-4.4-20200205.1-RHVH-x86_64-dvd1.iso
cockpit-ovirt-dashboard-0.14.1-1.el8ev.noarch
cockpit-bridge-211.1-1.el8.x86_64
cockpit-dashboard-211-1.el8.noarch
cockpit-system-211.1-1.el8.noarch
cockpit-ws-211.1-1.el8.x86_64
cockpit-211.1-1.el8.x86_64
cockpit-storaged-211-1.el8.noarch
rhvm-appliance-4.4-20200123.0.el8ev.x86_64

Comment 19 Yedidyah Bar David 2020-02-10 06:40:26 UTC
Why is this a test blocker? What test does it block, other than things related to the guest agent?

Comment 20 Wei Wang 2020-02-10 09:38:10 UTC
(In reply to Yedidyah Bar David from comment #19)
> Why is this a test blocker? What test does it block, other than things
> related to the guest agent?

All hosted engine deployment test cases are blocked by this bug, since hosted engine deploy failed before setting storage. Refer to https://bugzilla.redhat.com/show_bug.cgi?id=1770094#c17. 
Test version:
RHVH-4.4-20200205.1-RHVH-x86_64-dvd1.iso
cockpit-ovirt-dashboard-0.14.1-1.el8ev.noarch
cockpit-bridge-211.1-1.el8.x86_64
cockpit-dashboard-211-1.el8.noarch
cockpit-system-211.1-1.el8.noarch
cockpit-ws-211.1-1.el8.x86_64
cockpit-211.1-1.el8.x86_64
cockpit-storaged-211-1.el8.noarch
rhvm-appliance-4.4-20200123.0.el8ev.x86_64

Test Result:
Hosted engine deploy failed since

[ INFO ] TASK [ovirt.hosted_engine_setup : Wait for the host to be up]
[ ERROR ] fatal: [localhost]: FAILED! => {"ansible_facts": {"ovirt_hosts": [{"address": "hp-dl388g9-04.lab.eng.pek2.redhat.com", "affinity_labels": [], "auto_numa_status": "unknown", "certificate": {"organization": "lab.eng.pek2.redhat.com", "subject": "O=lab.eng.pek2.redhat.com,CN=hp-dl388g9-04.lab.eng.pek2.redhat.com"}, "cluster": {"href": "/ovirt-engine/api/clusters/3674fe3a-48b5-11ea-a66c-5254003404b0", "id": "3674fe3a-48b5-11ea-a66c-5254003404b0"}, "comment": "", "cpu": {"speed": 0.0, "topology": {}}, "device_passthrough": {"enabled": false}, "devices": [], "external_network_provider_configurations": [], "external_status": "ok", "hardware_information": {"supported_rng_sources": []}, "hooks": [], "href": "/ovirt-engine/api/hosts/54280d80-f3aa-4f68-b39c-1dcbbfcc8b45", "id": "54280d80-f3aa-4f68-b39c-1dcbbfcc8b45", "katello_errata": [], "kdump_status": "unknown", "ksm": {"enabled": false}, "max_scheduling_memory": 0, "memory": 0, "name": "hp-dl388g9-04.lab.eng.pek2.redhat.com", "network_attachments": [], "nics": [], "numa_nodes": [], "numa_supported": false, "os": {"custom_kernel_cmdline": ""}, "permissions": [], "port": 54321, "power_management": {"automatic_pm_enabled": true, "enabled": false, "kdump_detection": true, "pm_proxies": []}, "protocol": "stomp", "se_linux": {}, "spm": {"priority": 5, "status": "none"}, "ssh": {"fingerprint": "SHA256:piuq9fnOwos/lbsFaQKgYl7Mz+0rqlWNo/vqhZ39IPY", "port": 22}, "statistics": [], "status": "install_failed", "storage_connection_extensions": [], "summary": {"total": 0}, "tags": [], "transparent_huge_pages": {"enabled": false}, "type": "rhel", "unmanaged_networks": [], "update_available": false, "vgpu_placement": "consolidated"}]}, "attempts": 120, "changed": false, "deprecations": [{"msg": "The 'ovirt_host_facts' module has been renamed to 'ovirt_host_info', and the renamed one no longer returns ansible_facts", "version": "2.13"}]}

vdsm.log
2020-02-06 16:01:34,627+0800 ERROR (vm/7b97fbc8) [virt.vm] (vmId='7b97fbc8-4d9b-455c-b14a-afad0e136e5a') Failed to connect to guest agent channel (vm:2232)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 2230, in _vmDependentInit
    self.guestAgent.start()
  File "/usr/lib/python3.6/site-packages/vdsm/virt/guestagent.py", line 247, in start
    self._prepare_socket()
  File "/usr/lib/python3.6/site-packages/vdsm/virt/guestagent.py", line 289, in _prepare_socket
    supervdsm.getProxy().prepareVmChannel(self._socketName)
  File "/usr/lib/python3.6/site-packages/vdsm/common/supervdsm.py", line 56, in __call__
    return callMethod()
  File "/usr/lib/python3.6/site-packages/vdsm/common/supervdsm.py", line 54, in <lambda>
    **kwargs)
  File "<string>", line 2, in prepareVmChannel
  File "/usr/lib64/python3.6/multiprocessing/managers.py", line 772, in _callmethod
    raise convert_to_error(kind, result)
FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/libvirt/qemu/channels/7b97fbc8-4d9b-455c-b14a-afad0e136e5a.com.redhat.rhevm.vdsm'


Actually, this bug blocks bug 1770094 which occurs after setting storage.

Comment 21 Wei Wang 2020-02-14 02:51:21 UTC
The bug is reproduced with RHVH-UNSIGNED-ISO-4.4-20200212.0-RHVH-x86_64-dvd1.iso and rhvm-appliance-4.4-20200123.0.el8ev.rpm.

Comment 22 Yedidyah Bar David 2020-02-17 08:13:58 UTC
Wei, are you sure that hosted-engine deploy fails for you _because_ of current bug? About missing guest agent? I do not think so. I think you should be able to find some other error in the logs that explains why deploy failed.

Current bug is only about its current subject: During hosted engine deploy, vdsm log has: "Failed to connect to guest agent channel".

We might eventually fix it, not even sure about that, it might be harmless.

If you think it's a TestBlocker, please explain why.

If you think it causes deploy to fail, please explain why. As I said, if deploy fails, you should find other errors.

Thanks!

I personally got this error message in vdsm.log also during a successful deploy, so I do not think it's related to a failure to deploy.

Some things we might do about _current_ bug:

1. Actually fix it - install the guest agent and make sure vdsm can contact it, thus not log this error.

2. Talk with vdsm people, and if it's not a real error, change it to a warning.

3. Do nothing and close it (and document here that it's harmless).

Comment 23 Wei Wang 2020-02-18 02:19:18 UTC
(In reply to Yedidyah Bar David from comment #22)
> Wei, are you sure that hosted-engine deploy fails for you _because_ of
> current bug? About missing guest agent? I do not think so. I think you
> should be able to find some other error in the logs that explains why deploy
> failed.
> 
> Current bug is only about its current subject: During hosted engine deploy,
> vdsm log has: "Failed to connect to guest agent channel".
> 
> We might eventually fix it, not even sure about that, it might be harmless.
> 
> If you think it's a TestBlocker, please explain why.
> 
> If you think it causes deploy to fail, please explain why. As I said, if
> deploy fails, you should find other errors.
> 
> Thanks!
> 
> I personally got this error message in vdsm.log also during a successful
> deploy, so I do not think it's related to a failure to deploy.
> 
> Some things we might do about _current_ bug:
> 
> 1. Actually fix it - install the guest agent and make sure vdsm can contact
> it, thus not log this error.
> 
> 2. Talk with vdsm people, and if it's not a real error, change it to a
> warning.
> 
> 3. Do nothing and close it (and document here that it's harmless).

Ok, I will check the logs again soon, and updating lately.

Comment 24 Wei Wang 2020-02-18 06:34:09 UTC
Test with RHVH-UNSIGNED-ISO-4.4-20200212.0-RHVH-x86_64-dvd1.iso and rhvm-appliance-4.4-20200123.0.el8ev.rpm again:
Hosted engine deployment actually failed when "Wait for the host to be up" with the latest two builds and the symptom is same with this original bug. So add the comments #18, #20, #21. 
Besides, all the hosted engine test cases are failed or blocked since this failure, So move it to TestBlocker.

From your comments I can see "vdsm log has: "Failed to connect to guest agent channel" is not the root cause of this hosted engine failure. I've checked all ERROR in the related /var/logs files, I list here:

1) ovirt-hosted-engine-setup-ansible-bootstrap_local_vm-2020118131514-z2hsax.log
2020-02-18 13:39:17,030+0800 ERROR ansible failed {
    "ansible_host": "localhost",
    "ansible_playbook": "/usr/share/ovirt-hosted-engine-setup/ansible/trigger_role.yml",
    "ansible_result": {
        "_ansible_no_log": false,
        "ansible_facts": {
            "ovirt_hosts": [
                {
                    "address": "hp-dl388g9-05.lab.eng.pek2.**FILTERED**.com",
                    "affinity_labels": [],
                    "auto_numa_status": "unknown",
                    "certificate": {
                        "organization": "lab.eng.pek2.**FILTERED**.com",
                        "subject": "O=lab.eng.pek2.**FILTERED**.com,CN=hp-dl388g9-05.lab.eng.pek2.**FILTERED**.com"
                    },
                    "cluster": {
                        "href": "/ovirt-engine/api/clusters/c3957396-520e-11ea-889f-5254005d2164",
                        "id": "c3957396-520e-11ea-889f-5254005d2164"
                    },
                    "comment": "",
                    "cpu": {
                        "speed": 0.0,
                        "topology": {}
                    },
                    "device_passthrough": {
                        "enabled": false
                    },
                    "devices": [],
                    "external_network_provider_configurations": [],
                    "external_status": "ok",
                    "hardware_information": {
                        "supported_rng_sources": []
                    },
                    "hooks": [],
                    "href": "/ovirt-engine/api/hosts/71298729-124e-4537-9bc3-af5e501dd484",
                    "id": "71298729-124e-4537-9bc3-af5e501dd484",
                    "katello_errata": [],
                    "kdump_status": "unknown",
                    "ksm": {
                        "enabled": false
                    },

"max_scheduling_memory": 0,
                    "memory": 0,
                    "name": "hp-dl388g9-05.lab.eng.pek2.**FILTERED**.com",
                    "network_attachments": [],
                    "nics": [],
                    "numa_nodes": [],
                    "numa_supported": false,
                    "os": {
                        "custom_kernel_cmdline": ""
                    },
                    "permissions": [],
                    "port": 54321,
                    "power_management": {
                        "automatic_pm_enabled": true,
                        "enabled": false,
                        "kdump_detection": true,
                        "pm_proxies": []
                    },
                    "protocol": "stomp",
                    "se_linux": {},
                    "spm": {
                        "priority": 5,
                        "status": "none"
                    },
                    "ssh": {
                        "fingerprint": "SHA256:GdXFF2XDf24xo5RoahSGbjVHcJkMIGgKmamMP0C3yuc",
                        "port": 22
                    },
                    "statistics": [],
                    "status": "install_failed",
                    "storage_connection_extensions": [],
                    "summary": {
                        "total": 0
                    },
                    "tags": [],
                    "transparent_huge_pages": {

                         "enabled": false
                    },
                    "type": "rhel",
                    "unmanaged_networks": [],
                    "update_available": false,
                    "vgpu_placement": "consolidated"
                }
            ]
        },
        "attempts": 120,
        "changed": false,
        "deprecations": [
            {
                "msg": "The 'ovirt_host_facts' module has been renamed to 'ovirt_host_info', and the renamed one no longer returns ansible_facts",
                "version": "2.13"
            }
        ],
        "invocation": {
            "module_args": {
                "all_content": false,
                "cluster_version": null,
                "fetch_nested": false,
                "nested_attributes": [],
                "pattern": "name=hp-dl388g9-05.lab.eng.pek2.**FILTERED**.com"
            }
        }
    },
    "ansible_task": "Wait for the host to be up",
    "ansible_type": "task",
    "status": "FAILED",
    "task_duration": 664
}

2) engine.log
2020-02-18 13:26:47,869+08 ERROR [org.ovirt.engine.core.bll.CpuFlagsManagerHandler] (ServerService Thread Pool -- 40) [] Error getting info for CPU ' ', not in expected format.
2020-02-18 13:26:47,870+08 ERROR [org.ovirt.engine.core.bll.CpuFlagsManagerHandler] (ServerService Thread Pool -- 40) [] Error getting info for CPU ' ', not in expected format.
2020-02-18 13:36:43,061+08 ERROR [org.ovirt.engine.core.bll.hostdeploy.InstallVdsInternalCommand] (EE-ManagedThreadFactory-engine-Thread-1) [6a42b7c1] Host installation failed for host '71298729-124e-4537-9bc3-af5e501dd484', 'hp-dl388g9-05.lab.eng.pek2.redhat.com': Task Ensure Open vSwitch is started failed to execute:
2020-02-18 13:36:43,066+08 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedThreadFactory-engine-Thread-1) [6a42b7c1] START, SetVdsStatusVDSCommand(HostName = hp-dl388g9-05.lab.eng.pek2.redhat.com, SetVdsStatusVDSCommandParameters:{hostId='71298729-124e-4537-9bc3-af5e501dd484', status='InstallFailed', nonOperationalReason='NONE', stopSpmFailureLogged='false', maintenanceReason='null'}), log id: 1e580687
2020-02-18 13:36:43,075+08 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedThreadFactory-engine-Thread-1) [6a42b7c1] FINISH, SetVdsStatusVDSCommand, return: , log id: 1e580687
2020-02-18 13:36:43,101+08 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-1) [6a42b7c1] EVENT_ID: VDS_INSTALL_FAILED(505), Host hp-dl388g9-05.lab.eng.pek2.redhat.com installation failed. Task Ensure Open vSwitch is started failed to execute: .
2020-02-18 13:36:43,121+08 INFO  [org.ovirt.engine.core.bll.hostdeploy.InstallVdsInternalCommand] (EE-ManagedThreadFactory-engine-Thread-1) [6a42b7c1] Lock freed to object 'EngineLock:{exclusiveLocks='[71298729-124e-4537-9bc3-af5e501dd484=VDS]', sharedLocks=''}'

3) vdsm.log
2020-02-18 13:35:37,251+0800 ERROR (vm/03110ca5) [virt.vm] (vmId='03110ca5-125b-41ef-b16b-abf13ec7b6f6') Failed to connect to guest agent channel (vm:2238)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 2236, in _vmDependentInit
    self.guestAgent.start()
  File "/usr/lib/python3.6/site-packages/vdsm/virt/guestagent.py", line 247, in start
    self._prepare_socket()
  File "/usr/lib/python3.6/site-packages/vdsm/virt/guestagent.py", line 289, in _prepare_socket
    supervdsm.getProxy().prepareVmChannel(self._socketName)
  File "/usr/lib/python3.6/site-packages/vdsm/common/supervdsm.py", line 56, in __call__
    return callMethod()
  File "/usr/lib/python3.6/site-packages/vdsm/common/supervdsm.py", line 54, in <lambda>
    **kwargs)
  File "<string>", line 2, in prepareVmChannel
  File "/usr/lib64/python3.6/multiprocessing/managers.py", line 772, in _callmethod
    raise convert_to_error(kind, result)
FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/libvirt/qemu/channels/03110ca5-125b-41ef-b16b-abf13ec7b6f6.com.redhat.rhevm.vdsm'

4) supervdsm.log
MainProcess|vm/03110ca5::DEBUG::2020-02-18 13:35:37,250::supervdsm_server::93::SuperVdsm.ServerCallback::(wrapper) call prepareVmChannel with ('/var/lib/libvirt/qemu/channels/03110ca5-125b-41ef-b16b-abf13ec7b6f6.com.redhat.rhevm.vdsm',) {}
MainProcess|vm/03110ca5::ERROR::2020-02-18 13:35:37,250::supervdsm_server::97::SuperVdsm.ServerCallback::(wrapper) Error in prepareVmChannel
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/supervdsm_server.py", line 95, in wrapper
    res = func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/supervdsm_api/virt.py", line 39, in prepareVmChannel
    fsinfo = os.stat(socketFile)
FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/libvirt/qemu/channels/03110ca5-125b-41ef-b16b-abf13ec7b6f6.com.redhat.rhevm.vdsm'


Maybe 1) and 2) have the key point of this issue, we can see host installed failed in engine. And I doubt this is related to rhvm-appliance, since rhvh build has updated but the rhvm-appliance is still the old one, I got the same failure. DEV has anything update with rhvm-appliance for 4.4? 
But I am not familiar with debug. Could devel can help to filter the true cause of this failure? So that I can report a new one to trace it. Thanks.

Comment 25 Yedidyah Bar David 2020-02-18 06:47:11 UTC
Wei, thanks a lot for trying again and report. I see that you too now understand that the deploy failure is not due to the guest agent but some other reason.

We (dev) are currently working on moving the engine to el8, and things are a bit unstable. Hopefully we'll finish soon and then you can resume trying to deploy hosted-engine.

Comment 26 Wei Wang 2020-02-18 07:23:48 UTC
(In reply to Yedidyah Bar David from comment #25)
> Wei, thanks a lot for trying again and report. I see that you too now
> understand that the deploy failure is not due to the guest agent but some
> other reason.
> 
> We (dev) are currently working on moving the engine to el8, and things are a
> bit unstable. Hopefully we'll finish soon and then you can resume trying to
> deploy hosted-engine.

ok, this must be traced by bug? Could you provide the bug ID? If necessary, I need this id to mark test case, thanks.

Comment 27 Yedidyah Bar David 2020-02-18 07:45:29 UTC
(In reply to Wei Wang from comment #26)
> ok, this must be traced by bug? Could you provide the bug ID? If necessary,
> I need this id to mark test case, thanks.

This is tracked in bug 1701491.

Comment 28 Wei Wang 2020-02-18 08:10:34 UTC
Remove TestBlocker according to above comments.

Comment 29 Gal Zaidman 2020-02-18 08:35:09 UTC
Can you please lower the priority on this as it is not an urgent bug

Comment 30 Kenneth Weade 2020-03-30 13:10:30 UTC
How about you change the priority back to urgent and fix the issue.  This is causing installs of RHV oVirt engine to fail following your install instructions.  The only solution is to go back the previous version and get a clean install without errors.  All of my servers are wanting to update and I have new servers to setup.  It would be nice if I could trust the new versions that were released and not have to vet and test every patch and release that Red Hat publishes.  This need to be the highest priority.

Comment 31 Kenneth Weade 2020-03-30 13:10:30 UTC
How about you change the priority back to urgent and fix the issue.  This is causing installs of RHV oVirt engine to fail following your install instructions.  The only solution is to go back the previous version and get a clean install without errors.  All of my servers are wanting to update and I have new servers to setup.  It would be nice if I could trust the new versions that were released and not have to vet and test every patch and release that Red Hat publishes.  This need to be the highest priority.

Comment 32 Gal Zaidman 2020-03-30 14:27:06 UTC
(In reply to Kenneth Weade from comment #31)
> How about you change the priority back to urgent and fix the issue.  This is
> causing installs of RHV oVirt engine to fail following your install
> instructions.  The only solution is to go back the previous version and get
> a clean install without errors.  All of my servers are wanting to update and
> I have new servers to setup.  It would be nice if I could trust the new
> versions that were released and not have to vet and test every patch and
> release that Red Hat publishes.  This need to be the highest priority.

Hi Kenneth,

If you can look at comment #22 and see the this specific bug is not failing the installation.
If your installation fails you are more then welcome to:
1. open a bug and add the system logs and we will be happy to have a look.
2. start a thread on ovirt-users mail list

Comment 33 Michal Skrivanek 2020-04-15 06:35:47 UTC
descresing prio/sev, this is not failing anything...

Comment 34 Sandro Bonazzola 2020-06-09 15:11:13 UTC
From gerrit review:
> well, yeah, why do we create that channel anyway?
> it's a slightly bigger change, so here i would indeed just changed to INFO so it's not too verbose, and open bug on clean removal of ovirt ga socket in engine

Comment 35 Wei Wang 2020-08-11 09:21:07 UTC
The latest build RHVH-4.4-20200722.1-RHVH-x86_64-dvd1.iso include vdsm-4.40.22-1.el8ev.x86_64. 
QE will wait for the build which include vdsm-4.40.24 to do verification.

Comment 36 Wei Wang 2020-08-13 09:52:59 UTC
Test with RHVH-4.4-20200812.4-RHVH-x86_64-dvd1.iso (Include vdsm-4.40.25-1.el8ev.x86_64)
Bug is fixed, move the status to "VERIFIED"

Comment 37 Sandro Bonazzola 2020-09-18 07:13:27 UTC
This bugzilla is included in oVirt 4.4.2 release, published on September 17th 2020.

Since the problem described in this bug report should be resolved in oVirt 4.4.2 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.