1280268 – HE-VM cannot startup automatically after successful configure HE

Bug 1280268 - HE-VM cannot startup automatically after successful configure HE

Summary: HE-VM cannot startup automatically after successful configure HE

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-hosted-engine-ha
Sub Component:
Version:	3.6.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	ovirt-3.6.1
Target Release:	3.6.1
Assignee:	Martin Sivák
QA Contact:	Artyom
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1278548 (view as bug list)
Depends On:
Blocks:	1204470 1245143 RHEV3.6Upgrade
TreeView+	depends on / blocked

Reported:	2015-11-11 11:16 UTC by Chaofeng Wu
Modified:	2016-03-09 19:49 UTC (History)
CC List:	23 users (show)
Fixed In Version:	ovirt-hosted-engine-ha-1.3.3.5-1.el7ev
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-03-09 19:49:42 UTC
oVirt Team:	Integration
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
sosreport (5.66 MB, application/x-xz) 2015-11-11 11:16 UTC, Chaofeng Wu	no flags	Details
sosreport (5.69 MB, application/x-xz) 2015-11-12 07:55 UTC, Chaofeng Wu	no flags	Details
/var/log (1.05 MB, application/x-gzip) 2015-11-12 07:57 UTC, Chaofeng Wu	no flags	Details
sosreport from iSCSI deployment (5.72 MB, application/x-xz) 2015-11-16 10:00 UTC, Nikolai Sednev	no flags	Details
sosreport from NFS deployment (5.54 MB, application/x-xz) 2015-11-16 10:01 UTC, Nikolai Sednev	no flags	Details
/var/log (10.32 MB, application/x-gzip) 2015-11-24 10:30 UTC, Chaofeng Wu	no flags	Details
he setup log (430.69 KB, text/plain) 2015-11-26 12:27 UTC, Anatoly Litovsky	no flags	Details
sosreport with add host failure (6.27 MB, application/x-xz) 2015-11-26 14:21 UTC, Anatoly Litovsky	no flags	Details
hosted engine VM sosreport (6.82 MB, application/x-xz) 2015-11-26 14:36 UTC, Anatoly Litovsky	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1280310	urgent	CLOSED	Failed to startup HE-VM after upgrade RHEV-H7.1-20151015 to RHEV-H7.2-20151104 (Device or resource busy)	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHEA-2016:0422	normal	SHIPPED_LIVE	ovirt-hosted-engine-ha bug fix and enhancement update	2016-03-09 23:58:25 UTC
oVirt gerrit	48397	None	MERGED	hosted_engine_page: Fix Config() call	2020-08-13 11:35:15 UTC
oVirt gerrit	48457	None	MERGED	hosted_engine_page: Fix Config() call	2020-08-13 11:35:15 UTC
oVirt gerrit	50623	master	MERGED	service: Point systemd to the pid files	2020-08-13 11:35:15 UTC
oVirt gerrit	50711	None	MERGED	Remove Sys V support and fix the systemd unit files	2020-08-13 11:35:15 UTC
oVirt gerrit	50714	None	MERGED	Remove Sys V support and fix the systemd unit files	2020-08-13 11:35:15 UTC

Internal Links: 1280310

Description Chaofeng Wu 2015-11-11 11:16:14 UTC

Created attachment 1092641 [details]
sosreport

Description of problem:
After we install and configure RHEV-M VM successful, the RHEV-M VM still did not startup after 20 minutes. We need manually to restart the ovirt-ha-agent and ovirt-ha-broker services to start RHEV-M VM.


Version-Release number of selected component (if applicable):
rhev-hypervisor7-7.2-20151104.0.iso
ovirt-node-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
rhevm-appliance-20151110.0-1.x86_64.rhevm.ova

How reproducible:
50%
We tried 4 times, met 2 times VM failed to startup automatically.

Steps to Reproduce:
1. PXE install rhev-hypervisor7-7.2-20151104.0.iso
2. Configure HE successful, poweroff the RHEV-M VM, wait the monitor service to start VM.

Actual results:
After step2, wait more than 20 minutes, RHEV-M VM still did not startup automatically.

Expected results:
After step2, RHEV-M VM should startup in several minutes.

Additional info:
There was no errors during the HE installation and configuration.
If you manually restart ovirt-ha-agent and ovirt-ha-broker services, the RHEV-M can startup properly.

Comment 1 Fabian Deutsch 2015-11-11 13:43:12 UTC

Simone, can you tell by the logs what might be happening?

Comment 2 Simone Tiraboschi 2015-11-11 13:45:10 UTC

Chaofeng, did you also checked the status of that services before restarting?

Comment 3 Simone Tiraboschi 2015-11-11 13:49:09 UTC

From the log it seams that ovirt-ha-agent started only once at:

MainThread::INFO::2015-11-11 07:00:23,528::agent::78::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 1.3.2.1 started

No attempt before

Comment 4 Simone Tiraboschi 2015-11-11 14:08:33 UTC

And it was started by ovirt-hosted-engine-setup at that time.

2015-11-11 07:00:19 DEBUG otopi.plugins.otopi.services.systemd plugin.executeRaw:828 execute: ('/bin/systemctl', 'start', 'ovirt-ha-agent.service'), executable='None', cwd='None', env=None
2015-11-11 07:00:23 DEBUG otopi.plugins.otopi.services.systemd plugin.executeRaw:878 execute-result: ('/bin/systemctl', 'start', 'ovirt-ha-agent.service'), rc=0

No attempt before or after in this logs.

The ovirt-ha-broker has been started two time in past but the last one was 
MainThread::INFO::2015-11-11 07:00:21,496::broker::57::ovirt_hosted_engine_ha.broker.broker.Broker::(run) ovirt-hosted-engine-ha broker 1.3.2.1 started

from ovirt-hosted-engine-setup

So I don't see any manual restart here.

ovirt-ha-agent got stopped at 07:00:26, not sure why

Nov 11 07:00:26 localhost systemd-ovirt-ha-agent: Stopping ovirt-ha-agent: [  OK  ]
And nobody restarted it anymore.


ovirt-hosted-engine-setup seams to be stalled or brutally crashed:

2015-11-11 07:00:23 DEBUG otopi.plugins.ovirt_hosted_engine_setup.core.misc misc._persist_files_start:131 Saving persisting file configuration

Comment 5 Fabian Deutsch 2015-11-11 14:52:35 UTC

Please provide the complete /var/log directory.

Comment 6 Fabian Deutsch 2015-11-11 15:11:45 UTC

This might be the same as bug 1280310

Comment 7 Douglas Schilling Landgraf 2015-11-11 21:17:24 UTC

Looks like I have reproduced the report, using the below scenario:

# cat /etc/redhat-release 
Red Hat Enterprise Virtualization Hypervisor release 7.2 (20151104.0.el7ev)

# rpm -qa | grep -i hosted-engine
ovirt-hosted-engine-setup-1.3.0-1.el7ev.noarch
ovirt-hosted-engine-ha-1.3.2.1-1.el7ev.noarch
ovirt-node-plugin-hosted-engine-0.3.0-2.el7ev.noarch


Steps to reproduce:
-------------------------
#1 Configure HE successfully 
#2 ssh to machine, executed halt command.
#3 wait the monitor service to start VM (No answer)


I have noticed that even with halt command, I can see the qemu process still up in the node and probably that's why the HE keep waiting the vm get down.

# ps aux | grep -i qemu
qemu     20676 20.4 28.9 4841600 3696356 ?     Sl   20:18  10:18 /usr/libexec/qemu-kvm -name HostedEngine -S -machine rhel6.5.0,accel=kvm,usb=off -cpu SandyBridge -m 4096 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid 9eace8ed-4410-4b60-858a-4ab8291776a5 -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=7.2-20151104.0.el7ev,serial=A2FFC0AA-EC96-4A0A-8BCE-4169E6B72637,uuid=9eace8ed-4410-4b60-858a-4ab8291776a5 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-HostedEngine/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2015-11-11T20:19:00,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-reboot -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -drive if=none,id=drive-ide0-1-0,readonly=on,format=raw,serial= -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/var/run/vdsm/storage/9bdb425a-2260-461f-b9f2-83140de3dddf/fb927e75-a578-4fee-bb4c-c6f1fa0ef20a/75d05ee5-fb0b-442a-a497-be9adc50dbaa,if=none,id=drive-virtio-disk0,format=raw,serial=fb927e75-a578-4fee-bb4c-c6f1fa0ef20a,cache=none,werror=stop,rerror=stop,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=28,id=hostnet0,vhost=on,vhostfd=29 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:16:3e:4a:25:9a,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/9eace8ed-4410-4b60-858a-4ab8291776a5.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/9eace8ed-4410-4b60-858a-4ab8291776a5.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev socket,id=charchannel2,path=/var/lib/libvirt/qemu/channels/9eace8ed-4410-4b60-858a-4ab8291776a5.org.ovirt.hosted-engine-setup.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=org.ovirt.hosted-engine-setup.0 -chardev socket,id=charconsole0,path=/var/run/ovirt-vmconsole-console/9eace8ed-4410-4b60-858a-4ab8291776a5.sock,server,nowait -device virtconsole,chardev=charconsole0,id=console0 -vnc 0:0,password -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -msg timestamp=on

Comment 9 Chaofeng Wu 2015-11-12 07:55:04 UTC

Created attachment 1093036 [details]
sosreport

Comment 10 Chaofeng Wu 2015-11-12 07:57:01 UTC

Created attachment 1093037 [details]
/var/log

Comment 12 Simone Tiraboschi 2015-11-12 08:48:26 UTC

hosted-engine-setup waits for the VM to got halted, and it reaches that status 

2015-11-12 07:04:33 DEBUG otopi.ovirt_hosted_engine_setup.tasks tasks.wait:54 Waiting for VM down
2015-11-12 07:04:33 DEBUG otopi.ovirt_hosted_engine_setup.tasks tasks.wait:60 Done
2015-11-12 07:04:33 DEBUG otopi.plugins.ovirt_hosted_engine_setup.ha.ha_services ha_services._closeup:121 {'status': {'message': 'Machine destroyed', 'code': 0}}

Then it configures and enables the ha services and it tries to persist the configuration but it seams that hosted-engine-setup got stuck persisting its configuration.
Indeed in /var/log/ovirt-node.log (with wasn't included in sosreport archive and this is an issue by itself) we have:

2015-11-12 07:04:45,429      ERROR hosted-engine failed to deploy!
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt/node/setup/hostedengine/hosted_engine_page.py", line 301, in open_console
    self.__persist_configs()
  File "/usr/lib/python2.7/site-packages/ovirt/node/setup/hostedengine/hosted_engine_page.py", line 376, in __persist_configs
    [Config(d).persist(d) for d in dirs]
TypeError: __init__() takes exactly 1 argument (2 given)

Comment 13 Fabian Deutsch 2015-11-12 13:51:08 UTC

Thanks.

It looks like this bug was just fixed yesterday by Douglas.

Comment 14 Fabian Deutsch 2015-11-16 09:16:45 UTC

*** Bug 1278548 has been marked as a duplicate of this bug. ***

Comment 15 Nikolai Sednev 2015-11-16 09:57:47 UTC

It still appears in RHEVH (Beta) release 7.2 (20151113.123.el7ev), while deployed over iSCSI/NFS:

# service ovirt-ha-agent status
Redirecting to /bin/systemctl status  ovirt-ha-agent.service
● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent
   Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Sun 2015-11-15 13:13:51 UTC; 47min ago
  Process: 12786 ExecStop=/usr/lib/systemd/systemd-ovirt-ha-agent stop (code=exited, status=0/SUCCESS)
 Main PID: 12782 (code=exited, status=0/SUCCESS)

Nov 15 13:13:44 black-vdsb.qa.lab.tlv.redhat.com systemd[1]: Starting oVirt Hosted Engine High Availability Monitoring Agent...
Nov 15 13:13:48 black-vdsb.qa.lab.tlv.redhat.com systemd-ovirt-ha-agent[12756]: Starting ovirt-ha-agent: [  OK  ]
Nov 15 13:13:48 black-vdsb.qa.lab.tlv.redhat.com systemd[1]: Started oVirt Hosted Engine High Availability Monitoring Agent.
Nov 15 13:13:51 black-vdsb.qa.lab.tlv.redhat.com systemd-ovirt-ha-agent[12786]: Stopping ovirt-ha-agent: [  OK  ]

# service ovirt-ha-broker status
Redirecting to /bin/systemctl status  ovirt-ha-broker.service
● ovirt-ha-broker.service - oVirt Hosted Engine High Availability Communications Broker
   Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-broker.service; enabled; vendor preset: disabled)
   Active: active (running) since Sun 2015-11-15 13:13:52 UTC; 47min ago
  Process: 12828 ExecStart=/usr/lib/systemd/systemd-ovirt-ha-broker start (code=exited, status=0/SUCCESS)
 Main PID: 12872 (ovirt-ha-broker)
   CGroup: /system.slice/ovirt-ha-broker.service
           └─12872 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker

Nov 15 13:13:48 black-vdsb.qa.lab.tlv.redhat.com systemd[1]: Starting oVirt Hosted Engine High Availability Communications Broker...
Nov 15 13:13:52 black-vdsb.qa.lab.tlv.redhat.com systemd-ovirt-ha-broker[12828]: Starting ovirt-ha-broker: [  OK  ]
Nov 15 13:13:52 black-vdsb.qa.lab.tlv.redhat.com systemd[1]: Started oVirt Hosted Engine High Availability Communications Broker.

Comment 16 Nikolai Sednev 2015-11-16 10:00:24 UTC

Created attachment 1094795 [details]
sosreport from iSCSI deployment

Comment 17 Nikolai Sednev 2015-11-16 10:01:23 UTC

Created attachment 1094810 [details]
sosreport from NFS deployment

Comment 18 Nikolai Sednev 2015-11-16 10:04:13 UTC

Please consider reopening as it appears in RHEVH (Beta) release 7.2 (20151113.123.el7ev).

Comment 19 Nikolai Sednev 2015-11-16 10:05:22 UTC

Hi Fabian,
Can you please add exact RHEVH version, in which this bug was fixed?

Comment 20 Ying Cui 2015-11-16 10:06:24 UTC

Nikolai, the build RHEVH (Beta) release 7.2 (20151113.123.el7ev) was never gave Virt QE for testing.
And the last test build is rhev-hypervisor7-7.2-20151112.1.el7ev for Virt QE, if you still encounter the issue, please make the bug to ASSIGNED. And Virt QE are testing this bug as well.

Comment 21 Fabian Deutsch 2015-11-16 12:30:37 UTC

Please test this bug with RHEV-H 7.2 rhev-hypervisor7-7.2-20151112.1

Comment 22 Fabian Deutsch 2015-11-16 14:49:51 UTC

Nikolai, why is this bug blocking bug 1172511?

Comment 23 Nikolai Sednev 2015-11-16 14:58:40 UTC

Removing the dependency as actually final deployment does not matter for the 1172511, as it is being cancelled during deployment.

Comment 24 Nikolai Sednev 2015-11-16 15:09:37 UTC

(In reply to Ying Cui from comment #20)
> Nikolai, the build RHEVH (Beta) release 7.2 (20151113.123.el7ev) was never
> gave Virt QE for testing.
> And the last test build is rhev-hypervisor7-7.2-20151112.1.el7ev for Virt
> QE, if you still encounter the issue, please make the bug to ASSIGNED. And
> Virt QE are testing this bug as well.

Forth to my conversation with Anatoly Litovsky:
It's most likely 99.9% (by Anatoly) that bug exists (the version I checked (20151113.123.el7ev) is newer than you currently have. 
For more information please report to Anatoly Litovsky.

Comment 25 Fabian Deutsch 2015-11-16 15:46:00 UTC

Nikolai, please check this bug with the build given in comment 21.

Comment 26 Fabian Deutsch 2015-11-16 15:55:08 UTC

Dropping the NEEDINFO for now.

We see that the bug should be fixed in ovirt-node-plugin-hosted-engine-0.3.0-3.

It needs to be clarified if 20151113.123.el7ev contains this ovirt-node-plugin-hosted-engine build or not.
If RHEV-H 20151113.123.el7ev does not contain the build ovirt-node-plugin-hosted-engine-0.3.0-3 build, then the bug will not be fixed.

However, we know that the beta1 one build rhev-hypervisor7-7.2-20151112.1.el7ev does contain this new ovirt-node-plugin-hosted-engine-build.

Tolik or Nikolai, can you please check what ovirt-node-plugin-hosted-engine build is contained in RHEV-H 20151113.123.el7ev ?

Comment 30 Anatoly Litovsky 2015-11-18 19:01:29 UTC

I just did the all process with latest builds
and both agent and engine come up fine
I guess I need to see you setup.

Ping me when you ready

Comment 34 Fabian Deutsch 2015-11-24 09:41:27 UTC

As a note: There is a chance that the vdsm certs are not persisted correctly (or empty?) which can cause the "vdsm ProtocolDetector.SSLHandshakeDispatcher ERROR Error during handshake: unexpected eof" error

Comment 36 Chaofeng Wu 2015-11-24 10:30:41 UTC

Created attachment 1098103 [details]
/var/log

Comment 39 Anatoly Litovsky 2015-11-25 19:31:55 UTC

took 18 minutes for engine recovery
And all this on local VM machines and no network delay
I afraid this may be bigger then 20 minutes on real deployments
Please set us up with the environment with the bug happening.

Comment 40 Anatoly Litovsky 2015-11-26 12:27:18 UTC

Created attachment 1099274 [details]
he setup log

Comment 41 Anatoly Litovsky 2015-11-26 12:28:30 UTC

It seems that he setup fails but reports succes

    2015-11-26 10:29:22 DEBUG otopi.plugins.ovirt_hosted_engine_setup.engine.add_host add_host._closeup:653 Adding the host to the cluster
    2015-11-26 10:29:23 DEBUG otopi.plugins.ovirt_hosted_engine_setup.engine.add_host add_host._closeup:683 Cannot add the host to cluster Default
    Traceback (most recent call last):
      File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/ovirt-hosted-engine-setup/engine/add_host.py", line 673, in _closeup
        otopicons.NetEnv.IPTABLES_ENABLE
      File "/usr/lib/python2.7/site-packages/ovirtsdk/infrastructure/brokers.py", line 17454, in add
        headers={"Correlation-Id":correlation_id, "Expect":expect}
      File "/usr/lib/python2.7/site-packages/ovirtsdk/infrastructure/proxy.py", line 75, in add
        return self.request('POST', url, body, headers)
      File "/usr/lib/python2.7/site-packages/ovirtsdk/infrastructure/proxy.py", line 115, in request
        persistent_auth=self.__persistent_auth
      File "/usr/lib/python2.7/site-packages/ovirtsdk/infrastructure/connectionspool.py", line 79, in do_request
        persistent_auth)
      File "/usr/lib/python2.7/site-packages/ovirtsdk/infrastructure/connectionspool.py", line 155, in __do_request
        raise errors.RequestError(response_code, response_reason, response_body)
    RequestError:
    status: 409
    reason: Conflict
    detail: Cannot add Host. Connecting to host via SSH has failed, verify that the host is reachable (IP address, routable address etc.) You may refer to the engine.log file for further details.
    2015-11-26 10:29:23 ERROR otopi.plugins.ovirt_hosted_engine_setup.engine.add_host add_host._closeup:691 Cannot automatically add the host to cluster Default:
    Cannot add Host. Connecting to host via SSH has failed, verify that the host is reachable (IP address, routable address etc.) You may refer to the engine.log file for further details.
     
     
     
    The closeup function
     
    def _closeup(self):
            # shutdown the vm if this is first host.
            if not self.environment[ohostedcons.CoreEnv.IS_ADDITIONAL_HOST]:
     
                if self.environment[
                    ohostedcons.VMEnv.AUTOMATE_VM_SHUTDOWN
                ]:
                    self.logger.info(_('Shutting down the engine VM'))
                    cli = self.environment[ohostedcons.VDSMEnv.VDS_CLI]
                    res = cli.shutdown(self.environment[ohostedcons.VMEnv.VM_UUID])
                    self.logger.debug(res)
                else:
                    self.dialog.note(
                        _(
                            'Please shutdown the VM allowing the system '
                            'to launch it as a monitored service.\n'
                            'The system will wait until the VM is down.'
                        )
                    )
                waiter = tasks.VMDownWaiter(self.environment)
                if not waiter.wait():
                    # The VM is down but not destroyed
                    status = self.environment[
                        ohostedcons.VDSMEnv.VDS_CLI
                    ].destroy(
                        self.environment[ohostedcons.VMEnv.VM_UUID]
                    )
                    self.logger.debug(status)
                    if status['status']['code'] != 0:
                        self.logger.error(
                            _(
                                'Cannot destroy the Hosted Engine VM: ' +
                                status['status']['message']
                            )
                        )
                        raise RuntimeError(status['status']['message'])
            self.logger.info(_('Enabling and starting HA services'))
            for service in (
                ohostedcons.Const.HA_AGENT_SERVICE,
                ohostedcons.Const.HA_BROCKER_SERVICE,
            ):
                self.services.startup(
                    name=service,
                    state=True,
                )
                self.services.state(
                    name=service,
                    state=True,
                )

Comment 43 Anatoly Litovsky 2015-11-26 14:21:03 UTC

Created attachment 1099299 [details]
sosreport with add host failure

Comment 44 Sandro Bonazzola 2015-11-26 14:27:10 UTC

need to check why exit code  is still 0 here.

Comment 45 Anatoly Litovsky 2015-11-26 14:36:23 UTC

Created attachment 1099302 [details]
hosted engine VM sosreport

Comment 46 Simone Tiraboschi 2015-11-30 09:40:23 UTC

Adding the host failed cause the engine wasn't able to connect the host.

2015-11-26 05:27:37,405 ERROR [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (ajp-/127.0.0.1:8702-2) [ec6f118] Failed to establish session with host 'hosted_engine_1': null
2015-11-26 05:27:37,405 WARN  [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (ajp-/127.0.0.1:8702-2) [ec6f118] CanDoAction of action 'AddVds' failed for user admin@internal. Reasons: VAR__ACTION__ADD,VAR__TYPE__HOST,$server dhcp-test-11-227.nay.redhat.com,VDS_CANNOT_CONNECT_TO_SERVER
2015-11-26 05:27:37,611 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (ajp-/127.0.0.1:8702-2) [] Operation Failed: [Cannot add Host. Connecting to host via SSH has failed, verify that the host is reachable (IP address, routable address etc.) You may refer to the engine.log file for further details.]

Then ovirt-hosted-engine-setup showed an alert about the issue and a menu to let the user retrying with also an option to poweroff the engine VM witch wasn't working. It has fixed here https://gerrit.ovirt.org/#/c/47504/ as for https://bugzilla.redhat.com/show_bug.cgi?id=1271511

The fix is available since https://brewweb.devel.redhat.com/buildinfo?buildID=466825

Comment 47 Simone Tiraboschi 2015-11-30 17:28:05 UTC

I wasn't able to reproduce with:

[root@rhevh20151130 admin]# cat /etc/redhat-release
Red Hat Enterprise Virtualization Hypervisor (Beta) release 7.2 (20151112.1.el7ev)

[root@rhevh20151130 admin]# rpm -qa | grep -i hosted-engine
ovirt-hosted-engine-setup-1.3.0-1.el7ev.noarch
ovirt-hosted-engine-ha-1.3.2.1-1.el7ev.noarch
ovirt-node-plugin-hosted-engine-0.3.0-3.el7ev.noarch

Comment 48 Yaniv Kaul 2015-12-03 14:46:48 UTC

(In reply to Simone Tiraboschi from comment #47)
> I wasn't able to reproduce with:

Do you need QE's reproduction? I don't see a NEEDINFO.

Comment 49 Simone Tiraboschi 2015-12-03 15:37:23 UTC

Yes, thanks.

Chaofeng, are you able to provide a stable reproducer?

Comment 55 Fabian Deutsch 2015-12-08 08:07:14 UTC

The problem described in comment 12 is fixed.
It seems that hosted-engine is getting deployed:

[root@cshaoh1 log]# grep -rni "failed to depl"
[root@cshaoh1 log]# 


The remaining problem seems to be that ovirt-ha-agent is brought down by some tntity. It is started by he-setup, but then going down a few seconds later without any obvious reason:

Dec 08 03:30:15 cshaoh1.redhat.com systemd[1]: Started oVirt Hosted Engine High Availability Monitoring Agent.
-- Subject: Unit ovirt-ha-agent.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit ovirt-ha-agent.service has finished starting up.
-- 
-- The start-up result is done.
Dec 08 03:30:19 cshaoh1.redhat.com systemd-ovirt-ha-agent[28511]: Stopping ovirt-ha-agent: [  OK  ]


I don't see anything particular in /var/log.

Leaving for Simone for further investigations.

Comment 56 Fabian Deutsch 2015-12-08 08:07:59 UTC

For reference: In the end I manually started ovirt-ha-agent to see if it comes up, and if it was going down again.

It came up, and was not going down within the next seconds.

Comment 57 Fabian Deutsch 2015-12-08 08:38:30 UTC

I also see that

2015-12-08 03:30:16 DEBUG otopi.plugins.ovirt_hosted_engine_setup.core.misc misc._persist_files_start:131 Saving persisting file configuration

is the last line in the hosted-engine-setup logfile.

Is this the normal last line? or should there be more logs after this?

Comment 58 Fabian Deutsch 2015-12-08 08:41:11 UTC

For reference, the builds for the comments since comment 55:
# rpm -qa | grep ovirt
libgovirt-0.3.3-1.el7.x86_64
ovirt-node-plugin-hosted-engine-0.3.0-4.el7ev.noarch
ovirt-node-plugin-snmp-3.6.0-0.23.20151201git5eed7af.el7ev.noarch
ovirt-setup-lib-1.0.0-1.el7ev.noarch
ovirt-host-deploy-1.4.1-1.el7ev.noarch
ovirt-vmconsole-host-1.0.0-1.el7ev.noarch
ovirt-node-lib-legacy-3.6.0-0.23.20151201git5eed7af.el7ev.noarch
ovirt-node-lib-3.6.0-0.23.20151201git5eed7af.el7ev.noarch
ovirt-node-3.6.0-0.23.20151201git5eed7af.el7ev.noarch
ovirt-node-plugin-snmp-logic-3.6.0-0.23.20151201git5eed7af.el7ev.noarch
ovirt-hosted-engine-setup-1.3.1.1-1.el7ev.noarch
ovirt-node-plugin-vdsm-0.6.1-4.el7ev.noarch
ovirt-node-plugin-cim-3.6.0-0.23.20151201git5eed7af.el7ev.noarch
ovirt-vmconsole-1.0.0-1.el7ev.noarch
ovirt-node-lib-config-3.6.0-0.23.20151201git5eed7af.el7ev.noarch
ovirt-node-selinux-3.6.0-0.23.20151201git5eed7af.el7ev.noarch
ovirt-node-plugin-cim-logic-3.6.0-0.23.20151201git5eed7af.el7ev.noarch
ovirt-hosted-engine-ha-1.3.3.1-1.el7ev.noarch
ovirt-host-deploy-offline-1.4.0-1.el7ev.x86_64
ovirt-node-plugin-rhn-3.6.0-0.23.20151201git5eed7af.el7ev.noarch
ovirt-node-branding-rhev-3.6.0-0.23.20151201git5eed7af.el7ev.noarch

Comment 61 Simone Tiraboschi 2015-12-09 13:11:04 UTC

(In reply to Fabian Deutsch from comment #57)
> I also see that
> 
> 2015-12-08 03:30:16 DEBUG otopi.plugins.ovirt_hosted_engine_setup.core.misc
> misc._persist_files_start:131 Saving persisting file configuration
> 
> is the last line in the hosted-engine-setup logfile.
> 
> Is this the normal last line? or should there be more logs after this?

No, it's not.
Something weird happened here.

Comment 62 Simone Tiraboschi 2015-12-09 18:15:28 UTC

(In reply to Simone Tiraboschi from comment #61)
> (In reply to Fabian Deutsch from comment #57)
> > I also see that
> > 
> > 2015-12-08 03:30:16 DEBUG otopi.plugins.ovirt_hosted_engine_setup.core.misc
> > misc._persist_files_start:131 Saving persisting file configuration
> > 
> > is the last line in the hosted-engine-setup logfile.
> > 
> > Is this the normal last line? or should there be more logs after this?
> 
> No, it's not.
> Something weird happened here.

I tried to reproduce on Red Hat Enterprise Virtualization Hypervisor (Beta) release 7.2 (20151201.2.el7ev) deploying on NFS and it successfully worked: the agent stayed up and it successfully restarted the engine VM.

The hosted-engine-setup logs effectively terminated just with 'Saving persisting file configuration'.

After Fabian restarted the HA agent on the reported hosted everything was fine also there.

Huijuan, are able to provide an additional reproducer?

Comment 64 Huijuan Zhao 2015-12-10 03:11:34 UTC

Simone,  see comment 42 and comment 52, the 10.66.11.227 is still available, is the env. Is this env ok for you to investigate?
I will also reproduce it in an additional machine later today.

Comment 65 Simone Tiraboschi 2015-12-10 08:51:58 UTC

(In reply to Huijuan Zhao from comment #64)
> Simone,  see comment 42 and comment 52, the 10.66.11.227 is still available,
> is the env. Is this env ok for you to investigate?
> I will also reproduce it in an additional machine later today.

On 10.66.11.227 the ha agent simply gave up cause it was not able to connect the storage server and so it entered the failed state:

Nov 27 14:57:43 dhcp-test-11-227.nay.redhat.com ovirt-ha-agent[15820]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'Connection to storage server failed' - trying to restart agent
Nov 27 15:06:17 dhcp-test-11-227.nay.redhat.com ovirt-ha-agent[15820]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'Connection to storage server failed' - trying to restart agent
Nov 27 15:06:22 dhcp-test-11-227.nay.redhat.com ovirt-ha-agent[15820]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Too many errors occurred, giving up. Please review the log and consider filing a bug.
Nov 27 15:06:22 dhcp-test-11-227.nay.redhat.com systemd[1]: ovirt-ha-agent.service: main process exited, code=exited, status=157/n/a

And from vdsm logs:
Thread-345::ERROR::2015-11-27 15:06:17,759::hsm::2465::Storage.HSM::(connectStorageServer) Could not connect to storageServer
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/hsm.py", line 2462, in connectStorageServer
    conObj.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 419, in connect
    return self._mountCon.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 232, in connect
    six.reraise(t, v, tb)
  File "/usr/share/vdsm/storage/storageServer.py", line 224, in connect
    self._mount.mount(self.options, self._vfsType, cgroup=self.CGROUP)
  File "/usr/share/vdsm/storage/mount.py", line 225, in mount
    return self._runcmd(cmd, timeout)
  File "/usr/share/vdsm/storage/mount.py", line 241, in _runcmd
    raise MountError(rc, ";".join((out, err)))
MountError: (32, ';mount.nfs: Connection timed out\n')
Thread-345::DEBUG::2015-11-27 15:06:17,759::hsm::2489::Storage.HSM::(connectStorageServer) knownSDs: {}
Thread-345::INFO::2015-11-27 15:06:17,760::logUtils::51::dispatcher::(wrapper) Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 477, 'id': '69bcc4d0-81e3-4a88-9e19-5c2d7523a7ac'}]}
Thread-345::DEBUG::2015-11-27 15:06:17,760::task::1191::Storage.TaskManager.Task::(prepare) Task=`751ab5b9-a410-4da6-8fb5-db2e01f8540d`::finished: {'statuslist': [{'status': 477, 'id': '69bcc4d0-81e3-4a88-9e19-5c2d7523a7ac'}]}


But this is correct and attended if the storage server is not available.

We need to focus just on this issue: ha-agent no starting/failing after a successfully run of hosted-engine-setup

Comment 66 Fabian Deutsch 2015-12-10 09:18:56 UTC

Huijuan, how many times did you reproduce this issue out of how many trials - to get an idea on the likelyhood of this problem?

Comment 67 Huijuan Zhao 2015-12-10 10:21:25 UTC

Simone, I did not complete setup HE today, I will update after I reproduce this issue.

Fabian, QE team tested 5 times, reproduced 2 times.

Comment 70 Simone Tiraboschi 2015-12-15 10:03:39 UTC

ovirt-ha-agent logs in stardard mode seams clean.
We need to reproduce with the agent in debug mode since here I didn't enough information in the standard log.

To do that is enough to modify:
 /etc/ovirt-hosted-engine-ha/agent-log.conf

changing from
 [logger_root]
 level=INFO
to
 [logger_root]
 level=DEBUG

and restart the agent if it was running.
The issue is that currently restarting the agent solves it.

Huijuan, can you please try to reproduce from scratch with the agent in debug mode?

Comment 72 Huijuan Zhao 2015-12-17 03:47:36 UTC

Simone, I reproduced this issue with the agent in debug mode(according to Comment 70), the ENV:
10.66.73.132
admin/redhat

You can access the system to debug now.

This ENV will keep 2 days for you, is that ok?

Comment 73 Simone Tiraboschi 2015-12-18 16:21:26 UTC

ha-agent was still using Sys V init scripts and ha-agent systemd unit was simply calling Sys V init script. 
The Sys V init script was forking the deamon in background and since systemd can see it return quite fast it can kill thinking that it finished before the daemon was really detached.
Being a race condition was not systematic.

Properly moving to systemd should fix it.

Comment 74 Artyom 2015-12-28 10:12:18 UTC

Verified on
Red Hat Enterprise Virtualization Hypervisor (Beta) release 7.2 (20151221.1.el7ev)
# rpm -qa | grep ovirt-hosted
ovirt-hosted-engine-setup-1.3.1.3-1.el7ev.noarch
ovirt-hosted-engine-ha-1.3.3.6-1.el7ev.noarch
rhevm-appliance-20151216.0-1

I checked deploy 3 times and in all 3 times deployment succeed, all services up and engine vm succeed to run.

Comment 76 errata-xmlrpc 2016-03-09 19:49:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-0422.html

Note You need to log in before you can comment on or make changes to this bug.

bmcclain
cshao
cwu
dfediuck
dougsland
fdeutsch
gklein
huiwa
huzhao
istein
leiwang
lsurette
mavital
mgoldboi
nsednev
pstehlik
sbonazzo
stirabos
tlitovsk
yaniwang
ycui
ykaul
ylavi