Bug 1401359 - [Hosted-Engine] 3.5 HE SD upgrade fails if done on initial host
Summary: [Hosted-Engine] 3.5 HE SD upgrade fails if done on initial host
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-hosted-engine-ha
Version: 3.6.9
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ovirt-4.1.1-2
: ---
Assignee: Simone Tiraboschi
QA Contact: Nikolai Sednev
URL:
Whiteboard:
Depends On:
Blocks: 1400800 1422864 1430513 1444021
TreeView+ depends on / blocked
 
Reported: 2016-12-05 01:43 UTC by Germano Veit Michel
Modified: 2020-04-15 14:56 UTC (History)
8 users (show)

Fixed In Version: ovirt-hosted-engine-ha-2.1.0.5-2.el7ev
Doc Type: Bug Fix
Doc Text:
Previously, the Red Hat Enterprise Virtualization 3.5 self-hosted engine storage domain upgrade failed on the initial host due to permissions errors. This has been corrected.
Clone Of:
: 1422864 (view as bug list)
Environment:
Last Closed: 2017-05-03 07:51:23 UTC
oVirt Team: Integration
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
sosreport-puma18.scl.lab.tlv.redhat.com-20170412222758.tar.xz (14.45 MB, application/x-xz)
2017-04-12 19:41 UTC, Nikolai Sednev
no flags Details
sosreport-puma19.scl.lab.tlv.redhat.com-20170412222804.tar.xz (8.74 MB, application/x-xz)
2017-04-12 19:43 UTC, Nikolai Sednev
no flags Details
engine (7.65 MB, application/x-xz)
2017-04-12 19:44 UTC, Nikolai Sednev
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2792381 0 None None None 2016-12-05 05:07:51 UTC
Red Hat Product Errata RHEA-2017:1195 0 normal SHIPPED_LIVE ovirt-hosted-engine-ha bug fix and enhancement update 2017-05-03 11:51:13 UTC
oVirt gerrit 72409 0 None None None 2017-02-16 09:38:14 UTC
oVirt gerrit 72433 0 None None None 2017-02-16 13:07:42 UTC

Description Germano Veit Michel 2016-12-05 01:43:16 UTC
Description of problem:

In 3.5, /etc/ovirt-hosted-engine/answers.conf permissions are, right after install, as below:

Initial Host:
-rw-rw----. 1 root root 2585 Dec  5 00:08 /etc/ovirt-hosted-engine/answers.conf

Additional Hosts:
-rw-r--r--. 1 root root 2575 Dec  5 00:18 /etc/ovirt-hosted-engine/answers.conf

When upgrading to 3.6, if the Host chosen to upgrade first (to ha 1.3.x) is the Initial one selected for the initial deployment of HE (-rw-rw----), the HE SD upgrade fails due to EACCESS to answers.conf file.

Version-Release number of selected component (if applicable):
Red Hat Enterprise Virtualization Hypervisor release 7.2 (20160920.1.el7ev)
ovirt-hosted-engine-ha-1.3.5.8-1.el7ev.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy fresh Hosted Engine on 3.5 using 20160219.0.el7ev
2. Upgrade initial HE host to 20160920.1.el7ev
3. Trigger HE SD Upgrade (Host in maintenance, restart ha-agent)

Actual results:
If one chooses the initial HE Host to do the upgrade, the HE SD not Upgraded, ha-agent keeps restarting. See:
MainThread::INFO::2016-12-05 01:24:38,917::upgrade::1010::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(upgrade_35_36) Upgrading to current version
MainThread::INFO::2016-12-05 01:24:39,004::upgrade::736::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_stopMonitoringDomain) Stop monitoring domain
MainThread::INFO::2016-12-05 01:24:39,059::upgrade::151::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_conf_volume_there) Looking for conf volume
MainThread::ERROR::2016-12-05 01:24:39,112::upgrade::207::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_conf_volume_there) Unable to find HE conf volume
MainThread::INFO::2016-12-05 01:24:39,112::upgrade::953::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_move_to_shared_conf) _move_to_shared_conf
MainThread::INFO::2016-12-05 01:24:39,112::upgrade::375::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_get_conffile_content) Reading conf file: fhanswers.conf
MainThread::ERROR::2016-12-05 01:24:39,112::upgrade::399::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_get_conffile_content) Failed to read configuration file '/etc/ovirt-hosted-engine/answers.conf': [Errno 13] Permission denied: '/etc/ovirt-hosted-engine/answers.conf'
MainThread::ERROR::2016-12-05 01:24:39,113::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to read configuration file '/etc/ovirt-hosted-engine/answers.conf': [Errno 13] Permission denied: '/etc/ovirt-hosted-engine/answers.conf'' - trying to restart agent

Expected results:
HE SD upgraded. 

Additional info:
The permission issue seem to be there for some time. In previous hosted-engine-ha in 3.6 apparently used to upgrade the HE SD even when hitting this "Permission denied" error. See: https://bugzilla.redhat.com/show_bug.cgi?id=1292652#c1

Apparently we missed that error in that BZ, and now the behavior is slightly different, we restart the agent and the HE SD is NOT upgraded at the step it should. This causes troubles for 3.5 to 3.6 Upgrade.

Comment 5 Nikolai Sednev 2017-04-12 19:27:52 UTC
1)Deployed clean environment on puma18 and puma19 hosts, over NFS storage domain and added two NFS data storage domains.
Components on hosts:
ovirt-hosted-engine-setup-1.2.6.1-1.el7ev.noarch
sanlock-3.2.4-3.el7_2.x86_64
rhevm-sdk-python-3.5.6.0-1.el7ev.noarch
mom-0.4.1-4.el7ev.noarch
vdsm-4.16.38-1.el7ev.x86_64
qemu-kvm-rhev-2.3.0-31.el7_2.25.x86_64
ovirt-hosted-engine-ha-1.2.10-1.el7ev.noarch
libvirt-client-1.2.17-13.el7_2.6.x86_64
ovirt-host-deploy-1.3.2-1.el7ev.noarch
Linux version 3.10.0-327.53.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Tue Mar 14 10:49:09 EDT 2017
Linux 3.10.0-327.53.1.el7.x86_64 #1 SMP Tue Mar 14 10:49:09 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.2 (Maipo)

On engine:
rhevm-lib-3.5.8-0.1.el6ev.noarch
rhevm-setup-plugin-ovirt-engine-common-3.5.8-0.1.el6ev.noarch
rhevm-dwh-3.5.5-1.el6ev.noarch
rhevm-setup-plugins-3.5.4-1.el6ev.noarch
rhevm-iso-uploader-3.5.1-1.el6ev.noarch
rhevm-extensions-api-impl-3.5.8-0.1.el6ev.noarch
rhevm-spice-client-x64-msi-3.5-3.el6.noarch
rhevm-backend-3.5.8-0.1.el6ev.noarch
rhevm-sdk-python-3.5.6.0-1.el6ev.noarch
ovirt-host-deploy-1.3.2-1.el6ev.noarch
rhevm-spice-client-x64-cab-3.5-3.el6.noarch
rhevm-webadmin-portal-3.5.8-0.1.el6ev.noarch
rhevm-setup-base-3.5.8-0.1.el6ev.noarch
rhevm-reports-3.5.8-1.el6ev.noarch
rhevm-image-uploader-3.5.0-4.el6ev.noarch
rhevm-setup-plugin-websocket-proxy-3.5.8-0.1.el6ev.noarch
ovirt-host-deploy-java-1.3.2-1.el6ev.noarch
rhevm-spice-client-x86-cab-3.5-3.el6.noarch
rhevm-setup-3.5.8-0.1.el6ev.noarch
rhevm-tools-3.5.8-0.1.el6ev.noarch
rhevm-3.5.8-0.1.el6ev.noarch
rhevm-log-collector-3.5.4-2.el6ev.noarch
rhev-guest-tools-iso-3.5-15.el6ev.noarch
rhevm-doc-3.5.3-1.el6eng.noarch
rhevm-spice-client-x86-msi-3.5-3.el6.noarch
rhevm-guest-agent-common-1.0.10-2.el6ev.noarch
rhevm-dwh-setup-3.5.5-1.el6ev.noarch
rhevm-branding-rhev-3.5.0-4.el6ev.noarch
rhevm-cli-3.5.0.6-1.el6ev.noarch
rhevm-dbscripts-3.5.8-0.1.el6ev.noarch
rhevm-setup-plugin-ovirt-engine-3.5.8-0.1.el6ev.noarch
rhevm-reports-setup-3.5.8-1.el6ev.noarch
rhevm-dependencies-3.5.1-1.el6ev.noarch
rhevm-websocket-proxy-3.5.8-0.1.el6ev.noarch
rhevm-userportal-3.5.8-0.1.el6ev.noarch
rhevm-restapi-3.5.8-0.1.el6ev.noarch
Linux version 2.6.32-573.41.1.el6.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-16) (GCC) ) #1 SMP Thu Mar 2 11:08:17 EST 2017
Linux 2.6.32-573.41.1.el6.x86_64 #1 SMP Thu Mar 2 11:08:17 EST 2017 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 6.7 (Santiago)

puma18 ~]# hosted-engine --vm-status


--== Host 1 status ==--

Status up-to-date                  : True
Hostname                           : puma18
Host ID                            : 1
Engine status                      : {"health": "good", "vm": "up", "detail": "up"}
Score                              : 2400
Local maintenance                  : False
Host timestamp                     : 21579
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=21579 (Wed Apr 12 19:11:01 2017)
        host-id=1
        score=2400
        maintenance=False
        state=EngineUp


--== Host 2 status ==--

Status up-to-date                  : True
Hostname                           : puma19
Host ID                            : 2
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 2400
Local maintenance                  : False
Host timestamp                     : 21438
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=21438 (Wed Apr 12 19:11:02 2017)
        host-id=2
        score=2400
        maintenance=False
        state=EngineDown


2)Permissions right after deployment on both hosts as appears bellow:
puma18 ~]# ll -lsha /etc/ovirt-hosted-engine/answers.conf
4.0K -rw-rw----. 1 root root 2.6K Apr 12 17:52 /etc/ovirt-hosted-engine/answers.conf
puma19 ~]# ll -lsha /etc/ovirt-hosted-engine/answers.conf
4.0K -rw-r--r--. 1 root root 2.6K Apr 12 18:01 /etc/ovirt-hosted-engine/answers.conf

3)Created one guest VM with RHEL7.3 OS.
4)Upgraded HE from 3.5 to 3.6 following https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Virtualization/3.6/html/Self-Hosted_Engine_Guide/Upgrading_the_Self-Hosted_Engine.html
5)I was able to reproduce this issue on my environment.
MainThread::INFO::2017-04-12 22:24:36,343::upgrade::1012::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(upgrade_35_36) Upgrading to current version
MainThread::INFO::2017-04-12 22:24:36,368::upgrade::738::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_stopMonitoringDomain) Stop monitoring domain
MainThread::INFO::2017-04-12 22:24:36,383::upgrade::153::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_conf_volume_there) Looking for conf volume
MainThread::ERROR::2017-04-12 22:24:36,399::upgrade::209::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_conf_volume_there) Unable to find HE conf volume
MainThread::INFO::2017-04-12 22:24:36,399::upgrade::955::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_move_to_shared_conf) _move_to_shared_conf
MainThread::INFO::2017-04-12 22:24:36,400::upgrade::377::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_get_conffile_content) Reading conf file: fhanswers.conf
MainThread::ERROR::2017-04-12 22:24:36,400::upgrade::401::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_get_conffile_content) Failed to read configuration file '/etc/ovirt-hosted-engine/answers.conf': [Errno 13] Permission denied: '/etc/ovirt-hosted-engine/answers.conf'
MainThread::ERROR::2017-04-12 22:24:36,400::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to read configuration file '/etc/ovirt-hosted-engine/answers.conf': [Errno 13] Permission denied: '/etc/ovirt-hosted-engine/answers.conf'' - trying to restart agent


Moving back to assigned.

I was not able to move the second host to maintenance as HE-VM was running on it and it was failing to migrate to 3.6's rhel7.3 puma18.

Comment 6 Nikolai Sednev 2017-04-12 19:41:57 UTC
Created attachment 1271242 [details]
sosreport-puma18.scl.lab.tlv.redhat.com-20170412222758.tar.xz

Comment 7 Nikolai Sednev 2017-04-12 19:43:04 UTC
Created attachment 1271243 [details]
sosreport-puma19.scl.lab.tlv.redhat.com-20170412222804.tar.xz

Comment 8 Nikolai Sednev 2017-04-12 19:44:01 UTC
Created attachment 1271244 [details]
engine

Comment 9 Nikolai Sednev 2017-04-12 19:47:34 UTC
puma18 ~]# ll -lsha /etc/ovirt-hosted-engine/answers.conf
4.0K -rw-rw----. 1 root root 2.6K Apr 12 17:52 /etc/ovirt-hosted-engine/answers.conf
ovirt-vmconsole-1.0.4-1.el7ev.noarch
ovirt-hosted-engine-ha-1.3.5.10-1.el7ev.noarch
ovirt-setup-lib-1.0.1-1.el7ev.noarch
sanlock-3.4.0-1.el7.x86_64
mom-0.5.6-1.el7ev.noarch
ovirt-hosted-engine-setup-1.3.7.4-1.el7ev.noarch
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
vdsm-4.17.39-1.el7ev.noarch
ovirt-host-deploy-1.4.1-1.el7ev.noarch
libvirt-client-2.0.0-10.el7_3.5.x86_64
qemu-kvm-rhev-2.6.0-28.el7_3.9.x86_64
rhevm-sdk-python-3.6.9.1-1.el7ev.noarch
Linux version 3.10.0-514.16.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Fri Mar 10 13:12:32 EST 2017
Linux puma18 3.10.0-514.16.1.el7.x86_64 #1 SMP Fri Mar 10 13:12:32 EST 2017 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.3 (Maipo)

Comment 11 Simone Tiraboschi 2017-04-13 11:08:34 UTC
ovirt-hosted-engine-ha-2.1.0.5-1.el7ev didn't included the fix, I built ovirt-hosted-engine-ha-2.1.0.5-2.el7ev with it

Comment 12 Yaniv Kaul 2017-04-18 07:53:56 UTC
(In reply to Simone Tiraboschi from comment #11)
> ovirt-hosted-engine-ha-2.1.0.5-1.el7ev didn't included the fix, I built
> ovirt-hosted-engine-ha-2.1.0.5-2.el7ev with it

Was it consumed by QE? Is it in any errata?

Comment 19 Nikolai Sednev 2017-04-24 12:14:55 UTC
1)Initial host running on 3.5 components with ovirt-hosted-engine-ha-1.2.10-1.el7ev.noarch:
puma18 ~]# ll -lsha /etc/ovirt-hosted-engine/answers.conf
4.0K -rw-rw----. 1 root root 2.6K Apr 18 18:44 /etc/ovirt-hosted-engine/answers.conf.

2)After upgrading initial host to latest 4.1:
puma18 ~]#  ll -lsha /etc/ovirt-hosted-engine/answers.conf
4.0K -rw-rw----. 1 root kvm 2.6K Apr 18 18:44 /etc/ovirt-hosted-engine/answers.conf
puma18 ~]# rpm -qa | grep ovirt-hosted-engine-ha
ovirt-hosted-engine-ha-2.1.0.5-2.el7ev.noarch

3)vdsm service was not running dues to known bug opened by Simone:
puma18 ~]# systemctl status vdsmd -l
● vdsmd.service - Virtual Desktop Server Manager
   Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled; vendor preset: enabled)
   Active: failed (Result: start-limit) since Mon 2017-04-24 15:05:06 IDT; 25s ago
  Process: 15302 ExecStartPre=/usr/libexec/vdsm/vdsmd_init_common.sh --pre-start (code=exited, status=1/FAILURE)
 Main PID: 3042 (code=exited, status=0/SUCCESS)

Apr 24 15:05:06 puma18.scl.lab.tlv.redhat.com systemd[1]: Failed to start Virtual Desktop Server Manager.
Apr 24 15:05:06 puma18.scl.lab.tlv.redhat.com systemd[1]: Unit vdsmd.service entered failed state.
Apr 24 15:05:06 puma18.scl.lab.tlv.redhat.com systemd[1]: vdsmd.service failed.
Apr 24 15:05:06 puma18.scl.lab.tlv.redhat.com systemd[1]: vdsmd.service holdoff time over, scheduling restart.
Apr 24 15:05:06 puma18.scl.lab.tlv.redhat.com systemd[1]: start request repeated too quickly for vdsmd.service
Apr 24 15:05:06 puma18.scl.lab.tlv.redhat.com systemd[1]: Failed to start Virtual Desktop Server Manager.
Apr 24 15:05:06 puma18.scl.lab.tlv.redhat.com systemd[1]: Unit vdsmd.service entered failed state.
Apr 24 15:05:06 puma18.scl.lab.tlv.redhat.com systemd[1]: vdsmd.service failed.

I've fixed it with "vdsm-tool configure --force" and then with "systemctl restart vdsmd && systemctl restart ovirt-ha-broker && systemctl restart ovirt-ha-agent".

puma18 ~]# systemctl status vdsmd -l
● vdsmd.service - Virtual Desktop Server Manager
   Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2017-04-24 15:07:44 IDT; 25s ago
  Process: 16228 ExecStartPre=/usr/libexec/vdsm/vdsmd_init_common.sh --pre-start (code=exited, status=0/SUCCESS)
 Main PID: 16309 (vdsm)
   CGroup: /system.slice/vdsmd.service
           ├─16309 /usr/bin/python2 /usr/share/vdsm/vdsm
           ├─16438 /usr/libexec/ioprocess --read-pipe-fd 40 --write-pipe-fd 39 --max-threads 10 --max-queued-requests 10
           ├─16445 /usr/libexec/ioprocess --read-pipe-fd 48 --write-pipe-fd 47 --max-threads 10 --max-queued-requests 10
           └─16457 /usr/libexec/ioprocess --read-pipe-fd 57 --write-pipe-fd 56 --max-threads 10 --max-queued-requests 10

Apr 24 15:07:45 puma18.scl.lab.tlv.redhat.com python2[16309]: DIGEST-MD5 client step 1
Apr 24 15:07:45 puma18.scl.lab.tlv.redhat.com python2[16309]: DIGEST-MD5 ask_user_info()
Apr 24 15:07:45 puma18.scl.lab.tlv.redhat.com python2[16309]: DIGEST-MD5 make_client_response()
Apr 24 15:07:45 puma18.scl.lab.tlv.redhat.com python2[16309]: DIGEST-MD5 client step 2
Apr 24 15:07:45 puma18.scl.lab.tlv.redhat.com python2[16309]: DIGEST-MD5 parse_server_challenge()
Apr 24 15:07:45 puma18.scl.lab.tlv.redhat.com python2[16309]: DIGEST-MD5 ask_user_info()
Apr 24 15:07:45 puma18.scl.lab.tlv.redhat.com python2[16309]: DIGEST-MD5 make_client_response()
Apr 24 15:07:45 puma18.scl.lab.tlv.redhat.com python2[16309]: DIGEST-MD5 client step 3
Apr 24 15:07:45 puma18.scl.lab.tlv.redhat.com vdsm[16309]: vdsm MOM WARN MOM not available.
Apr 24 15:07:45 puma18.scl.lab.tlv.redhat.com vdsm[16309]: vdsm MOM WARN MOM not available, KSM stats will be missing.

4)After step 3 I've checked again for puma18 ~]#  ll -lsha /etc/ovirt-hosted-engine/answers.conf
4.0K -rw-rw----. 1 root kvm 2.6K Apr 18 18:44 /etc/ovirt-hosted-engine/answers.conf
Permissions were not changed on host, although hosted-storage was successfully upgraded on host:
MainThread::INFO::2017-04-24 15:10:16,090::upgrade::1035::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(upgrade_3
5_36) Successfully upgraded

5)Moving to verified as besides of known issues and their workarounds the hosted-storage upgrade succeeded on initial host.

Comment 21 errata-xmlrpc 2017-05-03 07:51:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1195


Note You need to log in before you can comment on or make changes to this bug.