Bug 1294353 - Broken HE upgrade flow from 3.5 to 3.6(hosts upgraded from RHEL7.1 to RHEL7.2)
Broken HE upgrade flow from 3.5 to 3.6(hosts upgraded from RHEL7.1 to RHEL7.2)
Status: CLOSED CURRENTRELEASE
Product: ovirt-hosted-engine-setup
Classification: oVirt
Component: General (Show other bugs)
1.2.6.1
x86_64 Linux
unspecified Severity urgent (vote)
: ovirt-3.6.2
: 1.3.2.2
Assigned To: Simone Tiraboschi
Artyom
: TestOnly, Triaged
Depends On: 1282187 1298461
Blocks:
  Show dependency treegraph
 
Reported: 2015-12-27 08:46 EST by Artyom
Modified: 2016-06-01 13:46 EDT (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Known Issue
Doc Text:
Cause: a missing fix in 3.5.z leaves hosted engine storage domaing under VDSM monitoring when upgrading 3.5 to 3.6 at the same time of RHEL 7.1 to 7.2 Consequence: this bug Workaround (if any): after stopping ovirt-ha-agent before running yum update to update sanlock rpm run: source /etc/ovirt-hosted-engine/hosted-engine.conf vdsClient -s 0 stopMonitoringDomain {$sdUUID} Result: the upgrade flow works.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-02-18 06:09:31 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Integration
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
sbonazzo: ovirt‑3.6.z?
mavital: blocker?
rule-engine: planning_ack?
sbonazzo: devel_ack+
mavital: testing_ack+


Attachments (Terms of Use)
sosreport from host (6.53 MB, application/x-xz)
2015-12-27 08:46 EST, Artyom
no flags Details
logs from HE environment with two hosts (2.53 MB, application/zip)
2016-01-07 06:57 EST, Artyom
no flags Details

  None (edit)
Description Artyom 2015-12-27 08:46:07 EST
Created attachment 1109818 [details]
sosreport from host

Description of problem:
When I use general flow to upgrade HE with single host from 3.5 to 3.6, host automatically restarted in the middle of yum update action.

Version-Release number of selected component (if applicable):
3.5
==========================
# uname -r
3.10.0-229.26.1.el7.x86_64
# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.1 (Maipo)
# rpm -qa | grep sanlock
libvirt-lock-sanlock-1.2.8-16.el7_1.5.x86_64
sanlock-python-3.2.2-2.el7.x86_64
sanlock-lib-3.2.2-2.el7.x86_64
sanlock-3.2.2-2.el7.x86_64
# rpm -qa | grep vdsm-4
vdsm-4.16.31-1.el7ev.x86_64
# rpm -qa | grep ovirt-hosted
ovirt-hosted-engine-setup-1.2.6.1-1.el7ev.noarch
ovirt-hosted-engine-ha-1.2.8-1.el7ev.noarch

3.6
==========================
# uname -r
3.10.0-327.4.4.el7.x86_64
# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.2 (Maipo)
# rpm -qa | grep sanlock
sanlock-3.2.4-2.el7_2.x86_64
sanlock-python-3.2.4-2.el7_2.x86_64
libvirt-lock-sanlock-1.2.17-13.el7_2.2.x86_64
sanlock-lib-3.2.4-2.el7_2.x86_64
# rpm -qa | grep vdsm-4
vdsm-4.17.14-0.el7ev.noarch
# rpm -qa | grep ovirt-hosted
ovirt-hosted-engine-setup-1.3.2-1.el7ev.noarch
ovirt-hosted-engine-ha-1.3.3.6-1.el7ev.noarch

How reproducible:
Always

Steps to Reproduce:
1. Deploy HE with single host on 3.5 environment
2. Enable global maintenance
3. Upgrade engine vm from 3.5 to 3.6
4. Power off engine vm
5. Upgrade host from 3.5 to 3.6(add 3.6 repos and run yum update)

Actual results:
Host from some reason(I think, sanlock restarts it) rebooted in the middle of yum update and it can be source of many problems(for example I had kernel panic on new kernel)

Expected results:
yum update finish update and upgrade from 3.5 to 3.6 succeed without any errors

Additional info:
Like I said, I think sanlock release mechanism reboot host via watchdog device, so I tried to W/A problem:
1. Deploy HE with single host on 3.5 environment
2. Enable global maintenance
3. Upgrade engine vm from 3.5 to 3.6
4. Power off engine vm
5. Disable ovirt-ha-agent
6. Reboot host to guaranteed, that we do not have any sanlock's on it 
7. Upgrade host from 3.5 to 3.6(add 3.6 repos and run yum update)
8. Enable and start ovirt-ha-agent
9. Disable global maintenance

Upgrade succeed(it a reason why I put bug under high severity and not urgent one).

sanlock status before update
daemon 550a494b-d1af-42ae-92f7-dc80c6f81e29.master-vds
p -1 helper
p -1 listener
p 25485 HostedEngine
p -1 status
s hosted-engine:1:/var/run/vdsm/storage/d611aacd-1193-4f53-9e2f-2d8e2ef461ab/e6ca46c8-2274-402b-b72f-bcee0f0cbf93/b825c73d-d99b-4852-8697-00d630569d32:0
s d611aacd-1193-4f53-9e2f-2d8e2ef461ab:1:/rhev/data-center/mnt/10.35.64.11\:_vol_RHEV_Virt_alukiano__HE__upgrade__1/d611aacd-1193-4f53-9e2f-2d8e2ef461ab/dom_md/ids:0
r d611aacd-1193-4f53-9e2f-2d8e2ef461ab:e9afb90e-fabc-4a7f-ac06-fa0577362b4e:/rhev/data-center/mnt/10.35.64.11\:_vol_RHEV_Virt_alukiano__HE__upgrade__1/d611aacd-1193-4f53-9e2f-2d8e2ef461ab/images/f60a9083-1d15-4825-afe7-92adeae48b28/e9afb90e-fabc-4a7f-ac06-fa0577362b4e.lease:0:2 p 25485
Comment 1 Simone Tiraboschi 2015-12-27 10:28:39 EST
Please see this one:
https://bugzilla.redhat.com/show_bug.cgi?id=1282187#c31
The issue is just here: you have to manually stop ovirt-ha-agent otherwise it will keep a lock. Upgrading sanlock while it has an active lock can cause a reboot.

Probably we can just document it really well.
Comment 2 Artyom 2015-12-27 11:13:44 EST
Maybe we can just provide some script, like "prepare single HE host to upgrade", because if user will forget to do some step, reboot in the middle of yum update can corrupt whole system.
Comment 3 Yaniv Kaul 2015-12-31 10:33:49 EST
(In reply to Simone Tiraboschi from comment #1)
> Please see this one:
> https://bugzilla.redhat.com/show_bug.cgi?id=1282187#c31
> The issue is just here: you have to manually stop ovirt-ha-agent otherwise
> it will keep a lock. Upgrading sanlock while it has an active lock can cause
> a reboot.
> 
> Probably we can just document it really well.

Is there a documentation bug?
Comment 4 Doron Fediuck 2016-01-03 04:33:21 EST
(In reply to Artyom from comment #2)
> Maybe we can just provide some script, like "prepare single HE host to
> upgrade", because if user will forget to do some step, reboot in the middle
> of yum update can corrupt whole system.

Such an upgrade shold be done in /local/ maintenance as well for exactly this
reason. Global maintenance is designed for HE VM maintenance and you're maintaining the host itself.
Comment 5 Artyom 2016-01-03 04:38:45 EST
We talk about single host upgrade, so I can not put host to maintenance from engine and also put it to local maintenance via hosted-engine cli will leave running HE vm on host(because agent do not have better host to run vm on it).
Comment 6 Artyom 2016-01-07 05:22:14 EST
I encountered this problem also on the HE environment with two hosts(ISCSI). Looks like it is depend on how many packages you need to upgrade, more packages, more time, so the sanlock is succeed to get timeout and reboot host via watchdog device.
Comment 7 Artyom 2016-01-07 05:50:52 EST
Up severity, because the comment 6. I believe we need to merge the patch under https://bugzilla.redhat.com/show_bug.cgi?id=1282187 also to the 3.5 z-stream.
Comment 8 Martin Sivák 2016-01-07 05:51:48 EST
Doron, I believe that won't help (it is needed, but not good enough). Sanlock package needs to be updated and that might cause a machine reboot, because sanlock still has an active resource: the engine VM itself.
Comment 9 Artyom 2016-01-07 06:57 EST
Created attachment 1112425 [details]
logs from HE environment with two hosts

host master-vds10.qa.lab.tlv.redhat.com has sanlock:
[root@master-vds10 ~]# sanlock client status
daemon 3609a107-a6cc-429e-b26b-40e930539348.master-vds
p -1 helper
p -1 listener
p -1 status
p 5080 
s hosted-engine:2:/var/run/vdsm/storage/c8739b1f-432b-4d63-9028-746260ed9834/32dfd735-6f65-4c44-86fb-e38d4809aaba/c6f2604f-701a-4e34-a1fb-cb8a59b23a54:0
s c8739b1f-432b-4d63-9028-746260ed9834:2:/dev/c8739b1f-432b-4d63-9028-746260ed9834/ids:0


hosted-engine CLI:
--== Host 2 status ==--

Status up-to-date                  : False
Hostname                           : master-vds10.qa.lab.tlv.redhat.com
Host ID                            : 2
Engine status                      : unknown stale-data
Score                              : 0
Local maintenance                  : True
Host timestamp                     : 66493
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=66493 (Thu Jan  7 13:48:12 2016)
        host-id=2
        score=0
        maintenance=True
        state=LocalMaintenance


ovirt-ha-agent service:
[root@master-vds10 ~]# systemctl status ovirt-ha-agent
ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent
   Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled)
   Active: inactive (dead) since Thu 2016-01-07 13:48:22 IST; 1min 51s ago
  Process: 23177 ExecStop=/usr/lib/systemd/systemd-ovirt-ha-agent stop (code=exited, status=0/SUCCESS)
 Main PID: 5555 (code=exited, status=0/SUCCESS)



from engine side:
<name>hosted_engine_2</name>
<comment />
<link href="/ovirt-engine/api/hosts/ac1dad21-b1cd-4d69-ae6e-bbd8409e16b4/storage" rel="storage" />
<link href="/ovirt-engine/api/hosts/ac1dad21-b1cd-4d69-ae6e-bbd8409e16b4/nics" rel="nics" />
<link href="/ovirt-engine/api/hosts/ac1dad21-b1cd-4d69-ae6e-bbd8409e16b4/numanodes" rel="numanodes" />
<link href="/ovirt-engine/api/hosts/ac1dad21-b1cd-4d69-ae6e-bbd8409e16b4/tags" rel="tags" />
<link href="/ovirt-engine/api/hosts/ac1dad21-b1cd-4d69-ae6e-bbd8409e16b4/permissions" rel="permissions" />
<link href="/ovirt-engine/api/hosts/ac1dad21-b1cd-4d69-ae6e-bbd8409e16b4/statistics" rel="statistics" />
<link href="/ovirt-engine/api/hosts/ac1dad21-b1cd-4d69-ae6e-bbd8409e16b4/hooks" rel="hooks" />
<link href="/ovirt-engine/api/hosts/ac1dad21-b1cd-4d69-ae6e-bbd8409e16b4/fenceagents" rel="fenceagents" />
<link href="/ovirt-engine/api/hosts/ac1dad21-b1cd-4d69-ae6e-bbd8409e16b4/katelloerrata" rel="katelloerrata" />
<link href="/ovirt-engine/api/hosts/ac1dad21-b1cd-4d69-ae6e-bbd8409e16b4/devices" rel="devices" />
<link href="/ovirt-engine/api/hosts/ac1dad21-b1cd-4d69-ae6e-bbd8409e16b4/networkattachments" rel="networkattachments" />
<link href="/ovirt-engine/api/hosts/ac1dad21-b1cd-4d69-ae6e-bbd8409e16b4/unmanagednetworks" rel="unmanagednetworks" />
<link href="/ovirt-engine/api/hosts/ac1dad21-b1cd-4d69-ae6e-bbd8409e16b4/storageconnectionextensions" rel="storageconnectionextensions" />
<address>master-vds10.qa.lab.tlv.redhat.com</address>
 <certificate>
<organization>qa.lab.tlv.redhat.com</organization>
<subject>O=qa.lab.tlv.redhat.com,CN=master-vds10.qa.lab.tlv.redhat.com</subject>
 </certificate>
<status>
<state>maintenance</state>
</status>

versions:
[root@master-vds10 ~]# rpm -qa | grep vdsm
vdsm-yajsonrpc-4.16.30-0.el7.centos.noarch
vdsm-xmlrpc-4.16.30-0.el7.centos.noarch
vdsm-jsonrpc-4.16.30-0.el7.centos.noarch
vdsm-python-4.16.30-0.el7.centos.noarch
vdsm-4.16.30-0.el7.centos.x86_64
vdsm-cli-4.16.30-0.el7.centos.noarch
vdsm-python-zombiereaper-4.16.30-0.el7.centos.noarch
[root@master-vds10 ~]# rpm -qa | grep hosted
ovirt-hosted-engine-ha-1.2.8-1.el7.centos.noarch
ovirt-hosted-engine-setup-1.2.6.1-1.el7.centos.noarch
Comment 10 Simone Tiraboschi 2016-01-07 07:01:10 EST
We have two distinct issues here:

1. hosted-engine with just one host:
the engine VM could not migrate anywhere else so the engine VM should be off as we say in the release notes.

2. https://bugzilla.redhat.com/show_bug.cgi?id=1282187 
we fixed it on 3.6 but not on 3.5 and upstream we will not have other 3.5.z
The user has to put the host in maintenance, local maintenance and the cluster in global maintenance. Th engine should be somewhere else.
If a lock is still there the user has to manually remove it before upgrading rpms if sanlock is in the list.
We have also to properly document this.
Comment 11 Red Hat Bugzilla Rules Engine 2016-01-11 02:52:30 EST
Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.
Comment 12 Simone Tiraboschi 2016-01-12 13:08:56 EST
(In reply to Yaniv Kaul from comment #3)
> Is there a documentation bug?

https://bugzilla.redhat.com/show_bug.cgi?id=1293971
Comment 13 Simone Tiraboschi 2016-01-13 09:55:11 EST
This issue happens since https://bugzilla.redhat.com/show_bug.cgi?id=1282187 has not been fixed on 3.5.z

Workaround, after stopping  ovirt-ha-agent before running yum update to update sanlock rpm run:

source /etc/ovirt-hosted-engine/hosted-engine.conf
vdsClient -s 0 stopMonitoringDomain {$sdUUID}
Comment 14 Sandro Bonazzola 2016-01-14 03:09:07 EST
moving to QE for testing the workaround. It can't be fixed in oVirt 3.5.z since we stopped supporting it. In RHEV, this is tracked by bug #1298461
Comment 15 Artyom 2016-01-25 05:16:18 EST
Hi Simone,
I did all steps under comment 13, but looks like it is not really help, because sanlock still has locks on HE.
[root@rose05 yum.repos.d]# systemctl stop ovirt-ha-agent
[root@rose05 yum.repos.d]# source /etc/ovirt-hosted-engine/hosted-engine.conf
[root@rose05 yum.repos.d]# vdsClient -s 0 stopMonitoringDomain {$sdUUID}
OK
[root@rose05 yum.repos.d]# sanlock client status
daemon d930e53e-3c4a-424f-b642-af0c8ea8493c.rose05.qa.
p -1 helper
p -1 listener
p -1 status
p 8932 
s hosted-engine:1:/var/run/vdsm/storage/78626267-83ac-4f89-a971-84b75d46bee1/0cff6ab6-08fe-4357-b27e-2be1a4718dcd/1335757f-b381-436f-92d4-b1b1f096b4c5:0
s 78626267-83ac-4f89-a971-84b75d46bee1:1:/rhev/data-center/mnt/10.35.64.11\:_vol_RHEV_Virt_alukiano__HE__upgrade/78626267-83ac-4f89-a971-84b75d46bee1/dom_md/ids:0
So Can I move it back to ASSIGNED?
Also like workaround "sanlock client shutdown -f 1" works fine
Comment 16 Simone Tiraboschi 2016-01-25 05:19:55 EST
(In reply to Artyom from comment #15)
> So Can I move it back to ASSIGNED?
> Also like workaround "sanlock client shutdown -f 1" works fine

The issue was on 3.5.z, we fixed it here: https://bugzilla.redhat.com/show_bug.cgi?id=1298461

There is not really that much we can do on 3.6.z since the issue happens during the upgrade and not after that.
Comment 17 Artyom 2016-01-25 06:14:11 EST
Ok so I will wait until the bug https://bugzilla.redhat.com/show_bug.cgi?id=1298461 will be ON_QA and will verify both bugs.
Comment 18 Artyom 2016-01-28 09:30:11 EST
Verified on ovirt-hosted-engine-ha-1.3.3.7-1.el7ev.noarch

Note You need to log in before you can comment on or make changes to this bug.