1294353 – Broken HE upgrade flow from 3.5 to 3.6(hosts upgraded from RHEL7.1 to RHEL7.2)

Bug 1294353 - Broken HE upgrade flow from 3.5 to 3.6(hosts upgraded from RHEL7.1 to RHEL7.2)

Summary: Broken HE upgrade flow from 3.5 to 3.6(hosts upgraded from RHEL7.1 to RHEL7.2)

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-hosted-engine-setup
Classification:	oVirt
Component:	General
Sub Component:
Version:	1.2.6.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	ovirt-3.6.2
Target Release:	1.3.2.2
Assignee:	Simone Tiraboschi
QA Contact:	Artyom
Docs Contact:
URL:
Whiteboard:
Depends On:	1282187 1298461
Blocks:
TreeView+	depends on / blocked

Reported:	2015-12-27 13:46 UTC by Artyom
Modified:	2016-06-01 17:46 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2016-02-18 11:09:31 UTC
oVirt Team:	Integration
Embargoed:
Dependent Products:
Flags:	sbonazzo: ovirt-3.6.z? mavital: blocker? rule-engine: planning_ack? sbonazzo: devel_ack+ mavital: testing_ack+

Attachments	(Terms of Use)
sosreport from host (6.53 MB, application/x-xz) 2015-12-27 13:46 UTC, Artyom	no flags	Details
logs from HE environment with two hosts (2.53 MB, application/zip) 2016-01-07 11:57 UTC, Artyom	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1282187	0	urgent	CLOSED	Host under maintenance still have sanlock lockspaces which prevents the upgrade of the sanlock package	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1293971	0	high	CLOSED	[Docs][SHE] Document the hosted-engine command and common administrator tasks	2021-02-22 00:41:40 UTC

Internal Links: 1282187 1293971

Description Artyom 2015-12-27 13:46:07 UTC

Created attachment 1109818 [details]
sosreport from host

Description of problem:
When I use general flow to upgrade HE with single host from 3.5 to 3.6, host automatically restarted in the middle of yum update action.

Version-Release number of selected component (if applicable):
3.5
==========================
# uname -r
3.10.0-229.26.1.el7.x86_64
# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.1 (Maipo)
# rpm -qa | grep sanlock
libvirt-lock-sanlock-1.2.8-16.el7_1.5.x86_64
sanlock-python-3.2.2-2.el7.x86_64
sanlock-lib-3.2.2-2.el7.x86_64
sanlock-3.2.2-2.el7.x86_64
# rpm -qa | grep vdsm-4
vdsm-4.16.31-1.el7ev.x86_64
# rpm -qa | grep ovirt-hosted
ovirt-hosted-engine-setup-1.2.6.1-1.el7ev.noarch
ovirt-hosted-engine-ha-1.2.8-1.el7ev.noarch

3.6
==========================
# uname -r
3.10.0-327.4.4.el7.x86_64
# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.2 (Maipo)
# rpm -qa | grep sanlock
sanlock-3.2.4-2.el7_2.x86_64
sanlock-python-3.2.4-2.el7_2.x86_64
libvirt-lock-sanlock-1.2.17-13.el7_2.2.x86_64
sanlock-lib-3.2.4-2.el7_2.x86_64
# rpm -qa | grep vdsm-4
vdsm-4.17.14-0.el7ev.noarch
# rpm -qa | grep ovirt-hosted
ovirt-hosted-engine-setup-1.3.2-1.el7ev.noarch
ovirt-hosted-engine-ha-1.3.3.6-1.el7ev.noarch

How reproducible:
Always

Steps to Reproduce:
1. Deploy HE with single host on 3.5 environment
2. Enable global maintenance
3. Upgrade engine vm from 3.5 to 3.6
4. Power off engine vm
5. Upgrade host from 3.5 to 3.6(add 3.6 repos and run yum update)

Actual results:
Host from some reason(I think, sanlock restarts it) rebooted in the middle of yum update and it can be source of many problems(for example I had kernel panic on new kernel)

Expected results:
yum update finish update and upgrade from 3.5 to 3.6 succeed without any errors

Additional info:
Like I said, I think sanlock release mechanism reboot host via watchdog device, so I tried to W/A problem:
1. Deploy HE with single host on 3.5 environment
2. Enable global maintenance
3. Upgrade engine vm from 3.5 to 3.6
4. Power off engine vm
5. Disable ovirt-ha-agent
6. Reboot host to guaranteed, that we do not have any sanlock's on it 
7. Upgrade host from 3.5 to 3.6(add 3.6 repos and run yum update)
8. Enable and start ovirt-ha-agent
9. Disable global maintenance

Upgrade succeed(it a reason why I put bug under high severity and not urgent one).

sanlock status before update
daemon 550a494b-d1af-42ae-92f7-dc80c6f81e29.master-vds
p -1 helper
p -1 listener
p 25485 HostedEngine
p -1 status
s hosted-engine:1:/var/run/vdsm/storage/d611aacd-1193-4f53-9e2f-2d8e2ef461ab/e6ca46c8-2274-402b-b72f-bcee0f0cbf93/b825c73d-d99b-4852-8697-00d630569d32:0
s d611aacd-1193-4f53-9e2f-2d8e2ef461ab:1:/rhev/data-center/mnt/10.35.64.11\:_vol_RHEV_Virt_alukiano__HE__upgrade__1/d611aacd-1193-4f53-9e2f-2d8e2ef461ab/dom_md/ids:0
r d611aacd-1193-4f53-9e2f-2d8e2ef461ab:e9afb90e-fabc-4a7f-ac06-fa0577362b4e:/rhev/data-center/mnt/10.35.64.11\:_vol_RHEV_Virt_alukiano__HE__upgrade__1/d611aacd-1193-4f53-9e2f-2d8e2ef461ab/images/f60a9083-1d15-4825-afe7-92adeae48b28/e9afb90e-fabc-4a7f-ac06-fa0577362b4e.lease:0:2 p 25485

Comment 1 Simone Tiraboschi 2015-12-27 15:28:39 UTC

Please see this one:
https://bugzilla.redhat.com/show_bug.cgi?id=1282187#c31
The issue is just here: you have to manually stop ovirt-ha-agent otherwise it will keep a lock. Upgrading sanlock while it has an active lock can cause a reboot.

Probably we can just document it really well.

Comment 2 Artyom 2015-12-27 16:13:44 UTC

Maybe we can just provide some script, like "prepare single HE host to upgrade", because if user will forget to do some step, reboot in the middle of yum update can corrupt whole system.

Comment 3 Yaniv Kaul 2015-12-31 15:33:49 UTC

(In reply to Simone Tiraboschi from comment #1)
> Please see this one:
> https://bugzilla.redhat.com/show_bug.cgi?id=1282187#c31
> The issue is just here: you have to manually stop ovirt-ha-agent otherwise
> it will keep a lock. Upgrading sanlock while it has an active lock can cause
> a reboot.
> 
> Probably we can just document it really well.

Is there a documentation bug?

Comment 4 Doron Fediuck 2016-01-03 09:33:21 UTC

(In reply to Artyom from comment #2)
> Maybe we can just provide some script, like "prepare single HE host to
> upgrade", because if user will forget to do some step, reboot in the middle
> of yum update can corrupt whole system.

Such an upgrade shold be done in /local/ maintenance as well for exactly this
reason. Global maintenance is designed for HE VM maintenance and you're maintaining the host itself.

Comment 5 Artyom 2016-01-03 09:38:45 UTC

We talk about single host upgrade, so I can not put host to maintenance from engine and also put it to local maintenance via hosted-engine cli will leave running HE vm on host(because agent do not have better host to run vm on it).

Comment 6 Artyom 2016-01-07 10:22:14 UTC

I encountered this problem also on the HE environment with two hosts(ISCSI). Looks like it is depend on how many packages you need to upgrade, more packages, more time, so the sanlock is succeed to get timeout and reboot host via watchdog device.

Comment 7 Artyom 2016-01-07 10:50:52 UTC

Up severity, because the comment 6. I believe we need to merge the patch under https://bugzilla.redhat.com/show_bug.cgi?id=1282187 also to the 3.5 z-stream.

Comment 8 Martin Sivák 2016-01-07 10:51:48 UTC

Doron, I believe that won't help (it is needed, but not good enough). Sanlock package needs to be updated and that might cause a machine reboot, because sanlock still has an active resource: the engine VM itself.

Comment 9 Artyom 2016-01-07 11:57:15 UTC

Created attachment 1112425 [details]
logs from HE environment with two hosts

host master-vds10.qa.lab.tlv.redhat.com has sanlock:
[root@master-vds10 ~]# sanlock client status
daemon 3609a107-a6cc-429e-b26b-40e930539348.master-vds
p -1 helper
p -1 listener
p -1 status
p 5080 
s hosted-engine:2:/var/run/vdsm/storage/c8739b1f-432b-4d63-9028-746260ed9834/32dfd735-6f65-4c44-86fb-e38d4809aaba/c6f2604f-701a-4e34-a1fb-cb8a59b23a54:0
s c8739b1f-432b-4d63-9028-746260ed9834:2:/dev/c8739b1f-432b-4d63-9028-746260ed9834/ids:0


hosted-engine CLI:
--== Host 2 status ==--

Status up-to-date                  : False
Hostname                           : master-vds10.qa.lab.tlv.redhat.com
Host ID                            : 2
Engine status                      : unknown stale-data
Score                              : 0
Local maintenance                  : True
Host timestamp                     : 66493
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=66493 (Thu Jan  7 13:48:12 2016)
        host-id=2
        score=0
        maintenance=True
        state=LocalMaintenance


ovirt-ha-agent service:
[root@master-vds10 ~]# systemctl status ovirt-ha-agent
ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent
   Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled)
   Active: inactive (dead) since Thu 2016-01-07 13:48:22 IST; 1min 51s ago
  Process: 23177 ExecStop=/usr/lib/systemd/systemd-ovirt-ha-agent stop (code=exited, status=0/SUCCESS)
 Main PID: 5555 (code=exited, status=0/SUCCESS)



from engine side:
<name>hosted_engine_2</name>
<comment />
<link href="/ovirt-engine/api/hosts/ac1dad21-b1cd-4d69-ae6e-bbd8409e16b4/storage" rel="storage" />
<link href="/ovirt-engine/api/hosts/ac1dad21-b1cd-4d69-ae6e-bbd8409e16b4/nics" rel="nics" />
<link href="/ovirt-engine/api/hosts/ac1dad21-b1cd-4d69-ae6e-bbd8409e16b4/numanodes" rel="numanodes" />
<link href="/ovirt-engine/api/hosts/ac1dad21-b1cd-4d69-ae6e-bbd8409e16b4/tags" rel="tags" />
<link href="/ovirt-engine/api/hosts/ac1dad21-b1cd-4d69-ae6e-bbd8409e16b4/permissions" rel="permissions" />
<link href="/ovirt-engine/api/hosts/ac1dad21-b1cd-4d69-ae6e-bbd8409e16b4/statistics" rel="statistics" />
<link href="/ovirt-engine/api/hosts/ac1dad21-b1cd-4d69-ae6e-bbd8409e16b4/hooks" rel="hooks" />
<link href="/ovirt-engine/api/hosts/ac1dad21-b1cd-4d69-ae6e-bbd8409e16b4/fenceagents" rel="fenceagents" />
<link href="/ovirt-engine/api/hosts/ac1dad21-b1cd-4d69-ae6e-bbd8409e16b4/katelloerrata" rel="katelloerrata" />
<link href="/ovirt-engine/api/hosts/ac1dad21-b1cd-4d69-ae6e-bbd8409e16b4/devices" rel="devices" />
<link href="/ovirt-engine/api/hosts/ac1dad21-b1cd-4d69-ae6e-bbd8409e16b4/networkattachments" rel="networkattachments" />
<link href="/ovirt-engine/api/hosts/ac1dad21-b1cd-4d69-ae6e-bbd8409e16b4/unmanagednetworks" rel="unmanagednetworks" />
<link href="/ovirt-engine/api/hosts/ac1dad21-b1cd-4d69-ae6e-bbd8409e16b4/storageconnectionextensions" rel="storageconnectionextensions" />
<address>master-vds10.qa.lab.tlv.redhat.com</address>
 <certificate>
<organization>qa.lab.tlv.redhat.com</organization>
<subject>O=qa.lab.tlv.redhat.com,CN=master-vds10.qa.lab.tlv.redhat.com</subject>
 </certificate>
<status>
<state>maintenance</state>
</status>

versions:
[root@master-vds10 ~]# rpm -qa | grep vdsm
vdsm-yajsonrpc-4.16.30-0.el7.centos.noarch
vdsm-xmlrpc-4.16.30-0.el7.centos.noarch
vdsm-jsonrpc-4.16.30-0.el7.centos.noarch
vdsm-python-4.16.30-0.el7.centos.noarch
vdsm-4.16.30-0.el7.centos.x86_64
vdsm-cli-4.16.30-0.el7.centos.noarch
vdsm-python-zombiereaper-4.16.30-0.el7.centos.noarch
[root@master-vds10 ~]# rpm -qa | grep hosted
ovirt-hosted-engine-ha-1.2.8-1.el7.centos.noarch
ovirt-hosted-engine-setup-1.2.6.1-1.el7.centos.noarch

Comment 10 Simone Tiraboschi 2016-01-07 12:01:10 UTC

We have two distinct issues here:

1. hosted-engine with just one host:
the engine VM could not migrate anywhere else so the engine VM should be off as we say in the release notes.

2. https://bugzilla.redhat.com/show_bug.cgi?id=1282187 
we fixed it on 3.6 but not on 3.5 and upstream we will not have other 3.5.z
The user has to put the host in maintenance, local maintenance and the cluster in global maintenance. Th engine should be somewhere else.
If a lock is still there the user has to manually remove it before upgrading rpms if sanlock is in the list.
We have also to properly document this.

Comment 11 Red Hat Bugzilla Rules Engine 2016-01-11 07:52:30 UTC

Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.

Comment 12 Simone Tiraboschi 2016-01-12 18:08:56 UTC

(In reply to Yaniv Kaul from comment #3)
> Is there a documentation bug?

https://bugzilla.redhat.com/show_bug.cgi?id=1293971

Comment 13 Simone Tiraboschi 2016-01-13 14:55:11 UTC

This issue happens since https://bugzilla.redhat.com/show_bug.cgi?id=1282187 has not been fixed on 3.5.z

Workaround, after stopping  ovirt-ha-agent before running yum update to update sanlock rpm run:

source /etc/ovirt-hosted-engine/hosted-engine.conf
vdsClient -s 0 stopMonitoringDomain {$sdUUID}

Comment 14 Sandro Bonazzola 2016-01-14 08:09:07 UTC

moving to QE for testing the workaround. It can't be fixed in oVirt 3.5.z since we stopped supporting it. In RHEV, this is tracked by bug #1298461

Comment 15 Artyom 2016-01-25 10:16:18 UTC

Hi Simone,
I did all steps under comment 13, but looks like it is not really help, because sanlock still has locks on HE.
[root@rose05 yum.repos.d]# systemctl stop ovirt-ha-agent
[root@rose05 yum.repos.d]# source /etc/ovirt-hosted-engine/hosted-engine.conf
[root@rose05 yum.repos.d]# vdsClient -s 0 stopMonitoringDomain {$sdUUID}
OK
[root@rose05 yum.repos.d]# sanlock client status
daemon d930e53e-3c4a-424f-b642-af0c8ea8493c.rose05.qa.
p -1 helper
p -1 listener
p -1 status
p 8932 
s hosted-engine:1:/var/run/vdsm/storage/78626267-83ac-4f89-a971-84b75d46bee1/0cff6ab6-08fe-4357-b27e-2be1a4718dcd/1335757f-b381-436f-92d4-b1b1f096b4c5:0
s 78626267-83ac-4f89-a971-84b75d46bee1:1:/rhev/data-center/mnt/10.35.64.11\:_vol_RHEV_Virt_alukiano__HE__upgrade/78626267-83ac-4f89-a971-84b75d46bee1/dom_md/ids:0
So Can I move it back to ASSIGNED?
Also like workaround "sanlock client shutdown -f 1" works fine

Comment 16 Simone Tiraboschi 2016-01-25 10:19:55 UTC

(In reply to Artyom from comment #15)
> So Can I move it back to ASSIGNED?
> Also like workaround "sanlock client shutdown -f 1" works fine

The issue was on 3.5.z, we fixed it here: https://bugzilla.redhat.com/show_bug.cgi?id=1298461

There is not really that much we can do on 3.6.z since the issue happens during the upgrade and not after that.

Comment 17 Artyom 2016-01-25 11:14:11 UTC

Ok so I will wait until the bug https://bugzilla.redhat.com/show_bug.cgi?id=1298461 will be ON_QA and will verify both bugs.

Comment 18 Artyom 2016-01-28 14:30:11 UTC

Verified on ovirt-hosted-engine-ha-1.3.3.7-1.el7ev.noarch

Note You need to log in before you can comment on or make changes to this bug.