Created attachment 1109818 [details] sosreport from host Description of problem: When I use general flow to upgrade HE with single host from 3.5 to 3.6, host automatically restarted in the middle of yum update action. Version-Release number of selected component (if applicable): 3.5 ========================== # uname -r 3.10.0-229.26.1.el7.x86_64 # cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.1 (Maipo) # rpm -qa | grep sanlock libvirt-lock-sanlock-1.2.8-16.el7_1.5.x86_64 sanlock-python-3.2.2-2.el7.x86_64 sanlock-lib-3.2.2-2.el7.x86_64 sanlock-3.2.2-2.el7.x86_64 # rpm -qa | grep vdsm-4 vdsm-4.16.31-1.el7ev.x86_64 # rpm -qa | grep ovirt-hosted ovirt-hosted-engine-setup-1.2.6.1-1.el7ev.noarch ovirt-hosted-engine-ha-1.2.8-1.el7ev.noarch 3.6 ========================== # uname -r 3.10.0-327.4.4.el7.x86_64 # cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.2 (Maipo) # rpm -qa | grep sanlock sanlock-3.2.4-2.el7_2.x86_64 sanlock-python-3.2.4-2.el7_2.x86_64 libvirt-lock-sanlock-1.2.17-13.el7_2.2.x86_64 sanlock-lib-3.2.4-2.el7_2.x86_64 # rpm -qa | grep vdsm-4 vdsm-4.17.14-0.el7ev.noarch # rpm -qa | grep ovirt-hosted ovirt-hosted-engine-setup-1.3.2-1.el7ev.noarch ovirt-hosted-engine-ha-1.3.3.6-1.el7ev.noarch How reproducible: Always Steps to Reproduce: 1. Deploy HE with single host on 3.5 environment 2. Enable global maintenance 3. Upgrade engine vm from 3.5 to 3.6 4. Power off engine vm 5. Upgrade host from 3.5 to 3.6(add 3.6 repos and run yum update) Actual results: Host from some reason(I think, sanlock restarts it) rebooted in the middle of yum update and it can be source of many problems(for example I had kernel panic on new kernel) Expected results: yum update finish update and upgrade from 3.5 to 3.6 succeed without any errors Additional info: Like I said, I think sanlock release mechanism reboot host via watchdog device, so I tried to W/A problem: 1. Deploy HE with single host on 3.5 environment 2. Enable global maintenance 3. Upgrade engine vm from 3.5 to 3.6 4. Power off engine vm 5. Disable ovirt-ha-agent 6. Reboot host to guaranteed, that we do not have any sanlock's on it 7. Upgrade host from 3.5 to 3.6(add 3.6 repos and run yum update) 8. Enable and start ovirt-ha-agent 9. Disable global maintenance Upgrade succeed(it a reason why I put bug under high severity and not urgent one). sanlock status before update daemon 550a494b-d1af-42ae-92f7-dc80c6f81e29.master-vds p -1 helper p -1 listener p 25485 HostedEngine p -1 status s hosted-engine:1:/var/run/vdsm/storage/d611aacd-1193-4f53-9e2f-2d8e2ef461ab/e6ca46c8-2274-402b-b72f-bcee0f0cbf93/b825c73d-d99b-4852-8697-00d630569d32:0 s d611aacd-1193-4f53-9e2f-2d8e2ef461ab:1:/rhev/data-center/mnt/10.35.64.11\:_vol_RHEV_Virt_alukiano__HE__upgrade__1/d611aacd-1193-4f53-9e2f-2d8e2ef461ab/dom_md/ids:0 r d611aacd-1193-4f53-9e2f-2d8e2ef461ab:e9afb90e-fabc-4a7f-ac06-fa0577362b4e:/rhev/data-center/mnt/10.35.64.11\:_vol_RHEV_Virt_alukiano__HE__upgrade__1/d611aacd-1193-4f53-9e2f-2d8e2ef461ab/images/f60a9083-1d15-4825-afe7-92adeae48b28/e9afb90e-fabc-4a7f-ac06-fa0577362b4e.lease:0:2 p 25485
Please see this one: https://bugzilla.redhat.com/show_bug.cgi?id=1282187#c31 The issue is just here: you have to manually stop ovirt-ha-agent otherwise it will keep a lock. Upgrading sanlock while it has an active lock can cause a reboot. Probably we can just document it really well.
Maybe we can just provide some script, like "prepare single HE host to upgrade", because if user will forget to do some step, reboot in the middle of yum update can corrupt whole system.
(In reply to Simone Tiraboschi from comment #1) > Please see this one: > https://bugzilla.redhat.com/show_bug.cgi?id=1282187#c31 > The issue is just here: you have to manually stop ovirt-ha-agent otherwise > it will keep a lock. Upgrading sanlock while it has an active lock can cause > a reboot. > > Probably we can just document it really well. Is there a documentation bug?
(In reply to Artyom from comment #2) > Maybe we can just provide some script, like "prepare single HE host to > upgrade", because if user will forget to do some step, reboot in the middle > of yum update can corrupt whole system. Such an upgrade shold be done in /local/ maintenance as well for exactly this reason. Global maintenance is designed for HE VM maintenance and you're maintaining the host itself.
We talk about single host upgrade, so I can not put host to maintenance from engine and also put it to local maintenance via hosted-engine cli will leave running HE vm on host(because agent do not have better host to run vm on it).
I encountered this problem also on the HE environment with two hosts(ISCSI). Looks like it is depend on how many packages you need to upgrade, more packages, more time, so the sanlock is succeed to get timeout and reboot host via watchdog device.
Up severity, because the comment 6. I believe we need to merge the patch under https://bugzilla.redhat.com/show_bug.cgi?id=1282187 also to the 3.5 z-stream.
Doron, I believe that won't help (it is needed, but not good enough). Sanlock package needs to be updated and that might cause a machine reboot, because sanlock still has an active resource: the engine VM itself.
Created attachment 1112425 [details] logs from HE environment with two hosts host master-vds10.qa.lab.tlv.redhat.com has sanlock: [root@master-vds10 ~]# sanlock client status daemon 3609a107-a6cc-429e-b26b-40e930539348.master-vds p -1 helper p -1 listener p -1 status p 5080 s hosted-engine:2:/var/run/vdsm/storage/c8739b1f-432b-4d63-9028-746260ed9834/32dfd735-6f65-4c44-86fb-e38d4809aaba/c6f2604f-701a-4e34-a1fb-cb8a59b23a54:0 s c8739b1f-432b-4d63-9028-746260ed9834:2:/dev/c8739b1f-432b-4d63-9028-746260ed9834/ids:0 hosted-engine CLI: --== Host 2 status ==-- Status up-to-date : False Hostname : master-vds10.qa.lab.tlv.redhat.com Host ID : 2 Engine status : unknown stale-data Score : 0 Local maintenance : True Host timestamp : 66493 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=66493 (Thu Jan 7 13:48:12 2016) host-id=2 score=0 maintenance=True state=LocalMaintenance ovirt-ha-agent service: [root@master-vds10 ~]# systemctl status ovirt-ha-agent ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled) Active: inactive (dead) since Thu 2016-01-07 13:48:22 IST; 1min 51s ago Process: 23177 ExecStop=/usr/lib/systemd/systemd-ovirt-ha-agent stop (code=exited, status=0/SUCCESS) Main PID: 5555 (code=exited, status=0/SUCCESS) from engine side: <name>hosted_engine_2</name> <comment /> <link href="/ovirt-engine/api/hosts/ac1dad21-b1cd-4d69-ae6e-bbd8409e16b4/storage" rel="storage" /> <link href="/ovirt-engine/api/hosts/ac1dad21-b1cd-4d69-ae6e-bbd8409e16b4/nics" rel="nics" /> <link href="/ovirt-engine/api/hosts/ac1dad21-b1cd-4d69-ae6e-bbd8409e16b4/numanodes" rel="numanodes" /> <link href="/ovirt-engine/api/hosts/ac1dad21-b1cd-4d69-ae6e-bbd8409e16b4/tags" rel="tags" /> <link href="/ovirt-engine/api/hosts/ac1dad21-b1cd-4d69-ae6e-bbd8409e16b4/permissions" rel="permissions" /> <link href="/ovirt-engine/api/hosts/ac1dad21-b1cd-4d69-ae6e-bbd8409e16b4/statistics" rel="statistics" /> <link href="/ovirt-engine/api/hosts/ac1dad21-b1cd-4d69-ae6e-bbd8409e16b4/hooks" rel="hooks" /> <link href="/ovirt-engine/api/hosts/ac1dad21-b1cd-4d69-ae6e-bbd8409e16b4/fenceagents" rel="fenceagents" /> <link href="/ovirt-engine/api/hosts/ac1dad21-b1cd-4d69-ae6e-bbd8409e16b4/katelloerrata" rel="katelloerrata" /> <link href="/ovirt-engine/api/hosts/ac1dad21-b1cd-4d69-ae6e-bbd8409e16b4/devices" rel="devices" /> <link href="/ovirt-engine/api/hosts/ac1dad21-b1cd-4d69-ae6e-bbd8409e16b4/networkattachments" rel="networkattachments" /> <link href="/ovirt-engine/api/hosts/ac1dad21-b1cd-4d69-ae6e-bbd8409e16b4/unmanagednetworks" rel="unmanagednetworks" /> <link href="/ovirt-engine/api/hosts/ac1dad21-b1cd-4d69-ae6e-bbd8409e16b4/storageconnectionextensions" rel="storageconnectionextensions" /> <address>master-vds10.qa.lab.tlv.redhat.com</address> <certificate> <organization>qa.lab.tlv.redhat.com</organization> <subject>O=qa.lab.tlv.redhat.com,CN=master-vds10.qa.lab.tlv.redhat.com</subject> </certificate> <status> <state>maintenance</state> </status> versions: [root@master-vds10 ~]# rpm -qa | grep vdsm vdsm-yajsonrpc-4.16.30-0.el7.centos.noarch vdsm-xmlrpc-4.16.30-0.el7.centos.noarch vdsm-jsonrpc-4.16.30-0.el7.centos.noarch vdsm-python-4.16.30-0.el7.centos.noarch vdsm-4.16.30-0.el7.centos.x86_64 vdsm-cli-4.16.30-0.el7.centos.noarch vdsm-python-zombiereaper-4.16.30-0.el7.centos.noarch [root@master-vds10 ~]# rpm -qa | grep hosted ovirt-hosted-engine-ha-1.2.8-1.el7.centos.noarch ovirt-hosted-engine-setup-1.2.6.1-1.el7.centos.noarch
We have two distinct issues here: 1. hosted-engine with just one host: the engine VM could not migrate anywhere else so the engine VM should be off as we say in the release notes. 2. https://bugzilla.redhat.com/show_bug.cgi?id=1282187 we fixed it on 3.6 but not on 3.5 and upstream we will not have other 3.5.z The user has to put the host in maintenance, local maintenance and the cluster in global maintenance. Th engine should be somewhere else. If a lock is still there the user has to manually remove it before upgrading rpms if sanlock is in the list. We have also to properly document this.
Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.
(In reply to Yaniv Kaul from comment #3) > Is there a documentation bug? https://bugzilla.redhat.com/show_bug.cgi?id=1293971
This issue happens since https://bugzilla.redhat.com/show_bug.cgi?id=1282187 has not been fixed on 3.5.z Workaround, after stopping ovirt-ha-agent before running yum update to update sanlock rpm run: source /etc/ovirt-hosted-engine/hosted-engine.conf vdsClient -s 0 stopMonitoringDomain {$sdUUID}
moving to QE for testing the workaround. It can't be fixed in oVirt 3.5.z since we stopped supporting it. In RHEV, this is tracked by bug #1298461
Hi Simone, I did all steps under comment 13, but looks like it is not really help, because sanlock still has locks on HE. [root@rose05 yum.repos.d]# systemctl stop ovirt-ha-agent [root@rose05 yum.repos.d]# source /etc/ovirt-hosted-engine/hosted-engine.conf [root@rose05 yum.repos.d]# vdsClient -s 0 stopMonitoringDomain {$sdUUID} OK [root@rose05 yum.repos.d]# sanlock client status daemon d930e53e-3c4a-424f-b642-af0c8ea8493c.rose05.qa. p -1 helper p -1 listener p -1 status p 8932 s hosted-engine:1:/var/run/vdsm/storage/78626267-83ac-4f89-a971-84b75d46bee1/0cff6ab6-08fe-4357-b27e-2be1a4718dcd/1335757f-b381-436f-92d4-b1b1f096b4c5:0 s 78626267-83ac-4f89-a971-84b75d46bee1:1:/rhev/data-center/mnt/10.35.64.11\:_vol_RHEV_Virt_alukiano__HE__upgrade/78626267-83ac-4f89-a971-84b75d46bee1/dom_md/ids:0 So Can I move it back to ASSIGNED? Also like workaround "sanlock client shutdown -f 1" works fine
(In reply to Artyom from comment #15) > So Can I move it back to ASSIGNED? > Also like workaround "sanlock client shutdown -f 1" works fine The issue was on 3.5.z, we fixed it here: https://bugzilla.redhat.com/show_bug.cgi?id=1298461 There is not really that much we can do on 3.6.z since the issue happens during the upgrade and not after that.
Ok so I will wait until the bug https://bugzilla.redhat.com/show_bug.cgi?id=1298461 will be ON_QA and will verify both bugs.
Verified on ovirt-hosted-engine-ha-1.3.3.7-1.el7ev.noarch