Bug 1784906 - Upon updating to 4.3.7, the CPU feature MDS is no longer presented and the hosted engine will not boot.
Summary: Upon updating to 4.3.7, the CPU feature MDS is no longer presented and the ho...
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: ovirt-node
Classification: oVirt
Component: Installation & Update
Version: 4.3
Hardware: x86_64
OS: Unspecified
unspecified
unspecified
Target Milestone: ovirt-4.3.9
: ---
Assignee: Sandro Bonazzola
QA Contact: Wei Wang
URL:
Whiteboard:
Depends On: 1753541
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-12-18 16:19 UTC by Joshua Kocinski
Modified: 2020-03-11 16:54 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-03-11 16:30:37 UTC
oVirt Team: Node
Embargoed:
sbonazzo: ovirt-4.3?
peyu: testing_plan_complete+
cshao: testing_ack?


Attachments (Terms of Use)
sosreport from 4.3.5.x (16.70 MB, application/x-xz)
2020-01-02 20:54 UTC, Joshua Kocinski
no flags Details
sosreport from 4.3.7 (16.42 MB, application/x-xz)
2020-01-02 20:55 UTC, Joshua Kocinski
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 4393691 0 None None None 2020-02-19 08:38:08 UTC

Description Joshua Kocinski 2019-12-18 16:19:05 UTC
Description of problem:

Upon applying 4.3.7 update, the hosted engine would no longer boot. I was able to determine this to be caused by the MDS cpu feature no longer being present in the new environment.

Version-Release number of selected component (if applicable):

Pre upgrade: ovirt-node-ng-4.3.5.2-0.20190805.0 (3.10.0-957.27.2.el7.x86_64)
Post upgrade: ovirt-node-ng-4.3.7-0.20191121.0 (3.10.0-1062.4.3.el7.x86_64)

How reproducible:

I was able to reproduce this on two different systems. System details included below in Add'l info section.

Steps to Reproduce:
1. Install oVirt node 4.3.5.2
2. Verify presence of MDS:
[root@heavy ~]# dmesg |grep MDS
[    0.047694] MDS: Mitigation: Clear CPU buffers
3. Install upgrade package and reboot
4. Verify non-presence of MDS:
[root@heavy ~]# dmesg |grep MDS
[    0.047694] MDS: Vulnerable: Clear CPU buffers attempted, no microcode

Actual results:

hosted-engine will not start due to requirement for presence of MDS cpu flag.

Expected results:

hosted-engine should start.

Additional info:

Both systems in my lab are based on Supermicro X9DRW mainboards with Intel(R) Xeon(R) CPU E5-2670 processors. I was able to work around this issue for now by changing the grub default entry to be the older system/kernel.

Is this maybe a kernel issue?

Comment 2 peyu 2019-12-23 06:57:02 UTC
I made a upgrade test and did not reproduce this bug.


Version-Release number of selected component (if applicable):
ovirt-node-ng-installer-4.3.5-2019080513.el7.iso
ovirt-node-ng-image-update-4.3.7-1.el7.noarch.rpm


Test steps:
1. Install ovirt-node-ng-installer-4.3.5-2019080513.el7.iso
2. Verify presence of MDS:
   # dmesg |grep MDS
3. Install upgrade package and reboot
4. Verify non-presence of MDS:
   # dmesg |grep MDS


Test results:
1. The result of step 2:
[    0.024125] MDS: Mitigation: Clear CPU buffers
[    0.345002] MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details.

2. The result of step 4:
[    0.024395] MDS: Mitigation: Clear CPU buffers
[    0.345858] MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details.


Would you please deploy hosted-engine and try to start it?

Comment 3 Wei Wang 2019-12-23 07:36:32 UTC
(In reply to peyu from comment #2)
> I made a upgrade test and did not reproduce this bug.
> 
> 
> Version-Release number of selected component (if applicable):
> ovirt-node-ng-installer-4.3.5-2019080513.el7.iso
> ovirt-node-ng-image-update-4.3.7-1.el7.noarch.rpm
> 
> 
> Test steps:
> 1. Install ovirt-node-ng-installer-4.3.5-2019080513.el7.iso
> 2. Verify presence of MDS:
>    # dmesg |grep MDS
> 3. Install upgrade package and reboot
> 4. Verify non-presence of MDS:
>    # dmesg |grep MDS
> 
> 
> Test results:
> 1. The result of step 2:
> [    0.024125] MDS: Mitigation: Clear CPU buffers
> [    0.345002] MDS CPU bug present and SMT on, data leak possible. See
> https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more
> details.
> 
> 2. The result of step 4:
> [    0.024395] MDS: Mitigation: Clear CPU buffers
> [    0.345858] MDS CPU bug present and SMT on, data leak possible. See
> https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more
> details.
> 
> 
> Would you please deploy hosted-engine and try to start it?

peyu,
According to the reporter's comment 0, this is a upgrade bug (I was able to determine this to be caused by the MDS cpu feature no longer being present in the new environment.) after HE deployment environment. If you need the special machine, I will give it to you after I run HE automation testing.

Comment 4 Yuval Turgeman 2019-12-25 08:42:44 UTC
What version of microcode_ctl is installed on the node ? Also, can you please share some logs (an sosreport would be good) ?

Comment 5 Joshua Kocinski 2019-12-30 02:29:01 UTC
I am traveling and won't be back until the 2nd. I will upload SOS reports as soon as I get back. In the meantime, all I can get is the version of microcode_ctl prior to the upgrade. That version is:

microcode_ctl-2.1-47.5.el7_6.x86_64

Comment 6 Joshua Kocinski 2020-01-02 20:47:54 UTC
version of microcode_ctl from 4.3.7 release:

microcode_ctl-2.1-53.3.el7_7.x86_64

I will attach sosreports separately.

Comment 7 Joshua Kocinski 2020-01-02 20:54:01 UTC
Created attachment 1649293 [details]
sosreport from 4.3.5.x

sosreport from 4.3.5.x (prior to update)

Comment 8 Joshua Kocinski 2020-01-02 20:55:23 UTC
Created attachment 1649304 [details]
sosreport from 4.3.7

sosreport from 4.3.7 (after update)

Comment 9 Joshua Kocinski 2020-01-02 21:00:46 UTC
I have added the requested sosreports and commented with the relevant versions of microcode_ctl package.

Checking, I still see the "no microcode" message:

[root@breathe ~]# dmesg |grep MDS
[    0.048436] MDS: Vulnerable: Clear CPU buffers attempted, no microcode

Let me know if there is anything else I can do to help isolate this issue.

Comment 10 Joshua Kocinski 2020-01-09 01:50:40 UTC
After doing a bit of digging myself, I found a workaround in this bug:

https://bugzilla.redhat.com/show_bug.cgi?id=1753541#c13

Specifically, if I do this:

 * Boot into 4.3.7
 * install -D /dev/null /etc/microcode_ctl/ucode_with_caveats/force-intel-06-2d-07
 * dracut -f --early-microcode
 * Reboot

Then the system passes my original tests with 4.3.7:

[root@breathe ~]# dmesg |grep MDS
[    0.048602] MDS: Mitigation: Clear CPU buffers
[root@breathe ~]# cat /sys/devices/system/cpu/cpu0/microcode/version
0x718

Further, I can start the HostedEngine VM on the 4.3.7 host.

If I understand the notes in bug #1753541 correctly, the microcode for my CPU which mitigates MDS was blacklisted due to some sort of issue with similar CPUs on other systems (possibly in bug #1758382) and this "force" workaround is necessary to have the system install the microcode.

Is this not a bug then? Should this be documented somewhere?

Thanks.

-J

Comment 11 Sandro Bonazzola 2020-01-14 09:05:24 UTC
In 4.3.9 we are going to consume 7.8 so for now targeting this to 4.3.9.

Comment 12 Wei Wang 2020-01-16 03:14:38 UTC
Re-test with below steps:
1. Install ovirt-node ovirt-node-ng-installer-4.3.5-2019080513.el7.iso build
[root@hp-dl388g9-04 ~]# dmesg |grep MDS
[    0.044526] MDS: Vulnerable: Clear CPU buffers attempted, no microcode

2. Deploy hosted-engine with ovirt-engine-appliance-4.3-20190731.1.el7.x86_64.rpm
[root@hp-dl388g9-05 ~]# hosted-engine --vm-status


--== Host hp-dl388g9-05.lab.eng.pek2.redhat.com (id: 1) status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : hp-dl388g9-05.lab.eng.pek2.redhat.com
Host ID                            : 1
Engine status                      : {"health": "good", "vm": "up", "detail": "Up"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : eed8ccf1
local_conf_timestamp               : 4433
Host timestamp                     : 4433
Extra metadata (valid at timestamp):
	metadata_parse_version=1
	metadata_feature_version=1
	timestamp=4433 (Thu Jan 16 09:06:48 2020)
	host-id=1
	score=3400
	vm_conf_refresh_time=4433 (Thu Jan 16 09:06:48 2020)
	conf_on_shared_storage=True
	maintenance=False
	state=EngineUp
	stopped=False
[root@hp-dl388g9-05 ~]# dmesg |grep MDS
[    0.044592] MDS: Vulnerable: Clear CPU buffers attempted, no microcode

2. Install upgrade package,reboot
[root@hp-dl388g9-05 ~]# imgbase w
You are on ovirt-node-ng-4.3.7-0.20191121.0+1
[root@hp-dl388g9-05 ~]# dmesg |grep MDS
[    0.045146] MDS: Vulnerable: Clear CPU buffers attempted, no microcode
[root@hp-dl388g9-05 ~]# hosted-engine --vm-status


--== Host hp-dl388g9-05.lab.eng.pek2.redhat.com (id: 1) status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : hp-dl388g9-05.lab.eng.pek2.redhat.com
Host ID                            : 1
Engine status                      : {"health": "good", "vm": "up", "detail": "Up"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : a3fc8a7e
local_conf_timestamp               : 484
Host timestamp                     : 483
Extra metadata (valid at timestamp):
	metadata_parse_version=1
	metadata_feature_version=1
	timestamp=483 (Thu Jan 16 10:35:58 2020)
	host-id=1
	score=3400
	vm_conf_refresh_time=484 (Thu Jan 16 10:35:58 2020)
	conf_on_shared_storage=True
	maintenance=False
	state=EngineUp
	stopped=False
You have new mail in /var/spool/mail/root

Besides, I also install ovirt-node ovirt-node-ng-installer-4.3.7-2019112110.el7.iso build, then deploying hosted-engine with ovirt-engine-appliance-4.3-20191121.1.el7.x86_64.rpm
[root@hp-dl388g9-04 ~]# hosted-engine --vm-status


--== Host hp-dl388g9-04.lab.eng.pek2.redhat.com (id: 1) status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : hp-dl388g9-04.lab.eng.pek2.redhat.com
Host ID                            : 1
Engine status                      : {"health": "good", "vm": "up", "detail": "Up"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : ca0a81bf
local_conf_timestamp               : 4793
Host timestamp                     : 4793
Extra metadata (valid at timestamp):
	metadata_parse_version=1
	metadata_feature_version=1
	timestamp=4793 (Thu Jan 16 09:40:30 2020)
	host-id=1
	score=3400
	vm_conf_refresh_time=4793 (Thu Jan 16 09:40:30 2020)
	conf_on_shared_storage=True
	maintenance=False
	state=EngineUp
	stopped=False
[root@hp-dl388g9-04 ~]# dmesg |grep MDS
[    0.045049] MDS: Vulnerable: Clear CPU buffers attempted, no microcode

QE cannot reproduce this issue. And the issue is not related to the present of MDS. I think about this for two reason:
1. Suggest reporter to wait for a long time for the engine vm up and health good after upgrade. Sometime if we check the status of vm during retrieving, we will get the failed result.

2. If after a long time, the engine vm still cannot be retrieved, maybe this issue occurs with special machine. if so,

rhbugzilla,
Could you please help to verify this bug after it is fixed?
Thanks!

Comment 13 Joshua Kocinski 2020-01-30 02:07:05 UTC
Wei,

First, let me say I don't think this is a bug/issue with the hosted engine. Rather I think this is situational due to upstream/downstream changing the microcode packages. I am more than happy to test/validate after any further updates. Especially now that I know how to workaround the microcode issue.

-Joshua

Comment 14 cshao 2020-02-19 09:05:18 UTC
This is microcode issue, according bug 1753541, this issue has been fixed after microcode_ctl-2.1-54.el7, and the microcode version in the latest rhvh(redhat-virtualization-host-4.3.9-20200204.0.el7_8 + microcode_ctl-2.1-61.el7.x86_64) is higher then the fixed version, so the bug should be gone. Feel free to re-open it if still can reproduce this issue in future.

Thanks.

Comment 15 Sandro Bonazzola 2020-02-25 11:57:07 UTC
Moving back to new state.
CentOS and oVirt Node NG are still shipping microcode_ctl-2.1-53.7.el7_7.x86_64 which is the problematic microcode_ctl package.
In CentOS GIT I see imports/c7-beta/microcode_ctl-2.1-55.el7 (https://git.centos.org/rpms/microcode_ctl/tree/02e5ee5ae4518967956768ffbb6662c30053d67a) for your specific case it maybe worth rebuilding the package from there. It's from RHEL 7.8 Beta import in CentOS.

We are going to include it in oVirt Node once CentOS 7.8 will GA.

Comment 17 Michal Skrivanek 2020-03-11 16:30:37 UTC
it doesn't seem worth tracking, we're going to pick it up on rebuild, but even before that one can just update microcode directly from Intel or from comment#15

Comment 18 Sandro Bonazzola 2020-03-11 16:54:11 UTC
(In reply to Michal Skrivanek from comment #17)
> it doesn't seem worth tracking, we're going to pick it up on rebuild, but
> even before that one can just update microcode directly from Intel or from
> comment#15

ok, just note that this way there won't be any mentioning about this issue being fixed in release notes.


Note You need to log in before you can comment on or make changes to this bug.