Bug 1784906
| Summary: | Upon updating to 4.3.7, the CPU feature MDS is no longer presented and the hosted engine will not boot. | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [oVirt] ovirt-node | Reporter: | Joshua Kocinski <rhbugzilla> | ||||||
| Component: | Installation & Update | Assignee: | Sandro Bonazzola <sbonazzo> | ||||||
| Status: | CLOSED NEXTRELEASE | QA Contact: | Wei Wang <weiwang> | ||||||
| Severity: | unspecified | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 4.3 | CC: | bugs, cshao, lsvaty, mavital, michal.skrivanek, mtessun, nlevy, peyu, qiyuan, sbonazzo, shlei, weiwang, yaniwang, yturgema | ||||||
| Target Milestone: | ovirt-4.3.9 | Keywords: | Reopened | ||||||
| Target Release: | --- | Flags: | sbonazzo:
ovirt-4.3?
peyu: testing_plan_complete+ cshao: testing_ack? |
||||||
| Hardware: | x86_64 | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2020-03-11 16:30:37 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | Node | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | 1753541 | ||||||||
| Bug Blocks: | |||||||||
| Attachments: |
|
||||||||
|
Description
Joshua Kocinski
2019-12-18 16:19:05 UTC
I made a upgrade test and did not reproduce this bug. Version-Release number of selected component (if applicable): ovirt-node-ng-installer-4.3.5-2019080513.el7.iso ovirt-node-ng-image-update-4.3.7-1.el7.noarch.rpm Test steps: 1. Install ovirt-node-ng-installer-4.3.5-2019080513.el7.iso 2. Verify presence of MDS: # dmesg |grep MDS 3. Install upgrade package and reboot 4. Verify non-presence of MDS: # dmesg |grep MDS Test results: 1. The result of step 2: [ 0.024125] MDS: Mitigation: Clear CPU buffers [ 0.345002] MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details. 2. The result of step 4: [ 0.024395] MDS: Mitigation: Clear CPU buffers [ 0.345858] MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details. Would you please deploy hosted-engine and try to start it? (In reply to peyu from comment #2) > I made a upgrade test and did not reproduce this bug. > > > Version-Release number of selected component (if applicable): > ovirt-node-ng-installer-4.3.5-2019080513.el7.iso > ovirt-node-ng-image-update-4.3.7-1.el7.noarch.rpm > > > Test steps: > 1. Install ovirt-node-ng-installer-4.3.5-2019080513.el7.iso > 2. Verify presence of MDS: > # dmesg |grep MDS > 3. Install upgrade package and reboot > 4. Verify non-presence of MDS: > # dmesg |grep MDS > > > Test results: > 1. The result of step 2: > [ 0.024125] MDS: Mitigation: Clear CPU buffers > [ 0.345002] MDS CPU bug present and SMT on, data leak possible. See > https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more > details. > > 2. The result of step 4: > [ 0.024395] MDS: Mitigation: Clear CPU buffers > [ 0.345858] MDS CPU bug present and SMT on, data leak possible. See > https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more > details. > > > Would you please deploy hosted-engine and try to start it? peyu, According to the reporter's comment 0, this is a upgrade bug (I was able to determine this to be caused by the MDS cpu feature no longer being present in the new environment.) after HE deployment environment. If you need the special machine, I will give it to you after I run HE automation testing. What version of microcode_ctl is installed on the node ? Also, can you please share some logs (an sosreport would be good) ? I am traveling and won't be back until the 2nd. I will upload SOS reports as soon as I get back. In the meantime, all I can get is the version of microcode_ctl prior to the upgrade. That version is: microcode_ctl-2.1-47.5.el7_6.x86_64 version of microcode_ctl from 4.3.7 release: microcode_ctl-2.1-53.3.el7_7.x86_64 I will attach sosreports separately. Created attachment 1649293 [details]
sosreport from 4.3.5.x
sosreport from 4.3.5.x (prior to update)
Created attachment 1649304 [details]
sosreport from 4.3.7
sosreport from 4.3.7 (after update)
I have added the requested sosreports and commented with the relevant versions of microcode_ctl package. Checking, I still see the "no microcode" message: [root@breathe ~]# dmesg |grep MDS [ 0.048436] MDS: Vulnerable: Clear CPU buffers attempted, no microcode Let me know if there is anything else I can do to help isolate this issue. After doing a bit of digging myself, I found a workaround in this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1753541#c13 Specifically, if I do this: * Boot into 4.3.7 * install -D /dev/null /etc/microcode_ctl/ucode_with_caveats/force-intel-06-2d-07 * dracut -f --early-microcode * Reboot Then the system passes my original tests with 4.3.7: [root@breathe ~]# dmesg |grep MDS [ 0.048602] MDS: Mitigation: Clear CPU buffers [root@breathe ~]# cat /sys/devices/system/cpu/cpu0/microcode/version 0x718 Further, I can start the HostedEngine VM on the 4.3.7 host. If I understand the notes in bug #1753541 correctly, the microcode for my CPU which mitigates MDS was blacklisted due to some sort of issue with similar CPUs on other systems (possibly in bug #1758382) and this "force" workaround is necessary to have the system install the microcode. Is this not a bug then? Should this be documented somewhere? Thanks. -J In 4.3.9 we are going to consume 7.8 so for now targeting this to 4.3.9. Re-test with below steps:
1. Install ovirt-node ovirt-node-ng-installer-4.3.5-2019080513.el7.iso build
[root@hp-dl388g9-04 ~]# dmesg |grep MDS
[ 0.044526] MDS: Vulnerable: Clear CPU buffers attempted, no microcode
2. Deploy hosted-engine with ovirt-engine-appliance-4.3-20190731.1.el7.x86_64.rpm
[root@hp-dl388g9-05 ~]# hosted-engine --vm-status
--== Host hp-dl388g9-05.lab.eng.pek2.redhat.com (id: 1) status ==--
conf_on_shared_storage : True
Status up-to-date : True
Hostname : hp-dl388g9-05.lab.eng.pek2.redhat.com
Host ID : 1
Engine status : {"health": "good", "vm": "up", "detail": "Up"}
Score : 3400
stopped : False
Local maintenance : False
crc32 : eed8ccf1
local_conf_timestamp : 4433
Host timestamp : 4433
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=4433 (Thu Jan 16 09:06:48 2020)
host-id=1
score=3400
vm_conf_refresh_time=4433 (Thu Jan 16 09:06:48 2020)
conf_on_shared_storage=True
maintenance=False
state=EngineUp
stopped=False
[root@hp-dl388g9-05 ~]# dmesg |grep MDS
[ 0.044592] MDS: Vulnerable: Clear CPU buffers attempted, no microcode
2. Install upgrade package,reboot
[root@hp-dl388g9-05 ~]# imgbase w
You are on ovirt-node-ng-4.3.7-0.20191121.0+1
[root@hp-dl388g9-05 ~]# dmesg |grep MDS
[ 0.045146] MDS: Vulnerable: Clear CPU buffers attempted, no microcode
[root@hp-dl388g9-05 ~]# hosted-engine --vm-status
--== Host hp-dl388g9-05.lab.eng.pek2.redhat.com (id: 1) status ==--
conf_on_shared_storage : True
Status up-to-date : True
Hostname : hp-dl388g9-05.lab.eng.pek2.redhat.com
Host ID : 1
Engine status : {"health": "good", "vm": "up", "detail": "Up"}
Score : 3400
stopped : False
Local maintenance : False
crc32 : a3fc8a7e
local_conf_timestamp : 484
Host timestamp : 483
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=483 (Thu Jan 16 10:35:58 2020)
host-id=1
score=3400
vm_conf_refresh_time=484 (Thu Jan 16 10:35:58 2020)
conf_on_shared_storage=True
maintenance=False
state=EngineUp
stopped=False
You have new mail in /var/spool/mail/root
Besides, I also install ovirt-node ovirt-node-ng-installer-4.3.7-2019112110.el7.iso build, then deploying hosted-engine with ovirt-engine-appliance-4.3-20191121.1.el7.x86_64.rpm
[root@hp-dl388g9-04 ~]# hosted-engine --vm-status
--== Host hp-dl388g9-04.lab.eng.pek2.redhat.com (id: 1) status ==--
conf_on_shared_storage : True
Status up-to-date : True
Hostname : hp-dl388g9-04.lab.eng.pek2.redhat.com
Host ID : 1
Engine status : {"health": "good", "vm": "up", "detail": "Up"}
Score : 3400
stopped : False
Local maintenance : False
crc32 : ca0a81bf
local_conf_timestamp : 4793
Host timestamp : 4793
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=4793 (Thu Jan 16 09:40:30 2020)
host-id=1
score=3400
vm_conf_refresh_time=4793 (Thu Jan 16 09:40:30 2020)
conf_on_shared_storage=True
maintenance=False
state=EngineUp
stopped=False
[root@hp-dl388g9-04 ~]# dmesg |grep MDS
[ 0.045049] MDS: Vulnerable: Clear CPU buffers attempted, no microcode
QE cannot reproduce this issue. And the issue is not related to the present of MDS. I think about this for two reason:
1. Suggest reporter to wait for a long time for the engine vm up and health good after upgrade. Sometime if we check the status of vm during retrieving, we will get the failed result.
2. If after a long time, the engine vm still cannot be retrieved, maybe this issue occurs with special machine. if so,
rhbugzilla,
Could you please help to verify this bug after it is fixed?
Thanks!
Wei, First, let me say I don't think this is a bug/issue with the hosted engine. Rather I think this is situational due to upstream/downstream changing the microcode packages. I am more than happy to test/validate after any further updates. Especially now that I know how to workaround the microcode issue. -Joshua This is microcode issue, according bug 1753541, this issue has been fixed after microcode_ctl-2.1-54.el7, and the microcode version in the latest rhvh(redhat-virtualization-host-4.3.9-20200204.0.el7_8 + microcode_ctl-2.1-61.el7.x86_64) is higher then the fixed version, so the bug should be gone. Feel free to re-open it if still can reproduce this issue in future. Thanks. Moving back to new state. CentOS and oVirt Node NG are still shipping microcode_ctl-2.1-53.7.el7_7.x86_64 which is the problematic microcode_ctl package. In CentOS GIT I see imports/c7-beta/microcode_ctl-2.1-55.el7 (https://git.centos.org/rpms/microcode_ctl/tree/02e5ee5ae4518967956768ffbb6662c30053d67a) for your specific case it maybe worth rebuilding the package from there. It's from RHEL 7.8 Beta import in CentOS. We are going to include it in oVirt Node once CentOS 7.8 will GA. it doesn't seem worth tracking, we're going to pick it up on rebuild, but even before that one can just update microcode directly from Intel or from comment#15 (In reply to Michal Skrivanek from comment #17) > it doesn't seem worth tracking, we're going to pick it up on rebuild, but > even before that one can just update microcode directly from Intel or from > comment#15 ok, just note that this way there won't be any mentioning about this issue being fixed in release notes. |