Description of problem: Upon applying 4.3.7 update, the hosted engine would no longer boot. I was able to determine this to be caused by the MDS cpu feature no longer being present in the new environment. Version-Release number of selected component (if applicable): Pre upgrade: ovirt-node-ng-4.3.5.2-0.20190805.0 (3.10.0-957.27.2.el7.x86_64) Post upgrade: ovirt-node-ng-4.3.7-0.20191121.0 (3.10.0-1062.4.3.el7.x86_64) How reproducible: I was able to reproduce this on two different systems. System details included below in Add'l info section. Steps to Reproduce: 1. Install oVirt node 4.3.5.2 2. Verify presence of MDS: [root@heavy ~]# dmesg |grep MDS [ 0.047694] MDS: Mitigation: Clear CPU buffers 3. Install upgrade package and reboot 4. Verify non-presence of MDS: [root@heavy ~]# dmesg |grep MDS [ 0.047694] MDS: Vulnerable: Clear CPU buffers attempted, no microcode Actual results: hosted-engine will not start due to requirement for presence of MDS cpu flag. Expected results: hosted-engine should start. Additional info: Both systems in my lab are based on Supermicro X9DRW mainboards with Intel(R) Xeon(R) CPU E5-2670 processors. I was able to work around this issue for now by changing the grub default entry to be the older system/kernel. Is this maybe a kernel issue?
I made a upgrade test and did not reproduce this bug. Version-Release number of selected component (if applicable): ovirt-node-ng-installer-4.3.5-2019080513.el7.iso ovirt-node-ng-image-update-4.3.7-1.el7.noarch.rpm Test steps: 1. Install ovirt-node-ng-installer-4.3.5-2019080513.el7.iso 2. Verify presence of MDS: # dmesg |grep MDS 3. Install upgrade package and reboot 4. Verify non-presence of MDS: # dmesg |grep MDS Test results: 1. The result of step 2: [ 0.024125] MDS: Mitigation: Clear CPU buffers [ 0.345002] MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details. 2. The result of step 4: [ 0.024395] MDS: Mitigation: Clear CPU buffers [ 0.345858] MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details. Would you please deploy hosted-engine and try to start it?
(In reply to peyu from comment #2) > I made a upgrade test and did not reproduce this bug. > > > Version-Release number of selected component (if applicable): > ovirt-node-ng-installer-4.3.5-2019080513.el7.iso > ovirt-node-ng-image-update-4.3.7-1.el7.noarch.rpm > > > Test steps: > 1. Install ovirt-node-ng-installer-4.3.5-2019080513.el7.iso > 2. Verify presence of MDS: > # dmesg |grep MDS > 3. Install upgrade package and reboot > 4. Verify non-presence of MDS: > # dmesg |grep MDS > > > Test results: > 1. The result of step 2: > [ 0.024125] MDS: Mitigation: Clear CPU buffers > [ 0.345002] MDS CPU bug present and SMT on, data leak possible. See > https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more > details. > > 2. The result of step 4: > [ 0.024395] MDS: Mitigation: Clear CPU buffers > [ 0.345858] MDS CPU bug present and SMT on, data leak possible. See > https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more > details. > > > Would you please deploy hosted-engine and try to start it? peyu, According to the reporter's comment 0, this is a upgrade bug (I was able to determine this to be caused by the MDS cpu feature no longer being present in the new environment.) after HE deployment environment. If you need the special machine, I will give it to you after I run HE automation testing.
What version of microcode_ctl is installed on the node ? Also, can you please share some logs (an sosreport would be good) ?
I am traveling and won't be back until the 2nd. I will upload SOS reports as soon as I get back. In the meantime, all I can get is the version of microcode_ctl prior to the upgrade. That version is: microcode_ctl-2.1-47.5.el7_6.x86_64
version of microcode_ctl from 4.3.7 release: microcode_ctl-2.1-53.3.el7_7.x86_64 I will attach sosreports separately.
Created attachment 1649293 [details] sosreport from 4.3.5.x sosreport from 4.3.5.x (prior to update)
Created attachment 1649304 [details] sosreport from 4.3.7 sosreport from 4.3.7 (after update)
I have added the requested sosreports and commented with the relevant versions of microcode_ctl package. Checking, I still see the "no microcode" message: [root@breathe ~]# dmesg |grep MDS [ 0.048436] MDS: Vulnerable: Clear CPU buffers attempted, no microcode Let me know if there is anything else I can do to help isolate this issue.
After doing a bit of digging myself, I found a workaround in this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1753541#c13 Specifically, if I do this: * Boot into 4.3.7 * install -D /dev/null /etc/microcode_ctl/ucode_with_caveats/force-intel-06-2d-07 * dracut -f --early-microcode * Reboot Then the system passes my original tests with 4.3.7: [root@breathe ~]# dmesg |grep MDS [ 0.048602] MDS: Mitigation: Clear CPU buffers [root@breathe ~]# cat /sys/devices/system/cpu/cpu0/microcode/version 0x718 Further, I can start the HostedEngine VM on the 4.3.7 host. If I understand the notes in bug #1753541 correctly, the microcode for my CPU which mitigates MDS was blacklisted due to some sort of issue with similar CPUs on other systems (possibly in bug #1758382) and this "force" workaround is necessary to have the system install the microcode. Is this not a bug then? Should this be documented somewhere? Thanks. -J
In 4.3.9 we are going to consume 7.8 so for now targeting this to 4.3.9.
Re-test with below steps: 1. Install ovirt-node ovirt-node-ng-installer-4.3.5-2019080513.el7.iso build [root@hp-dl388g9-04 ~]# dmesg |grep MDS [ 0.044526] MDS: Vulnerable: Clear CPU buffers attempted, no microcode 2. Deploy hosted-engine with ovirt-engine-appliance-4.3-20190731.1.el7.x86_64.rpm [root@hp-dl388g9-05 ~]# hosted-engine --vm-status --== Host hp-dl388g9-05.lab.eng.pek2.redhat.com (id: 1) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : hp-dl388g9-05.lab.eng.pek2.redhat.com Host ID : 1 Engine status : {"health": "good", "vm": "up", "detail": "Up"} Score : 3400 stopped : False Local maintenance : False crc32 : eed8ccf1 local_conf_timestamp : 4433 Host timestamp : 4433 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=4433 (Thu Jan 16 09:06:48 2020) host-id=1 score=3400 vm_conf_refresh_time=4433 (Thu Jan 16 09:06:48 2020) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False [root@hp-dl388g9-05 ~]# dmesg |grep MDS [ 0.044592] MDS: Vulnerable: Clear CPU buffers attempted, no microcode 2. Install upgrade package,reboot [root@hp-dl388g9-05 ~]# imgbase w You are on ovirt-node-ng-4.3.7-0.20191121.0+1 [root@hp-dl388g9-05 ~]# dmesg |grep MDS [ 0.045146] MDS: Vulnerable: Clear CPU buffers attempted, no microcode [root@hp-dl388g9-05 ~]# hosted-engine --vm-status --== Host hp-dl388g9-05.lab.eng.pek2.redhat.com (id: 1) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : hp-dl388g9-05.lab.eng.pek2.redhat.com Host ID : 1 Engine status : {"health": "good", "vm": "up", "detail": "Up"} Score : 3400 stopped : False Local maintenance : False crc32 : a3fc8a7e local_conf_timestamp : 484 Host timestamp : 483 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=483 (Thu Jan 16 10:35:58 2020) host-id=1 score=3400 vm_conf_refresh_time=484 (Thu Jan 16 10:35:58 2020) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False You have new mail in /var/spool/mail/root Besides, I also install ovirt-node ovirt-node-ng-installer-4.3.7-2019112110.el7.iso build, then deploying hosted-engine with ovirt-engine-appliance-4.3-20191121.1.el7.x86_64.rpm [root@hp-dl388g9-04 ~]# hosted-engine --vm-status --== Host hp-dl388g9-04.lab.eng.pek2.redhat.com (id: 1) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : hp-dl388g9-04.lab.eng.pek2.redhat.com Host ID : 1 Engine status : {"health": "good", "vm": "up", "detail": "Up"} Score : 3400 stopped : False Local maintenance : False crc32 : ca0a81bf local_conf_timestamp : 4793 Host timestamp : 4793 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=4793 (Thu Jan 16 09:40:30 2020) host-id=1 score=3400 vm_conf_refresh_time=4793 (Thu Jan 16 09:40:30 2020) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False [root@hp-dl388g9-04 ~]# dmesg |grep MDS [ 0.045049] MDS: Vulnerable: Clear CPU buffers attempted, no microcode QE cannot reproduce this issue. And the issue is not related to the present of MDS. I think about this for two reason: 1. Suggest reporter to wait for a long time for the engine vm up and health good after upgrade. Sometime if we check the status of vm during retrieving, we will get the failed result. 2. If after a long time, the engine vm still cannot be retrieved, maybe this issue occurs with special machine. if so, rhbugzilla, Could you please help to verify this bug after it is fixed? Thanks!
Wei, First, let me say I don't think this is a bug/issue with the hosted engine. Rather I think this is situational due to upstream/downstream changing the microcode packages. I am more than happy to test/validate after any further updates. Especially now that I know how to workaround the microcode issue. -Joshua
This is microcode issue, according bug 1753541, this issue has been fixed after microcode_ctl-2.1-54.el7, and the microcode version in the latest rhvh(redhat-virtualization-host-4.3.9-20200204.0.el7_8 + microcode_ctl-2.1-61.el7.x86_64) is higher then the fixed version, so the bug should be gone. Feel free to re-open it if still can reproduce this issue in future. Thanks.
Moving back to new state. CentOS and oVirt Node NG are still shipping microcode_ctl-2.1-53.7.el7_7.x86_64 which is the problematic microcode_ctl package. In CentOS GIT I see imports/c7-beta/microcode_ctl-2.1-55.el7 (https://git.centos.org/rpms/microcode_ctl/tree/02e5ee5ae4518967956768ffbb6662c30053d67a) for your specific case it maybe worth rebuilding the package from there. It's from RHEL 7.8 Beta import in CentOS. We are going to include it in oVirt Node once CentOS 7.8 will GA.
it doesn't seem worth tracking, we're going to pick it up on rebuild, but even before that one can just update microcode directly from Intel or from comment#15
(In reply to Michal Skrivanek from comment #17) > it doesn't seem worth tracking, we're going to pick it up on rebuild, but > even before that one can just update microcode directly from Intel or from > comment#15 ok, just note that this way there won't be any mentioning about this issue being fixed in release notes.