Since the Spectre/Meltdown fixes landed in the kernel (we think), it seems like any attempt at nested virt will cause the host VM to crash some relatively short time after the virt guest is launched. tflink has run into this with taskotron, and I have with openqa. Tim talked about it a little with Patrick, and Patrick seemed to think it was likely related to the Spectre/Meltdown fixes (not sure which). This is why openqa01 and openqa-stg01 have died shortly after 4am every day for the last little while; that's when createhdds kicks in and tries to create some disk images...which involves launching a VM (via virt-install). Unfortunately the kernel trace does not appear to make it to the system logs, so I don't have one to attach ATM, but it looks like this should be relatively easy to reproduce.
Can you try booting with nopti on the kernel command line in some combination of host/guest?
Will do - I'll just have to set up a reproducer somewhere outside of one of our automated testing systems :P
I am assuming these are running fedora all the way down the chain? host/virt host/guest?
In my case, I was running an F27 VM inside a F27 VM on an EL7 virthost. As far as I know, adam was running fedora for all of the systems
Mine would be F26/F27 in F27 in ??? - I don't know what infra runs openqa01 and openqa-stg01 VMs (which are the 'F27' in that chain) on. Sorry I still didn't get the reproducer, I was going to do it in the Super Secret Meeting this morning but we had to close laptops, then this afternoon I had a ton of meetings 'n' stuff. I'll do it soon, really I promise.
Sorry for the delay, but I just tried this, and the crash still seems to happen even with 'nopti' on the kernel command line...this with kernel 4.14.16-300.fc27.
Trying with 4.15.3 now.
This looks to be fixed in 4.15.3! openQA staging has rebuilt 3 disk images without crashing so far. With 4.14 it would never even get one done. Marking as fixed, will re-open if it reoccurs.
Spoke too soon, I just got lucky :( First attempt with 4.15.3 got through two images but crashed halfway through the third. After rebooting, another try crashed halfway through the first tried image again.
Note: still happening with 4.15.14-300.fc27.x86_64 at least. I just crashed the production openQA server three times by trying to re-generate an image on it, having forgotten about this bug...
*********** MASS BUG UPDATE ************** We apologize for the inconvenience. There are a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 27 kernel bugs. Fedora 27 has now been rebased to 4.17.7-100.fc27. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you have moved on to Fedora 28, and are still experiencing this issue, please change the version to Fedora 28. If you experience different issues, please open a new bug report for those.
This is still valid with 4.17.7-200.fc28.x86_64 . Just tested it on openqa-stg01, attempted two runs, system crashed both times.
*********** MASS BUG UPDATE ************** We apologize for the inconvenience. There are a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 28 kernel bugs. Fedora 28 has now been rebased to 4.20.5-100.fc28. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you have moved on to Fedora 29, and are still experiencing this issue, please change the version to Fedora 29. If you experience different issues, please open a new bug report for those.
I do believe this really did go away recently, as I'm pretty sure I did a disk image build on one of the servers by mistake and it worked. So I'm gonna close it again. Will reopen again if it turns out I was wrong again. :)