This might be a very tricky one to figure out, but I thought I'd best file it in case it worries people. When kernel 4.10 showed up in Fedora Rawhide - in 20161218.n.1 - several openQA tests started failing. The same set of tests failed for 20161218.n.1 and 20161219.n.0 in both production and staging openQA. The affected set of tests appears to be 'any x86_64 install with a package set other than minimal' - the other package sets tested are Server and KDE (there would also be Workstation, but there are dep issues in Workstation ATM which prevent it being installed). The symptom is that the install runs fine, but when booting the installed system, the boot process appears to hang shortly after grub. I've just figured out that this appears to be related to the use of '-cpu host' in the openQA tests. When I adjust openQA so that the test VMs are launched with '-cpu Nehalem' instead of '-cpu host', they seem to run fine. We have two different types of worker host machine (the boxes on which the qemu processes actually run) for openQA ATM: some have Xeon E5540 CPUs, some have Xeon E5-2450 CPUs. We have seen this problem on both. e.g. here's a failure on qa05 (E5540): https://openqa.fedoraproject.org/tests/51527 and here's a failure on qa14 (E5-2450): https://openqa.fedoraproject.org/tests/51589 I seem to be able to produce a similar effect on my desktop, which has a Core i7-2600K CPU: if I boot a Rawhide 20161219.n.0 install with virt-manager's 'Copy host CPU configuration' box checked, I get a similar result. That doesn't actually do '-cpu host', instead it does this: -cpu Westmere,+vme,+ds,+acpi,+ss,+ht,+tm,+pbe,+pclmuldq,+dtes64,+monitor,+ds_cpl,+vmx,+est,+tm2,+xtpr,+pdcm,+pcid,+tsc-deadline,+xsave,+osxsave,+avx,+arat,+xsaveopt,+rdtscp From that scenario, I can get a traceback, which looks like this: [ 2.217005] CPU: 0 PID: 139 Comm: cryptomgr_test Not tainted 4.10.0-0.rc0.git4.1.fc26.x86_64 #1 [ 2.217005] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1.fc26 04/01/2014 [ 2.217005] task: ffff9f0475b9b100 task.stack: ffffb443c075c000 [ 2.217005] RIP: 0010:aes_ctr_enc_128_avx_by8+0xa/0x1250 [ 2.217005] RSP: 0018:ffffb443c075f868 EFLAGS: 00010206 [ 2.217005] RAX: 0000000000000010 RBX: 0000000000000000 RCX: ffff9f0475b2f000 [ 2.217005] RDX: ffff9f0475b7a470 RSI: ffffb443c075fda8 RDI: ffff9f0475b2f000 [ 2.217005] RBP: ffffb443c075f870 R08: 0000000000000040 R09: ffff9f0475b2f000 [ 2.217005] R10: ffff9f0475b2f000 R11: ffff9f0440000000 R12: ffff9f0475b7a470 [ 2.217005] R13: ffff9f0475bae290 R14: ffffffffb5f16e00 R15: ffff9f0475bae240 [ 2.217005] FS: 0000000000000000(0000) GS:ffff9f04bfc00000(0000) knlGS:0000000000000000 [ 2.217005] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2.217005] CR2: 00007fd34fc59844 CR3: 000000007ce0c000 CR4: 00000000000006f0 [ 2.217005] Call Trace: [ 2.217005] ? aesni_ctr_enc_avx_tfm+0x41/0x50 [ 2.217005] ctr_crypt+0x87/0x1c0 [ 2.217005] simd_skcipher_encrypt+0xb7/0xc0 [ 2.217005] __test_skcipher+0x387/0xcd0 [ 2.217005] ? kvm_clock_read+0x25/0x30 [ 2.217005] ? sched_clock+0x9/0x10 [ 2.217005] ? sched_clock_local+0x17/0x80 [ 2.217005] ? sched_clock+0x9/0x10 [ 2.217005] ? sched_clock_local+0x17/0x80 [ 2.217005] ? rcu_read_lock_sched_held+0x4a/0x80 [ 2.217005] ? __kmalloc+0x2e5/0x320 [ 2.217005] ? crypto_create_tfm+0x2a/0xe0 [ 2.217005] ? crypto_create_tfm+0x41/0xe0 [ 2.217005] ? crypto_spawn_tfm2+0x34/0x60 [ 2.217005] ? cryptd_skcipher_init_tfm+0x1d/0x40 [ 2.217005] ? crypto_skcipher_init_tfm+0x88/0x190 [ 2.217005] ? crypto_create_tfm+0x41/0xe0 [ 2.217005] ? crypto_alloc_tfm+0x79/0x120 [ 2.217005] ? crypto_alloc_skcipher+0x19/0x20 [ 2.217005] ? cryptd_alloc_skcipher+0x77/0xc0 [ 2.217005] ? simd_skcipher_init+0x24/0x40 [ 2.217005] ? crypto_skcipher_init_tfm+0x88/0x190 [ 2.217005] ? crypto_create_tfm+0x41/0xe0 [ 2.217005] test_skcipher+0x27/0xa0 [ 2.217005] alg_test_skcipher+0x47/0xb0 [ 2.217005] alg_test+0x1c4/0x3e0 [ 2.217005] ? __schedule+0x302/0xae0 [ 2.217005] cryptomgr_test+0x41/0x50 [ 2.217005] kthread+0x10f/0x150 [ 2.217005] ? crypto_acomp_scomp_free_ctx+0x30/0x30 [ 2.217005] ? kthread_create_on_node+0x60/0x60 [ 2.217005] ? kthread_create_on_node+0x60/0x60 [ 2.217005] ret_from_fork+0x2a/0x40 [ 2.217005] Code: c1 31 73 d9 08 c5 79 7e c8 41 89 42 08 eb 05 c4 41 7a 7f 0a 4c 89 f4 41 5f 41 5e 41 5d 41 5c c3 90 49 83 f8 10 0f 82 31 12 00 00 <c5> 79 6f 0d de c3 96 00 c5 7a 6f 06 c4 42 39 00 c1 4d 89 c2 49 [ 2.217005] RIP: aes_ctr_enc_128_avx_by8+0xa/0x1250 RSP: ffffb443c075f868 [ 2.249428] ---[ end trace 890405263e80f7ad ]--- [ 2.249775] note: cryptomgr_test[139] exited with preempt_count 1 [ 2.250337] cryptomgr_test (139) used greatest stack depth: 12416 bytes left Prior to the 20161218.n.1 compose - i.e. with 20161215.n.0 and earlier - the tests were running fine with '-cpu host'.
> I've just figured out that this appears to be related to the use of '-cpu > host' in the openQA tests. When I adjust openQA so that the test VMs are > launched with '-cpu Nehalem' instead of '-cpu host', they seem to run fine. Westmere was the first to introduce HW AES-NI acceleration[1], Nehalem was it's predecessor so that probably explains the reason it works with that option. [1] https://en.wikipedia.org/wiki/AES_instruction_set#Intel_and_AMD_x86_architecture
Looks to be significant changes to AES-NI in this cycle so I'd be starting there. eg http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=85671860caaca2f3831f675d48356810731a33eb
In case it helps triage at all, I do *not* see the bug when using '-cpu host' on a Core i7-3537U.
*********** MASS BUG UPDATE ************** We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 25 kernel bugs. Fedora 25 has now been rebased to 4.10.9-200.fc25. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you have moved on to Fedora 26, and are still experiencing this issue, please change the version to Fedora 26. If you experience different issues, please open a new bug report for those.
*********** MASS BUG UPDATE ************** This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 2 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.