1406609 – Installed system with kernel 4.10 fails to boot in 'qemu-kvm -cpu host' on Xeon E5540 and E5-2450

Bug 1406609 - Installed system with kernel 4.10 fails to boot in 'qemu-kvm -cpu host' on Xeon E5540 and E5-2450 [NEEDINFO]

Summary: Installed system with kernel 4.10 fails to boot in 'qemu-kvm -cpu host' on Xe...

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	25
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-12-21 03:13 UTC by Adam Williamson
Modified:	2019-01-09 12:54 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-04-28 17:09:12 UTC
Type:	Bug
Embargoed:
Dependent Products:
Flags:	jforbes: needinfo?

Attachments	(Terms of Use)

Description Adam Williamson 2016-12-21 03:13:55 UTC

This might be a very tricky one to figure out, but I thought I'd best file it in case it worries people.

When kernel 4.10 showed up in Fedora Rawhide - in 20161218.n.1 - several openQA tests started failing. The same set of tests failed for 20161218.n.1 and 20161219.n.0 in both production and staging openQA.

The affected set of tests appears to be 'any x86_64 install with a package set other than minimal' - the other package sets tested are Server and KDE (there would also be Workstation, but there are dep issues in Workstation ATM which prevent it being installed).

The symptom is that the install runs fine, but when booting the installed system, the boot process appears to hang shortly after grub.

I've just figured out that this appears to be related to the use of '-cpu host' in the openQA tests. When I adjust openQA so that the test VMs are launched with '-cpu Nehalem' instead of '-cpu host', they seem to run fine.

We have two different types of worker host machine (the boxes on which the qemu processes actually run) for openQA ATM: some have Xeon E5540 CPUs, some have Xeon E5-2450 CPUs. We have seen this problem on both. e.g. here's a failure on qa05 (E5540):

https://openqa.fedoraproject.org/tests/51527

and here's a failure on qa14 (E5-2450):

https://openqa.fedoraproject.org/tests/51589

I seem to be able to produce a similar effect on my desktop, which has a Core i7-2600K CPU: if I boot a Rawhide 20161219.n.0 install with virt-manager's 'Copy host CPU configuration' box checked, I get a similar result. That doesn't actually do '-cpu host', instead it does this:

-cpu Westmere,+vme,+ds,+acpi,+ss,+ht,+tm,+pbe,+pclmuldq,+dtes64,+monitor,+ds_cpl,+vmx,+est,+tm2,+xtpr,+pdcm,+pcid,+tsc-deadline,+xsave,+osxsave,+avx,+arat,+xsaveopt,+rdtscp

From that scenario, I can get a traceback, which looks like this:

[    2.217005] CPU: 0 PID: 139 Comm: cryptomgr_test Not tainted 4.10.0-0.rc0.git4.1.fc26.x86_64 #1
[    2.217005] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1.fc26 04/01/2014
[    2.217005] task: ffff9f0475b9b100 task.stack: ffffb443c075c000
[    2.217005] RIP: 0010:aes_ctr_enc_128_avx_by8+0xa/0x1250
[    2.217005] RSP: 0018:ffffb443c075f868 EFLAGS: 00010206
[    2.217005] RAX: 0000000000000010 RBX: 0000000000000000 RCX: ffff9f0475b2f000
[    2.217005] RDX: ffff9f0475b7a470 RSI: ffffb443c075fda8 RDI: ffff9f0475b2f000
[    2.217005] RBP: ffffb443c075f870 R08: 0000000000000040 R09: ffff9f0475b2f000
[    2.217005] R10: ffff9f0475b2f000 R11: ffff9f0440000000 R12: ffff9f0475b7a470
[    2.217005] R13: ffff9f0475bae290 R14: ffffffffb5f16e00 R15: ffff9f0475bae240
[    2.217005] FS:  0000000000000000(0000) GS:ffff9f04bfc00000(0000) knlGS:0000000000000000
[    2.217005] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    2.217005] CR2: 00007fd34fc59844 CR3: 000000007ce0c000 CR4: 00000000000006f0
[    2.217005] Call Trace:
[    2.217005]  ? aesni_ctr_enc_avx_tfm+0x41/0x50
[    2.217005]  ctr_crypt+0x87/0x1c0
[    2.217005]  simd_skcipher_encrypt+0xb7/0xc0
[    2.217005]  __test_skcipher+0x387/0xcd0
[    2.217005]  ? kvm_clock_read+0x25/0x30
[    2.217005]  ? sched_clock+0x9/0x10
[    2.217005]  ? sched_clock_local+0x17/0x80
[    2.217005]  ? sched_clock+0x9/0x10
[    2.217005]  ? sched_clock_local+0x17/0x80
[    2.217005]  ? rcu_read_lock_sched_held+0x4a/0x80
[    2.217005]  ? __kmalloc+0x2e5/0x320
[    2.217005]  ? crypto_create_tfm+0x2a/0xe0
[    2.217005]  ? crypto_create_tfm+0x41/0xe0
[    2.217005]  ? crypto_spawn_tfm2+0x34/0x60
[    2.217005]  ? cryptd_skcipher_init_tfm+0x1d/0x40
[    2.217005]  ? crypto_skcipher_init_tfm+0x88/0x190
[    2.217005]  ? crypto_create_tfm+0x41/0xe0
[    2.217005]  ? crypto_alloc_tfm+0x79/0x120
[    2.217005]  ? crypto_alloc_skcipher+0x19/0x20
[    2.217005]  ? cryptd_alloc_skcipher+0x77/0xc0
[    2.217005]  ? simd_skcipher_init+0x24/0x40
[    2.217005]  ? crypto_skcipher_init_tfm+0x88/0x190
[    2.217005]  ? crypto_create_tfm+0x41/0xe0
[    2.217005]  test_skcipher+0x27/0xa0
[    2.217005]  alg_test_skcipher+0x47/0xb0
[    2.217005]  alg_test+0x1c4/0x3e0
[    2.217005]  ? __schedule+0x302/0xae0
[    2.217005]  cryptomgr_test+0x41/0x50
[    2.217005]  kthread+0x10f/0x150
[    2.217005]  ? crypto_acomp_scomp_free_ctx+0x30/0x30
[    2.217005]  ? kthread_create_on_node+0x60/0x60
[    2.217005]  ? kthread_create_on_node+0x60/0x60
[    2.217005]  ret_from_fork+0x2a/0x40
[    2.217005] Code: c1 31 73 d9 08 c5 79 7e c8 41 89 42 08 eb 05 c4 41 7a 7f 0a 4c 89 f4 41 5f 41 5e 41 5d 41 5c c3 90 49 83 f8 10 0f 82 31 12 00 00 <c5> 79 6f 0d de c3 96 00 c5 7a 6f 06 c4 42 39 00 c1 4d 89 c2 49 
[    2.217005] RIP: aes_ctr_enc_128_avx_by8+0xa/0x1250 RSP: ffffb443c075f868
[    2.249428] ---[ end trace 890405263e80f7ad ]---
[    2.249775] note: cryptomgr_test[139] exited with preempt_count 1
[    2.250337] cryptomgr_test (139) used greatest stack depth: 12416 bytes left

Prior to the 20161218.n.1 compose - i.e. with 20161215.n.0 and earlier - the tests were running fine with '-cpu host'.

Comment 1 Peter Robinson 2016-12-21 03:26:08 UTC

> I've just figured out that this appears to be related to the use of '-cpu
> host' in the openQA tests. When I adjust openQA so that the test VMs are
> launched with '-cpu Nehalem' instead of '-cpu host', they seem to run fine.

Westmere was the first to introduce HW AES-NI acceleration[1], Nehalem was it's predecessor so that probably explains the reason it works with that option.

[1] https://en.wikipedia.org/wiki/AES_instruction_set#Intel_and_AMD_x86_architecture

Comment 2 Peter Robinson 2016-12-21 03:28:54 UTC

Looks to be significant changes to AES-NI in this cycle so I'd be starting there. eg

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=85671860caaca2f3831f675d48356810731a33eb

Comment 3 Adam Williamson 2016-12-21 03:37:26 UTC

In case it helps triage at all, I do *not* see the bug when using '-cpu host' on a Core i7-3537U.

Comment 4 Justin M. Forbes 2017-04-11 14:50:56 UTC

*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 25 kernel bugs.

Fedora 25 has now been rebased to 4.10.9-200.fc25.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 26, and are still experiencing this issue, please change the version to Fedora 26.

If you experience different issues, please open a new bug report for those.

Comment 5 Justin M. Forbes 2017-04-28 17:09:12 UTC

*********** MASS BUG UPDATE **************
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 2 weeks. If you are still experiencing this issue, please reopen and attach the 
relevant data from the latest kernel you are running and any data that might have been requested previously.

Note You need to log in before you can comment on or make changes to this bug.