Bug 1578846

Summary: unstable production kernels 4.16.x under Xenserver
Product: [Fedora] Fedora Reporter: customercare
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: rawhideCC: airlied, bskeggs, ewk, hdegoede, ichavero, itamar, jarodwilson, jcline, jglisse, john.j5live, jonathan, josef, kernel-maint, linville, mchehab, mjg59, steved
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-07-13 16:39:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1588395    

Description customercare 2018-05-16 13:31:09 UTC
Description of problem:

Kernel 4.16.x do crash after different runtimes inside a XENServer VM.

#####
##### 4.16.5+ are HIGHLY UNSTABLE ###
#####

We have different VM and Xenversions running, it's not bound to a specific one.

Check this boot log : 

reboot   system boot  4.15.17-200.fc26 Mon May  7 00:24   still running
reboot   system boot  4.16.5-200.fc27. Sun May  6 19:29 - 23:49  (04:19)
reboot   system boot  4.16.5-200.fc27. Sun May  6 16:57 - 23:49  (06:52)
reboot   system boot  4.16.5-200.fc27. Sun May  6 15:17 - 23:49  (08:32)
reboot   system boot  4.16.5-200.fc27. Sun May  6 13:48 - 23:49  (10:00)
reboot   system boot  4.16.5-200.fc27. Sat May  5 21:53 - 23:49 (1+01:55)
reboot   system boot  4.16.5-200.fc27. Sat May  5 11:59 - 23:49 (1+11:50)
reboot   system boot  4.16.5-200.fc27. Fri May  4 12:24 - 23:49 (2+11:25)
reboot   system boot  4.16.5-200.fc27. Fri May  4 11:54 - 23:49 (2+11:55)
reboot   system boot  4.15.17-200.fc26 Fri May  4 11:04 - 11:53  (00:48)
reboot   system boot  4.15.12-201.fc26 Thu Mar 29 12:12 - 11:03 (35+22:51)
reboot   system boot  4.15.9-200.fc26. Wed Mar 14 23:40 - 12:10 (14+11:30)
reboot   system boot  4.14.14-200.fc26 Sat Jan 27 04:33 - 23:39 (46+19:05)
reboot   system boot  4.14.13-200.fc26 Mon Jan 15 17:10 - 04:33 (11+11:22)
reboot   system boot  4.13.13-200.fc26 Thu Nov 23 20:47 - 17:10 (52+20:22)
reboot   system boot  4.13.13-100.fc25 Thu Nov 23 20:13 - 20:47  (00:33)
reboot   system boot  4.11.12-200.fc25 Thu Jul 27 14:46 - 20:13 (119+06:27)

next server:

reboot   system boot  4.15.9-200.fc26. Wed May 16 14:05   still running
reboot   system boot  4.16.7-100.fc26. Wed May 16 13:36 - 14:06  (00:30)
reboot   system boot  4.16.7-100.fc26. Wed May 16 10:33 - 14:06  (03:33)


Version-Release number of selected component (if applicable):

4.16.x

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:

crash around 30-90 Minutes into running

Expected results:

stable running as 4.15.17

Additional info:


For more informations how we discovered the bug, see here:

https://bugzilla.redhat.com/show_bug.cgi?id=1575403

Comment 1 Jeremy Cline 2018-05-30 19:54:27 UTC
Hi,

Please attach the complete kernel log from a boot that crashes. You can get the log with "journalctl -k" and use the "-b" flag to select a previous boot log.

Thanks.

Comment 2 customercare 2018-10-02 16:34:49 UTC
kernel 4.18.10-100 crashing after 8 1/2 hours on a low traffic production server.

reboot   system boot  4.15.17-200.fc26 Tue Oct  2 18:30   still running
reboot   system boot  4.18.10-100.fc27 Tue Oct  2 18:19 - 18:30  (00:11)
reboot   system boot  4.18.10-100.fc27 Tue Oct  2 09:51 - 18:30  (08:39)

Comment 3 customercare 2019-04-08 15:11:49 UTC
atm: testing stability for kernel 5.0.5 against XenServer 7.6 .. result pending.

Comment 4 customercare 2019-04-19 15:20:17 UTC
temporary result: 5.0.6 kernels seem to be stable again with xen.

Comment 5 customercare 2019-04-19 15:27:47 UTC
@Jeremy:

due to the nature of the crashes, all logs get lost in the imminent filesystem corruption. Its a spontane reset of the entire vm. As a result, you not even see a oops message on the connected display,it gets overwritten by the restart of the server.

I have upgraded several Vms last week to kernels 5.0.x and all seem to run stable again. Whatever fixed the problem, must be implemented in the last few releases in 4.20.x or 5.0.x, as i tried several 4.20.x kernels and all (tested) failed.

Comment 6 customercare 2020-06-06 06:59:44 UTC
Request to close