1579925 – Lenovo T500 Laptop does not boot with the latest kernel 4.16.8-300.fc28.x86_64

Bug 1579925 - Lenovo T500 Laptop does not boot with the latest kernel 4.16.8-300.fc28.x86_64

Summary: Lenovo T500 Laptop does not boot with the latest kernel 4.16.8-300.fc28.x86_64

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	29
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-05-18 16:53 UTC by Albert Flügel
Modified:	2019-11-27 23:32 UTC (History)
CC List:	21 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2019-11-27 23:32:48 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Screen photo with the stack after the hang and the watchdog timeout (807.03 KB, image/jpeg) 2018-10-06 13:54 UTC, Albert Flügel	no flags	Details
Screen photo with the messages before the hang with loglevel=6 (700.15 KB, image/jpeg) 2018-10-06 13:57 UTC, Albert Flügel	no flags	Details
Screen photo about 80 seconds after the hang (2.91 MB, image/jpeg) 2019-01-29 20:18 UTC, Albert Flügel	no flags	Details
Screen photo about 80 seconds after the hang with 4.20.8-100 (2.19 MB, image/jpeg) 2019-02-24 12:56 UTC, Albert Flügel	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Linux Kernel	201645	0	None	None	None	2019-05-09 16:30:59 UTC

Description Albert Flügel 2018-05-18 16:53:43 UTC

Description of problem:
T500 Laptop does not boot. I see the following normal line:

 Booting 'Fedora (4.16.8-300.fc28.x86_64) 28 (Twenty Eight)'

Then for 60.8 seconds nothing happens, then i see  the following lines:
[  60.817084] INFO: rcu_sched detected stalls on CPUs/tasks:
[  60.817145] o1-...!: (0 ticks this GP) idle=910/0/0 softirq=584/584 fqs=0
[  60.817197] o(detected by 0, t=60002 jiffies, g=-153, c=-154, q=28)
[  60.818077] rcu_sched kthread starved for 60002 jiffies! g18446744073709551463 c18446744073709551462 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=1

The same lines except with other timestamps and other jiffie values and
slightly different g= c= and q= values, appear after 240 seconds, 300 seconds,
480 seconds, 540 seconds, 720 and 780 and 960, then i reset the laptop.
Did not have any hope to see anything else.


Version-Release number of selected component (if applicable):
4.16.8-300.fc28

How reproducible:
always

Steps to Reproduce:
1. Install the named kernel
2. Boot


Actual results:
Laptop does not boot, see above

Expected results:
Laptop boots normally

Additional info:
The laptop has an Intel(R) Core(TM)2 Duo CPU P9600 2.66 GHz processor, if that matters.
Booting the predecessor 4.16.7-300 works normally.
On an AMD FX6300 i don't see the problem, neither on a Pentium M i686
nor on a virtual machine with an i7-3930K as host CPU.

Comment 1 Albert Flügel 2018-05-19 12:29:20 UTC

With 4.16.9-300 from the updates-testing repo the problem happens not every time, but about every 3rd boot. When the boot works, about every 2nd time i see for a short moment the following message:

[  10.0something] ata1: COMRESET failed (errno: 16)

Comment 2 Joachim Frieben 2018-05-21 07:54:03 UTC

Same issue with a Lenovo ThinkPad T400 upon reboot fairly frequently but not after powering off and on the system. I think I saw this issue for recent kernels prior to kernel-4.16.8-300.fc28, too.

Comment 3 Albert Flügel 2018-05-24 19:08:55 UTC

With 4.16.11-300 i see at reboot the same behaviour like with 4.16.9-300 with the additional oddity, that the display backlight is reduced to minimum.
I cannot confirm, that it happens only at reboot and not after power cycling.
Most times at power-on the laptop boots, but then always shows the "COMRESET failed" message for a moment.
If it does not boot, i see the messages written in the bug description regarding stalls detected by rcu_sched ...

Comment 4 Albert Flügel 2018-05-27 12:12:23 UTC

With 4.6.12-300 from updates-testing the (probably alleged) CPU stalls do not happen anymore, neither during reboot or power-on. Still i see the COMRESET failed (errno=-16) delaying boot by 10 seconds.
Probably it has to do with the fact, that the laptop does not have a conventional harddisk, but a (Kingston) SSD. This would also explain, why there ain't many more bug reports related to Lenovo Laptops (at least i would expect them). It would probably be interesting to know, whether Mr. Frieben also has an SSD in his T400.

Comment 5 Joachim Frieben 2018-05-30 17:09:06 UTC

(In reply to Albert Flügel from comment #4)
Same issue as I booted in to kernel 4.6.12-300.fc28 just after installing it. And no, I am using a standard hard drive.

Comment 6 mmarget 2018-06-04 21:20:16 UTC

I have the same issue on a Thinkpad T500 booting kernel-4.16.13-300.fc28.x86_64 after swapping out the standard hard drive with an Integral SSD. Kernel 4.16.3-301 boots fine with the SSD. 
With the old standard hard drive, kernels 4.16.12-300 and 4.16.11-300 were booting without issues.

Comment 7 Jeremy Cline 2018-06-05 14:04:37 UTC

Hi folks,

Before anything else I'd recommend trying out the 4.17 kernel to see the issue is already fixed[0]. Since this sounds like a race condition that doesn't always happen, I recommend booting a number of times to see if you can trigger it.

According to the documentation, there's supposed to be a stack trace after the stall warning. Can you remove "quiet" and "rhgb" from the kernel command line to see if that makes the stack trace show up? Including exactly what's printed at 240 and 300 seconds would be helpful as well, since it can give clues about what state the CPU is stuck in.

For those that can reliably reproduce this, it would be very helpful to:

* Confirm the problem is not present in 4.16.7 and is present in 4.16.8+

* Bisect the kernel[1] and determine the exact commit that introduced the problem.

This will greatly increase the chances of this getting fixed. Thanks!


[0] https://koji.fedoraproject.org/koji/buildinfo?buildID=1088633
[1] https://docs.fedoraproject.org/quick-docs/en-US/kernel/troubleshooting.html#bisecting-the-kernel

Comment 8 Albert Flügel 2018-06-06 11:23:43 UTC

With 4.17 (which is a FC29 RPM, however, i guess it makes no big difference) it happens about every 5th boot or power-cycle.
What i generally get (also printed with older kernel) is:
at around 6.28 seconds: ata1: link is slow to respond, please be patient (ready=0)
It would be nice, if this did not happen, cause it seems to cause unnecessary wait time.

In the following text the messages on screen are typed from a video, so i leave mostly off the timestamps to save typing. I hope things remain clear.

In the failing case after the messages

[0.85]
Non-volatile memory driver v1.3
Linux agpgart interface v0.183
ACPI: Battery Slot [BAT0] (battery present)

there is a 13 seconds wait time, then:
[13.9] random: crng init done

now what i have alreay posted, then the stack:
[60.84]
INFO: rcu_sched detected stalls on CPUs/tasks:
o1-...!: (0 ticks this GP) idle=688/0/0 softirq=171/171 fqs=0
o(detected by 0, t=60002 jiffies, g=-142, c=-143, q=9)
Sending NMI from CPU 0 to CPU 1:
NMI backtrace for cpu 1 skipped: idling at acpi_processor_ffh_cstate_enter+0x65/0xb0
rcu_sched kthread starved for 60002 jiffies! g18446744073709551474 c1844674407370955143 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 -> cpu=1
RCU grace-period kthread stack dump:
rcu_sched       I    0     9      2 0x80000000
Call Trace:
 ? __schedule+0x234/0x850
 schedule+0x28/0x80
 schedule_timeout+0x166/0x380
 ? __next_timer_interrupt+0xc0/0xc0
 rcu_gp_kthread+0x368/0x830
 ? rcu_process_callbacks+0x4f0/0x4f0
 kthread+0x112/0x130
 ? kthread_create_worker_on_cpu+0x70/0x70
 ret_from_fork+0x35/0x40

Next messages appear at 240.84. They are mostly identical to the above.
Except for the jiffies counter the only differences are:
idle=328/0/0 ... fqs=1
... t=240007

Same for 300.85:
idle=cc4/0/0 ... fqs=0
o(... t=60002 ... g=-141, c=-142, q=8)

From my experience the problem came clearly between 4.16.7 and 4.16.8.

Another thing that to me seems new with 4.17.0 are the following 3 messages
from 0.910186 to 0.923187 regarding tpm:
[    0.861628] tpm_tis 00:05: 1.2 TPM (device-id 0x1020, rev-id 6)
[    0.866548] ACPI: Battery Slot [BAT0] (battery present)
[    0.910186] tpm tpm0: A TPM error (6) occurred attempting to read a pcr value
[    0.910268] tpm tpm0: TPM is disabled/deactivated (0x6)
[    0.923187] tpm tpm0: A TPM error (6) occurred attempting get random
[    0.923628] ahci 0000:00:1f.2: version 3.0

They appear during successful boot, if the machine is not hanging.

Bisecting the kernel takes even more time. I'll do this only, if noone has a clue up to now

Comment 9 Jeremy Cline 2018-06-06 16:27:35 UTC

Hi Albert,

The TPM messages are unrelated.

Bisecting between a single stable release will be quite fast because the change set is very small (both in terms of commit count and change) and will point us at the exact change that introduced this problem. Doing so will save us both time since the other options are for me to guess at likely patches to revert, build, and have you test, or take this directly to upstream where you'll need to guess who to email since we don't know exactly which change introduced the regression. It's quite possible the first thing they'll ask you to do is bisect, anyway.

Comment 10 Albert Flügel 2018-06-06 19:17:09 UTC

Side note: i've just seen it happening also with 4.6.12-300.

The bisecting i'll do as soon as i find the time.

Comment 11 Joachim Frieben 2018-06-07 05:47:44 UTC

(In reply to Albert Flügel from comment #8)
- current kernel-4.16.13-300.fc28 fails to boot most of the time on a Lenovo ThinkPad T400.
- after deactivating TPM in the BIOS, TPM-related messages show up for kernel-4.16.13-300.fc28, too.

Comment 12 mmarget 2018-06-08 00:46:23 UTC

While bisecting i found out that when you build your own kernel, the initramfs file is about 3x larger than the one provided by fedora. this needs more time to unpack while booting which results in the bug not showing up. I will try to do it again with a smaller initramfs.

Comment 13 Albert Flügel 2018-06-08 07:17:14 UTC

Not only this, the /usr/lib/modules/<version> directory gets huge, mostly because the kernel object files are not compressed. I'd like to generally complain regarding the "quick docs" now to bisect and build the kernel. What i did is follow the advice, but did something wrong:
git bisect start 4.16.7 4.17.8
No error message, no warning, just nothing. Then i built (resulting in a full root filesystem, because i'm such an oldfashioned idiot, who separates system and data filesystems), to finally see, that it built 4.17.0. What is of no use and the following git bisect commands fail. Finally. Up to here waisted hours.
For someone who does this the first time, the "quick docs" are imho inappropriate. At least please write down an example. How can i know, that i have to add the letter v before the version ? Why does git bisect not complain ? Is there a make target to compress the kernel objects of the modules ?
I'd really like to open a documentation bug report. Plus one against git bisect.

Comment 14 mmarget 2018-06-08 13:31:56 UTC

git bisect concludes to this:

1ab4ca7c59d45b2563754053e9b9fb7c40bdf795 is the first bad commit
commit 1ab4ca7c59d45b2563754053e9b9fb7c40bdf795
Author: Peter Zijlstra <peterz>
Date:   Mon Apr 30 12:00:12 2018 +0200

    x86/tsc: Fix mark_tsc_unstable()
    
    commit e3b4f79025e0a4eb7e2a2c7d24dadfa1e38893b0 upstream.
    
    mark_tsc_unstable() also needs to affect tsc_early, Now that
    clocksource_mark_unstable() can be used on a clocksource irrespective of
    its registration state, use it on both tsc_early and tsc.
    
    This does however require cs->list to be initialized empty, otherwise it
    cannot tell the registation state before registation.
    
    Fixes: aa83c45762a2 ("x86/tsc: Introduce early tsc clocksource")
    Signed-off-by: Peter Zijlstra (Intel) <peterz>
    Signed-off-by: Thomas Gleixner <tglx>
    Tested-by: Diego Viola <diego.viola>
    Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki>
    Cc: len.brown
    Cc: rjw
    Cc: rui.zhang
    Cc: stable.org
    Link: https://lkml.kernel.org/r/20180430100344.533326547@infradead.org
    Signed-off-by: Greg Kroah-Hartman <gregkh>

:040000 040000 b829d8dfc910976ccaa33e332cc7e25eb9734f5b 39fd7ceba2809b37dbfcece1085c67e4b04b608e M	arch

Comment 15 Jeremy Cline 2018-06-08 19:13:52 UTC

Thanks mmarget! I *think* I see what's happening. Can people test out [0] once it's finished building (after making sure you can reproduce it on 4.16.14, I expect you should) and let me know if you can reproduce it there?

Albert, thanks for the feedback on the documentation. I was the author of those docs and it's good to hear where they are lacking. I'll see about improving them. As for git, I'm surprised it didn't complain, but you'll have to take that up with git's upstream.

[0] https://koji.fedoraproject.org/koji/taskinfo?taskID=27492345

Comment 16 mmarget 2018-06-09 00:50:25 UTC

Hi again, I can reproduce the bug on 4.16.14 as well as the kernel you provided. 

As for the documentation stuff, I found it much more intuitive to use the following:

git bisect start
git bisect good <tag>
git bisect bad <tag>

this way git also complains, if you put an invalid revision

Comment 17 Albert Flügel 2018-06-09 09:59:18 UTC

Yes, i also used this sysnopsis now. However, i obtained only "good" kernels and git bisect lead to this:
dec316ea18281d2892324a4bfeb4d5a8a6605e69 is the first bad commit
commit dec316ea18281d2892324a4bfeb4d5a8a6605e69
Author: Greg Kroah-Hartman <gregkh>
Date:   Wed May 9 09:53:14 2018 +0200

    Linux 4.16.8

:100644 100644 1c5d5d8c45e215f553014a5385d0b087b752414a 5da6ffd69209aeefa3ebc54b1be956a8e8693ee4 M      Makefile

what i do not believe. Somehow the bug (race condition) is not met. i also observed, that the initrd is 4 times bigger. But i don't think this is the issue that hides the bug here as the initrd is unpacked by the loader (or has this changed ? According to my understanding the drivers in the initrd are (not always, but in many cases) needed to access the root filesystem)

And during the last git next i got this message, don't know if that matters:
[b11873bfabc767fccc7e57b9a13f34b039386ff5] tracing: Fix bad use of igrab in trace_uprobe.c

I'll try this 4.16.14-301 next.

Comment 18 Albert Flügel 2018-06-09 10:10:37 UTC

4.16.14-301 does not help. rcu_check detects stalls ...

Comment 19 Albert Flügel 2018-06-09 10:16:00 UTC

I'm trying bisect from 4.16.7 - 4.16.9 now. If it's really the last commit to 4.16.8, this will be verified this way - and it's only one step more due to the binary search.

Comment 20 mmarget 2018-06-09 10:36:55 UTC

Albert, like i wrote earlier: when you build the kernel yourself, the produced initramfs is much larger than the ones provided by fedora, which then takes longer to load on boot time, that means, you can not reproduce the bug this way.

What I did is, after "make modules_install && make install" I changed /boot/grub2/grub.cfg so the custom built kernel loads initramfs-4.16.8-300.fc28.x86_64 instead.

Comment 21 Albert Flügel 2018-06-09 12:28:40 UTC

Folks,
could it be, tsc_unstable should be initialized to 1 or 0 in arch/x86/kernel/tsc.c ?
When i set on the kernel commandline
tsc=unstable or tsc=reliable or tsc=noirqtime ,
the laptop ALWAYS boots with 4.16.14-301.
I tried at least 15 times. Without this parameter it hangs at least 2 of 3 boots.
So for me the workaround is tsc=unstable . It seems set later by the kernel
anyway, if i don't do it:
Jun  8 07:18:27 tiramisu kernel: tsc: Marking TSC unstable due to TSC halts in idle

Your job, i think.

Comment 22 Jeremy Cline 2018-06-12 20:13:09 UTC

tsc_unstable is declared static so it should be initialized to 0.

Upstream can't reproduce this so we're going to have to narrow things down a bit. I don't think this is related to the power policy since that was always enabled in Fedora 28, but for the sake of thoroughness, can you see if you can reproduce it with "ahci.mobile_lpm_policy=0" on the kernel command line?

Also, it might be helpful to learn which call to "mark_tsc_unstable" is responsible for this. Adding "initcall_debug" should print out the initcalls as they're executed, so the last one before it hangs would be interesting to know.

Thanks!

Comment 23 Albert Flügel 2018-06-13 09:49:22 UTC

Currently the hang occurs extremely rarely. No idea, why. Probably the rainy weather or earth radiation. Nontheless i've seen a hang without ahci.mobile_lpm_policy=0 and without tsc=unstable but with initcall_debug (rhgb and quiet removed). Did not see any additional output compared to what i already posted.

If ahci.mobile_lpm_policy=0 helps instead of tsc=unstable is currently hard to tell because of the rare occurence.

A vague impression is, that the hang is also prevented by removing quiet and/or adding initcall_debug. Causing different timing ?

However, what is wrong initializing tsc_unstable with 0 ?

Comment 24 Albert Flügel 2018-06-13 10:33:15 UTC

In the meantime i've seen several hangs with ahci.mobile_lpm_policy=0 , this seems not to help. There seems to be a difference between 4.16.14-300 and 4.16.14-301: With 4.16.14-300 i see no message at all, just a blinking cursor on the black screen (waited 5 minutes for sth. to happen), with 4.16.14-301 i see after 60 seconds the well-known rcu_sched messages. Could be coincidence. However i report it here, probably it gives a hint.

Comment 25 Jeremy Cline 2018-06-18 18:08:54 UTC

Thanks for confirming that. It's unfortunate, but if it really is a race condition (which seems more and more likely) slowing things down with extra logging may be enough to make it not happen or happen very infrequently.

Explicitly marking tsc=unstable is an okay workaround, but ideally we should track the race down fix it.

Comment 26 Justin M. Forbes 2018-07-23 14:59:25 UTC

*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 28 kernel bugs.

Fedora 28 has now been rebased to 4.17.7-200.fc28.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you experience different issues, please open a new bug report for those.

Comment 27 Joachim Frieben 2018-07-23 19:16:49 UTC

Lenovo ThinkPad T400 got stuck right after booting into kernel 4.17.7-200.fc28 for the very first time.

Comment 28 Albert Flügel 2018-07-27 06:43:52 UTC

Still the same with 4.17.7-200.fc28. It does not always get stuck, but often and when it does, i see an additional line after the 60000 ms pause:
RCU grace-period kthread stack dump:

and nothing more.

Comment 29 Joachim Frieben 2018-08-11 09:00:03 UTC

I have seen this issue up to kernel 4.17.7-200.fc28 after installing which I have added "tsc=unstable" permanently to the kernel boot options. Since then, no further occurrence of this issue has been observed.

Comment 30 Albert Flügel 2018-08-11 15:26:13 UTC

Still the same with 4.17.12-200

Comment 31 Albert Flügel 2018-08-27 10:12:00 UTC

Still the same with 4.17.18-200

Comment 32 Laura Abbott 2018-10-01 21:29:27 UTC

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 28 kernel bugs.
 
Fedora 28 has now been rebased to 4.18.10-300.fc28.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.
 
If you have moved on to Fedora 29, and are still experiencing this issue, please change the version to Fedora 29.
 
If you experience different issues, please open a new bug report for those.

Comment 33 Joachim Frieben 2018-10-02 09:48:02 UTC

Issue is still present for kernel-4.18.10-300.fc28 unless kernel option "tsc=unstable" is added.

Comment 34 Albert Flügel 2018-10-03 10:21:33 UTC

Still the same with 4.18.11-200

Comment 35 Albert Flügel 2018-10-03 18:02:19 UTC

Hello Joachim Frieben,
can you try a patch i attempted:
http://getwings.ddns.net/pub/kernel-4.8.11-201.fc28.tar
Thank you !
Here 20 boots were ok without tsc=unstable with a mixture of reboot, power-cycle, battery-off

Comment 36 Joachim Frieben 2018-10-03 18:43:51 UTC

(In reply to Albert Flügel from comment #35)
Please attach your patch to this bug report and ask Laura to do a scratch build after reviewing it. Thanks!

Comment 37 Albert Flügel 2018-10-03 18:56:44 UTC

In the meantime it turned out that it did not help. About the 26th boot hung again like we know it :-( Sorry for the noise ...

Comment 38 Albert Flügel 2018-10-06 13:54:15 UTC

Created attachment 1491101 [details]
Screen photo with the stack after the hang and the watchdog timeout

Comment 39 Albert Flügel 2018-10-06 13:57:04 UTC

Created attachment 1491102 [details]
Screen photo with the messages before the hang with loglevel=6

Increasing the loglevel i got additional messages, that can be seen in the attached screen photos. Hope, this is helpful.

Comment 40 Albert Flügel 2018-10-16 19:26:56 UTC

Still the same with 4.18.13-200 .
Kindly, only a question: Is there any interest to understand and fix this ?

Comment 41 Joachim Frieben 2018-10-17 17:15:19 UTC

(In reply to Albert Flügel from comment #40)
I will file a bug upstream because the issue is still around and requires kernel option "tsc=unstable" as a workaround.

Comment 42 Albert Flügel 2018-11-04 21:19:12 UTC

just 4 info: still the same with 4.18.16-200

Comment 43 Justin M. Forbes 2019-01-29 16:26:19 UTC

*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 28 kernel bugs.

Fedora 28 has now been rebased to 4.20.5-100.fc28.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 29, and are still experiencing this issue, please change the version to Fedora 29.

If you experience different issues, please open a new bug report for those.

Comment 44 Albert Flügel 2019-01-29 20:18:51 UTC

Created attachment 1524794 [details]
Screen photo about 80 seconds after the hang

Can't find 4.20.5 in current updates or updates-testing repo. Tried with 4.20.4-100.
Result quite the same. Please see screen photography.

Comment 45 Joachim Frieben 2019-02-01 05:43:37 UTC

(In reply to Albert Flügel from comment #44)
Wile I was not able to reproduce this issue with kernels of the 4.19.x series, I can confirm that kernel 4.20.4 leads to frequent lock-ups for both of Fedora 28 and 29.

Comment 46 Albert Flügel 2019-02-24 12:54:52 UTC

Indeed, could neither trigger the problem with 4.19.3 and 4.19.5. Did not try others of the 4.19ers. However, with current 4.20.8-100 i still see it. Please see attached screen photo with earlyprintk=console loglevel=6

Comment 47 Albert Flügel 2019-02-24 12:56:10 UTC

Created attachment 1538160 [details]
Screen photo about 80 seconds after the hang with 4.20.8-100

Comment 48 Joachim Frieben 2019-02-25 07:28:23 UTC

(In reply to Albert Flügel from comment #46)
Please try kernel 4.20.11-100.fc28 which seems to boot without issue and, moreover, fixes a serious vulnerability.

Comment 49 Albert Flügel 2019-02-26 19:07:01 UTC

2 first attempts with 4.20-11-100: both immediately hang. No change from my perspective. I could attach a screen photo again, but as it shows exactly the same messages, i skip this.

Comment 50 Albert Flügel 2019-03-21 10:46:21 UTC

Still the same with 4.20-14-100, but could be i have a new finding: The problem only appears, when the laptop is in a docking station. Hope i did enough experiments for significance. 10 times no problem outside the docking station, 2nd attempt in the docking station lead to hang. Can anyone confirm ?

Comment 51 Joachim Frieben 2019-03-22 06:08:29 UTC

(In reply to Albert Flügel from comment #50)
No, this issue affected my notebook from the beginning, always undocked. In my case, the issue has evolved though in a way such that the system hangs frequently at boot but without ever showing the error message "INFO: rcu_sched detected stalls on CPUs/tasks: ..".

Comment 52 Albert Flügel 2019-03-23 17:01:09 UTC

These messages are still issued and can be made visible e.g. with the kernel arguments earlyprintk=console loglevel=5

Comment 53 Joachim Frieben 2019-03-26 15:47:52 UTC

(In reply to Albert Flügel from comment #52)
Correct, adding these kernel options was not necessary for older kernels but when added, the latest kernel 5.0.3 hangs frequently because of the rcu_sched issue.

Comment 54 Ben Cotton 2019-05-02 19:21:26 UTC

This message is a reminder that Fedora 28 is nearing its end of life.
On 2019-May-28 Fedora will stop maintaining and issuing updates for
Fedora 28. It is Fedora's policy to close all bug reports from releases
that are no longer maintained. At that time this bug will be closed as
EOL if it remains open with a Fedora 'version' of '28'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 28 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 55 Ben Cotton 2019-05-02 19:45:06 UTC

This message is a reminder that Fedora 28 is nearing its end of life.
On 2019-May-28 Fedora will stop maintaining and issuing updates for
Fedora 28. It is Fedora's policy to close all bug reports from releases
that are no longer maintained. At that time this bug will be closed as
EOL if it remains open with a Fedora 'version' of '28'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 28 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 56 Albert Flügel 2019-05-11 16:38:02 UTC

still the same with 5.0.13-300 on Fedora-30. Should i open a new bug for this due to new Fedora version ?

Comment 57 Joachim Frieben 2019-05-11 21:30:12 UTC

(In reply to Albert Flügel from comment #56)
I do see this issue with 5.0.13-300.fc30 on Fedora 30, too. Since Fedora 28 is reaching EOL this month and Fedora 29 receives the same kernel updates as Fedora 30, I recommend setting the version to 29.

Comment 58 Justin M. Forbes 2019-08-20 17:43:59 UTC

*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 29 kernel bugs.

Fedora 29 has now been rebased to 5.2.9-100.fc29.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 30, and are still experiencing this issue, please change the version to Fedora 30.

If you experience different issues, please open a new bug report for those.

Comment 59 Albert Flügel 2019-08-31 12:13:25 UTC

With 5.2.9 i haven't experienced the problem yet. I tried quite a number of times.
But as it was always seemed to be also a matter of random, i would not announce the all-clear for now.

Comment 60 Albert Flügel 2019-10-13 10:15:42 UTC

The problem seems not to occur anymore.
However, i wonder whether this is due to other timeouts happening during early boot, that bring things unintendedly into the correct order (the laptop needs around 15 seconds longer to boot than with the kernels that still showed the problem): these items take quite long:

[    3.159464] .... <unrelated stuff, just the previous entry in the syslog to show the timestamp>
[    6.710290] ata1: link is slow to respond, please be patient (ready=0)
[   11.374286] ata1: COMRESET failed (errno=-16)
[   11.683998] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

And i always see "Failed to start Setup Virtual Console" at around 14.5 on screen, not in syslog.
(Side note: i tried to disable this pseudoservice in systemd, but it seems impossible, probably due to dependencies ?)
In the syslog there are these sections taking some time related to display:
[   13.757654] [drm] ib test on ring 0 succeeded in 0 usecs
[   14.412748] [drm] ib test on ring 5 succeeded
...
[   14.413252] [drm]   Encoders:
[   14.413253] [drm]     CRT1: INTERNAL_KLDSCP_DAC1
[   16.325038] [drm] Cannot find any crtc or sizes

These are the tsc related messages:
[    0.000000] tsc: Fast TSC calibration using PIT
[    0.000000] tsc: Detected 2659.872 MHz processor
[    0.321171] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x26572ab1400, max_idle_ns: 440795213444 ns
[    0.713192] clocksource: Switched to clocksource tsc-early
[    1.310565] tsc: Marking TSC unstable due to TSC halts in idle

Please decide yourself, if you consider the problem really fixed and whether you want to close this bug.

Comment 61 Ben Cotton 2019-10-31 18:46:22 UTC

This message is a reminder that Fedora 29 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 29 on 2019-11-26.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '29'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 29 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 62 Ben Cotton 2019-11-27 23:32:48 UTC

Fedora 29 changed to end-of-life (EOL) status on 2019-11-26. Fedora 29 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.

af
airlied
bskeggs
diego.viola
ewk
hdegoede
ichavero
itamar
jarodwilson
jcline
jfrieben
jglisse
john.j5live
jonathan
josef
kernel-maint
linville
mchehab
mjg59
mmarget
steved