Bug 1514734

Summary:

[abrt] kernel-PAE-core: __do_softirq(): WARNING: CPU: 0 PID: 0 at kernel/rcu/tree.c:2821 rcu_process_callbacks+0x436/0x460

Product:

[Fedora] Fedora

Reporter:

Claude Frantz <Claude.Frantz>

Component:

kernel

Assignee:

Kernel Maintainer List <kernel-maint>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

CC:

airlied, bskeggs, Claude.Frantz, ewk, fatkasuvayu, fedora2021q2, hdegoede, ichavero, itamar, jarodwilson, jeremy, jforbes, jglisse, john.j5live, jonathan, josef, kernel-maint, labbott, linville, mchehab, mjg59, oggust, steved

Target Milestone:

---

Target Release:

---

Hardware:

i686

OS:

Unspecified

Whiteboard:

abrt_hash:e295f92f80cdf8992a61fd5079554cada79237c9;

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-07-30 13:41:50 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1489998

Attachments:

Description	Flags
File: backtrace	none
File: cpuinfo	none
File: dmesg	none
File: kernel_tainted_long	none
File: not-reportable	none
File: proc_modules	none
File: suspend_stats	none
traceback	none

Description Claude Frantz 2017-11-18 07:06:08 UTC

Version-Release number of selected component:
kernel-PAE-core-4.13.12-200.fc26

Additional info:
reporter:       libreport-2.9.1
cmdline:        BOOT_IMAGE=/vmlinuz-4.13.12-200.fc26.i686+PAE root=UUID=30a9af7c-df05-4249-a2ad-b920bcbd4f45 ro rd.md=0 rd.lvm=0 rd.dm=0 rd.luks=0 vconsole.font=latarcyrheb-sun16 vconsole.keymap=de rhgb acpi_backlight=vendor acpi_osi=Linux resume=/dev/sda6 quiet LANG=en_US.UTF-8
crash_function: __do_softirq
kernel:         4.13.12-200.fc26.i686+PAE
kernel_tainted_short: GDW
runlevel:       N 5
type:           Kerneloops

Truncated backtrace:
WARNING: CPU: 0 PID: 0 at kernel/rcu/tree.c:2821 rcu_process_callbacks+0x436/0x460
Modules linked in: fuse ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ccm ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables sunrpc coretemp kvm_intel kvm uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core irqbypass videodev iTCO_wdt iTCO_vendor_support snd_hda_codec_via arc4 media snd_hda_codec_generic ath9k snd_hda_intel ath9k_common ath9k_hw snd_hda_codec joydev lpc_ich snd_hda_core mac80211 snd_hwdep snd_seq snd_seq_device snd_pcm ath asus_laptop cfg80211 sparse_keymap
 snd_timer rfkill tpm_tis input_polldev tpm_tis_core snd tpm soundcore acpi_cpufreq dm_multipath i915 serio_raw i2c_algo_bit drm_kms_helper atl1e drm video
CPU: 0 PID: 0 Comm: swapper/0 Tainted: G      D W       4.13.12-200.fc26.i686+PAE #1
Hardware name: ASUSTeK Computer Inc.         P50IJ               /P50IJ     , BIOS 203     12/04/2009
task: dd2d2280 task.stack: dd2ca000
EIP: rcu_process_callbacks+0x436/0x460
EFLAGS: 00010002 CPU: 0
EAX: 00000000 EBX: f77ddb40 ECX: 00000002 EDX: 00000001
ESI: f77ddb60 EDI: dd2f0940 EBP: f70c7fc8 ESP: f70c7f98
 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
CR0: 80050033 CR2: b7f01e87 CR3: 1d498000 CR4: 000406f0
Call Trace:
 <SOFTIRQ>
 __do_softirq+0xb1/0x260
 ? takeover_tasklets+0x1b0/0x1b0
 do_softirq_own_stack+0x24/0x30
 </SOFTIRQ>
 irq_exit+0xbd/0xd0
 smp_apic_timer_interrupt+0x38/0x50
 apic_timer_interrupt+0x39/0x40
EIP: cpuidle_enter_state+0x144/0x360
EFLAGS: 00000246 CPU: 0
EAX: 00000000 EBX: dd335e30 ECX: 3e349739 EDX: 00000000
ESI: 3e349739 EDI: 000000ca EBP: dd2cbf24 ESP: dd2cbef0
 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
 ? trace_event_raw_event_sched_kthread_stop_ret+0x7b/0xa0
 cpuidle_enter+0x14/0x20
 call_cpuidle+0x21/0x40
 do_idle+0x174/0x1d0
 cpu_startup_entry+0x65/0x70
 rest_init+0x9c/0xa0
 start_kernel+0x404/0x41d
 i386_start_kernel+0x94/0x98
 startup_32_smp+0x16b/0x16d
Code: 04 2f dd 0f 8f d8 fd ff ff 8b 15 60 04 2f dd 89 53 64 e9 ca fd ff ff 8d b6 00 00 00 00 0f ff e9 29 fc ff ff 0f ff e9 04 fd ff ff <0f> ff e9 e7 fd ff ff 8b 55 dc 89 f0 e8 e9 df 6d 00 e9 53 fc ff

Comment 1 Claude Frantz 2017-11-18 07:06:26 UTC

Created attachment 1354507 [details]
File: backtrace

Comment 2 Claude Frantz 2017-11-18 07:06:28 UTC

Created attachment 1354508 [details]
File: cpuinfo

Comment 3 Claude Frantz 2017-11-18 07:06:32 UTC

Created attachment 1354509 [details]
File: dmesg

Comment 4 Claude Frantz 2017-11-18 07:06:34 UTC

Created attachment 1354510 [details]
File: kernel_tainted_long

Comment 5 Claude Frantz 2017-11-18 07:06:36 UTC

Created attachment 1354511 [details]
File: not-reportable

Comment 6 Claude Frantz 2017-11-18 07:06:39 UTC

Created attachment 1354512 [details]
File: proc_modules

Comment 7 Claude Frantz 2017-11-18 07:06:41 UTC

Created attachment 1354513 [details]
File: suspend_stats

Comment 8 Laura Abbott 2018-02-28 03:56:59 UTC

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale. The kernel moves very fast so bugs may get fixed as part of a kernel update. Due to this, we are doing a mass bug update across all of the Fedora 26 kernel bugs.
 
Fedora 26 has now been rebased to 4.15.4-200.fc26.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.
 
If you have moved on to Fedora 27, and are still experiencing this issue, please change the version to Fedora 27.
 
If you experience different issues, please open a new bug report for those.

Comment 9 fednuc 2018-02-28 14:15:11 UTC

I'm seeing something very similar on 4.15.4-300:

[Wed Feb 28 14:03:57 2018] WARNING: CPU: 6 PID: 0 at kernel/rcu/tree.c:2792 rcu_process_callbacks+0x4cb/0x4e0

and:

[Wed Feb 28 14:03:57 2018] Call Trace:
[Wed Feb 28 14:03:57 2018]  <IRQ>
[Wed Feb 28 14:03:57 2018]  __do_softirq+0xe7/0x2cb
[Wed Feb 28 14:03:57 2018]  irq_exit+0xf1/0x100
[Wed Feb 28 14:03:57 2018]  smp_apic_timer_interrupt+0x6c/0x120
[Wed Feb 28 14:03:57 2018]  apic_timer_interrupt+0xa2/0xb0
[Wed Feb 28 14:03:57 2018]  </IRQ>

This happens when trying to use a set of disks behind an eSATA port multiplier.

After this, disconnecting the disks doesn't produce any dmesg output, sync hangs, etc., and a restart seems to be the only thing that gets things back to normal.

I didn't see this a week ago, on 4.15.3, or any time previously, though it may be unrelated to the kernel update.

I can't change the Fedora version BTW; someone else will need to do that if necessary.

Comment 10 fednuc 2018-02-28 16:08:03 UTC

Obviously not statistically significant yet, but I booted into 4.15.3 and didn't see the same error when connecting/mounting etc. the eSATA box.

Before that I saw it each of the three times I tried to do the same on 4.15.4.

Comment 11 fednuc 2018-03-05 16:08:07 UTC

Two more data points:

4.15.6: same failure seen when using eSATA disks behind multiplier.
4.15.3: worked fine.

So currently 100% of 4 attempts on >= 4.15.4 have failed as above, 100% of 2 attempts on 4.15.3 (since first seeing this issue) have *not* failed.

Starting to look more and more like a kernel regression - what's the best way of dealing with this issue so it doesn't languish in RHBZ?

Comment 12 fednuc 2018-03-12 12:58:35 UTC

Upstream bug (patch already accepted presumably to master but not in stable trees as of 3 days ago, apparently): https://bugzilla.kernel.org/show_bug.cgi?id=198861

Comment 13 Björn Augustsson 2018-03-16 16:23:18 UTC

From upstream:
> Kernels 4.15.10 and 4.14.27 include patch "scsi: core: Avoid that ATA error handling can trigger a kernel hang or oops".

So 4.15.10 should do the trick.

Comment 14 Suvayu 2018-03-18 12:22:31 UTC

Created attachment 1409466 [details]
traceback

Looks like I have the same problem.  I have attached my traceback to the bug.  I'll test out 4.15.10 from updates-testing and see if it addresses my issue.

Comment 15 Suvayu 2018-03-19 02:15:22 UTC

Does it make a difference that I am not using a PAE kernel (I use x86_64)?  I tried 4.15.10 from updates-testing, I still experience a freeze, curiously though, I don't see a traceback now.

In fact, I have a weird issue.  The traceback and the ata errors don't always show up in the journal.  For example with 4.13.9, I see ata errors in the journal like this:

ata5.00: exception Emask 0x11 SAct 0x7ff7ffff SErr 0x400000 action 0x6 frozen
ata5.00: irq_stat 0x48000008, interface fatal error
ata5: SError: { Handshk }
ata5.00: failed command: WRITE FPDMA QUEUED
ata5.00: cmd 61/58:00:60:72:4c/05:00:17:00:00/40 tag 0 ncq dma 700416 out
ata5.00: status: { DRDY }
ata5.00: failed command: WRITE FPDMA QUEUED
ata5.00: cmd 61/a8:08:b8:77:4c/02:00:17:00:00/40 tag 1 ncq dma 348160 out
ata5.00: status: { DRDY }

but no traceback.  However for 4.15+ kernels up to 4.15.9 it's the opposite, I do not see the ata errors, but I see the traceback I attached above.  On upgrading to 4.15.10, I see the above ata errors again, but the traceback is missing.  The freezes are a constant through all these kernels though.

Comment 16 fednuc 2018-03-19 10:09:32 UTC

Suvayu this isn't limited to PAE kernels, no.

The backtrace is a result of a  bug introduced into the kernel in 4.15.4 (see the upstream bug), which shouldn't happen, and has been fixed in 4.15.10+.

The ATA errors are (very likely) a result of a poor-quality link (or failing/buggy hardware), and aren't (or very unlikely to be) a kernel bug.

The freezes are probably related to the faulty/failing disk hardware/link.


....


In general reply to this bug, 4.15.10 seems to have fixed this issue - I saw some link resets (normal for this crap eSATA box) but no backtrace, and didn't end up in a state that required a reboot.

Comment 17 Suvayu 2018-03-24 05:39:41 UTC

Stephen, sorry about my late response. Thank you for your comments, they are reassuring.  If it's alright, I would like to ask a follow-up question.

The old drive on my system is not mounted at a critical point.  In fact I boot without it, and mount when I need some dump space.  My system freezes happen both when I'm using it, or not (as in, unmounted, or mounted but no process is accessing files in the partition). When I'm using it, the freeze will happen, it's just a matter of time, but when I'm not, it's quite random.  Also, for a disk related freeze where the partition is non-critical, I would expect the process accessing files in that partition to freeze and go to "uninterruptible sleep" not instantaneously freeze the whole system.

Do you think this points to other problems beside my disks?  I am having graphics issues (kernel support is incomplete), so I boot with nomodeset.  All critical components in my system are brand new.

Comment 18 Claude Frantz 2018-03-24 06:49:19 UTC

Suvayu, I suggest to run a long selftest on your drive and to examine carefully the report. Perhaps there exists a firmware update for the drive, which is able to resolve the problem. Remember that, on a PC, a drive is able to freeze the whole system, via the controller, in the case of a malfunction or even when using a hidden or badly documented option. The kernel is not always able to recognize any of such behaviours. Please be careful and ensure that the drive itself is working well.

Comment 19 Justin M. Forbes 2018-07-23 15:29:19 UTC

*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 27 kernel bugs.

Fedora 27 has now been rebased to 4.17.7-100.fc27.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 28, and are still experiencing this issue, please change the version to Fedora 28.

If you experience different issues, please open a new bug report for those.

Comment 20 fednuc 2018-07-28 16:25:03 UTC

This is fixed; I can't close it.

Comment 21 Justin M. Forbes 2018-07-30 13:41:50 UTC

Thanks for the update.