Bug 1255763 - Kernel 4.1.5 not stable - crashes on nightly backup
Kernel 4.1.5 not stable - crashes on nightly backup
Status: CLOSED ERRATA
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
22
x86_64 Linux
unspecified Severity urgent
: ---
: ---
Assigned To: Kernel Maintainer List
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-08-21 09:39 EDT by Gerhard Wiesinger
Modified: 2016-01-10 08:51 EST (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-10-19 08:12:01 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Gerhard Wiesinger 2015-08-21 09:39:30 EDT
Description of problem:
I'm having big problems with Fedora FC22 kernel 4.1.5 (happened with all tried kernels 4.1.x from FC22) which is not stable at all. At the nightly backup jobs (database dumps, rsync via ssh, etc.) maschine crashes reproduceable at every night with the stack trace below. Message repeats on different CPUs in around 1~10s with same message.

Kernel 4.0.8 from Fedora FC22 works well with long uptimes, also previous kernel versions are highly stable. Kernel 4.1.4/4.1.5 had a lot of RAID fixes so I tried it again but it didn't help. So something critical must be different from 4.0.8 to 4.1.2 and later.

I'm running 2 RAID5 volumes with each LVM and cryptsetup above. After the crash RAID does a resync.

Machine:
- Mainboard: ASUS - M3N-H HDMI with latest BIOS
- CPU: AMD Phenom II X4 940 Black Edition, 4x 3.00GHz, boxed (HDZ940XCGIBOX)
- NIC: HP Broadcom Netxtreme Gigabit PCIe Netzwerkkarte 482914-001 (BCM5761)

If you need further information please let me know.

Version-Release number of selected component (if applicable):
kernel-4.1.5-200.fc22.x86_64

How reproducible:
Crashes on nightly backup activity, see above, reproduceable with Kernel 4.1.x

Steps to Reproduce:
1. Just run the machine

Actual results:
[63525.726812] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [ping:18283]
[63525.734015] Modules linked in: tun ebtable_filter ebtables bridge stp llc cfg80211 rfkill ipt_MASQUERADE nf_nat_masquerade_ipv4 ip6t_REJECT nf_reject_ipv6 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_i
pv4 nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat xt_CHECKSUM xt_conntrack nf_conntrack iptable_mangle iptable_security ip6table_filter ip6_tables iptable_raw hwmon_vid snd_hda_codec_hdmi lnbp21 stb6100 stb0899 snd_hd
a_codec_realtek snd_hda_codec_generic kvm_amd kvm snd_hda_intel snd_hda_controller snd_hda_codec snd_hda_core edac_core edac_mce_amd mantis snd_hwdep mantis_core snd_seq k10temp snd_seq_device dvb_core snd_pcm s
nd_timer snd soundcore shpchp i2c_nforce2 asus_atk0110 acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc dm_crypt raid1 ata_generic raid456 async_raid6_recov async_memcpy async_pq async_xor xo
r async_tx pata_acpi raid6_pq nouveau i2c_algo_bit drm_kms_helper ttm mxm_wmi drm tg3 serio_raw ptp pps_core firewire_ohci forcedeth firewire_core crc_itu_t pata_amd video wmi uas usb_storage
[63525.825481] CPU: 1 PID: 18283 Comm: ping Tainted: G      D W L  4.1.5-200.fc22.x86_64 #1
[63525.833809] Hardware name: System manufacturer System Product Name/M3N-H/HDMI, BIOS ASUS M3N-H/HDMI ACPI BIOS Revision 2603 06/11/2010
[63525.845863] task: ffff88019de5c520 ti: ffff880117f50000 task.ti: ffff880117f50000
[63525.853325] RIP: 0010:[<ffffffff81121cc2>] [<ffffffff81121cc2>] smp_call_function_many+0x222/0x280
[63525.862366] RSP: 0018:ffff880117f53c58  EFLAGS: 00000202
[63525.867663] RAX: 0000000000000003 RBX: 0000000000000293 RCX: 0000000000000000
[63525.874781] RDX: ffff88023fc1b8c8 RSI: 0000000000000008 RDI: ffff880237406bb0
[63525.881897] RBP: ffff880117f53c98 R08: 0000000000000000 R09: 000000000000000d
[63525.889015] R10: ffffffff813ad019 R11: ffffffff813acfa4 R12: ffff880117f53c28
[63525.896131] R13: ffff880117f53bc8 R14: ffffffff813acfa4 R15: 00000000000082d2
[63525.903249] FS:  00007f4227e48700(0000) GS:ffff88023fc40000(0000) knlGS:0000000000000000
[63525.911319] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[63525.917051] CR2: 00007fb542e34000 CR3: 0000000014833000 CR4: 00000000000006e0
[63525.924166] Stack:
[63525.926176]  0000000000000001 0100000000000001 0000000000000002 0000000000000000
[63525.933620]  ffffffff81069d90 0000000000000000 ffff880117f53db0 0000000000000001
[63525.941067]  ffff880117f53cc8 ffffffff81121d81 ffffc90001130000 0000000000000000
[63525.948513] Call Trace:
[63525.950956]  [<ffffffff81069d90>] ? unmap_pte_range+0xe0/0xe0
[63525.956688]  [<ffffffff81121d81>] on_each_cpu+0x31/0x60
[63525.961901]  [<ffffffff8106bcd1>] change_page_attr_set_clr+0x421/0x530
[63525.968412]  [<ffffffff8106c8bf>] set_memory_ro+0x2f/0x40
[63525.973797]  [<ffffffff81191e99>] bpf_prog_select_runtime+0x29/0x40
[63525.980047]  [<ffffffff81699130>] bpf_prepare_filter+0x160/0x180
[63525.986038]  [<ffffffff81699462>] sk_attach_filter+0xe2/0x190
[63525.991772]  [<ffffffff810dee91>] ? pick_next_task_fair+0x7e1/0x980
[63525.998022]  [<ffffffff8166b005>] sock_setsockopt+0x3f5/0x9a0
[63526.003755]  [<ffffffff81665966>] SyS_setsockopt+0xd6/0xf0
[63526.009225]  [<ffffffff810250d7>] ? syscall_trace_leave+0xc7/0x140
[63526.015391]  [<ffffffff817a1e6e>] system_call_fastpath+0x12/0x71
[63526.021382] Code: 05 78 a2 c0 00 89 c1 0f 8d 73 fe ff ff 48 98 49 8b 16 48 03 14 c5 a0 77 d2 81 8b 42 18 a8 01 74 c8 0f 1f 84 00 00 00 00 00 f3 90 <8b> 42 18 a8 01 75 f7 eb b5 0f b6 4d c8 4c 89 ea 4c 89 e6 44
 89 

Expected results:
No crash

Additional info:
Reported also on Linux Kernel mailinglist
http://www.spinics.net/lists/kernel/msg2060516.html
Comment 1 Gerhard Wiesinger 2015-09-21 12:56:22 EDT
Any update on this critical issue?
Comment 2 Laura Abbott 2015-09-21 14:10:56 EDT
The 'D' in the taint flags indicates the kernel was dying. Can you attach the full dmesg log?
Comment 3 Gerhard Wiesinger 2015-09-21 14:15:39 EDT
Unfortunately not. The kernel is dead and that was the only information I got from the serial console (message is repeated, so serial console is spammed with these messages).
Comment 4 Laura Abbott 2015-09-21 14:20:58 EDT
can you try 

# echo 1 > /proc/sys/kernel/panic_on_oops 
# echo 1 > /proc/sys/kernel/panic_on_unrecovered_nmi

this will panic on the first oops or first NMI
Comment 5 Gerhard Wiesinger 2015-09-21 14:37:46 EDT
This is a production system. I already had several crashes with RAID rebuilds and I don't want to risk any loss of data.

Information from the call trace isn't sufficient?
Comment 6 Laura Abbott 2015-09-21 15:54:20 EDT
the call trace isn't helpful because it's not the first one, hence the comment about trying to capture everything. Can you at least try the latest stable? 4.1.5 had a known HID corruption bug
Comment 7 Gerhard Wiesinger 2015-09-22 16:55:08 EDT
Updated to 4.1.6-201.fc22.x86_64, let's see what happens tonight ...
Comment 8 Gerhard Wiesinger 2015-09-23 01:27:53 EDT
Still happens:
echo 1 > /proc/sys/kernel/panic_on_oops 
echo 1 > /proc/sys/kernel/panic_on_unrecovered_nmi
[82296.198648] ------------[ cut here ]------------
[82296.203265] WARNING: CPU: 0 PID: 22377 at kernel/watchdog.c:331 watchdog_overflow_callback+0x82/0xc0()
[82296.212550] Watchdog detected hard LOCKUP on cpu 0
[82296.217156] Modules linked in: tun ebtable_filter ebtables bridge stp llc cfg80211 rfkill ip6t_REJECT ipt_MASQUERADE nf_reject_ipv6 nf_nat_masquerade_ipv4 nf_conntrack_ipv6 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_defrag_ipv6 nf_nat_ipv4 xt_conntr
ack nf_nat nf_conntrack xt_CHECKSUM iptable_mangle iptable_security iptable_raw ip6table_filter ip6_tables hwmon_vid snd_hda_codec_hdmi lnbp21 snd_hda_codec_realtek stb6100 stb0899 snd_hda_codec_generic snd_hda_intel kvm_amd kvm snd_hda_controller edac_core
 snd_hda_codec edac_mce_amd mantis snd_hda_core mantis_core k10temp snd_hwdep snd_seq snd_seq_device dvb_core snd_pcm shpchp snd_timer snd soundcore asus_atk0110 acpi_cpufreq i2c_nforce2 nfsd auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc dm_crypt raid1
 ata_generic pata_acpi raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq nouveau mxm_wmi i2c_algo_bit drm_kms_helper ttm tg3 drm serio_raw ptp pps_core firewire_ohci pata_amd firewire_core forcedeth crc_itu_t video wmi uas usb_
storage
[82296.308814] CPU: 0 PID: 22377 Comm: rsync Not tainted 4.1.6-201.fc22.x86_64 #1
[82296.316024] Hardware name: System manufacturer System Product Name/M3N-H/HDMI, BIOS ASUS M3N-H/HDMI ACPI BIOS Revision 2603 06/11/2010
[82296.328077]  0000000000000000 00000000049f50fb ffff88023fc05a60 ffffffff81799a6d
[82296.335522]  0000000000000000 ffff88023fc05ab8 ffff88023fc05aa0 ffffffff810a165a
[82296.342967]  0000000000000000 ffff880237414800 0000000000000000 ffff88023fc05c00
[82296.350415] Call Trace:
[82296.352857]  <NMI>  [<ffffffff81799a6d>] dump_stack+0x45/0x57
[82296.358617]  [<ffffffff810a165a>] warn_slowpath_common+0x8a/0xc0
[82296.364614]  [<ffffffff810a16e5>] warn_slowpath_fmt+0x55/0x70
[82296.370348]  [<ffffffff8115a1f2>] watchdog_overflow_callback+0x82/0xc0
[82296.376867]  [<ffffffff811a192b>] __perf_event_overflow+0x9b/0x250
[82296.383038]  [<ffffffff811a2554>] perf_event_overflow+0x14/0x20
[82296.388947]  [<ffffffff8102dd8d>] x86_pmu_handle_irq+0x13d/0x1a0
[82296.394946]  [<ffffffff8102c09b>] perf_event_nmi_handler+0x2b/0x50
[82296.401118]  [<ffffffff81018fa8>] nmi_handle+0x88/0x130
[82296.406330]  [<ffffffff81019522>] default_do_nmi+0x42/0x110
[82296.411889]  [<ffffffff810196e8>] do_nmi+0xf8/0x170
[82296.416756]  [<ffffffff817a23c8>] end_repeat_nmi+0x1a/0x1e
[82296.422235]  [<ffffffff8179fd6d>] ? _raw_spin_lock_irq+0x3d/0x50
[82296.428227]  [<ffffffff8179fd6d>] ? _raw_spin_lock_irq+0x3d/0x50
[82296.434217]  [<ffffffff8179fd6d>] ? _raw_spin_lock_irq+0x3d/0x50
[82296.440209]  <<EOE>>  [<ffffffffa0329cca>] drop_one_stripe+0x3a/0xc0 [raid456]
[82296.447441]  [<ffffffffa0329d95>] raid5_cache_scan+0x45/0x60 [raid456]
[82296.453960]  [<ffffffff811bf179>] shrink_slab+0x219/0x3d0
[82296.459354]  [<ffffffff811c3d1c>] shrink_zone+0x2dc/0x2f0
[82296.464747]  [<ffffffff811c3eb3>] do_try_to_free_pages+0x183/0x420
[82296.470911]  [<ffffffff811c4267>] try_to_free_pages+0x117/0x170
[82296.476816]  [<ffffffff811b6de7>] __alloc_pages_nodemask+0x5c7/0xa00
[82296.483154]  [<ffffffff811b5f36>] ? get_page_from_freelist+0x2b6/0xaa0
[82296.489666]  [<ffffffff81200411>] alloc_pages_current+0x91/0x110
[82296.495665]  [<ffffffff81209775>] new_slab+0x85/0x4d0
[82296.500705]  [<ffffffff8120ac1a>] __slab_alloc+0x24a/0x5a0
[82296.506177]  [<ffffffff812ce94f>] ? ext4_alloc_inode+0x1f/0x1c0
[82296.512081]  [<ffffffff812ce94f>] ? ext4_alloc_inode+0x1f/0x1c0
[82296.517987]  [<ffffffff8120c083>] kmem_cache_alloc+0x1d3/0x240
[82296.523803]  [<ffffffff812c5274>] ? search_dir+0xc4/0x120
[82296.529189]  [<ffffffff812ce94f>] ext4_alloc_inode+0x1f/0x1c0
[82296.534921]  [<ffffffff8124705d>] alloc_inode+0x1d/0xa0
[82296.540133]  [<ffffffff8124877b>] iget_locked+0xdb/0x180
[82296.545433]  [<ffffffff812bb572>] ext4_iget+0x42/0xab0
[82296.550558]  [<ffffffff812456e9>] ? __d_alloc+0x29/0x190
[82296.555856]  [<ffffffff812bc015>] ext4_iget_normal+0x35/0x40
[82296.561502]  [<ffffffff812c5faa>] ext4_lookup+0xca/0x160
[82296.566802]  [<ffffffff81234c7d>] lookup_real+0x1d/0x70
[82296.572014]  [<ffffffff81236252>] __lookup_hash+0x42/0x60
[82296.577398]  [<ffffffff812362b3>] lookup_slow+0x43/0xc0
[82296.582611]  [<ffffffff8123b36e>] path_lookupat+0x7fe/0xc20
[82296.588169]  [<ffffffff8120c065>] ? kmem_cache_alloc+0x1b5/0x240
[82296.594161]  [<ffffffff8123c286>] ? getname_flags+0x56/0x200
[82296.599807]  [<ffffffff8123b7b7>] filename_lookup+0x27/0xc0
[82296.605364]  [<ffffffff8123d5b3>] user_path_at_empty+0x63/0xd0
[82296.611184]  [<ffffffff812ed1f5>] ? __ext4_journal_stop+0x45/0xd0
[82296.617260]  [<ffffffff8123d631>] user_path_at+0x11/0x20
[82296.622561]  [<ffffffff8122ff9a>] vfs_fstatat+0x6a/0xd0
[82296.627773]  [<ffffffff81230581>] SYSC_newlstat+0x31/0x60
[82296.633166]  [<ffffffff81023a15>] ? do_audit_syscall_entry+0x55/0x80
[82296.639504]  [<ffffffff81024d1b>] ? syscall_trace_enter_phase1+0x14b/0x1b0
[82296.646362]  [<ffffffff811473f6>] ? __audit_syscall_exit+0x1f6/0x290
[82296.652699]  [<ffffffff810250d7>] ? syscall_trace_leave+0xc7/0x140
[82296.658864]  [<ffffffff812306be>] SyS_newlstat+0xe/0x10
[82296.664076]  [<ffffffff817a002e>] system_call_fastpath+0x12/0x71
[82296.670068] ---[ end trace 3be0ce8b251dc860 ]---
[82339.056536] INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 2, t=60005 jiffies, g=2932852, c=2932851, q=0)
[82339.067739]  ffff8801919bf488 ffff88016c979dd0 ffff88023362a800 ffff88023362a808
[82339.075192]  ffff8801919bf4b8 ffffffffa0329cca 0000000000000080 000000000000002a
[82339.082639]  ffff8801919bf580 ffff88023362a800 ffff8801919bf4e8 ffffffffa0329d95
[82339.090084] Call Trace:
[82339.092560]  [<ffffffffa0329cca>] drop_one_stripe+0x3a/0xc0 [raid456]
[82339.098999]  [<ffffffffa0329d95>] raid5_cache_scan+0x45/0x60 [raid456]
[82339.105521]  [<ffffffff811bf179>] shrink_slab+0x219/0x3d0
[82339.110913]  [<ffffffff811c3d1c>] shrink_zone+0x2dc/0x2f0
[82339.116306]  [<ffffffff811c3eb3>] do_try_to_free_pages+0x183/0x420
[82339.122477]  [<ffffffff811c4267>] try_to_free_pages+0x117/0x170
[82339.128393]  [<ffffffff811b6de7>] __alloc_pages_nodemask+0x5c7/0xa00
[82339.134740]  [<ffffffff811b5f36>] ? get_page_from_freelist+0x2b6/0xaa0
[82339.141260]  [<ffffffff81200411>] alloc_pages_current+0x91/0x110
[82339.147259]  [<ffffffff81209775>] new_slab+0x85/0x4d0
[82339.152305]  [<ffffffff8120ac1a>] __slab_alloc+0x24a/0x5a0
[82339.157789]  [<ffffffff812ce94f>] ? ext4_alloc_inode+0x1f/0x1c0
[82339.163701]  [<ffffffff812ce94f>] ? ext4_alloc_inode+0x1f/0x1c0
[82339.169612]  [<ffffffff8120c083>] kmem_cache_alloc+0x1d3/0x240
[82339.175439]  [<ffffffff812c5274>] ? search_dir+0xc4/0x120
[82339.180835]  [<ffffffff812ce94f>] ext4_alloc_inode+0x1f/0x1c0
[82339.186576]  [<ffffffff8124705d>] alloc_inode+0x1d/0xa0
[82339.191795]  [<ffffffff8124877b>] iget_locked+0xdb/0x180
[82339.197104]  [<ffffffff812bb572>] ext4_iget+0x42/0xab0
[82339.202238]  [<ffffffff812456e9>] ? __d_alloc+0x29/0x190
[82339.207545]  [<ffffffff812bc015>] ext4_iget_normal+0x35/0x40
[82339.213199]  [<ffffffff812c5faa>] ext4_lookup+0xca/0x160
[82339.218508]  [<ffffffff81234c7d>] lookup_real+0x1d/0x70
[82339.223726]  [<ffffffff81236252>] __lookup_hash+0x42/0x60
[82339.229121]  [<ffffffff812362b3>] lookup_slow+0x43/0xc0
[82339.234343]  [<ffffffff8123b36e>] path_lookupat+0x7fe/0xc20
[82339.239911]  [<ffffffff8120c065>] ? kmem_cache_alloc+0x1b5/0x240
[82339.245908]  [<ffffffff8123c286>] ? getname_flags+0x56/0x200
[82339.251562]  [<ffffffff8123b7b7>] filename_lookup+0x27/0xc0
[82339.257130]  [<ffffffff8123d5b3>] user_path_at_empty+0x63/0xd0
[82339.262960]  [<ffffffff812ed1f5>] ? __ext4_journal_stop+0x45/0xd0
[82339.269045]  [<ffffffff8123d631>] user_path_at+0x11/0x20
[82339.274351]  [<ffffffff8122ff9a>] vfs_fstatat+0x6a/0xd0
[82339.279574]  [<ffffffff81230581>] SYSC_newlstat+0x31/0x60
[82339.284967]  [<ffffffff81023a15>] ? do_audit_syscall_entry+0x55/0x80
[82339.291313]  [<ffffffff81024d1b>] ? syscall_trace_enter_phase1+0x14b/0x1b0
[82339.298181]  [<ffffffff811473f6>] ? __audit_syscall_exit+0x1f6/0x290
[82339.304525]  [<ffffffff810250d7>] ? syscall_trace_leave+0xc7/0x140
[82339.310700]  [<ffffffff812306be>] SyS_newlstat+0xe/0x10
[82339.315920]  [<ffffffff817a002e>] system_call_fastpath+0x12/0x71
[82356.026990] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[82356.034073] ata2.00: failed command: FLUSH CACHE EXT
[82356.039055] ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 30
[82356.039055]          res 40/00:ff:00:78:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[82356.052691] ata2.00: status: { DRDY }
[82356.056396] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[82356.063447] ata1.00: failed command: FLUSH CACHE EXT
[82356.068409] ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 21
[82356.068409]          res 40/00:ff:00:78:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[82356.082027] ata1.00: status: { DRDY }
[82356.085712] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[82356.092781] ata3.00: failed command: FLUSH CACHE EXT
[82356.097771] ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 13
[82356.097771]          res 40/00:ff:00:78:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[82356.111418] ata3.00: status: { DRDY }
[82361.386401] ata2.00: qc timeout (cmd 0xec)
[82361.390553] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[82361.396655] ata2.00: revalidation failed (errno=-5)
[82361.401569] ata1.00: qc timeout (cmd 0xec)
[82361.405692] ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[82361.411798] ata1.00: revalidation failed (errno=-5)
[82361.427349] ata3.00: qc timeout (cmd 0xec)
[82361.431468] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[82361.437552] ata3.00: revalidation failed (errno=-5)
[82371.711661] ata2.00: qc timeout (cmd 0xec)
[82371.715765] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[82371.721849] ata2.00: revalidation failed (errno=-5)
[82371.726716] ata2: limiting SATA link speed to 1.5 Gbps
[82371.731879] ata1.00: qc timeout (cmd 0xec)
[82371.735981] ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[82371.742067] ata1.00: revalidation failed (errno=-5)
[82371.746942] ata1: limiting SATA link speed to 1.5 Gbps
[82371.752100] ata3.00: qc timeout (cmd 0xec)
[82371.756225] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[82371.762329] ata3.00: revalidation failed (errno=-5)
[82371.767207] ata3: limiting SATA link speed to 1.5 Gbps
[82402.032221] ata1.00: qc timeout (cmd 0xec)
[82402.036355] ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[82402.042457] ata1.00: revalidation failed (errno=-5)
[82402.047353] ata1.00: disabled
[82402.050338] ata1.00: device reported invalid CHS sector 0
[82402.055754] ata3.00: qc timeout (cmd 0xec)
[82402.059852] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[82402.065933] ata3.00: revalidation failed (errno=-5)
[82402.070799] ata3.00: disabled
[82402.073763] ata3.00: device reported invalid CHS sector 0
[82402.079187] ata2.00: qc timeout (cmd 0xec)
[82402.083332] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[82402.089426] ata2.00: revalidation failed (errno=-5)
[82402.094327] ata2.00: disabled
[82402.097320] ata2.00: device reported invalid CHS sector 0
[82402.362973] blk_update_request: I/O error, dev sda, sector 1953519935
[82402.369423] md: super_written gets error=-5, uptodate=0
[82402.374645] md/raid:md0: Disk failure on sda1, disabling device.
[82402.374645] md/raid:md0: Operation continuing on 2 devices.
[82402.386316] blk_update_request: I/O error, dev sda, sector 1225584391
Comment 9 Gerhard Wiesinger 2015-09-23 15:01:12 EDT
Any conclusio from the call trace?
Comment 10 Gerhard Wiesinger 2015-10-18 07:19:51 EDT
Fixed in 4.2.3-200.fc22.x86_64

But another bug is triggered:
https://bugzilla.redhat.com/show_bug.cgi?id=1272645
Comment 11 Josh Boyer 2015-10-19 08:12:01 EDT
Thank you for letting us know.
Comment 12 Andrej Podzimek 2016-01-09 10:02:37 EST
This is still a problem on 4.2.8-300: https://andrej.podzimek.org/2016-01-09.jpg
The hang at random, roughly in 50% of boot attempts, making the system sort of unusable. After a series of soft lockup backtraces, it ultimately freezes with a hard lockup. Is there a known workaround for this? Could one e.g. temporarily disable some of the recent memory protection features to circumvent the crash?
Comment 13 Andrej Podzimek 2016-01-09 11:38:03 EST
There are also other spurious failures on repeated boot attempts. All of them appear to be related to memory management during kernel module loading. So I tried to add 'udev.children-max=1 rd.udev.children-max=1' to the kernel command line, but sadly this didn't help. It reduces the probability of boot failures somewhat, but it doesn't stop the failures from occurring.

A few observations:

1. The problem appears to be confined to the initial ramdisk phase. Once the system boots up, it doesn't fail, it can just run for days, suspend and resume, use both the internal and discrete GPU and everything. Only the boot process is prone to failures.

2. Unsuccessful boots are not recorded in the logs at all. 'journalctl -k -b -1' shows the previous successful boot, ignoring the failures(s) in between. So all the bad stuff happens in the initial ramdisk, before the root filesystem gets mounted.
Comment 14 Andrej Podzimek 2016-01-10 08:51:46 EST
Because the OP's hardware configuration appears to be way too different from mine and because the issue has become a really bad blocker, I filed a separate bug 1297188 (https://bugzilla.redhat.com/show_bug.cgi?id=1297188). Removing NEEDINFO from here.

Note You need to log in before you can comment on or make changes to this bug.