1067615 – Kernel > 3.12 leads to file system errors

Bug 1067615 - Kernel > 3.12 leads to file system errors

Summary: Kernel > 3.12 leads to file system errors

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	xorg-x11-drv-nouveau
Sub Component:
Version:	20
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Ben Skeggs
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-02-20 18:03 UTC by Erinn Looney-Triggs
Modified:	2014-07-10 21:00 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2014-07-10 21:00:23 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Picture of exception with BTRFS (530.92 KB, image/jpeg) 2014-02-20 18:03 UTC, Erinn Looney-Triggs	no flags	Details
EXT4 errors part 1 (9.77 MB, image/jpeg) 2014-02-20 21:39 UTC, Erinn Looney-Triggs	no flags	Details
EXT4 errors part 2 (2.03 MB, image/jpeg) 2014-02-20 21:41 UTC, Erinn Looney-Triggs	no flags	Details
View All

Description Erinn Looney-Triggs 2014-02-20 18:03:57 UTC

Created attachment 865637 [details]
Picture of exception with BTRFS

Description of problem:
If I use a kernel > 3.12 my file system turn read only with either BTRFS or EXT4. In some cases I am able to log in, in most cases I am not. Please see attached picture for specifics. 

After 5 reinstalls yesterday adjusting filesystems etc this problem is very reproducible for me. It happens every single time kernel < 3.13 everything works fine, kernel > 3.12 file system errors.

Version-Release number of selected component (if applicable):
kernel-3.13.3-201.fc20.x86_64

Additional info:
Hardware Lenovo ThinkPad t440p with Samsung SSD. 

I will attempt to capture the same info using ext4 later today.

Comment 1 Erinn Looney-Triggs 2014-02-20 21:39:40 UTC

Created attachment 865694 [details]
EXT4 errors part 1

Comment 2 Erinn Looney-Triggs 2014-02-20 21:41:13 UTC

Created attachment 865695 [details]
EXT4 errors part 2

Includes output from mount.

Comment 3 Ric Wheeler 2014-02-21 01:42:08 UTC

Usually, if you try to install multiple file systems and see similar errors, this is a sign of bad storage.


You might be able to test for this by booting from a USB stick (live image) and running something like smartctl or looking for disk related errors?

Comment 4 Eric Sandeen 2014-02-21 01:50:58 UTC

From the errors, everything looks like on-disk corruption.

However, it does seem unlikely that it's failing hardware, if it all continues to work fine with kernel 3.12.

Ric's suggestion of booting a livecd or something to poke around some more might be good.  That might allow you to better capture dmesg, as well.

Comment 5 Lukáš Czerner 2014-02-21 13:06:30 UTC

This really looks like hardware corruption. However the btrfs case probably shows also real bug in btrfs error handling code because we hit this:

VM_BUG_ON_PAGE(!PageLocked(page), page);


However when we say hardware corruption we usually mean anything that file system resides on. In this case it seems that you're using dm-thinp target so the real problem might be in the thin provisioning code.

Can you provide more information ?

- most importantly please take thin provisioning out of the mix completely so we can rule this out, or cinfirm.
- what file system /storage setup you're using
- what is your dm library version (dmsetup --version)
- and whether the smartctl on the real hardware shows anything problematic.


Thanks!
-Lukas

Comment 6 Erinn Looney-Triggs 2014-02-23 13:43:57 UTC

If this is thin provisioned, and honestly I can't tell how it is, that was purely a mistake either by myself, most likely, or by the installer as that wasn't the route I was aiming for. My guess is that you are confusing the host name thin-mint with thin provisioning:
 LV   VG                Attr       LSize  Pool Origin Data%  Move Log Cpy%Sync Convert
  home fedora_thin-mint2 -wi-ao---- 51.01g                                             
  root fedora_thin-mint2 -wi-ao---- 19.53g

I believe the above indicates that it is not thin provisioned under the attributes, but this is not my area of expertise.

I am using EXT4 though as I said similar behavior (from my perspective) is observed with BTRFS. It is a very basic layout, generally what you would get from anaconda:
mount
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime,seclabel)
devtmpfs on /dev type devtmpfs (rw,nosuid,seclabel,size=8150448k,nr_inodes=2037612,mode=755)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
selinuxfs on /sys/fs/selinux type selinuxfs (rw,relatime)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,seclabel)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,seclabel,gid=5,mode=620,ptmxmode=000)
tmpfs on /run type tmpfs (rw,nosuid,nodev,seclabel,mode=755)
tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,seclabel,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
efivarfs on /sys/firmware/efi/efivars type efivarfs (rw,nosuid,nodev,noexec,relatime)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpuacct,cpu)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
/dev/mapper/fedora_thin--mint2-root on / type ext4 (rw,relatime,seclabel,data=ordered)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=35,pgrp=1,timeout=300,minproto=5,maxproto=5,direct)
mqueue on /dev/mqueue type mqueue (rw,relatime,seclabel)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,seclabel)
tmpfs on /tmp type tmpfs (rw,seclabel)
debugfs on /sys/kernel/debug type debugfs (rw,relatime)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw,relatime)
configfs on /sys/kernel/config type configfs (rw,relatime)
sunrpc on /proc/fs/nfsd type nfsd (rw,relatime)
/dev/sda7 on /boot type ext4 (rw,relatime,seclabel,data=ordered)
/dev/sda2 on /boot/efi type vfat (rw,relatime,fmask=0077,dmask=0077,codepage=437,iocharset=ascii,shortname=winnt,errors=remount-ro)
/dev/mapper/fedora_thin--mint2-home on /home type ext4 (rw,relatime,seclabel,data=ordered)
fusectl on /sys/fs/fuse/connections type fusectl (rw,relatime)
gvfsd-fuse on /run/user/1607600003/gvfs type fuse.gvfsd-fuse (rw,nosuid,nodev,relatime,user_id=1607600003,group_id=1607600003)



dmsetup --version
Library version:   1.02.82 (2013-10-04)
Driver version:    4.25.0


smartctl -a /dev/sda
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.11.10-301.fc20.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     SAMSUNG MZ7TD256HAFV-000L9
Serial Number:    S17LNSAD910153
LU WWN Device Id: 5 002538 500000000
Firmware Version: DXT02L5Q
User Capacity:    256,060,514,304 bytes [256 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Feb 23 06:29:53 2014 MST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (53956) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  40) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       309
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       316
175 Program_Fail_Count_Chip 0x0032   100   100   010    Old_age   Always       -       0
176 Erase_Fail_Count_Chip   0x0032   100   100   010    Old_age   Always       -       0
177 Wear_Leveling_Count     0x0013   098   098   005    Pre-fail  Always       -       17
178 Used_Rsvd_Blk_Cnt_Chip  0x0013   100   100   010    Pre-fail  Always       -       0
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
180 Unused_Rsvd_Blk_Cnt_Tot 0x0013   100   100   010    Pre-fail  Always       -       6240
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
184 End-to-End_Error        0x0033   100   100   097    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   061   051   000    Old_age   Always       -       39
195 Hardware_ECC_Recovered  0x001a   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   253   253   000    Old_age   Always       -       0
233 Media_Wearout_Indicator 0x003a   199   199   000    Old_age   Always       -       10348
234 Unknown_Attribute       0x0012   100   100   000    Old_age   Always       -       0
235 Unknown_Attribute       0x0012   099   099   000    Old_age   Always       -       53
236 Unknown_Attribute       0x0012   099   099   000    Old_age   Always       -       4
237 Unknown_Attribute       0x0012   099   099   000    Old_age   Always       -       17
238 Unknown_Attribute       0x0012   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
  255        0    65535  Read_scanning was never started
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Comment 7 Leslie Satenstein 2014-02-25 03:26:07 UTC

USB drive files are corrupted when external drive is formatted to other than NTFS.

A possible Kernel bug. Please follow my explanation.
Linux fedora20.fedora20 3.13.3-201.fc20.x86_64 #1 SMP Fri Feb 14 19:08:32 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

I have two external hard drives (first drive is a 1 terabyte partitioned equally to NTFS and ext3).
The second drive is a 2 terabyte drive, partitioned equally (NTFS and ext3) and is a different vendor from drive1.
I connect one at a time to the computer.

On the external drives, it matters not if it is P1=NTFS, P2=ext3 or the reverse, or entirely formatted as ext2, ext3, btrfs (tested with all 3)

I have tried tests with 4 gigs memory and 8 gigs cpu memory on the same system.

With cpu memory at 4 gigs, both external drives do not suffer corruption via I/O transfers to/from.

With 8 gigs of cpu memory, using Fedora network.install.iso files, USB backup transfers occur without corruption to ext3 or NTFS partitions.

With 8 gigs cpu memory and files larger than 4 gigs (example Fedora dvd.iso image), transfer to the external ext3 partition arrives corrupted, and transfer to the external NTFS partition is un-corrupted. It matters not the direction of transfer. Front part of the transferred file is not corrupted.

Transfer between (internal hard drive to internal hard drive) is performed free of corruption.

Writing to a dvd image to a USB 8 gig flashdrive shows no corruption.

I am testing with kernel 3.13

What can I provide you. Smartctl does not report usb based hard drives.

Comment 8 Leslie Satenstein 2014-02-26 12:19:01 UTC

Please remove my comment 7.  The problem was resolved by
a) Power Off computer and power on, to enter bios settings.
b) Select option to initialize bios default parameters
c) Save bios update.

I believe that when the bios settings were initially created with memory at 4 gigs, the 4 gig value was stored in the settings.  By redoing the default, the bios picked up the additional 4 gigs, and problem appears resolved.

What I cannot answer is why NTFS file access was unaffected.

Sorry if there was some research applied. My own tests took over 40 non consecutive hours, and the fix was a discovery.

Comment 9 Erinn Looney-Triggs 2014-02-27 14:44:35 UTC

Ok folks unless you have some suggestions on this, it appears to me like things are ok from the smartctl perspective, nevertheless I will go ahead and replace the SSD to try that out as corruption is ongoing when I boot with kernel 3.13. 

-Erinn

Comment 10 Erinn Looney-Triggs 2014-03-01 20:53:14 UTC

Well I have narrowed this down considerably, it is due to the NVidia card that is in the system see here: https://github.com/Bumblebee-Project/bbswitch/issues/78 which is leading to corruption.

I noticed while running a test of a minimal install that the newer kernel did not cause any issues, it is only when the NVidia card gets involved that things can go down hill pretty quick.

Still no solution but getting closer. In the meantime that BTRFS issue may be worth having a look at, or perhaps opening another bug for?

-Erinn

Comment 11 Erinn Looney-Triggs 2014-03-03 15:26:30 UTC

Ok folks here is the deal as best as I can work it out right now. 

This issue so far appears to only be occurring on Lenovo ThinkPad t440p models equipped with the Nvidia GeForce GT 730M so far. Further one needs to be using plymouth for the boot process and kernels >= 3.13.x. To put it another way, a minimal install which doesn't seem to use plymouth if my understanding is correct will not trigger this issue regardless of kernel version.

This issue was originally found by the bbswitch folks here: https://github.com/Bumblebee-Project/bbswitch/issues/78 essentially they were experiencing corruption of memory, filesystem, and well a lot of other stuff broke too, when they would power down the Nvidia card and then power it back up.

There are a number of workarounds that are available, you can disable the Nvidia card using ACPI calls (see bbswitch link for more info), you can downgrade the BIOS/UEFI firmware in the laptop to 1.14, or as I found you can add the nomodeset parameter to the kernel boot which has the effect of falling back to the Intel graphics driver which works without issues (so far).

Can we get this reassigned to some of the graphics folks as I believe we should be able to track down the changes to nouveau (I assume) in 3.13 that lead to this issue?

-Erinn

Comment 12 Erinn Looney-Triggs 2014-03-07 17:10:11 UTC

Hello? Is there anybody out there? Just nod if you can hear me :)

Comment 13 Eric Sandeen 2014-03-07 19:22:57 UTC

Ok, sounds like it's an xorg driver problem then?

Comment 14 Ben Skeggs 2014-03-07 23:37:40 UTC

I don't actually think this is a graphics driver problem.  The problem is a firmware issue, one that happens as a result of executing the ACPI tables to turn on the GPU (if I understand correctly after following the bbswitch bug links).

The symptoms described here are slightly more severe, but I'd suggest trying the workaround mentioned here[1] - disabling "Intel Rapid Start Technology" in the firmware setup.

This should also confirm/deny whether it's actually the bug we're seeing here.

Thanks,
Ben.

[1] http://forums.lenovo.com/t5/W-Series-ThinkPad-Laptops/HOWTO-Brick-a-W540-in-easy-steps/m-p/1414465#M43530

Comment 15 Erinn Looney-Triggs 2014-03-10 16:04:39 UTC

Well the interesting point to draw back to here is this doesn't occur with 3.11.x but does with > 3.11.x

Testing with rapid start disabled, did not resolve the issue. Corruption still occurred with the same symptoms. 

The only workarounds I know so far is:
nomodeset to force use of intel graphics instead of nvidia
BIOS/UEFI firmware <= 1.14
kernel <= 3.11.x

-Erinn

Comment 16 Ben Skeggs 2014-03-10 21:43:02 UTC

Further in the Lenovo forums thread, a Lenovo representative mentions there's a 2.05 firmware for W530 (2.07 for T540p) available now which corrects the issue.  It would be interesting to see if this issue is also resolved by it too.

Comment 17 Erinn Looney-Triggs 2014-03-10 22:40:38 UTC

This is happening on a Lenovo T440P for me, with the latest BIOS update 2.19, I think, installed. I am not able to test on any other platforms.

-Erinn

Comment 18 Matt Hewitt 2014-03-21 14:05:30 UTC

I've been experiencing the same issue with a Dell XPS 15z. The system uses bumblebee and a Geforce 525M.

Can also confirm it only affects 3.12+ kernels. 3.11 works fine.

Comment 19 Petr Pachl 2014-05-09 19:05:56 UTC

This also appeared for me and Lenovo T440P. Month ago it was fine with kernel 3.11 but update to 3.13 forced me to do re-installation. Update (excluding kernel - version 3.11), installation and start of bumblebee crashed the system again even with old kernel. It have to be connected also with something else. Maby some update of nvidia drivers? 
I also tried to downgrade the BIOS to 1.14 (all the time from the beginning 1.18) but the installation disk didn't work for me. It just keep restarting with no change of the BIOS.

Any new suggestions?

Petr

Comment 20 Erinn Looney-Triggs 2014-07-10 21:00:23 UTC

This problem is fixed in Kernel 3.15.x, looks like this was the commit that did it according to sources: 


Did a git bisect and found that the problem was gone after the following commit:
```
commit faae404ebdc6bba744919d82e64c16448eb24a36
Author: Bob Moore <Robert.Moore>
Date:   Tue Feb 11 10:25:27 2014 +0800

    ACPICA: Add "Windows 2013" string to _OSI support.
    
    This urgent patch is cherry picked from ACPICA upstream.
    It is reported that some platforms fail to boot without this new _OSI
    string.
    
    This change adds this string for Windows 8.1 and Server 2012 R2.
    
    Reported-by: Zhang Rui <rui.zhang>
    Signed-off-by: Bob Moore <Robert.Moore>
    Signed-off-by: Lv Zheng <lv.zheng>
    Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki>

diff --git a/drivers/acpi/acpica/utosi.c b/drivers/acpi/acpica/utosi.c
index e210e51..1e5525e 100644
--- a/drivers/acpi/acpica/utosi.c
+++ b/drivers/acpi/acpica/utosi.c
@@ -74,6 +74,7 @@ static struct acpi_interface_info acpi_default_supported_interfaces[] = {
        {"Windows 2006 SP2", NULL, 0, ACPI_OSI_WIN_VISTA_SP2},  /* Windows Vista SP2 - Added 09/2010 */
        {"Windows 2009", NULL, 0, ACPI_OSI_WIN_7},      /* Windows 7 and Server 2008 R2 - Added 09/2009 */
        {"Windows 2012", NULL, 0, ACPI_OSI_WIN_8},      /* Windows 8 and Server 2012 - Added 08/2012 */
+       {"Windows 2013", NULL, 0, ACPI_OSI_WIN_8},      /* Windows 8.1 and Server 2012 R2 - Added 01/2014 */

        /* Feature Group Strings */

```

---
Reply to this email directly or view it on GitHub:
https://github.com/Bumblebee-Project/bbswitch/issues/78#issuecomment-48600484

Note You need to log in before you can comment on or make changes to this bug.