770754 – file system corruption after hibernation (possible i915 modesetting memory corruption)

Bug 770754 - file system corruption after hibernation (possible i915 modesetting memory corruption)

Summary: file system corruption after hibernation (possible i915 modesetting memory co...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	xorg-x11-drv-intel
Sub Component:
Version:	16
Hardware:	i686
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Adam Jackson
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	797478 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-12-29 00:41 UTC by Thomas Quinn
Modified:	2012-11-30 13:10 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2012-11-30 07:43:27 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
kernel panic backtrace (2.69 KB, application/octet-stream) 2011-12-29 00:41 UTC, Thomas Quinn	no flags	Details
Very similar panic, which does not seem to occur when i915 is not loaded (326.62 KB, image/jpeg) 2012-03-19 21:47 UTC, Bojan Smojver	no flags	Details
dmesg (94.51 KB, text/plain) 2012-11-30 02:38 UTC, Aaron Kaplan	no flags	Details
dmesg with i915.modeset=0 (112.00 KB, text/plain) 2012-11-30 12:47 UTC, Aaron Kaplan	no flags	Details
View All

Description Thomas Quinn 2011-12-29 00:41:59 UTC

Created attachment 549883 [details]
kernel panic backtrace

Description of problem:

After resuming from a hibernation, various symptoms of file system corruption occur, including kernel panics.

Version-Release number of selected component (if applicable):kernel-3.1.6-1.fc16.i686


How reproducible: intermittent


Steps to Reproduce:
1.hibernate the system (e.g. using the hibernate button in the logout dialog in xfce)
2.resume the system by powering on.
3.perform file system intensive activity (e.g. update an rpm package)
  
Actual results:
failing mkdir, kernel panics


Expected results:
Normal file system operation

Additional info:
See the attached panic log

Comment 1 Stanislaw Gruszka 2011-12-31 11:44:12 UTC

Is this 100% reproducible for you? Can you reproduce when boot with i915.modeset=0 kernel parameter?

Comment 2 Thomas Quinn 2012-01-02 05:20:53 UTC

(In reply to comment #1)
> Is this 100% reproducible for you? Can you reproduce when boot with
> i915.modeset=0 kernel parameter?

The corruption occurs about 1 out of 4 times that I hibernate and resume.

I've tried reproducing the problem with i915.modeset=0, and have not been able to, even after about 10 hibernate/resume cycles.

Comment 3 Dave Jones 2012-01-03 16:05:13 UTC

the modesetting datapoint is a useful one.

This is a duplicate of bug 744275, but lets keep this open for now to focus on that, as it sounds like modesetting causes memory corruption when we hibernate.

*** This bug has been marked as a duplicate of bug 744275 ***

Comment 4 Dave Jones 2012-01-03 16:05:57 UTC

Derp. I never meant to dupe this. Fixing.

Comment 5 Stanislaw Gruszka 2012-02-27 07:56:59 UTC

*** Bug 797478 has been marked as a duplicate of this bug. ***

Comment 6 Bojan Smojver 2012-03-19 21:47:28 UTC

Created attachment 571236 [details]
Very similar panic, which does not seem to occur when i915 is not loaded

Comment 7 Adam Jackson 2012-06-05 15:19:34 UTC

Pretty sure this was a dupe of 744275.  Please reopen if you can reproduce this with a current kernel from updates.

Comment 8 Thomas Quinn 2012-06-05 16:01:24 UTC

Just tried hibernating and resuming with kernel 3.3.7-1.fc17.i686 on the same machine I reported the bug for (Asus eeePc 900HA).  I now get:
EXT4-fs error (device dm-0): ext4_mb_generate_buddy:739: group8, 25068 clusters in bitmap, 25067 in gd
JBD2: Spotted dirty metadata buffer (dev = dm-0, blocknr = 0). There's a risk of filesystem corruption in case of system crash.

Comment 9 Aaron Kaplan 2012-11-02 16:40:15 UTC

I'm still having ext4 corruption after hibernation in F17 with kernel 3.6.3-1.fc17. i915.modeset=0 seems to prevent it. Should I file a separate bug for F17? Anything else I can do to move this along?

Comment 10 Stanislaw Gruszka 2012-11-12 11:37:13 UTC

Can you reproduce corruption with test_hib.sh script  checkmem.c program as described here:
https://bugzilla.redhat.com/show_bug.cgi?id=701857#c24
?

If so, what hardware do you have (:lspci -nnvv: of VGA controller) ?

Comment 11 Thomas Quinn 2012-11-18 01:39:50 UTC

I ran about 30 hibernate cycles with the test_hib.sh script.  No errors reported, but a couple of times it hung during the reboot.  A hard reset got it going again.

Comment 12 Aaron Kaplan 2012-11-18 03:17:27 UTC

Ran test_hib.sh for 31 cycles, no corruption detected.

Comment 13 Stanislaw Gruszka 2012-11-19 14:42:25 UTC

(In reply to comment #11)
> No errors
> reported, but a couple of times it hung during the reboot.

Not sure if this is corruption related, perhaps this is some suspend/resume bug.

Let's try this at night:

"
while true; do 
echo "0" > /sys/class/rtc/rtc0/wakealarm
echo "+120" > /sys/class/rtc/rtc0/wakealarm
sync; echo 1 > /sys/power/pm_trace; pm-suspend
sleep 60
done
"

Scripts suspend/resume infinity with enabled error detection. Once suspend or resume will fail system will reboot there should be information in dmesg which driver is responsible for suspend failure, so attach dmesg here (restarting system will erase that information).

Note that on failure this will override you HW clock, so you will need to setup that in BIOS or by "date + hwclock --systohc".

Comment 14 Aaron Kaplan 2012-11-19 15:09:34 UTC

I've never seen filesystem corruption after suspending, only after hibernating (and pretty much every time I hibernate). Are you sure it's useful to run this script that uses pm-suspend rather than pm-hibernate?

I don't mean to hijack this ticket, happy to file a separate one if that's indicated, but so far my symptoms and Thomas Quinn's seem consistent.

Comment 15 Stanislaw Gruszka 2012-11-19 15:43:09 UTC

Instructions from comment 13 was intended for Thomas to discover his hibernate reboot problems.

I just realized that in kernel corruption detection works only on -debug kernel variant. Did you run test_hib.sh on kernel-debug ? If not please retest after installing and booting that kernel (I'm sorry for not informing you about that).

Comment 16 Aaron Kaplan 2012-11-20 12:31:45 UTC

I was not previously using a debug kernel, but I tried again with it, and test_hib.sh still didn't detect any corruption after running overnight.

Comment 17 Aaron Kaplan 2012-11-21 03:05:48 UTC

The script from comment 13 ran for about 12 hours, until I interrupted it.

Comment 18 Thomas Quinn 2012-11-23 02:57:05 UTC

I also repeated the script with a -debug kernel.  61 cycles and no corruption detected.

Comment 19 Stanislaw Gruszka 2012-11-27 16:09:18 UTC

Hmm, so looks like file system corruption is not caused by memory corruption of i915 or other driver. How filesystem corrupt manifest itself on your systems?

Comment 20 Aaron Kaplan 2012-11-27 16:27:25 UTC

ext4 errors in /var/log/messages when I resume after hibernation, and again whenever I mount that filesystem until I've fscked it. Only hibernating to disk, not suspending to RAM, causes these errors. When I booted with i915.modeset=0 and then hibernated, the corruption didn't occur, but since I've only tried once I'm not confident saying that that fixes it for sure.

Comment 21 Stanislaw Gruszka 2012-11-27 16:30:21 UTC

Ok, let's look at those errors. Please boot system then hibernate and resume and attach dmesg here, as long errors are there, if not repeat hibernate/resume cycles (perhaps using script).

Comment 22 Aaron Kaplan 2012-11-30 02:37:37 UTC

No script needed, corruption happens reliably every time I hibernate. Attaching dmesg. See "EXT4-fs error" near the end,

Comment 23 Aaron Kaplan 2012-11-30 02:38:19 UTC

Created attachment 654714 [details]
dmesg

Comment 24 Stanislaw Gruszka 2012-11-30 07:43:27 UTC

Before ext4 errors, there are:

[  500.926215] end_request: I/O error, dev dm-0, sector 1953458048
[  500.926220] Buffer I/O error on device dm-0, logical block 244182256
[  500.926236] end_request: I/O error, dev dm-0, sector 1953458048
[  500.926238] Buffer I/O error on device dm-0, logical block 244182256

what indicate data read (or write) problem on disk. This could be hardware malfunction or device driver problem. Nothing that is related with bug originally reported here. Please open new bug report after assure this is not hardware issue. 

i915 bug which was reported here originally is fixed, closing ...

Comment 25 Aaron Kaplan 2012-11-30 12:47:37 UTC

Created attachment 654999 [details]
dmesg with i915.modeset=0

As I mentioned in my first comment, setting i915.modeset=0 makes my problem go away. I'm attaching dmesg from after booting with i915.modeset=0 and then hibernating and resuming four or five times. Neither the I/O errors nor the filesystem errors occur as they do without the modeset=0 kernel option.

I'm quite confident that this is not just chance. In the course of testing I have hibernated with the normal settings dozens of times and seen filesystem errors every single time; and I've hibernated about six times with i915.modeset=0 and not seen any filesystem errors.

Comment 26 Stanislaw Gruszka 2012-11-30 12:57:37 UTC

So this must be related with traffic on PCIe bus - i915 device with modeset=1 do something that break disk controller. But not related with issue reported here. Please open a separate bug report, provide "lspci -vnn" and link to information you already provided here.

Comment 27 Aaron Kaplan 2012-11-30 13:10:36 UTC

Filed bug 882232

Note You need to log in before you can comment on or make changes to this bug.