Bug 166452

Summary:	swsusp kills sbp2-controlled disks
Product:	[Fedora] Fedora	Reporter:	Alexandre Oliva <oliva>
Component:	kernel	Assignee:	Dave Jones <davej>
Status:	CLOSED UPSTREAM	QA Contact:	Brian Brock <bbrock>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	5	CC:	pfrields, stefan-r-rhbz, wtogami
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2006-10-17 21:46:39 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Alexandre Oliva 2005-08-21 20:18:08 UTC

Description of problem:
After my first successful swsusp on my Athlon64 notebook, with kernel
2.6.12-1.1504_FC5, the system resumed successfully, but an external hard disk
connected through a Firewire port didn't come back up, killing the raid 1
members it held.  Another external disk connected through a USB port came back
up just fine.  I've tried to swsusp with the two disks connected over USB, but
that froze during suspend, even though I hadn't re-added the faulty raid
members, so that was not it.  I'll try to poke it again when raid reconstruction
is complete.

Here's the kind of errors I got on the console and /var/log/messages after the
system resumed:

Aug 21 16:39:47 livre kernel: Restarting tasks... done
Aug 21 16:39:47 livre kernel: ieee1394: sbp2: aborting sbp2 command
Aug 21 16:39:47 livre kernel: scsi1 : destination target 1, lun 0
Aug 21 16:39:47 livre kernel:         command: Read (10): 28 00 0f 9b 31 ac 00
00 68 00
Aug 21 16:39:47 livre kernel: ieee1394: sbp2: aborting sbp2 command
Aug 21 16:39:47 livre kernel: scsi1 : destination target 1, lun 0
Aug 21 16:39:47 livre kernel:         command: Test Unit Ready: 00 00 00 00 00 00
Aug 21 16:39:47 livre kernel: ieee1394: sbp2: reset requested
Aug 21 16:39:47 livre kernel: ieee1394: sbp2: Generating sbp2 fetch agent reset
Aug 21 16:39:47 livre kernel: ieee1394: sbp2: aborting sbp2 command
Aug 21 16:39:47 livre kernel: scsi1 : destination target 1, lun 0
Aug 21 16:39:47 livre kernel:         command: Test Unit Ready: 00 00 00 00 00 00
Aug 21 16:39:47 livre kernel: ieee1394: sbp2: reset requested
Aug 21 16:39:47 livre kernel: ieee1394: sbp2: Generating sbp2 fetch agent reset
Aug 21 16:39:47 livre kernel: ieee1394: sbp2: aborting sbp2 command
Aug 21 16:39:47 livre kernel: scsi1 : destination target 1, lun 0
Aug 21 16:39:47 livre kernel:         command: Test Unit Ready: 00 00 00 00 00 00
Aug 21 16:39:47 livre kernel: ieee1394: sbp2: reset requested
Aug 21 16:39:47 livre kernel: ieee1394: sbp2: Generating sbp2 fetch agent reset
Aug 21 16:39:47 livre kernel: ieee1394: sbp2: aborting sbp2 command
Aug 21 16:39:47 livre kernel: scsi1 : destination target 1, lun 0
Aug 21 16:39:47 livre kernel:         command: Test Unit Ready: 00 00 00 00 00 00
Aug 21 16:39:47 livre kernel: scsi: Device offlined - not ready after error
recovery: host 1 channel 0 id 1 lun 0
Aug 21 16:39:47 livre kernel: SCSI error : <1 0 1 0> return code = 0x50000
Aug 21 16:39:47 livre kernel: end_request: I/O error, dev sda, sector 261829036
Aug 21 16:39:47 livre kernel: scsi1 (1:0): rejecting I/O to offline device
Aug 21 16:39:47 livre kernel: raid1: Disk failure on sda11, disabling device.
Aug 21 16:39:47 livre kernel:   Operation continuing on 1 devices

Comment 1 Alexandre Oliva 2005-10-06 03:38:08 UTC

More info (running kernel 2.6.13-1.1588_FC5.x86_64 IIRC): with swap on the
sbp2-controlled disk, suspend would halt after stopping all tasks.  It appears
that, when all tasks are stopped, the task that controls the disks (khpsbkt) is
dead or at least stopped, preventing access to the disk, so suspend never completes.

Comment 2 Dave Jones 2005-11-08 04:44:32 UTC

is this any better ? I special cased the firewire thread to not suspend.

Comment 3 Alexandre Oliva 2005-11-08 17:30:46 UTC

No luck :-(  It suspended to swap on non-sbp2 disk, looked like it would come
back ok, but then it started printing those messages about resetting sbp2, and
the system remained unusable.  I suspect that, after a while, something would
time out and it would kill the raid members in the sbp2 disk, like before, but
decided not to take the chances.  I didn't even try raid on sbp2, since the
simpler test didn't work, but I guess I could if you think it would be useful :-)

Comment 4 Alexandre Oliva 2005-11-08 17:37:26 UTC

Thanks for beating me to trying that, but no luck with 1654 :-(  It suspended to
swap on non-sbp2 disk, looked like it would come back ok, but then it started
printing those messages about resetting sbp2, and the system remained unusable.
 I suspect that, after a while, something would time out and it would kill the
raid members in the sbp2 disk, like before, but decided not to take the chances.
 I didn't even try raid on sbp2, since the simpler test didn't work, but I guess
I could if you think it would be useful :-)

Comment 5 Stefan Richter 2006-02-14 17:32:05 UTC

I added sbp2's behaviour through suspend/ resume cycles to my list of issues to
investigate. I won't make progress soon due to time constraints.

Comment 6 Stefan Richter 2006-02-23 18:00:35 UTC

I suppose this is actually a bug in ohci1394. There are no functions implemented
to save and restore the controller's state, except for Uninorth based Macs.

Comment 7 Dave Jones 2006-10-17 00:31:54 UTC

A new kernel update has been released (Version: 2.6.18-1.2200.fc5)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

In the last few updates, some users upgrading from FC4->FC5
have reported that installing a kernel update has left their
systems unbootable. If you have been affected by this problem
please check you only have one version of device-mapper & lvm2
installed.  See bug 207474 for further details.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

If this bug has been fixed, but you are now experiencing a different
problem, please file a separate bug for the new problem.

Thank you.

Comment 8 Stefan Richter 2006-10-17 06:22:03 UTC

This is still not fixed upstream. We have some partly tested code in
linux1394-2.6.git but there may be additional work needed to enable high-level
functions like sbp2 after resume.

Comment 9 Dave Jones 2006-10-17 21:46:39 UTC

Thanks for the pointer to the upstream bug Stefan.

As there's not going to be Fedora specific changes here, it makes more sense to
track this exclusively upstream rather than spam this bug with continual "does
it work yet" requests.