Bug 786610 - PCI device reset can cause a kernel bug
Summary: PCI device reset can cause a kernel bug
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel
Version: 6.2
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Don Dutile (Red Hat)
QA Contact: Endre "Hrebicek" Balint-Nagy
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-02-01 22:07 UTC by Don Dutile (Red Hat)
Modified: 2014-09-24 01:29 UTC (History)
3 users (show)

Fixed In Version: kernel-2.6.32-238.el6
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-06-20 08:21:15 UTC
Target Upstream Version:


Attachments (Terms of Use)
Test script used to reproduce error, and verify fix. (166 bytes, application/x-shellscript)
2012-02-01 22:07 UTC, Don Dutile (Red Hat)
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2012:0862 0 normal SHIPPED_LIVE Moderate: Red Hat Enterprise Linux 6 kernel security, bug fix and enhancement update 2012-06-20 12:55:00 UTC

Description Don Dutile (Red Hat) 2012-02-01 22:07:08 UTC
Created attachment 558926 [details]
Test script used to reproduce error, and verify fix.

Description of problem:
pci_block_user_cfg_access was designed for the use case that a single
context, the IPR driver, temporarily delays user space accesses to the
config space via sysfs. This assumption became invalid by the time
pci_dev_reset was added as locking instance. Today, if you run two
loops in parallel that reset the same device via sysfs, you end up with
a kernel BUG as pci_block_user_cfg_access detect the broken assumption.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. login as root (to be able to write to sysfs of a device
2. run two shell scripts, each resetting the same pci device, with a 1 second
   delay between each reset.  Pick a device (extra nic card, for example) that
   host is not using.
3.
  
Actual results:
Host hangs within 5 seconds

Expected results:
The two threads can run indefinitely.


Additional info:
Backport of upstream commit fb51ccbf217c1c994607b6519c7d85250928553d
resolves this problem.
Note: straight cherry-pick/backport will break kabi since it renames
      pci_dev structure element, so must modify backport to maintain
      kabi.

Comment 2 RHEL Program Management 2012-02-01 23:29:19 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 3 Don Dutile (Red Hat) 2012-02-14 19:16:40 UTC
Test script to reproduce error, and verify (posted) fix:

!/bin/bash
for i in {1..1000}; do
  echo "[-- $i iteration -- $(date) -- ]"
  echo 1 > /sys/bus/pci/devices/0000\:05\:00.0/reset
  echo "sleep 1 secs"
  sleep 1
done

Comment 4 Aristeu Rozanski 2012-02-24 21:56:01 UTC
Patch(es) available on kernel-2.6.32-238.el6

Comment 7 Endre "Hrebicek" Balint-Nagy 2012-02-28 12:45:33 UTC
Good job!
The 220.el6 kernel hung before the second iteration of reproducer,
the 238.el6 kernel survived 744+ iterations till now.
After the 1000th iteration I'll set this BZ to VERIFIED state.

Comment 9 errata-xmlrpc 2012-06-20 08:21:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0862.html


Note You need to log in before you can comment on or make changes to this bug.