Bug 786610

Summary: PCI device reset can cause a kernel bug
Product: Red Hat Enterprise Linux 6 Reporter: Don Dutile (Red Hat) <ddutile>
Component: kernelAssignee: Don Dutile (Red Hat) <ddutile>
Status: CLOSED ERRATA QA Contact: Endre "Hrebicek" Balint-Nagy <endre>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.2CC: benl, kzhang, mjenner
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: kernel-2.6.32-238.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-06-20 08:21:15 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Test script used to reproduce error, and verify fix. none

Description Don Dutile (Red Hat) 2012-02-01 22:07:08 UTC
Created attachment 558926 [details]
Test script used to reproduce error, and verify fix.

Description of problem:
pci_block_user_cfg_access was designed for the use case that a single
context, the IPR driver, temporarily delays user space accesses to the
config space via sysfs. This assumption became invalid by the time
pci_dev_reset was added as locking instance. Today, if you run two
loops in parallel that reset the same device via sysfs, you end up with
a kernel BUG as pci_block_user_cfg_access detect the broken assumption.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. login as root (to be able to write to sysfs of a device
2. run two shell scripts, each resetting the same pci device, with a 1 second
   delay between each reset.  Pick a device (extra nic card, for example) that
   host is not using.
3.
  
Actual results:
Host hangs within 5 seconds

Expected results:
The two threads can run indefinitely.


Additional info:
Backport of upstream commit fb51ccbf217c1c994607b6519c7d85250928553d
resolves this problem.
Note: straight cherry-pick/backport will break kabi since it renames
      pci_dev structure element, so must modify backport to maintain
      kabi.

Comment 2 RHEL Program Management 2012-02-01 23:29:19 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 3 Don Dutile (Red Hat) 2012-02-14 19:16:40 UTC
Test script to reproduce error, and verify (posted) fix:

!/bin/bash
for i in {1..1000}; do
  echo "[-- $i iteration -- $(date) -- ]"
  echo 1 > /sys/bus/pci/devices/0000\:05\:00.0/reset
  echo "sleep 1 secs"
  sleep 1
done

Comment 4 Aristeu Rozanski 2012-02-24 21:56:01 UTC
Patch(es) available on kernel-2.6.32-238.el6

Comment 7 Endre "Hrebicek" Balint-Nagy 2012-02-28 12:45:33 UTC
Good job!
The 220.el6 kernel hung before the second iteration of reproducer,
the 238.el6 kernel survived 744+ iterations till now.
After the 1000th iteration I'll set this BZ to VERIFIED state.

Comment 9 errata-xmlrpc 2012-06-20 08:21:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0862.html