Bugzilla will be upgraded to version 5.0 on a still to be determined date in the near future. The original upgrade date has been delayed.
Bug 614206 - WARNING: at fs/buffer.c:1159 mark_buffer_dirty+0x82/0xa0() after device removal
WARNING: at fs/buffer.c:1159 mark_buffer_dirty+0x82/0xa0() after device removal
Status: CLOSED DUPLICATE of bug 591466
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel (Show other bugs)
6.0
All Linux
low Severity medium
: rc
: ---
Assigned To: Red Hat Kernel Manager
Red Hat Kernel QE team
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2010-07-13 16:58 EDT by Babu Moger
Modified: 2010-07-15 12:54 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-07-15 12:54:33 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Messages file showing the OOps(time stamp Jul 13 17:18:12) (2.23 MB, text/plain)
2010-07-13 17:03 EDT, Babu Moger
no flags Details
Kernel abrt files from /var/spool/abrt/ (358.51 KB, application/x-gzip-compressed)
2010-07-13 17:05 EDT, Babu Moger
no flags Details

  None (edit)
Description Babu Moger 2010-07-13 16:58:10 EDT
Description of problem:
We have noticed a kernel oops during our testing with fibre Channel storage.
After this point system becomes very un-reliable. Kernel shows following output during the problem. 

Jul 13 17:18:12 kswc-kenpachi kernel: ------------[ cut here ]------------
Jul 13 17:18:12 kswc-kenpachi kernel: WARNING: at fs/buffer.c:1159 mark_buffer_dirty+0x82/0xa0() (Not tainted)
Jul 13 17:18:12 kswc-kenpachi kernel: Hardware name: PowerEdge R710
Jul 13 17:18:12 kswc-kenpachi kernel: Modules linked in: ext3 jbd fuse nfs lockd fscache nfs_acl auth_rpcgss autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables be2iscsi bnx2i cnic uio cxgb3i iw_cxgb3 cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi dm_mirror dm_region_hash dm_log uinput dcdbas serio_raw iTCO_wdt iTCO_vendor_support i7core_edac edac_core bnx2 ext4 mbcache jbd2 mppVhba(U) sr_mod cdrom ata_generic pata_acpi ata_piix qla2xxx scsi_transport_fc scsi_tgt mptsas mptscsih mptbase scsi_transport_sas mppUpper(U) sg sd_mod crc_t10dif dm_mod [last unloaded: microcode]
Jul 13 17:18:12 kswc-kenpachi kernel: Pid: 21671, comm: LinuxSmash Not tainted 2.6.32-44.el6.x86_64 #1
Jul 13 17:18:12 kswc-kenpachi kernel: Call Trace:
Jul 13 17:18:12 kswc-kenpachi kernel: [<ffffffff8106a883>] warn_slowpath_common+0x83/0xc0
Jul 13 17:18:12 kswc-kenpachi kernel: [<ffffffff8106a8d4>] warn_slowpath_null+0x14/0x20
Jul 13 17:18:12 kswc-kenpachi kernel: [<ffffffff81198702>] mark_buffer_dirty+0x82/0xa0
Jul 13 17:18:12 kswc-kenpachi kernel: [<ffffffffa04ddff7>] ext3_commit_super.clone.0+0x57/0x80 [ext3]
Jul 13 17:18:12 kswc-kenpachi kernel: [<ffffffffa04de09f>] ext3_handle_error+0x7f/0xd0 [ext3]
Jul 13 17:18:12 kswc-kenpachi kernel: [<ffffffffa04de182>] __ext3_std_error+0x92/0xa0 [ext3]
Jul 13 17:18:12 kswc-kenpachi kernel: [<ffffffffa04d4d20>] ext3_reserve_inode_write+0x50/0xa0 [ext3]
Jul 13 17:18:12 kswc-kenpachi kernel: [<ffffffffa04d8af3>] ext3_orphan_del+0x93/0x1c0 [ext3]
Jul 13 17:18:12 kswc-kenpachi kernel: [<ffffffffa04d6c92>] ext3_truncate+0x322/0x9f0 [ext3]
Jul 13 17:18:12 kswc-kenpachi kernel: [<ffffffff81058f92>] ? finish_task_switch+0x42/0xd0
Jul 13 17:18:12 kswc-kenpachi kernel: [<ffffffff811211be>] vmtruncate+0x5e/0x70
Jul 13 17:18:12 kswc-kenpachi kernel: [<ffffffff811a1287>] __blockdev_direct_IO+0xb17/0xc40
Jul 13 17:18:12 kswc-kenpachi kernel: [<ffffffff81055363>] ? __wake_up+0x53/0x70
Jul 13 17:18:12 kswc-kenpachi kernel: [<ffffffffa04d525a>] ext3_direct_IO+0xba/0x240 [ext3]
Jul 13 17:18:12 kswc-kenpachi kernel: [<ffffffffa04d6030>] ? ext3_get_block+0x0/0x120 [ext3]
Jul 13 17:18:12 kswc-kenpachi kernel: [<ffffffff8110cbd2>] generic_file_direct_write+0xc2/0x190
Jul 13 17:18:12 kswc-kenpachi kernel: [<ffffffff8110e3c5>] __generic_file_aio_write+0x345/0x480
Jul 13 17:18:12 kswc-kenpachi kernel: [<ffffffff8110e56f>] generic_file_aio_write+0x6f/0xe0
Jul 13 17:18:12 kswc-kenpachi kernel: [<ffffffff81169d9a>] do_sync_write+0xfa/0x140
Jul 13 17:18:12 kswc-kenpachi kernel: [<ffffffff81090d50>] ? autoremove_wake_function+0x0/0x40
Jul 13 17:18:12 kswc-kenpachi kernel: [<ffffffff810dd63e>] ? call_rcu+0xe/0x10
Jul 13 17:18:12 kswc-kenpachi kernel: [<ffffffff8124c357>] ? cic_free_func+0x87/0xa0
Jul 13 17:18:12 kswc-kenpachi kernel: [<ffffffff81206f9b>] ? selinux_file_permission+0xfb/0x150
Jul 13 17:18:12 kswc-kenpachi kernel: [<ffffffff811fa486>] ? security_file_permission+0x16/0x20
Jul 13 17:18:12 kswc-kenpachi kernel: [<ffffffff8116a098>] vfs_write+0xb8/0x1a0
Jul 13 17:18:12 kswc-kenpachi kernel: [<ffffffff810d3709>] ? audit_syscall_exit+0x239/0x270
Jul 13 17:18:12 kswc-kenpachi kernel: [<ffffffff8116ab92>] sys_pwrite64+0x82/0xa0
Jul 13 17:18:12 kswc-kenpachi kernel: [<ffffffff8104c69c>] sys32_pwrite+0x1c/0x20
Jul 13 17:18:12 kswc-kenpachi kernel: [<ffffffff8104badf>] sysenter_dispatch+0x7/0x2e
Jul 13 17:18:12 kswc-kenpachi kernel: ---[ end trace fc6fde635caecda3 ]---


Version-Release number of selected component (if applicable):

# uname -a
Linux kswc-kenpachi 2.6.32-44.el6.x86_64 #1 SMP Wed Jul 7 15:47:50 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

How reproducible:
Happens most of the time but may not happen everytime.

Steps to Reproduce:
1. Install RHEL 6. Upgrade the kernel  to 2.6.32-44.
2. Present 64 luns using LSI storage. Connect the storage using fibre channel switch.
3. Install RDAC driver and configure multipath.
4. Start i/o.
5. When i/o is running powercycle the switch causing devices to go away.
6. I/O should fail gracefully.
  
Actual results:

We have noticed system OOps during this time with the following messages.
Jul 13 17:18:42 kswc-kenpachi abrt: Kerneloops: Reported 1 kernel oopses to Abrt
Jul 13 17:18:42 kswc-kenpachi abrtd: Directory 'kerneloops-1279059522-1' creation detected


Expected results:

System should report i/o errors and exit gracefully with any oops.
Additional info:
Comment 2 Babu Moger 2010-07-13 17:03:06 EDT
Created attachment 431593 [details]
Messages file showing the OOps(time stamp Jul 13 17:18:12)
Comment 3 Babu Moger 2010-07-13 17:05:32 EDT
Created attachment 431594 [details]
Kernel abrt files from /var/spool/abrt/
Comment 4 Eric Sandeen 2010-07-14 11:56:34 EDT
So, this was down an ext3 error (__ext3_std_error) path, coming from ext3_reserve_inode_write().

We got these errors just prior:

Jul 13 17:18:12 kswc-kenpachi kernel: sd 5:0:0:5: [sdg] Unhandled error code
Jul 13 17:18:12 kswc-kenpachi kernel: sd 5:0:0:5: [sdg] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Jul 13 17:18:12 kswc-kenpachi kernel: sd 5:0:0:5: [sdg] CDB: Write(10): 2a 00 00 0c e4 e8 00 00 80 00
Jul 13 17:18:12 kswc-kenpachi kernel: end_request: I/O error, dev sdg, sector 845032
...
Jul 13 17:18:12 kswc-kenpachi kernel: end_request: I/O error, dev sdg, sector 1067160
...

...
Jul 13 17:18:12 kswc-kenpachi kernel: Aborting journal on device sdg.
Jul 13 17:18:12 kswc-kenpachi kernel: EXT3-fs error (device sdg) in ext3_reserve_inode_write: Journal has aborted

due to the:

> 5. When i/o is running powercycle the switch causing devices to go away.

The warning was in mark_buffer_dirty, warning that the buffer was not marked uptodate:  WARN_ON_ONCE(!buffer_uptodate(bh));

going down the ext3_commit_super() path from ext3_handle_error.

So in the end we have a not-uptodate superblock buffer which gets marked dirty on the way to commit, and mark_buffer_dirty() warns us that it is not uptodate.

It's likely not uptodate because of the IO errors ... note that this warning isn't an oops or a BUG(), and should not lead directly to any system instability.

Just because abrt called this an "oops" doesn't make it so, it tends to call any backtrace an "oops" sadly; the noisiness in the logs is not a huge problem per se.

You mentioned that the system becomes unreliable after this happens.  In what way does it become unreliable?  That seems to be the actual bug here, but it's not well explained.
Comment 5 Babu Moger 2010-07-14 13:02:40 EDT
After this problem system locks up with NMI. This happens when the devices come back when the FC switch is back online(from powercycle). NMI lockup happens when the system starts scanning the scsi devices. Here is the console output..

EXT3-fs error (device sdg) in ext3_reserve_inode_write: Journal has aborted
EXT3-fs error (device sdg) in ext3_orphan_del: Journal has aborted
EXT3-fs error (device sdg) in ext3_truncate: Journal has aborted
journal commit I/O error
ext3_abort called.
EXT3-fs error (device sdg): ext3_journal_start_sb: 
Remounting filesystem read-only
EXT3-fs error (device sdg): ext3_get_inode_loc: unable to read inode block - inode=24622, block=98373
EXT3-fs error (device sdg): ext3_get_inode_loc: unable to read inode block - inode=24622, block=98373
EXT3-fs error (device sdg): ext3_get_inode_loc: unable to read inode block - inode=24622, block=98373


BUG: NMI Watchdog detected LOCKUP on CPU1, ip ffffffff814da62f, registers:
CPU 1 
Modules linked in: ext3 jbd fuse nfs lockd fscache nfs_acl auth_rpcgss autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables be2iscsi bnx2i cnic uio cxgb3i iw_cxgb3 cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi dm_mirror dm_region_hash dm_log uinput dcdbas serio_raw iTCO_wdt iTCO_vendor_support i7core_edac edac_core bnx2 ext4 mbcache jbd2 mppVhba(U) sr_mod cdrom ata_generic pata_acpi ata_piix qla2xxx scsi_transport_fc scsi_tgt mptsas mptscsih mptbase scsi_transport_sas mppUpper(U) sg sd_mod crc_t10dif dm_mod [last unloaded: microcode]

Modules linked in: ext3 jbd fuse nfs lockd fscache nfs_acl auth_rpcgss autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables be2iscsi bnx2i cnic uio cxgb3i iw_cxgb3 cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi dm_mirror dm_region_hash dm_log uinput dcdbas serio_raw iTCO_wdt iTCO_vendor_support i7core_edac edac_core bnx2 ext4 mbcache jbd2 mppVhba(U) sr_mod cdrom ata_generic pata_acpi ata_piix qla2xxx scsi_transport_fc scsi_tgt mptsas mptscsih mptbase scsi_transport_sas mppUpper(U) sg sd_mod crc_t10dif dm_mod [last unloaded: microcode]
Pid: 384, comm: mpp_dcr/1 Tainted: G        W  2.6.32-44.el6.x86_64 #1 PowerEdge R710
RIP: 0010:[<ffffffff814da62f>]  [<ffffffff814da62f>] _spin_lock_irqsave+0x2f/0x40
RSP: 0018:ffff88041712fce0  EFLAGS: 00000097
RAX: 0000000000000a5d RBX: ffff8803e90b0000 RCX: 0000000000000a5c
RDX: 0000000000000286 RSI: 000000000000001e RDI: ffff880417764f00
RBP: ffff88041712fce0 R08: ffff8803f8c20358 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff8803e90b0000
R13: 000000000000001e R14: ffff88041712fd8f R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff880036820000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000022e22f8 CR3: 00000003f68c9000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process mpp_dcr/1 (pid: 384, threadinfo ffff88041712e000, task ffff880417234a60)
Stack:
 ffff88041712fd30 ffffffffa004762f ffff88041712fd10 ffff88041712fd8e
<0> ffff88041712fd60 ffff8803f8c20358 ffff8803e90b0000 ffff88041712fde8
<0> ffff88041712fdf0 ffff8803e057b9c0 ffff88041712fdc0 ffffffffa01443d4
Call Trace:
 [<ffffffffa004762f>] mppCmn_NewSelectPath+0x3f/0xc0 [mppUpper]
 [<ffffffffa01443d4>] mppLnx_do_queuecommand+0xa4/0xb60 [mppVhba]
 [<ffffffffa0138438>] mppLnx_process_ios+0x118/0x170 [mppVhba]
 [<ffffffffa0138320>] ? mppLnx_process_ios+0x0/0x170 [mppVhba]
 [<ffffffff8108b6c0>] worker_thread+0x170/0x2a0
 [<ffffffff81090d50>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8108b550>] ? worker_thread+0x0/0x2a0
 [<ffffffff810909e6>] kthread+0x96/0xa0
 [<ffffffff810141ca>] child_rip+0xa/0x20
 [<ffffffff81090950>] ? kthread+0x0/0xa0
 [<ffffffff810141c0>] ? child_rip+0x0/0x20
Comment 6 Eric Sandeen 2010-07-14 13:46:39 EDT
This lockup is in the mppUpper (RDAC?) module, which is not something we can debug or support, I think.

Perhaps we have an issue with the WARNING() and the superblock buffer, I need to think about that little more, but this looks like a different bug, in a 3rd party kernel module.

-Eric
Comment 7 Ric Wheeler 2010-07-14 13:55:23 EDT
It would be interesting to try this with RHEL native multipathing - dm multipath.
Comment 8 Babu Moger 2010-07-14 14:07:19 EDT
Ok.. We will try with native multipathing and let you know the results..
Comment 9 Ric Wheeler 2010-07-14 14:17:21 EDT
Great - let us know if you have questions about multipathing...
Comment 10 RHEL Product and Program Management 2010-07-15 10:34:18 EDT
This issue has been proposed when we are only considering blocker
issues in the current Red Hat Enterprise Linux release. It has
been denied for the current Red Hat Enterprise Linux release.

** If you would still like this issue considered for the current
release, ask your support representative to file as a blocker on
your behalf. Otherwise ask that it be considered for the next
Red Hat Enterprise Linux release. **
Comment 11 Eric Sandeen 2010-07-15 12:54:33 EDT
I'm going to dup this one to bug 591466 for the mark_buffer_dirty issue.

If there are other problems that arise in testing, please open new bugs.

Thanks!

-Eric

*** This bug has been marked as a duplicate of bug 591466 ***

Note You need to log in before you can comment on or make changes to this bug.