Bug 663068
Summary: | genesis w/flocks on NFS can't verify write | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Nate Straz <nstraz> | ||||
Component: | kernel | Assignee: | Jeff Layton <jlayton> | ||||
Status: | CLOSED DUPLICATE | QA Contact: | Cluster QE <mspqa-list> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 5.6 | CC: | adas, bfields, bmarzins, dhowells, jlayton, rpeterso, rwheeler, steved, swhiteho | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | 245024 | Environment: | |||||
Last Closed: | 2011-05-26 17:17:56 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Nate Straz
2010-12-14 15:34:39 UTC
*** Bug 223263 has been marked as a duplicate of this bug. *** Spoke with Jeff Layton who agreed to take it for now, since we can recreate it outside of GFS and GFS2. Let me, Abhi, or Nate know if you want help recreating it. I wonder whether this may be a duplicate of bug 672981. The problem sounds like it may be somewhat related. Once we've had some time to over that patch, I'll put in my test kernels and perhaps we can see whether it helps this issue too. Nate, I've added Trond's patch to the jtltest.131 kernels on my people.redhat.com page: http://people.redhat.com/jlayton/ ...if you could test rerun this test with those kernels, that would be great. If it happens to fix this issue, then perhaps we can get this into 5.7 after all. Nate ran the test over the weekend. It made it most of the weekend and then failed at ~1am Monday morning. While Trond's patch seems to help, it doesn't seem to be a complete fix. The next step is to re-run this with RHEL6 clients to see if this is still an upstream issue. He's doing that this week and should have results by week's end... This we I have run the genesis load with the same RHEL5.6 server running kernel-2.6.18-238.el5. I've run RHEL6.0 clients running kernel-2.6.32-71.el6.x86_64 and RHEL6.1 clients running kernel-2.6.32-94.el6.x86_64. I have not hit the data comparison error at all during the four days of testing. I have hit the short read several times on RHEL6.0 and RHEL6.1 using both ext3 and GFS under the NFS export. I hit a new symptom on the RHEL6.1 clients. After hitting a short read both systems (one w/ ext3 under NFS and one w/ GFS under NFS), one of the genesis processes got stuck on a CPU. west-09, running NFS on ext3: creat_file(): Short read(3, buf, 1024) (nbytes is 0) on gendir_0/udiehtlcyxdvubptsaaudgtm genesis (32625) exited with status 1 Killing all pids. genesis (32626) terminated with signal 15: genesis (32627) terminated with signal 15: genesis (32628) terminated with signal 15: BUG: soft lockup - CPU#2 stuck for 61s! [genesis:32630] Modules linked in: nfs fscache nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc cpufreq_ondemand powernow_k8 freq_table ipv6 dm_mirror dm_regio n_hash dm_log dcdbas serio_raw qla2xxx scsi_transport_fc scsi_tgt amd64_edac_mod edac_core edac_mce_amd k8temp hwmon tg3 sg i2c_piix4 shpchp ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom sata_svw ata_generic pata_acpi pata_serverworks radeon ttm drm_kms_helper drm i2c_algo_bit i 2c_core dm_mod [last unloaded: scsi_wait_scan] CPU 2: Modules linked in: nfs fscache nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc cpufreq_ondemand powernow_k8 freq_table ipv6 dm_mirror dm_regio n_hash dm_log dcdbas serio_raw qla2xxx scsi_transport_fc scsi_tgt amd64_edac_mod edac_core edac_mce_amd k8temp hwmon tg3 sg i2c_piix4 shpchp ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom sata_svw ata_generic pata_acpi pata_serverworks radeon ttm drm_kms_helper drm i2c_algo_bit i 2c_core dm_mod [last unloaded: scsi_wait_scan] Pid: 32630, comm: genesis Not tainted 2.6.32-94.el6.x86_64 #1 PowerEdge SC1435 RIP: 0010:[<ffffffff81117f00>] [<ffffffff81117f00>] clear_page_dirty_for_io+0x60/0x100 RSP: 0000:ffff880119ec7b58 EFLAGS: 00000246 RAX: ffff880118672848 RBX: ffff880119ec7b68 RCX: ffff880028035b98 RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffffea000397a9d8 RBP: ffffffff8100bc8e R08: ffff880028035b98 R09: fed8b06043dc8407 R10: 00000000002fc970 R11: 0000000000000400 R12: ffff880119ec7ac0 R13: ffffffff8108a130 R14: ffff8801190b2b30 R15: 0000000000000001 FS: 00007f874964a700(0000) GS:ffff880028280000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f1e57302c08 CR3: 0000000118231000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Call Trace: [<ffffffffa0478fd0>] ? nfs_wb_page+0x60/0xd0 [nfs] [<ffffffffa0466c1e>] ? nfs_launder_page+0x4e/0x80 [nfs] [<ffffffff8111be52>] ? invalidate_inode_pages2_range+0x322/0x3b0 [<ffffffff8111bef7>] ? invalidate_inode_pages2+0x17/0x20 [<ffffffffa046af88>] ? nfs_revalidate_mapping+0x128/0x170 [nfs] [<ffffffffa0467c77>] ? nfs_file_read+0x77/0x130 [nfs] [<ffffffff81165c8a>] ? do_sync_read+0xfa/0x140 [<ffffffff8118f4f4>] ? writeback_single_inode+0x154/0x2c0 [<ffffffff8108a0f0>] ? autoremove_wake_function+0x0/0x40 [<ffffffff8120537b>] ? selinux_file_permission+0xfb/0x150 [<ffffffff811f87e6>] ? security_file_permission+0x16/0x20 [<ffffffff811666b5>] ? vfs_read+0xb5/0x1a0 [<ffffffff810cca12>] ? audit_syscall_entry+0x272/0x2a0 [<ffffffff811667f1>] ? sys_read+0x51/0x90 [<ffffffff8100b172>] ? system_call_fastpath+0x16/0x1b west-10, running NFS on GFS: creat_file(): Short read(3, buf, 1024) (nbytes is 0) on gendir_0/udiehtlcyxdvubptsaaudgtm genesis (30566) exited with status 1 Killing all pids. genesis (30567) terminated with signal 15: genesis (30568) terminated with signal 15: genesis (30569) terminated with signal 15: BUG: soft lockup - CPU#2 stuck for 61s! [genesis:30570] Modules linked in: nfs fscache nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc cpufreq_ondemand powernow_k8 freq_table ipv6 dm_mirror dm_regio n_hash dm_log dcdbas serio_raw amd64_edac_mod edac_core edac_mce_amd k8temp hwmon qla2xxx scsi_transport_fc scsi_tgt tg3 sg i2c_piix4 shpchp ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom sata_svw pata_acpi ata_generic pata_serverworks radeon ttm drm_kms_helper drm i2c_algo_bit i 2c_core dm_mod [last unloaded: scsi_wait_scan] CPU 2: Modules linked in: nfs fscache nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc cpufreq_ondemand powernow_k8 freq_table ipv6 dm_mirror dm_regio n_hash dm_log dcdbas serio_raw amd64_edac_mod edac_core edac_mce_amd k8temp hwmon qla2xxx scsi_transport_fc scsi_tgt tg3 sg i2c_piix4 shpchp ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom sata_svw pata_acpi ata_generic pata_serverworks radeon ttm drm_kms_helper drm i2c_algo_bit i 2c_core dm_mod [last unloaded: scsi_wait_scan] Pid: 30570, comm: genesis Not tainted 2.6.32-94.el6.x86_64 #1 PowerEdge SC1435 RIP: 0010:[<ffffffff814c51e0>] [<ffffffff814c51e0>] out_of_line_wait_on_bit_lock+0x0/0x90 RSP: 0018:ffff880112f6db00 EFLAGS: 00000202 RAX: 00000000ffffffff RBX: ffff880112f6db68 RCX: 0000000000000082 RDX: ffffffffa0482aa0 RSI: 0000000000000007 RDI: ffff880114da45a0 RBP: ffffffff8100bc8e R08: ffff880028032b80 R09: fecd25ff96756807 R10: 00000000002e018c R11: 0000000000000400 R12: ffffffffa0482aa0 R13: ffff880114da45a0 R14: 0000000000000007 R15: 0000000000000400 FS: 00007fd1fc79e700(0000) GS:ffff880028280000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000003fd78672b0 CR3: 0000000112d2b000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Call Trace: [<ffffffffa0490e5a>] ? nfs_commit_inode+0xaa/0x1c0 [nfs] [<ffffffffa0490fe9>] ? nfs_wb_page+0x79/0xd0 [nfs] [<ffffffffa047ec1e>] ? nfs_launder_page+0x4e/0x80 [nfs] [<ffffffff8111be52>] ? invalidate_inode_pages2_range+0x322/0x3b0 [<ffffffff8111bef7>] ? invalidate_inode_pages2+0x17/0x20 [<ffffffffa0482f88>] ? nfs_revalidate_mapping+0x128/0x170 [nfs] [<ffffffffa047fc77>] ? nfs_file_read+0x77/0x130 [nfs] [<ffffffff81165c8a>] ? do_sync_read+0xfa/0x140 [<ffffffff8118f4f4>] ? writeback_single_inode+0x154/0x2c0 [<ffffffff8108a0f0>] ? autoremove_wake_function+0x0/0x40 [<ffffffff8120537b>] ? selinux_file_permission+0xfb/0x150 [<ffffffff811f87e6>] ? security_file_permission+0x16/0x20 [<ffffffff811666b5>] ? vfs_read+0xb5/0x1a0 [<ffffffff810cca12>] ? audit_syscall_entry+0x272/0x2a0 [<ffffffff811667f1>] ? sys_read+0x51/0x90 [<ffffffff8100b172>] ? system_call_fastpath+0x16/0x1b Created attachment 477043 [details]
console logs from west-09 and west-10 with soft lockup messages
I looked at the logs more closely and there are a bunch of back traces from several processes in the logs. I've attached the complete console logs from the past two days for inspection.
I think the RHEL6 problems may be related to bug 672305, but it's hard to be sure. It may be good to open a separate bug for that... It's another longshot, but I have a some patches in my latest test kernel that I suspect may help this. The existing code can end up returning too early from fsync, close or setattr -- before writeback is done. Nate, would you be able to rerun this test on the kernels here? http://people.redhat.com/jlayton/ Even if it doesn't help, it'll be good to rule out these problems as the cause. I ran the test kernel for a day with three processes and didn't hit anything. I'm going to restart with six processes and let it run through the weekend. Thanks Nate. I'll set this to needinfo to remind me that we're waiting on results... I ran 2.6.18-246.el5.jtltest.135 on five client systems for a week and was not able to reproduce the data comparison error, nor the short read. Thumbs up! Hi Nate, I think all of the relevant patches for this are now in the latest 5.7 builds. Could you re-test this with a -262.el5 kernel or later? Got the load started on a -262.el5 kernel. It ran all night but I'm going to let it go through the weekend. I figured out this morning that I had a problem with my command line and only one process was running on each NFS client instead of three. I restarted the test this morning. I've run the load on 8 NFS client to one server all running -262.el5 for four days. I have not reproduced the data corruption issues I was seeing before. Since this is in the 5.7 kernel, shouldn't all of the 5.7 flags be set? Excellent news. Well, I think the right thing to do here is to go ahead and close this as a duplicate of 441730 since that patchset is what I suspect ultimately fixed this. I'll go ahead and do that, please reopen this one if that's a problem. *** This bug has been marked as a duplicate of bug 441730 *** |