Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1171317

Summary:

xfs may crash after unmount if a log write is delayed

Product:

Red Hat Enterprise Linux 6

Reporter:

Harold Miller <hamiller>

Component:

kernel

Assignee:

Brian Foster <bfoster>

kernel sub component:

XFS

QA Contact:

Zorro Lang <zlang>

Status:

CLOSED ERRATA

Docs Contact:

Severity:

medium

Priority:

medium

CC:

bfoster, eguan, esandeen, plambri, vlad.bespalov

Version:

6.6

Target Milestone:

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

kernel-2.6.32-537.el6

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2015-07-22 08:34:06 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Script to cause crash	none
Proposd patch	none

Description Harold Miller 2014-12-05 21:39:44 UTC

Created attachment 965235 [details]
Script to cause crash

Description of problem:If a specific XFS log write is delayed, e.g. in a failing drive, and while it is not complete XFS goes into shutdown mode and is unmounted, then a crash accessing freed memory is possible.


Version-Release number of selected component (if applicable):/linux-2.6/xfs_buf.c


How reproducible: See attached repro-script


Steps to Reproduce:
1.Execute script
2.
3.

Actual results:Crash with the following call trace:

<1>BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8
<1>IP: [<ffffffff8152ce7e>] _spin_lock+0xe/0x30
<4>PGD aeb4c067 PUD aeb1a067 PMD 0
<4>Oops: 0002 [#1] SMP
<4>last sysfs file: /sys/devices/pci0000:00/0000:00:15.0/0000:03:00.0/host2/port-2:2/end_device-2:2/target2:0:2/2:0:2:0/delete
<4>CPU 2
<4>Modules linked in: xfs(U) exportfs 8021q garp stp llc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntra
ck ip6table_filter ip6_tables ipv6 vsock(U) fiop(P)(U) ppdev parport_pc parport microcode vmware_balloon e1000 sg vmci(U) i2c_piix4 i2c_core shpchp ext4 jbd2 mbcache sd_mod crc_t10dif sr_mod
 cdrom mptsas mptscsih mptbase scsi_transport_sas pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: mperf]
<4>
<4>Pid: 4844, comm: xfslogd/2 Tainted: P        W  ---------------    2.6.32-504.1.3.el6.x86_64 #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
<4>RIP: 0010:[<ffffffff8152ce7e>]  [<ffffffff8152ce7e>] _spin_lock+0xe/0x30
<4>RSP: 0018:ffff8800a603fdb0  EFLAGS: 00010286
<4>RAX: 0000000000010000 RBX: ffff8800b91e4780 RCX: ffffe8ffffd086c8
<4>RDX: ffff8800bc87abb8 RSI: 0000000000000002 RDI: 00000000000000a8
<4>RBP: ffff8800a603fdb0 R08: 0000000000000000 R09: 0000000000000000
<4>R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000000a8
<4>R13: 0000000000000000 R14: 0000000000000002 R15: ffffe8ffffd086c8
<4>FS:  00007f3aec62b700(0000) GS:ffff880002300000(0000) knlGS:0000000000000000
<4>CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
<4>CR2: 00000000000000a8 CR3: 00000000aeb56000 CR4: 00000000000407e0
<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>Process xfslogd/2 (pid: 4844, threadinfo ffff8800a603e000, task ffff880037af2aa0)
<4>Stack:
<4> ffff8800a603fde0 ffffffffa052bf79 ffff8800b91e4780 0000000000000000
<4><d> ffff8800bc87ab00 ffff8800a603ffd8 ffff8800a603fe10 ffffffffa052c739
<4><d> ffff8800a603fe30 ffff8800bc87ab00 ffffe8ffffd086c0 ffffffffa0547780
<4>Call Trace:
<4> [<ffffffffa052bf79>] xlog_state_done_syncing+0x39/0xb0 [xfs]
<4> [<ffffffffa052c739>] xlog_iodone+0x59/0xb0 [xfs]
<4> [<ffffffffa0547780>] ? xfs_buf_iodone_work+0x0/0x50 [xfs]
<4> [<ffffffffa05477a6>] xfs_buf_iodone_work+0x26/0x50 [xfs]
<4> [<ffffffff81097fe0>] worker_thread+0x170/0x2a0
<4> [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40
<4> [<ffffffff81097e70>] ? worker_thread+0x0/0x2a0
<4> [<ffffffff8109e66e>] kthread+0x9e/0xc0


Expected results: unmount complete without crash


Additional info: proposed patch attached

Comment 1 Harold Miller 2014-12-05 21:41:30 UTC

Created attachment 965236 [details]
Proposd patch

Comment 2 Eric Sandeen 2014-12-05 22:38:57 UTC

Interesting, the reporter first reported this upstream in 2013:

http://marc.info/?t=137123029400001&r=1&w=2

Comment 3 Eric Sandeen 2014-12-05 22:49:10 UTC

Note that the attached patch is not a proposed patch to fix the problem, but a patch which provokes the error.

Comment 8 Harold Miller 2014-12-10 16:46:45 UTC

OK, I unchecked the 'Private' box, and will update customer.

Comment 9 Harold Miller 2014-12-11 16:57:32 UTC

SFDC Case closed, customer will monitor BZ directly. Issue NOT fixed yet...

Comment 10 Brian Foster 2015-01-15 16:56:00 UTC

I think the difference between this crashing and not is the fs shutdown. This means we skip writing the unmount record and the associated log write wait logic. Using the attached delay hack patch, I can reproduce by issuing a godown and umount once a delayed log write is triggered. Without the shutdown, the umount sequence blocks on the delayed I/O.

I think this is similar to bug 1112214 on rhel7 and fundamentally addressed by the same fix to make the workqueue per-mount such that it is destroyed at umount. The problem is that rhel6 does not have the newer workqueue implementation that has lended itself to the lower-cost creation of many per-mount wq's in XFS. I see that we have created at least one single-threaded per-mount workqueue in rhel6. I don't think that's really an option for xfslogd, however, as it is responsible for all metadata I/O completion along with log I/O completion.

We do have some code in xfs_log_unmount_write() that appears to attempt to handle the unmount of a shutdown fs scenario. I'll have to dig more into that to see why it doesn't prevent this issue...

Comment 11 Brian Foster 2015-01-15 18:17:31 UTC

... and that code doesn't help us because all of the iclogs are in error state.

Basically, what we have is a branch of code in the unmount path that runs explicitly when we've had a log I/O shutdown and attempts to process and wait on the current iclog if it isn't in an error state. As part of log I/O shutdown, we set _all_ of the iclogs to an error state (xlog_state_ioerror()). So it appears this code doesn't really do anything at all.

Comment 12 Brian Foster 2015-01-15 18:25:23 UTC

I think what we might need is the following commit or some variant thereof:

9c23eccc xfs: unmount does not wait for shutdown during unmount

Comment 13 Brian Foster 2015-01-16 15:48:49 UTC

Confirmed that the commit in comment #12 addresses the problem via the delayed I/O reproducer. I have a series undergoing additional testing.

Comment 14 RHEL Program Management 2015-01-16 15:49:44 UTC

This request was evaluated by Red Hat Product Management for
inclusion in a Red Hat Enterprise Linux release.  Product
Management has requested further review of this request by
Red Hat Engineering, for potential inclusion in a Red Hat
Enterprise Linux release for currently deployed products.
This request is not yet committed for inclusion in a release.

Comment 16 Rafael Aquini 2015-02-25 15:41:54 UTC

Patch(es) available on kernel-2.6.32-537.el6

Comment 19 Zorro Lang 2015-03-05 07:50:20 UTC

Reproduced on kernel 504 with attached patch and script(test on scsi_debug disk):

BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8
IP: [<ffffffff8152ce5e>] _spin_lock+0xe/0x30
PGD 0 
Oops: 0002 [#1] SMP 
last sysfs file: /sys/devices/pseudo_0/adapter0/host6/target6:0:0/6:0:0:0/delete
CPU 41 
Modules linked in: scsi_debug xfs(U) exportfs autofs4 cpufreq_ondemand acpi_cpufreq freq_table mperf ipv6 ipmi_devintf ipmi_si ipmi_msghandler microcode iTCO_wdt iTCO_vendor_support cdc_ether usbnet mii serio_raw i2c_i801 i2c_core lpc_ich mfd_core shpchp sg i7core_edac edac_core ioatdma dca bnx2 ext4 jbd2 mbcache sd_mod crc_t10dif pata_acpi ata_generic ata_piix megaraid_sas wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_debug]

Pid: 47574, comm: xfslogd/41 Not tainted 2.6.32-504.el6.x86_64 #1 IBM System x3850 X5 -[7143OQE]-/Node 1, Processor Card
RIP: 0010:[<ffffffff8152ce5e>]  [<ffffffff8152ce5e>] _spin_lock+0xe/0x30
RSP: 0018:ffff881032491db0  EFLAGS: 00010286
RAX: 0000000000010000 RBX: ffff881077de03c0 RCX: ffffe8f7ffd24948
RDX: ffff88107865ebb8 RSI: 0000000000000002 RDI: 00000000000000a8
RBP: ffff881032491db0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: 00000000000000a8
R13: 0000000000000000 R14: 0000000000000002 R15: ffffe8f7ffd24948
FS:  0000000000000000(0000) GS:ffff88048e520000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000000000a8 CR3: 0000000001a85000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process xfslogd/41 (pid: 47574, threadinfo ffff881032490000, task ffff881078edb500)
Stack:
 ffff881032491de0 ffffffffa051af79 ffff881077de03c0 0000000000000000
<d> ffff88107865eb00 ffff881032491fd8 ffff881032491e10 ffffffffa051b739
<d> ffff881032491e30 ffff88107865eb00 ffffe8f7ffd24940 ffffffffa0536780
Call Trace:
 [<ffffffffa051af79>] xlog_state_done_syncing+0x39/0xb0 [xfs]
 [<ffffffffa051b739>] xlog_iodone+0x59/0xb0 [xfs]
 [<ffffffffa0536780>] ? xfs_buf_iodone_work+0x0/0x50 [xfs]
 [<ffffffffa05367a6>] xfs_buf_iodone_work+0x26/0x50 [xfs]
 [<ffffffff81097fe0>] worker_thread+0x170/0x2a0
 [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff81097e70>] ? worker_thread+0x0/0x2a0
 [<ffffffff8109e66e>] kthread+0x9e/0xc0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109e5d0>] ? kthread+0x0/0xc0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20
Code: e5 0f 1f 44 00 00 fa 66 0f 1f 44 00 00 f0 81 2f 00 00 00 01 74 05 e8 22 bb d6 ff c9 c3 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 <f0> 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 0f b7 17 eb f5 
RIP  [<ffffffff8152ce5e>] _spin_lock+0xe/0x30
 RSP <ffff881032491db0>
CR2: 00000000000000a8

Comment 20 Zorro Lang 2015-03-06 17:57:50 UTC

test passed, after run this reproducer more than 100 times on kernel 540

Comment 22 errata-xmlrpc 2015-07-22 08:34:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-1272.html