Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1086163

Summary:

Guest os works abnormally after running 35 days; XFS is stuck at xlog_grant_head_wait()

Product:

Red Hat Enterprise Linux 7

Reporter:

Xu Han <xuhan>

Component:

kernel

Assignee:

Radim Krčmář <rkrcmar>

Status:

CLOSED DUPLICATE

QA Contact:

Virtualization Bugs <virt-bugs>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

7.0

CC:

acathrow, hhuang, juzhang, kwolf, lcapitulino, lersek, michen, virt-maint, vyasevic, xfu, xuhan

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2014-05-20 17:51:22 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
dmesg log	none
screen shot	none

Description Xu Han 2014-04-10 09:25:31 UTC

Description of problem:
One guest running mysql bench & test for 35 days then works abnormally.
1. can not access guest network.
2. can not login guest through tty(got messages below).
[3093270.843198] systemd-logind[622]: Failed to issue method call: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

Execute 'query-cpus' always get:
{"return": [{"current": true, "CPU": 0, "pc": -2130427674, "halted": true, "thread_id": 2986}, {"current": false, "CPU": 1, "pc": -2130427674, "halted": true, "thread_id": 2987}, {"current": false, "CPU": 2, "pc": -2130427674, "halted": true, "thread_id": 2988}, {"current": false, "CPU": 3, "pc": -2130427674, "halted": true, "thread_id": 2989}]}
(PC didnot changed)

Some guest info:
      KERNEL: /usr/lib/debug/lib/modules/3.10.0-97.el7.x86_64/vmlinux
    DUMPFILE: guest-mem.dump
        CPUS: 4
        DATE: Thu Apr 10 13:18:44 2014
      UPTIME: 35 days, 19:10:13
LOAD AVERAGE: 113.00, 112.99, 112.95
       TASKS: 298
    NODENAME: mysql.el7.x86_64.longevity
     RELEASE: 3.10.0-97.el7.x86_64
     VERSION: #1 SMP Tue Feb 25 15:22:08 EST 2014
     MACHINE: x86_64  (2826 Mhz)
      MEMORY: 3 GB

Something need to clarify:
After found this issue, I did 'sync' & 'drop_caches' on host, also, did twice:
{ "execute": "dump-guest-memory", "arguments": { "protocol": "file:guest-mem.dump", "paging": true } }
Not sure if these above operations would affect debug.

Version-Release number of selected component (if applicable):
guest kernel: 3.10.0-97.el7.x86_64
qemu-kvm-rhev-1.5.3-49.el7.x86_64
host kernel: 3.10.0-95.el7.x86_64

How reproducible:
1/1

Steps to Reproduce:
1.
2.
3.

Actual results:
Guest os worked abnormally.

Expected results:
Guest os works properly.

Additional info:
QEMU CLI:
/usr/libexec/qemu-kvm -M pc-i440fx-rhel7.0.0 -cpu Penryn -m 3G -smp 4,threads=1,cores=4,sockets=1 -enable-kvm -name dell-op760-02-RHEL-7.0-x86_64-mysql -uuid 1ab4f15a-054a-494e-a113-6fcd0af703d9 -nodefconfig -nodefaults -no-shutdown -k en-us -rtc base=utc,clock=host,driftfix=slew -qmp tcp:0:5000,server,nowait -boot order=c,menu=on -iscsi initiator-name=iqn.1994-05.com.redhat:longevity-test -vga qxl -global qxl-vga.vram_size=67108864 -spice port=6000,password=123@qwe -drive file=/dev/sdb6,if=none,id=drive-virtio0,cache=none,aio=native,rerror=stop,werror=stop -device virtio-blk-pci,drive=drive-virtio0,id=os-disk,scsi=off,bootindex=1 -netdev tap,id=tap0,vhost=off,script=/etc/ovs-ifup,downscript=/etc/ovs-ifdown -device e1000,netdev=tap0,mac=54:e1:d2:c3:b4:a3,id=net0 -chardev socket,id=charserial0,path=/var/local/qemu/dell-op760-02-RHEL-7.0-x86_64-mysql/console,server,nowait -device isa-serial,chardev=charserial0,id=serial0 -device virtio-serial-pci,id=virtio-serial0,max_ports=16 -chardev socket,id=channel0,path=/var/local/qemu/dell-op760-02-RHEL-7.0-x86_64-mysql/virtserial,server,nowait -device virtserialport,chardev=channel0,name=org.linux-kvm.port.0,bus=virtio-serial0.0,id=port1 -chardev socket,id=qemu-ga0,path=/var/local/qemu/dell-op760-02-RHEL-7.0-x86_64-mysql/qemu-ga,server,nowait -device virtserialport,chardev=qemu-ga0,name=org.qemu.guest_agent.0,bus=virtio-serial0.0,id=port2 -device virtio-balloon-pci,id=balloon0

Comment 1 Xu Han 2014-04-10 09:31:00 UTC

Created attachment 884822 [details]
dmesg log

Guest dmesg log which was generated from 2nd mem dump.

Comment 3 Xu Han 2014-04-10 09:45:00 UTC

host res info:
# free -m
             total       used       free     shared    buffers     cached
Mem:          3664       3561        102          1          0        512
-/+ buffers/cache:       3049        614
Swap:         3855       1900       1955

# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    1
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 23
Model name:            Intel(R) Core(TM)2 Quad CPU    Q9550  @ 2.83GHz
Stepping:              10
CPU MHz:               2826.179
BogoMIPS:              5652.35
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              6144K
NUMA node0 CPU(s):     0-3

Comment 5 Xu Han 2014-04-10 10:00:07 UTC

Created attachment 884868 [details]
screen shot

Command:
mysql-test:  sudo -u mysql ./mysql-test-run
mysql-bench: run-all-tests --server=mysql --cmp=mysql --user=root --password=$PW

Comment 7 Vlad Yasevich 2014-04-11 15:03:23 UTC

I think the network outage is a symptom, not the cause.  The fact the dbus service can't receive a response leads to a an wider issue with the VM.
Attempts to switch consoles on the system (sending Alt-Ctl-3) failed.

Based on the network config on the command line, I don't think this is
network related.  The VM is using an e1000 nic without any acceleration.
I might be plugged into an OVS switch on the host side, but at this point
this isn't the issue.  The issue is the VM appears halted and doesn't
respond to any key presses.

-vlad

Comment 8 Luiz Capitulino 2014-04-14 20:54:48 UTC

Vlad, while I agree with you, I think that the starting point to debug this issue is to understand why the network is not working. That's the main symptom we have. Can you help me with that?

As far as I could look into this issue, it seems that systemd-logind calls bus_method_call_with_reply(), which will in turn call dbus_connection_send_with_reply_and_block() which is the function that's failing because a reply is not received within a timeout. This could happen because of several reasons. A non-working network is one of them.

Xu Han, I'm under the impression that you were able to keep a shell into the guest before the problem happened. Is this true? If it is, can you please do the following:

1. Send the latest dmesg output (if you have it, and if attachment #884822 [details]) is not the latest one already)

2. Run systemd-loginctl list-sessions. If it doesn't work, run "strace systemd-logincttl list-sessions" (hopefully you have strace)

Comment 9 Xu Han 2014-04-15 02:49:09 UTC

(In reply to Luiz Capitulino from comment #8)
> Xu Han, I'm under the impression that you were able to keep a shell into the
> guest before the problem happened. Is this true? If it is, can you please do
> the following:

Yep, there is a login shell on tty1 running mysql tests. Unfortunately, it is out of control after this issue happened. (could access to the guest following comment 4)

The dmesg output which I pasted in comment 1 was fetched from the guest memory dump, and the memory dump was generated after this issue happened (comment 6, '2nd-guest-mem.dump'). Sorry, I have no ideas about gathering more debug infos :(

Comment 10 Vlad Yasevich 2014-04-15 13:05:56 UTC

(In reply to Luiz Capitulino from comment #8)
> Vlad, while I agree with you, I think that the starting point to debug this
> issue is to understand why the network is not working. That's the main
> symptom we have. Can you help me with that?

Sure

> 
> As far as I could look into this issue, it seems that systemd-logind calls
> bus_method_call_with_reply(), which will in turn call
> dbus_connection_send_with_reply_and_block() which is the function that's
> failing because a reply is not received within a timeout. This could happen
> because of several reasons. A non-working network is one of them.

The network we are talking about here is either loopback (less likely) or
unix domain sockets (more likely).  This is internal to the system and not part of the qemu network stack itself. I tried using the Magic SysRQ key on the VM in question and it didn't respond.  I was trying to get a stack dump of the current processes.

The cpus are still showing as halted, that might explain why the os in not responding.

-vlad

Comment 11 Luiz Capitulino 2014-04-15 18:37:35 UTC

Here's the latest info on this BZ:

1. We have no access to the guest whatsoever, as all vcpus are halted. We do have access to QEMU through QMP though. And we also have a guest memory dump

2. Here's a stack dump of the systemd-logind init. Nothing unusual.

crash> bt 622
PID: 622    TASK: ffff8800babac1a0  CPU: 1   COMMAND: "systemd-logind"
 #0 [ffff8800b9541d68] schedule at ffffffff815c83b0
 #1 [ffff8800b9541de0] schedule_hrtimeout_range_clock at ffffffff815c779d
 #2 [ffff8800b9541e78] schedule_hrtimeout_range at ffffffff815c77d3
 #3 [ffff8800b9541e88] ep_poll at ffffffff811e86d3
 #4 [ffff8800b9541f38] sys_epoll_wait at ffffffff811e9675
 #5 [ffff8800b9541f80] system_call_fastpath at ffffffff815d2c99

3. Also, some guest network info. The ens4 lost its IP (considering it had one)

crash> net
   NET_DEVICE     NAME   IP ADDRESS(ES)
ffff8800bad47000  lo     127.0.0.1
ffff8800b8842000  ens4   
crash> 

4. Now, something which looks puzzling to me: query-block says that the image file (which is partition /dev/sdb6) has actual-size=0:

{"return": [{"io-status": "ok", "device": "drive-virtio0", "locked": false, "removable": false, "inserted": {"iops_rd": 0, "image": {"virtual-size": 32212254720, "filename": "/dev/sdb6", "format": "raw", "actual-size": 0}, "iops_wr": 0, "ro": false, "backing_file_depth": 0, "drv": "raw", "iops": 0, "bps_wr": 0, "encrypted": false, "bps": 0, "bps_rd": 0, "file": "/dev/sdb6", "encryption_key_missing": false}, "type": "unknown"}]}

qemu-img info tells the same thing. Xu Han, this image can't be empty, right? Kevin, is it possible to show actual-size=0 (say, QEMU wasn't able to determine the device's actual size).

Comment 12 juzhang 2014-04-16 01:59:05 UTC

(In reply to Luiz Capitulino from comment #11)
> Here's the latest info on this BZ:
> 
> 1. We have no access to the guest whatsoever, as all vcpus are halted. We do
> have access to QEMU through QMP though. And we also have a guest memory dump
> 
> 2. Here's a stack dump of the systemd-logind init. Nothing unusual.
> 
> crash> bt 622
> PID: 622    TASK: ffff8800babac1a0  CPU: 1   COMMAND: "systemd-logind"
>  #0 [ffff8800b9541d68] schedule at ffffffff815c83b0
>  #1 [ffff8800b9541de0] schedule_hrtimeout_range_clock at ffffffff815c779d
>  #2 [ffff8800b9541e78] schedule_hrtimeout_range at ffffffff815c77d3
>  #3 [ffff8800b9541e88] ep_poll at ffffffff811e86d3
>  #4 [ffff8800b9541f38] sys_epoll_wait at ffffffff811e9675
>  #5 [ffff8800b9541f80] system_call_fastpath at ffffffff815d2c99
> 
> 3. Also, some guest network info. The ens4 lost its IP (considering it had
> one)
> 
> crash> net
>    NET_DEVICE     NAME   IP ADDRESS(ES)
> ffff8800bad47000  lo     127.0.0.1
> ffff8800b8842000  ens4   
> crash> 
> 
> 4. Now, something which looks puzzling to me: query-block says that the
> image file (which is partition /dev/sdb6) has actual-size=0:
> 
> {"return": [{"io-status": "ok", "device": "drive-virtio0", "locked": false,
> "removable": false, "inserted": {"iops_rd": 0, "image": {"virtual-size":
> 32212254720, "filename": "/dev/sdb6", "format": "raw", "actual-size": 0},
> "iops_wr": 0, "ro": false, "backing_file_depth": 0, "drv": "raw", "iops": 0,
> "bps_wr": 0, "encrypted": false, "bps": 0, "bps_rd": 0, "file": "/dev/sdb6",
> "encryption_key_missing": false}, "type": "unknown"}]}
> 
> qemu-img info tells the same thing. Xu Han, this image can't be empty,
> right? Kevin, is it possible to show actual-size=0 (say, QEMU wasn't able to
> determine the device's actual size).

Hi Xuhan,

Can you add your comment?

Best Regards,
Junyi

Comment 13 Xu Han 2014-04-16 02:34:48 UTC

(In reply to Luiz Capitulino from comment #11)
> 4. Now, something which looks puzzling to me: query-block says that the
> image file (which is partition /dev/sdb6) has actual-size=0:
> 
> {"return": [{"io-status": "ok", "device": "drive-virtio0", "locked": false,
> "removable": false, "inserted": {"iops_rd": 0, "image": {"virtual-size":
> 32212254720, "filename": "/dev/sdb6", "format": "raw", "actual-size": 0},
> "iops_wr": 0, "ro": false, "backing_file_depth": 0, "drv": "raw", "iops": 0,
> "bps_wr": 0, "encrypted": false, "bps": 0, "bps_rd": 0, "file": "/dev/sdb6",
> "encryption_key_missing": false}, "type": "unknown"}]}
> 
> qemu-img info tells the same thing. Xu Han, this image can't be empty,
> right? Kevin, is it possible to show actual-size=0 (say, QEMU wasn't able to
> determine the device's actual size).

sdb6 is not empty. I got the similar output on another VM(qcow2 format):

# qemu-img info /dev/sdb7
image: /dev/sdb7
file format: qcow2
virtual size: 30G (32212254720 bytes)
disk size: 0
cluster_size: 65536
Format specific information:
    compat: 1.1
    lazy refcounts: false

{"return": [{"io-status": "ok", "device": "drive-scsi0-0-0", "locked": false, "removable": false, "inserted": {"iops_rd": 0, "image": {"virtual-size": 32212254720, "filename": "/dev/sdb7", "cluster-size": 65536, "format": "qcow2",

"actual-size": 0, 

"format-specific": {"type": "qcow2", "data": {"compat": "1.1", "lazy-refcounts": false}}, "dirty-flag": false}, "iops_wr": 0, "ro": false, "backing_file_depth": 0, "drv": "qcow2", "iops": 0, "bps_wr": 0, "encrypted": false, "bps": 0, "bps_rd": 0, "file": "/dev/sdb7", "encryption_key_missing": false}, "type": "unknown"}]}

Comment 14 Xu Han 2014-04-16 02:42:13 UTC

(In reply to Luiz Capitulino from comment #11)
> 3. Also, some guest network info. The ens4 lost its IP (considering it had
> one)
> 
> crash> net
>    NET_DEVICE     NAME   IP ADDRESS(ES)
> ffff8800bad47000  lo     127.0.0.1
> ffff8800b8842000  ens4   

ens4 had an IP address which was generated by a DHCP server.

Comment 15 Kevin Wolf 2014-04-16 11:06:36 UTC

(In reply to Luiz Capitulino from comment #11)
> 4. Now, something which looks puzzling to me: query-block says that the
> image file (which is partition /dev/sdb6) has actual-size=0:
> [...]
> qemu-img info tells the same thing. Xu Han, this image can't be empty,
> right? Kevin, is it possible to show actual-size=0 (say, QEMU wasn't able to
> determine the device's actual size).

On block devices (as opposed to regular files), you always get an actual-size
of 0 because that is what fstat() returns for st_blocks. I guess we could
change the host_device driver to return the full size instead if this is
confusing.

Comment 16 Luiz Capitulino 2014-04-16 13:13:42 UTC

I find it confusing myself, but it's minor. I thought that the fact that actual-size=0 could have something to do with this bug, but it doesn't. Thanks.

Comment 17 RHEL Program Management 2014-04-24 05:51:21 UTC

This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 18 Luiz Capitulino 2014-04-28 13:16:57 UTC

Randim,

I couldn't find anything wrong in the guest, the next thing I'd do would be to check in the host why the guest's vCPUs are stopped.

Comment 19 Laszlo Ersek 2014-04-28 17:00:36 UTC

As long as the qemu process is running (which I imagine is the case, given that the monitor responds), one could also carefully attach to the qemu process, and peek around the VCPU threads (just make sure you detach before exiting gdb).

Comment 20 Radim Krčmář 2014-05-05 21:42:16 UTC

I didn't find anything unusual in kvm/qemu
 - qemu scheduled vcpus, which periodically handled irqs
 - I could switch consoles with alt+f. and write in them
   (login wasn't there, understandable with systemd-logind failure)
 - new systemd.logind timeouts popped up between writing
 - SysRq didn't work because we have 'kernel.sysrq = 16' (insane modern default)

I was getting more desperate in trying to find out what could have gone wrong, messed up and lost the guest after gdb hung when single-stepping qemu-kvm ...

So I looked into the guest dumpfiles after that. The guest was weird:
A lot (113) of processes were 'UN' (uninterruptible), like dbus-daemon, systemd and dhclient[1], which wasn't scheduled for more that 22 hours.
They were all waiting on xfs.

I think it is the same bug 1052789.
Going to dig deeper into xfs structures to confirm that, tomorrow.

---
1: 
PID: 1633   TASK: ffff8800b96c8000  CPU: 3   COMMAND: "dhclient"
 #0 [ffff8800b95b9ae8] __schedule at ffffffff815c7e9d
 #1 [ffff8800b95b9b50] schedule at ffffffff815c83d9
 #2 [ffff8800b95b9b60] xlog_grant_head_wait at ffffffffa01edfed [xfs]
 #3 [ffff8800b95b9bb0] xlog_grant_head_check at ffffffffa01ee155 [xfs]
 #4 [ffff8800b95b9bf0] xfs_log_reserve at ffffffffa01f188f [xfs]
 #5 [ffff8800b95b9c20] xfs_trans_reserve at ffffffffa01ad524 [xfs]
 #6 [ffff8800b95b9c68] xfs_vn_update_time at ffffffffa01a3b76 [xfs]
 #7 [ffff8800b95b9ca8] update_time at ffffffff811bc575
 #8 [ffff8800b95b9cd8] file_update_time at ffffffff811bc7e0
 #9 [ffff8800b95b9d10] xfs_file_aio_write_checks at ffffffffa019b5ab [xfs]
#10 [ffff8800b95b9d58] xfs_file_buffered_aio_write at ffffffffa019b63e [xfs]
#11 [ffff8800b95b9de0] xfs_file_aio_write at ffffffffa019b821 [xfs]
#12 [ffff8800b95b9e30] do_sync_write at ffffffff811a1ec0
#13 [ffff8800b95b9f00] vfs_write at ffffffff811a25fd
#14 [ffff8800b95b9f40] sys_write at ffffffff811a2fc9
#15 [ffff8800b95b9f80] system_call_fastpath at ffffffff815d2c99
    RIP: 00007faeb9e92c30  RSP: 00007fff5409b5c8  RFLAGS: 00010202
    RAX: 0000000000000001  RBX: ffffffff815d2c99  RCX: 0000000000000033
    RDX: 000000000000026b  RSI: 00007faebb1a3000  RDI: 0000000000000004
    RBP: 00007faebb1a3000   R8: 00007faebb189840   R9: 000000000000001f
    R10: 00007faebb189840  R11: 0000000000000246  R12: 0000000000000000
    R13: 000000000000026b  R14: 00007faebd11c9f0  R15: 000000000000026b
    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

Comment 21 Luiz Capitulino 2014-05-06 14:19:12 UTC

Wow, nice, it really looks like the same issue.

Maybe Xu Han could run two VMs to reproduce the problem. One with the same kernel as in the description and the other with kernel-3.10.0-113.el7 and see which one reproduces the problem?

Having to wait more than a month to get the results looks cumbersome though. But what are our options? Run the same test case as bug 1052789? Does anyone have better ideas to confirm it's the same issue by testing?

Comment 22 Radim Krčmář 2014-05-06 21:08:49 UTC

generic/270 xfstest is executing right now; delays caused by iscsi backing could have played a role in our original bug, but the test requires two dedicated partitions, so raw file is used instead, which makes it almost useless for us, but it took me few hours to debug their compilation process, so I wanted to try at least.

I wasn't very keen on reading xfs code after that, still, crash analysis is one of our better chances, as this bug might not be hit again.

Does QE have enough capacity to run two guests?

Comment 24 Radim Krčmář 2014-05-07 16:23:04 UTC

It looks similar to 1091136 as well, but we have empty xlog->cil->xc_cil.

crash> bt 12568
PID: 12568  TASK: ffff880086995780  CPU: 0   COMMAND: "kworker/0:0"
 #0 [ffff88002313fc70] __schedule at ffffffff815c7e9d
 #1 [ffff88002313fcd8] schedule at ffffffff815c83d9
 #2 [ffff88002313fce8] xlog_grant_head_wait at ffffffffa01edfed [xfs]
 #3 [ffff88002313fd38] xlog_grant_head_check at ffffffffa01ee155 [xfs]
 #4 [ffff88002313fd78] xfs_log_reserve at ffffffffa01f188f [xfs]
 #5 [ffff88002313fda8] xfs_trans_reserve at ffffffffa01ad524 [xfs]
 #6 [ffff88002313fdf0] xfs_fs_log_dummy at ffffffffa019d775 [xfs]
 #7 [ffff88002313fe10] xfs_log_worker at ffffffffa01f1168 [xfs]
 #8 [ffff88002313fe28] process_one_work at ffffffff8107860b
 #9 [ffff88002313fe70] worker_thread at ffffffff810793bb
#10 [ffff88002313fed0] kthread at ffffffff8107fc10
#11 [ffff88002313ff50] ret_from_fork at ffffffff815d2bec

crash> p &((struct xlog *)0xffff8800b89a5800)->l_cilp->xc_cil
$1 = (struct list_head *) 0xffff8800b8ba8e08
crash> p ((struct xlog *)0xffff8800b89a5800)->l_cilp->xc_cil
$2 = {
  next = 0xffff8800b8ba8e08, 
  prev = 0xffff8800b8ba8e08
}

Comment 27 Radim Krčmář 2014-05-20 17:51:22 UTC

XFS was slowly leaking log space until it couldn't reserve more, and a lot of processes got stuck at that point.  (See http://oss.sgi.com/pipermail/xfs/2014-May/036276.html for fix and details.)

The patch is being backported,

*** This bug has been marked as a duplicate of bug 1091136 ***