Bug 637619

Summary: starting tcpdump causes machine to freeze for a while
Product: [Fedora] Fedora Reporter: Maciej Żenczykowski <zenczykowski>
Component: kernelAssignee: Neil Horman <nhorman>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: 13CC: dougsland, gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda, nhorman
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Fixed In Version: kernel- Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-12-22 19:52:49 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Description Flags
fragment of /var/log/messages (with: dmesg -n 8)
patch to allow af_packet to use multiple single pages for ring buffer none

Description Maciej Żenczykowski 2010-09-26 21:59:02 UTC

# rpm -q kernel tcpdump


# tcpdump -i wlan0 (ath9k)
# tcpdump -i eth0 (sky2)

occasionally causes the machine to hard freeze (no mouse cursor movement in X, no caps lock response).

I have no idea what the problem is, but I've had this happen multiple times over the course of the past months, and it's always recovered, but it can take 30-40 seconds of waiting for a totally unresponsive machine (no mouse movement updates, no screen updates, caps lock doesn't toggle).  Once the machines becomes responsive everything works, including tcpdump, and the queue of input events that queued up prompty gets emptied.  Furthermore I believe stopping and restarting tcpdump will no longer exhibit this behaviour (unsure of this though).

It's like initializing the packet dumping core requires some huge memory allocation or something and that freezes the kernel or something...

Comment 1 Neil Horman 2010-09-27 12:39:11 UTC
Does a keyboard based sysrq-t generate a task dump in the log?  If not, it sounds like theres a spinlock getting held for way too long during some AF_PACKET socket operation. I'd suggest you start with an strace to see what system call path your starting in when the hang occurs, and then use system tap to instrument how far through that path you get before you hang.  That should make such an offensive spin lock jump right out at you.

Comment 2 Maciej Żenczykowski 2010-10-01 03:44:22 UTC
# strace tcpdump -i eth0 -n

bind(3, {sa_family=AF_PACKET, proto=0x03, if2, pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0
getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
setsockopt(3, SOL_PACKET, PACKET_ADD_MEMBERSHIP, "\2\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0", 16) = 0
setsockopt(3, SOL_PACKET, PACKET_AUXDATA, [1], 4) = 0
getsockopt(3, SOL_PACKET, PACKET_HDRLEN, [28], [4]) = 0
setsockopt(3, SOL_PACKET, PACKET_VERSION, [1], 4) = 0
setsockopt(3, SOL_PACKET, PACKET_RESERVE, [4], 4) = 0
setsockopt(3, SOL_PACKET, PACKET_RX_RING, {block_size=131072, block_nr=31, frame_size=65600, frame_nr=31}, 16) = -1 ENOMEM (Cannot allocate memory)
setsockopt(3, SOL_PACKET, PACKET_RX_RING, {block_size=131072, block_nr=30, frame_size=65600, frame_nr=30}, 16) = 0
mmap(NULL, 3932160, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0) = 0x7f3070dca000
getgid()                                = 0
setgid(0)                               = 0
getuid()                                = 0
setuid(0)                               = 0

The PACKET_RX_RING seems to be where it pauses (not sure if the first one or the second one though, I think the first one).  So it does indeed look like that probably requires a 128kB linear block of memory (order=5 ?) - guessing.

Comment 3 Maciej Żenczykowski 2010-10-01 03:52:01 UTC
Considering this machine has 4GB ram, but it is mostly used

Mem:   4040432k total,  3710612k used,   329820k free,   174184k buffers
Swap:  8388604k total,   560300k used,  7828304k free,   204596k cached

this is possibly triggering some sort of memory defragmentation?

Curious side note:
- I was running this in an xterm under X on a laptop (MacBookPro4,1) with builtin USB keyboard, and attached USB mouse and keyboard (for the sysrq key)
- afterwards both keyboards seemed to almost immediately die (I managed to control+C tcpdump though
- however the mouse worked fine, unplugging and replugging the mouse and external keyboard, resulted once more in a working mouse, but not a working keyboard
- switching (Ctrl+Alt+F2) out of X and back fixed the keyboard...

I'll attach the 10K line /var/log/messages log from Alt+SysRQ + T from while it was frozen.

Comment 4 Maciej Żenczykowski 2010-10-01 03:54:00 UTC
Created attachment 450939 [details]
fragment of /var/log/messages (with: dmesg -n 8)

I don't really know how to interpret this...

Comment 5 Maciej Żenczykowski 2010-10-01 04:23:49 UTC
Hmm, the keyboard going unresponsive might have been a red herring, maybe it just missed the release of the alt key from the alt+sysrq+T combo and X still though the alt key was held - that would certainly explain the observed behaviour.


Also, based on:


struct tpacket_req {
    unsigned int    tp_block_size;  /* Minimal size of contiguous block */
    unsigned int    tp_block_nr;    /* Number of blocks */
    unsigned int    tp_frame_size;  /* Size of frame */
    unsigned int    tp_frame_nr;    /* Total number of frames */

// This structure is defined in /usr/include/linux/if_packet.h and establishes a
// circular buffer (ring) of unswappable memory.

it looks like

setsockopt(3, SOL_PACKET, PACKET_RX_RING, {block_size=131072, block_nr=31,
frame_size=65600, frame_nr=31}, 16) = -1 ENOMEM (Cannot allocate memory)

is actually attempting 31 allocations of 128kB each (in order to be able to capture 31 frames of 65600 bytes each (ie. slightly more than 64kB per frame).

The comment at


states this uses __get_free_pages()

The code that actually implements this is at

the part where we block is most likely:

  order = get_order(req->tp_block_size);
  pg_vec = alloc_pg_vec(req, order);

which translates to

  pg_vec = alloc_pg_vec( "req->tp_block_nr" == 31 , 5);

which calls alloc_one_pg_vec_page(5) 31 times in a row.

Which results in 31 calls in a row to:

  __get_free_pages(GFP_KERNEL | __GFP_COMP | __GFP_ZERO | __GFP_NOWARN, 5);

This translates to:

__GFP_WAIT      /* Can wait and reschedule? */
__GFP_IO        /* Can start physical IO? */
__GFP_FS        /* Can call down to low-level FS? */
__GFP_NOWARN    /* Suppress page allocation failure warning */
__GFP_COMP      /* Add compound page metadata */
__GFP_ZERO      /* Return zeroed page on success */

Since GFP_KERNEL (ie __GFP_{WAIT|IO|FS} ) has been passed, it is not immediately clear why this is blocking the machine.

Comment 6 Maciej Żenczykowski 2010-10-01 04:32:14 UTC
The only things I can think of at this point are some sort of bad interactions with disk page out and memory defrag under this level of memory pressure.

Perhaps the system doesn't actually block, but all userspace processes very quickly block on disk access?

But would that explain why even cpu usage graphs stop updating?

I'm pretty much at the limits of my knowledge here.


I will just add, that this temp lockup that I caught was much much shorter than the last one I complained about, that one was somewhere on the order of just shy of a minute, this one was closer to 5-10 seconds.

I was also unable to replicate this a day or two ago when there was still quite some free ram in the system and it was not long after a reboot, a point at which chrome [the only running app, besides a couple xterm/ssh/mc and X with some gnome applets] with a bazillion tabs hadn't yet used all the ram...]

Comment 7 Neil Horman 2010-10-01 13:32:02 UTC
You're right on all accounts.  We're trying to allocate 31 blocks of 128k apiece, and on a memory constrained system that kicks in kswapd, which can quickly lock out the rest of the system as it scans the lru lists for pages to evict.  Thats what the task dump in your message log is showing.  That everyone is blocked on I/O while kswapd tries to free enough memory to satisfy all the order 5 requests that the packet socket is making.  Unfortunately I don't know what we can do about this.  We have an established interface that allows user space to dictate how much memory the socket should try to allocate, and its got to be linear because the interface uses that to memory map packets to user space.  Of the things I could think we could do:

1) teach af_packet in the kernel to use non-linear memory segments to store packets.  This would help solve the high order allocation problem, but may result in performance degradation, as it would require several more kernel traps per large frame read from tcpdump

2) add some predictive watermarking to the code so that we just return ENOMEM without actually trying to allocate all the ram, thus avoiding the big spins on memory allocation.

I'm inclined to think that option 1 is the right way to go, simply because it feels less like a hack and is more likely to work consistently.  Lemme see if I can come up with a patch for you.

Comment 8 Maciej Żenczykowski 2010-10-01 18:07:04 UTC
ad 1) it's not clear to me why the frame buffers have to be linear (after all userspace handles non physically linear memory all the time)... If we were dma-ing directly into them, I would understand... but I can't see that being the way this works, we are most likely memcopying sk_buff data into these buffers, and doing so in kernel, as such this memory should easily be virtual... maybe there's a lack of an interfaces that would allow you to map vmalloc'ed memory into userspace (perhaps because there's no way to mark it as non movable or something to prevent the double mapping from being broken?).

It is pretty clear the problem here is that we need both a linear virtual kernel visible mapping, and a linear virtual userspace app visible mapping.  Perhaps that was easiest to achieve with a linear physically mapping and maybe someone just went for an easy way out?

ad 2) yeah, this would be a bit of a hack

3) perhaps it would make sense to just fix get_user_pages(order > 0) to perform 'memory defrag'.  ie. pick a linear area of memory of the proper size, that only has free and virtually mapped pages, and move the virtually mapped pages out of the way (one page at a time into empty order=0 pages), and use that...

After all... 31 * 128kB is under 4MB, a cpu can copy at max 4MB (1024 * 4kB) of data out of the way in a fraction of the time that any disk request would take.
In general dropping caches, flushing buffers, etc, to realize this order > 0 request just doesn't seem a good idea.


I think (3) would be a still more generic solution, and (1) is better than (2).

Of course implementing both (2) and (3) would probably be 'the right way to fix this' (tm).

Comment 9 Neil Horman 2010-10-01 18:20:23 UTC
Strictly speaking, we don't need it to be contiguous, thats just an artifact of how AF_PACKET implements its memory mapped interface.  

I'm currently writing a patch to modify AF_PACKET to use a iovec type structure rather than a contiguous buffer.  Instead of allocating 1 order 5 area, I'm going to allocate 5 order 1 areas and maintain separate pointers to them in an array.  Then I'll use vmap to create a contiguous virtual mapping to those pages so that we can still treat it like one region.

Comment 10 Maciej Żenczykowski 2010-10-01 18:38:42 UTC

(1 order 5 == 32 order 0)

Comment 11 Neil Horman 2010-10-02 18:43:38 UTC
Created attachment 451213 [details]
patch to allow af_packet to use multiple single pages for ring buffer

It still needs some cleanup, and its diffed against the net-next tree, but should apply to fedora.  It teaches af_packet to use multiple order-0 page allocations rather than one 32 page, order 5 allocation, so as to avoid triggering lots of blocking during defrag.  Functional for me.  Can you please test and see if it prevents the stalls you've been seeing? Thanks!

Comment 12 Neil Horman 2010-10-04 15:46:53 UTC
Maciej, any results on the patch?

Comment 13 Neil Horman 2010-10-04 18:34:21 UTC
awesome, I'll send this upstream then, and backport it once its accepted.  Thanks!

Comment 14 Neil Horman 2010-10-04 18:36:52 UTC
sorry, wrong bz, please ignore

Comment 15 Maciej Żenczykowski 2010-10-05 00:24:46 UTC
I'll get back to you on the results of testing when I have the spare time to reboot my machine (unfortunately being used for actual work at the moment).

Comment 16 Neil Horman 2010-10-05 12:52:24 UTC
Ok, please let me know when you reboot your system.  Thanks.

Comment 17 Neil Horman 2010-10-13 15:55:28 UTC
ping, any feedback here, I'd like to get this post this upstream if it fixes your problem

Comment 18 Maciej Żenczykowski 2010-10-19 02:26:02 UTC
I'm afraid I'm bogged down with some work stuff, and haven't been able to test this.  However, I should have time next weekend.

Comment 19 Neil Horman 2010-10-19 10:42:56 UTC

Comment 20 Neil Horman 2010-10-25 16:48:07 UTC
I managed to free up a test system that I could tinker with this on, and it prevented swapping on a loaded system for me quite well, so I'm going to push this upstream shortly.

Comment 21 Neil Horman 2010-10-25 19:18:54 UTC

sent upstream

Comment 22 Neil Horman 2010-12-17 15:44:18 UTC
Ok, this and a few cleanup patches wen't into net-next.  I'll backport them to F13 shortly.

Comment 23 Fedora Update System 2010-12-19 23:56:33 UTC
kernel- has been submitted as an update for Fedora 14.

Comment 24 Fedora Update System 2010-12-21 13:55:08 UTC
kernel- has been submitted as an update for Fedora 14.

Comment 25 Fedora Update System 2010-12-22 00:03:31 UTC
kernel- has been pushed to the Fedora 14 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update kernel'.  You can provide feedback for this update here: https://admin.fedoraproject.org/updates/kernel-

Comment 26 Fedora Update System 2010-12-22 19:51:43 UTC
kernel- has been pushed to the Fedora 14 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 27 Fedora Update System 2011-02-05 20:13:55 UTC
kernel- has been submitted as an update for Fedora 13.

Comment 28 Maciej Żenczykowski 2011-02-05 21:18:16 UTC
I'm still semi-regularly seeing disk-io-based 30+ second freezes (even seen 5+ minutes), however at least they're not being caused by running tcpdump, as such I think this particular bug is indeed fixed..

Comment 29 Maciej Żenczykowski 2011-02-08 20:17:09 UTC
Just saw a ~25 second freeze when running tcpdump on
Not really sure what this means... (has the patch actually been backported?)

Comment 30 Fedora Update System 2011-02-24 15:50:04 UTC
kernel- has been submitted as an update for Fedora 13.

Comment 31 Fedora Update System 2011-03-07 21:05:57 UTC
kernel- has been pushed to the Fedora 13 stable repository.  If problems still persist, please make note of it in this bug report.