With: # rpm -q kernel tcpdump kernel-2.6.34.7-56.fc13.x86_64 tcpdump-4.0.0-3.20090921gitdf3cb4.fc12.x86_64 Running: # tcpdump -i wlan0 (ath9k) or # tcpdump -i eth0 (sky2) occasionally causes the machine to hard freeze (no mouse cursor movement in X, no caps lock response). I have no idea what the problem is, but I've had this happen multiple times over the course of the past months, and it's always recovered, but it can take 30-40 seconds of waiting for a totally unresponsive machine (no mouse movement updates, no screen updates, caps lock doesn't toggle). Once the machines becomes responsive everything works, including tcpdump, and the queue of input events that queued up prompty gets emptied. Furthermore I believe stopping and restarting tcpdump will no longer exhibit this behaviour (unsure of this though). It's like initializing the packet dumping core requires some huge memory allocation or something and that freezes the kernel or something...
Does a keyboard based sysrq-t generate a task dump in the log? If not, it sounds like theres a spinlock getting held for way too long during some AF_PACKET socket operation. I'd suggest you start with an strace to see what system call path your starting in when the hang occurs, and then use system tap to instrument how far through that path you get before you hang. That should make such an offensive spin lock jump right out at you.
# strace tcpdump -i eth0 -n ... bind(3, {sa_family=AF_PACKET, proto=0x03, if2, pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0 getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 setsockopt(3, SOL_PACKET, PACKET_ADD_MEMBERSHIP, "\2\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0", 16) = 0 setsockopt(3, SOL_PACKET, PACKET_AUXDATA, [1], 4) = 0 getsockopt(3, SOL_PACKET, PACKET_HDRLEN, [28], [4]) = 0 setsockopt(3, SOL_PACKET, PACKET_VERSION, [1], 4) = 0 setsockopt(3, SOL_PACKET, PACKET_RESERVE, [4], 4) = 0 setsockopt(3, SOL_PACKET, PACKET_RX_RING, {block_size=131072, block_nr=31, frame_size=65600, frame_nr=31}, 16) = -1 ENOMEM (Cannot allocate memory) setsockopt(3, SOL_PACKET, PACKET_RX_RING, {block_size=131072, block_nr=30, frame_size=65600, frame_nr=30}, 16) = 0 mmap(NULL, 3932160, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0) = 0x7f3070dca000 getgid() = 0 setgid(0) = 0 getuid() = 0 setuid(0) = 0 socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 4 ... The PACKET_RX_RING seems to be where it pauses (not sure if the first one or the second one though, I think the first one). So it does indeed look like that probably requires a 128kB linear block of memory (order=5 ?) - guessing.
Considering this machine has 4GB ram, but it is mostly used Mem: 4040432k total, 3710612k used, 329820k free, 174184k buffers Swap: 8388604k total, 560300k used, 7828304k free, 204596k cached this is possibly triggering some sort of memory defragmentation? Curious side note: - I was running this in an xterm under X on a laptop (MacBookPro4,1) with builtin USB keyboard, and attached USB mouse and keyboard (for the sysrq key) - afterwards both keyboards seemed to almost immediately die (I managed to control+C tcpdump though - however the mouse worked fine, unplugging and replugging the mouse and external keyboard, resulted once more in a working mouse, but not a working keyboard - switching (Ctrl+Alt+F2) out of X and back fixed the keyboard... I'll attach the 10K line /var/log/messages log from Alt+SysRQ + T from while it was frozen.
Created attachment 450939 [details] fragment of /var/log/messages (with: dmesg -n 8) I don't really know how to interpret this...
Hmm, the keyboard going unresponsive might have been a red herring, maybe it just missed the release of the alt key from the alt+sysrq+T combo and X still though the alt key was held - that would certainly explain the observed behaviour. --- Also, based on: http://lxr.linux.no/linux+v2.6.34.7/Documentation/networking/packet_mmap.txt#L174 struct tpacket_req { unsigned int tp_block_size; /* Minimal size of contiguous block */ unsigned int tp_block_nr; /* Number of blocks */ unsigned int tp_frame_size; /* Size of frame */ unsigned int tp_frame_nr; /* Total number of frames */ }; // This structure is defined in /usr/include/linux/if_packet.h and establishes a // circular buffer (ring) of unswappable memory. it looks like setsockopt(3, SOL_PACKET, PACKET_RX_RING, {block_size=131072, block_nr=31, frame_size=65600, frame_nr=31}, 16) = -1 ENOMEM (Cannot allocate memory) is actually attempting 31 allocations of 128kB each (in order to be able to capture 31 frames of 65600 bytes each (ie. slightly more than 64kB per frame). The comment at http://lxr.linux.no/linux+v2.6.34.7/Documentation/networking/packet_mmap.txt#L240 states this uses __get_free_pages() The code that actually implements this is at http://lxr.linux.no/linux+v2.6.34.7/net/packet/af_packet.c#L2276 the part where we block is most likely: order = get_order(req->tp_block_size); pg_vec = alloc_pg_vec(req, order); which translates to pg_vec = alloc_pg_vec( "req->tp_block_nr" == 31 , 5); which calls alloc_one_pg_vec_page(5) 31 times in a row. Which results in 31 calls in a row to: __get_free_pages(GFP_KERNEL | __GFP_COMP | __GFP_ZERO | __GFP_NOWARN, 5); This translates to: GFP_KERNEL == __GFP_WAIT | __GFP_IO | __GFP_FS __GFP_WAIT /* Can wait and reschedule? */ __GFP_IO /* Can start physical IO? */ __GFP_FS /* Can call down to low-level FS? */ __GFP_NOWARN /* Suppress page allocation failure warning */ __GFP_COMP /* Add compound page metadata */ __GFP_ZERO /* Return zeroed page on success */ Since GFP_KERNEL (ie __GFP_{WAIT|IO|FS} ) has been passed, it is not immediately clear why this is blocking the machine.
The only things I can think of at this point are some sort of bad interactions with disk page out and memory defrag under this level of memory pressure. Perhaps the system doesn't actually block, but all userspace processes very quickly block on disk access? But would that explain why even cpu usage graphs stop updating? I'm pretty much at the limits of my knowledge here. --- I will just add, that this temp lockup that I caught was much much shorter than the last one I complained about, that one was somewhere on the order of just shy of a minute, this one was closer to 5-10 seconds. I was also unable to replicate this a day or two ago when there was still quite some free ram in the system and it was not long after a reboot, a point at which chrome [the only running app, besides a couple xterm/ssh/mc and X with some gnome applets] with a bazillion tabs hadn't yet used all the ram...]
You're right on all accounts. We're trying to allocate 31 blocks of 128k apiece, and on a memory constrained system that kicks in kswapd, which can quickly lock out the rest of the system as it scans the lru lists for pages to evict. Thats what the task dump in your message log is showing. That everyone is blocked on I/O while kswapd tries to free enough memory to satisfy all the order 5 requests that the packet socket is making. Unfortunately I don't know what we can do about this. We have an established interface that allows user space to dictate how much memory the socket should try to allocate, and its got to be linear because the interface uses that to memory map packets to user space. Of the things I could think we could do: 1) teach af_packet in the kernel to use non-linear memory segments to store packets. This would help solve the high order allocation problem, but may result in performance degradation, as it would require several more kernel traps per large frame read from tcpdump 2) add some predictive watermarking to the code so that we just return ENOMEM without actually trying to allocate all the ram, thus avoiding the big spins on memory allocation. I'm inclined to think that option 1 is the right way to go, simply because it feels less like a hack and is more likely to work consistently. Lemme see if I can come up with a patch for you.
ad 1) it's not clear to me why the frame buffers have to be linear (after all userspace handles non physically linear memory all the time)... If we were dma-ing directly into them, I would understand... but I can't see that being the way this works, we are most likely memcopying sk_buff data into these buffers, and doing so in kernel, as such this memory should easily be virtual... maybe there's a lack of an interfaces that would allow you to map vmalloc'ed memory into userspace (perhaps because there's no way to mark it as non movable or something to prevent the double mapping from being broken?). It is pretty clear the problem here is that we need both a linear virtual kernel visible mapping, and a linear virtual userspace app visible mapping. Perhaps that was easiest to achieve with a linear physically mapping and maybe someone just went for an easy way out? ad 2) yeah, this would be a bit of a hack 3) perhaps it would make sense to just fix get_user_pages(order > 0) to perform 'memory defrag'. ie. pick a linear area of memory of the proper size, that only has free and virtually mapped pages, and move the virtually mapped pages out of the way (one page at a time into empty order=0 pages), and use that... After all... 31 * 128kB is under 4MB, a cpu can copy at max 4MB (1024 * 4kB) of data out of the way in a fraction of the time that any disk request would take. In general dropping caches, flushing buffers, etc, to realize this order > 0 request just doesn't seem a good idea. --- I think (3) would be a still more generic solution, and (1) is better than (2). Of course implementing both (2) and (3) would probably be 'the right way to fix this' (tm).
Strictly speaking, we don't need it to be contiguous, thats just an artifact of how AF_PACKET implements its memory mapped interface. I'm currently writing a patch to modify AF_PACKET to use a iovec type structure rather than a contiguous buffer. Instead of allocating 1 order 5 area, I'm going to allocate 5 order 1 areas and maintain separate pointers to them in an array. Then I'll use vmap to create a contiguous virtual mapping to those pages so that we can still treat it like one region.
Awesome. (1 order 5 == 32 order 0)
Created attachment 451213 [details] patch to allow af_packet to use multiple single pages for ring buffer It still needs some cleanup, and its diffed against the net-next tree, but should apply to fedora. It teaches af_packet to use multiple order-0 page allocations rather than one 32 page, order 5 allocation, so as to avoid triggering lots of blocking during defrag. Functional for me. Can you please test and see if it prevents the stalls you've been seeing? Thanks!
Maciej, any results on the patch?
awesome, I'll send this upstream then, and backport it once its accepted. Thanks!
sorry, wrong bz, please ignore
I'll get back to you on the results of testing when I have the spare time to reboot my machine (unfortunately being used for actual work at the moment).
Ok, please let me know when you reboot your system. Thanks.
ping, any feedback here, I'd like to get this post this upstream if it fixes your problem
I'm afraid I'm bogged down with some work stuff, and haven't been able to test this. However, I should have time next weekend.
ok
I managed to free up a test system that I could tinker with this on, and it prevented swapping on a loaded system for me quite well, so I'm going to push this upstream shortly.
http://marc.info/?l=linux-netdev&m=128803359421924&w=3 sent upstream
Ok, this and a few cleanup patches wen't into net-next. I'll backport them to F13 shortly.
kernel-2.6.35.10-69.fc14 has been submitted as an update for Fedora 14. https://admin.fedoraproject.org/updates/kernel-2.6.35.10-69.fc14
kernel-2.6.35.10-72.fc14 has been submitted as an update for Fedora 14. https://admin.fedoraproject.org/updates/kernel-2.6.35.10-72.fc14
kernel-2.6.35.10-72.fc14 has been pushed to the Fedora 14 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with su -c 'yum --enablerepo=updates-testing update kernel'. You can provide feedback for this update here: https://admin.fedoraproject.org/updates/kernel-2.6.35.10-72.fc14
kernel-2.6.35.10-72.fc14 has been pushed to the Fedora 14 stable repository. If problems still persist, please make note of it in this bug report.
kernel-2.6.34.8-67.fc13 has been submitted as an update for Fedora 13. https://admin.fedoraproject.org/updates/kernel-2.6.34.8-67.fc13
I'm still semi-regularly seeing disk-io-based 30+ second freezes (even seen 5+ minutes), however at least they're not being caused by running tcpdump, as such I think this particular bug is indeed fixed..
Just saw a ~25 second freeze when running tcpdump on 2.6.35.10-80.fc14.x86_64 Not really sure what this means... (has the patch actually been backported?)
kernel-2.6.34.8-68.fc13 has been submitted as an update for Fedora 13. https://admin.fedoraproject.org/updates/kernel-2.6.34.8-68.fc13
kernel-2.6.34.8-68.fc13 has been pushed to the Fedora 13 stable repository. If problems still persist, please make note of it in this bug report.