Hide Forgot
Description of problem: We are suffering a bridge locking up problem similar to what had been reported several years ago at http://xen.1045712.n5.nabble.com/intermittent-network-hangs-td2552446.html. When the problem happens, Xen domains won't be pingable for about 70 seconds for recently RHEL release. Steps to Reproduce: 1. create bridge named mybridge as following, #ifcfg-mybridge DEVICE=mybridge BOOTPROTO=static DHCPCLASS= IPADDR=192.168.1.11 NETMASK=255.255.255.0 ONBOOT=yes TYPE=Bridge DELAY=0 # ifcfg-eth0 DEVICE=eth0 ONBOOT=yes BRIDGE=avpublic 2. Pass the bridge to two domains by adding following line into each of their Xen configuration under /etc/xen as following, vif = [ "mac=xxxxx, bridge=mybridge" ] The mac address for each domain is different. 3. from a client machine, ssh to domain1 # ssh domain1-ip 4. from domain1, ssh to dom0 # ssh dom0-ip 5. from dom0, ssh to domain2 # ssh domain2-ip 6. from domain2, cat a big file which size is > 5M # cat <file-name> Actual results: Normally, after repeating step 6 for several times, the cat process will hang somewhere and all the domains but dom0 won't be pingable. For some old version of Xen+RHEL, the bridge will lock up forever; for some recent releases, the doamins will be pingable after locking up (appearing) for about 70 seconds. Regards
Shared rings A shared ring consists of three structures, a shared page (containing ring entries and some fields), a backend descriptor, and a frontend descriptor. The frontend submits requests and consumes responses, the backend consumes requests and replaces them in-place with responses. Summary of interesting fields by owner: backend desc: shared page: frontend desc: - rsp_prod_pvt - req_prod - req_prod_pvt - req_cons - rsp_prod - rsp_cons - nr_ents - req_event - nr_ents - shared page ptr - rsp_event - shared page ptr - array of req/resp unions The "nr_ents" fields always maintain the number of entries in the shared array. It's read-only after initialization. Here's a diagram about the various indices into the array of req/resp unions. +--------------------------+ low offsets | | | | | unused (free) | | | | | +--------------------------+ <--- frontend.rsp_cons | | | | | unprocessed responses | | | | | +--------------------------+ <--- ring.rsp_prod | | | | | response commit area | | | | | +--------------------------+ <--- backend.rsp_prod_pvt | | | | | requests in-flight | | | (currently being | progress | processed by the | | | backend) | | | | | +--------------------------+ <--- backend.req_cons | | | | | requests not yet opened | | | by the backend | | | | | +--------------------------+ <--- ring.req_prod | | | | | request commit area | | | | | +--------------------------+ <--- frontend.req_prod_pvt | | | | | unused (free) | | | | | +--------------------------+ high offsets V (The ring.req_event and ring.rsp_event indices are not yet shown.) All the indices grow toward higher offsets (downwards on the diagram), and wrap around. For any given type (req or rsp), the following inequality should hold (in wrap-around arithmetic): type_cons <= type_prod <= type_prod_pvt Plus requests always precede responses. The frontend increases req_prod_pvt and places new requests in the "request commit area", then flushes the slots (wmb()), and then syncs the backend-visible req_prod to req_prod_pvt. In other words, the commit area is only ever used temporarily, as a means to publish new requests for the backend atomically. -- RING_PUSH_REQUESTS() The backend opens requests at req_cons. The structure supports multiple requests being processed by the backend in parallel, ie. the placement of completed repsonses can happen in a detached way even within the backend. -- RING_HAS_UNCONSUMED_REQUESTS() The backend places the completed responses at rsp_prod_pvt, and then commits them atomically by flushing them and updating rsp_prod to rsp_prod_pvt. -- RING_PUSH_RESPONSES() The frontend consumes responses at rsp_cons. -- RING_HAS_UNCONSUMED_RESPONSES() The shared req_event and rsp_event fields implement an ingenious notification pattern. In this pattern, the above "methods" are replaced like this: RING_PUSH_REQUESTS --> RING_PUSH_REQUESTS_AND_CHECK_NOTIFY (front) RING_HAS_UNCONSUMED_REQUESTS --> RING_FINAL_CHECK_FOR_REQUESTS (back) RING_PUSH_RESPONSES --> RING_PUSH_RESPONSES_AND_CHECK_NOTIFY (back) RING_HAS_UNCONSUMED_RESPONSES --> RING_FINAL_CHECK_FOR_RESPONSES (front) The FINAL_CHECK macros set a "work_to_do" boolean, and are meant to be called from a loop. As long as the work_to_do flag is set, the loop must not be exited. Once the work_to_do flag is unset on return, the consumer (which can be either the frontend -- response consumer -- or the backend -- request consumer) can go to sleep. Namely, the corresponding PUSH_AND_CHECK macros will report back to the producer if the producer has to wake the consumer, after having pushed several requests at once. This is implemented by: - The given type of the FINAL_CHECK() macro sets the corresponding type_event index at type_cons + 1. Basically saying "this is where I stop consuming and go to sleep, wake me if you produce anything past this offset". - The given type of PUSH_AND_CHECK() macro examines if the "commit area" *crossed* the type_event index while the entries were produced. If so, the producer wakes the consumer.
> I'll have to look at netfront/netback in more detail to understand how the > various requests are interleaved in the rings, and if abundance of any kind of > request (eg. [1]) can starve another kind (eg. [3]). It seems very unlikely. Requests/replies for different streams go on the same ring, so it is indeed possible, though unlikely, that replies to the first hop are starving the others and causing strange behavior of TCP congestion control. You can try asking mst about this, it should be the same in KVM/virtio-net.
(In reply to comment #20) > Requests/replies for different streams go on the same ring, so it is indeed > possible, though unlikely, that replies to the first hop are starving the > others and causing strange behavior of TCP congestion control. Repeating the diagram with some labels added: domU +--------------------------------------------+ |[b]sshd ... ssh[c] [f]sshd ... cat[g] | | : : : | | +----------------netfront--------------+ | | | : : : | | +--+--------------------------------------+--+ : : : v [3] ^ [2] v [1] : : : +--+--------------------------------------+--+ | | : : : | | | +-----------------netback--------------+ | | : : : | | : [d]sshd ........... ssh[e] | +--------------------------------------------+ : dom0 : ssh (external) [a] Suppose ssh[e] is blocked on write() (towards [d]sshd). Also suppose the tx ring is mostly full of [1] requests, and there are only a few tx requests of kind [3]. Suppose the receive socket buffer of ssh[e] already became full (drained some data from [1]). When this happens, the netback driver could still get [1]-requests out of the tx queue. Since the ssh[e] receive socket buffer is full, those TCP segments would be simply dropped somewhere in the networking code, and the TCP layer would be responsible for later retransmission. However, some free space would have been created in the tx ring anyway, and the send socket buffer of [b]sshd should get some chance to submit tx requests of kind [3]. What I'm implying is that TCP congestion should work on a higher level; ssh[e] being unable to consume does not imply that the tx ring will block. OTOH I think that you meant the following instead: the jerky behavior of data transfer along the full chain turns the TCP timers upside down, and then at least one TCP sender will refrain from sending for a minute, even though the ring & socket buffer statuses could accomodate the transfer. But this would only affect a single TCP stream, and according to the described symptoms, the guest is not pingable at all during this period. (Which BTW also makes my point (5) in comment 19 moot.)
The one-in-four behavior can actually be reproduced easily (at least here, on a simple home network with stock F15 and a Linux 2.6.20 router): just ping a non-existent address: $ ping 192.168.10.124 | awk '{ "date +%X"|getline a; close("date +%X"); print a " | " $0}' 04:37:44 PM | PING 192.168.10.124 (192.168.10.124) 56(84) bytes of data. 04:37:44 PM | From 192.168.10.243 icmp_seq=2 Destination Host Unreachable 04:37:44 PM | From 192.168.10.243 icmp_seq=3 Destination Host Unreachable 04:37:44 PM | From 192.168.10.243 icmp_seq=4 Destination Host Unreachable 04:37:48 PM | From 192.168.10.243 icmp_seq=6 Destination Host Unreachable 04:37:48 PM | From 192.168.10.243 icmp_seq=7 Destination Host Unreachable 04:37:48 PM | From 192.168.10.243 icmp_seq=8 Destination Host Unreachable 04:37:52 PM | From 192.168.10.243 icmp_seq=10 Destination Host Unreachable 04:37:52 PM | From 192.168.10.243 icmp_seq=11 Destination Host Unreachable 04:37:52 PM | From 192.168.10.243 icmp_seq=12 Destination Host Unreachable 04:37:56 PM | From 192.168.10.243 icmp_seq=14 Destination Host Unreachable 04:37:56 PM | From 192.168.10.243 icmp_seq=15 Destination Host Unreachable 04:37:56 PM | From 192.168.10.243 icmp_seq=16 Destination Host Unreachable (Note the first icmp_seq is 1). Comparing the tcpdump run with one from such a ping can help, perhaps.
The jumps are there in comment 24 too; I didn't notice them before. Could you please try this on your router (see kernel/net/ipv4/icmp.c)? sysctl -w net.ipv4.icmp_ratemask=$((16#1810)) (Should turn off ratelimiting "dest unreachable".) Thanks.
In netback, netif_be_int() seems to be set up as interrupt handler, at the local end of an interdomain event channel: frontend_changed() -> connect() -> connect_rings() -> netif_map() -> bind_evtchn_to_irqhandler(netif->evtchn, netif_be_int, ...) I think it is responsible for handling notifications from netfront. netif_be_int() -> add_to_net_schedule_list_tail() -> list_add_tail() -> netif_get() -> maybe_schedule_tx_action() -> net_tx_action() -> netif_rx() -> netif_schedulable() -> netbk_queue_full() -> netif_wake_queue() netif_be_int() first adds the interface to the list of interfaces that need handling. Then it calls maybe_schedule_tx_action(), which checks some conditions and schedules "net_tx_tasklet" (function net_tx_action()). net_tx_action() has the leading comment "Called after netfront has transmitted". It seems to consist of two loops. The first loop iterates over all interfaces that need handling (see add_to_net_schedule_list_tail() above) and all outstanding requests for those interfaces (RING_FINAL_CHECK_FOR_REQUESTS()). It checks for credits (bandwidth limiting), which I'm skipping now. Anyway, it prepares skbs and stores them in "tx_queue". I think the second loop is only entered if at least one skb was prepared. The second loop processes all "tx_queue" entries, and passes them on to the network stack (netif_rx()). I must have glossed over a thousand details. Anyway, what catches my eye is the condition for scheduling net_tx_tasklet in maybe_schedule_tx_action(). (Multiple interfaces can schedule the same net_tx_action() tasklet function, which in turn can handle multiple interfaces, see "net_schedule_list"). static inline void maybe_schedule_tx_action(void) { smp_mb(); if ((NR_PENDING_REQS < (MAX_PENDING_REQS/2)) && !list_empty(&net_schedule_list)) tasklet_schedule(&net_tx_tasklet); } #define MAX_PENDING_REQS 256 #define NR_PENDING_REQS (MAX_PENDING_REQS - pending_prod + pending_cons) The loops described above manipulate the pending_* scalars and arrays. I think they implement some kind of global request storage across all interfaces. I checked the dom0 core: crash> print pending_cons $2 = 10384 crash> print pending_prod $3 = 10511 crash> print 256 - pending_prod + pending_cons $4 = 129 So the condition to schedule "net_tx_tasklet" must be evaluated to false (129 < 128). Note that NR_PENDING_REQS == 129 also equals the "TX in flight" falue in comment 35. Netback may be waiting for the NR_PENDING_REQS value to go down, ie. "pending_prod" to increase and/or "pending_cons" to decrease. "pending_cons" seems never to be decreased in the source, so "pending_prod" must increase. I can see two functions that increase "pending_prod"; both seem to be called only by net_tx_action()... net_tx_action() -> net_tx_action_dealloc() -> netbk_tx_check_mop() I'm most probably misunderstanding something, but this seems to be wrong. If maybe_schedule_tx_action() decides not to schedule "net_tx_tasklet" / net_tx_action(), then net_tx_action() won't have a chance to restore the condition to make itself schedulable by maybe_schedule_tx_action(). There's another function scheduling "net_tx_tasklet": netif_idx_release(). But that one is only called by netbk_tx_check_mop() (see above), net_tx_action() and netif_page_release(). The latter one is a destructor function set up in netback_init() for the (pending?) pages that netback seems to get from the balloon. So the only way to decrease NR_PENDING_REQS seems to be via net_tx_action(), whose scheduling depends on NR_PENDING_REQS being low enough...
http://lists.xensource.com/archives/html/xen-devel/2008-06/msg00464.html Especially http://lists.xensource.com/archives/html/xen-devel/2008-06/msg00699.html Upstream has "netbk_tx_pending_timeout", we don't; and "netif_idx_release" could be delayed by the dom0 NIC driver. http://xenbits.xensource.com/xen-unstable.hg/rev/15182 http://xenbits.xensource.com/linux-2.6.18-xen.hg/rev/959
Created attachment 524068 [details] log msg when netbk vif changes its mind about scheduling net_tx_tasklet This is a tracing patch. I built & tested it locally. The idea is that multiple netbk vifs call maybe_schedule_tx_action(), and then net_tx_action() handles them all. This patch tries to keep trace, per netbk vif, of the decision about scheduling or not scheduling the tasklet. Whenever the decision is inverted (on behalf of a given netbk vif), a message should be logged, like vif1.0: inverting scheduled_tx to 1
(In reply to comment #47) > Whenever the decision is inverted (on > behalf of a given netbk vif), a message should be logged printk() claims to be callable from any context: - comment before the function's definition: "This is printk. It can be called from any context. We want it to work." - http://lwn.net/Kernel/LDD3/ chapter 4, How Messages Get Logged "Another feature of the Linux approach to messaging is that printk can be invoked from anywhere, even from an interrupt handler, with no limit on how much data can be printed."
Hello Tom, (In reply to comment #0) > > We are suffering a bridge locking up problem > > 1. create bridge named mybridge as following, if you configure the bridge manually and leave out the netloop interfaces/driver, then the NIC driver on the bridge can reference data-carrying pages for indefinite time, pages that the netback driver depends on to transfer. The network-bridge script configures netloop in order to break this linkage between the NIC driver and netback. The necessity of netloop is not a bug, it was a design decision in the dom0 version that is shipped in RHEL-5. > When the problem happens, Xen domains won't be pingable for about 70 seconds > for recently RHEL release. The length of the lockup depends on how fast (or slow) dom0 processes give up waiting for data transfer. When a dom0 process closes its socket(s), for example by exiting, the socket buffer pages are (may be) returned to netback, and transmission can resume (for a time).