Bug 55963
Summary: | eth gdb still lock ups. | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] eCos | Reporter: | Andrew Lunn <andrew.lunn> | ||||||||
Component: | Ethernet drivers | Assignee: | Hugo Tyson <hmt> | ||||||||
Status: | CLOSED WONTFIX | QA Contact: | George Thomas <gthomas> | ||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | medium | ||||||||||
Version: | 1.5.2 | CC: | hmt, jlarmour, jskov, nickg | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | strongarm | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2003-06-20 16:06:06 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Andrew Lunn
2001-11-09 15:43:26 UTC
Created attachment 37041 [details]
Trace from debug session
Created attachment 37042 [details]
ecos.ecc being used to build flood
I spent a little more time on this.... It locks up directly after receiving an NTP broadcast. After the broadcast the host retries the TCP segment, but the target does not respond. I have reproduced this in the ASCOM branch but not in the trunk. So something is missing. Date: Tue, 4 Dec 2001 18:59:17 GMT X-Authentication-Warning: masala.cambridge.redhat.com: hmt set sender to hmt using -f From: Hugo Tyson <hmt> To: andrew.lunn Cc: alexs, CASEs, hmt In-reply-to: <200111061633.fA6GXjs29640.redhat.com> (message from Hugo Tyson on Tue, 6 Nov 2001 16:33:45 GMT) Subject: Re: CASE 106437 Oh dear oh dear - there's a load of additional changes of which I was unaware to add to the earlier patch I sent. This involves locking (disabling interrupts) around every single call into RedBoot in the virtual vectors ie. every diag_printf(), that sort of thing. The changes are large, but repetitive, so I enclose a complete new file rather than patch. The file is $PACKAGES/hal/common/$VERSION/include/hal_if.h and it just drops into the tree we released to you for ecos-1.5.2 That took some fiddling because quite a lot of other stuff has changed too in the meanwhile. It is of course best to rebuild RedBoot as well as all and any eCos apps, but I'm really not sure that this is completely necessary - for the simple reason that once we are within RedBoot (in the stubs or whatever) interrupts are off anyway, so this new second layer of locking is unneccessary. But for best results, do rebuild RedBoot (or equivalent bootstraps/stubs) if you see any problems at all, please. Thanks, and sorry for the time its taken to fully get this out to you. - Huge Created attachment 39728 [details]
Attached the new hal_if.h so i don't loose it!
Nope, still broken. In fact, its more broken :-( I rebuilt everything. I can connect with gdb and download the image. I then hit c and it locks up. ^C causes gdb to send packets over TCP, but the target never responds. A break point at cyg_start has no effect, its never reached, or the stub does not inform gdb. Thare are no packets sent to the network. No dhcp, no pings etc. Nothing. If i turn the flood elf into srec and from the redboot serial console tftp to load it and execute it, it works as expect, flood pinging the server returned in the dhcp response. If it prints *anything* until after the network has been completely initialized (eg. printing the bootp record) then it will hang. I know you know that, but I'm just checking. Have you tried a breakpoint in the init routine of the ether driver? (dunno what platform you're trying here). It should get that far... (before cyg_start IIRC) No i didn't know this. It worked before. Im using the same test case as before and it ran until the NTP broadcast was received. If this is true the default values for CYGHWR_NET_DRIVER_ETH?_BOOTP_SHOW should be changed. They default to 1. Also you test programs are broken..... void net_test(cyg_addrword_t param) { struct protoent *p; diag_printf("Start Flood PING test\n"); init_all_network_interfaces(); diag_printf("Interfaces up:\n"); cyg_kmem_print_stats(); So it will do a diag_printf before the interfaces are up! Hum, when you say network, are you including or excluding the network interface? The printing thing might be a red herring; but a) it probably depends on the hardware, b) most people debug over serial and like to know how network startup is getting along. I just tried it with a PLC2 (SA1110 + SMSClan91C96) and it works AOK with network debug even with all that printing in there. It's an "if in doubt, turn off that printing" thing. Ah, actually it probably matters far more with > 1 interface, 'cos DHCP has to switch eth0 off whilst dealing with eth1, and vice versa. Printing before even starting to init the interfaces is OK, sorry for that confusion. So maybe it's down to NTP broadcasts... After a nights sleep, i've now correctly build redboot. (I hope). It still does not work, be at least we are back to NTP killing it..... arm-elf-gdb flood tar rem 192.168.11.229:9000 load b floodsend c At this point dhcp does its stuff and then prints out the network buffer stats. It then hits the break point Twiddle thumbs until NTP broadcast occurs. cont We are dead. No responce from target..... 14:38:04.310337 tuxp.ma.tech.ascom.ch.ntp > 192.168.9.255.ntp: v4 bcast strat 6 poll 6 prec -16 14:38:04.311661 192.168.11.229.9000 > tuxp2.ma.tech.ascom.ch.3799: P 10247:10311(64) ack 329234 win 1458 14:38:04.330283 tuxp2.ma.tech.ascom.ch.3799 > 192.168.11.229.9000: . ack 10311 win 32120 (DF) Something interesting here. On receiving the ntp broadcast the target imeadiatly sends something to gdb! Turning on remotedebug does not provide anything interesting.... Breakpoint 1, floodsend (param=0) at /lhome/lunn/ecos-1.5.2/packages/net/tcpip/v1_5_2/tests/flood.c:294 294 in /lhome/lunn/ecos-1.5.2/packages/net/tcpip/v1_5_2/tests/flood.c (gdb) c Continuing. Sending packet: $s#73...Timed out. Hum.... I used tcpdump to catch the packet from the target after the NTP broadcast. 14:51:56.335155 192.168.11.229.9000 > tuxp2.ma.tech.ascom.ch.3802: P 10247:10311(64) ack 2835958213 win 1458 4500 0068 0e97 0000 4006 d39e c0a8 0be5 c0a8 0b25 2328 0eda 0000 2807 a909 49c5 5018 05b2 4cd1 0000 244f 3431 3533 3533 3435 3532 3534 3230 3436 3431 3439 3443 3341 3230 3343 3334 3345 3733 3633 3638 3635 3634 3245 3633 3738 3738 3230 3230 3230 3230 3230 3230 Decoding that by hand.... 4500 0068 0e97 0000 4006 d39e c0a8 0be5 E. .h .. .. @. .. .. .. c0a8 0b25 2328 0eda 0000 2807 a909 49c5 .. .% #( .. .. (. .. I. 5018 05b2 4cd1 0000 244f 3431 3533 3533 P. .. L. .. $O 41 53 53 3435 3532 3534 3230 3436 3431 3439 3443 45 52 54 20 46 41 49 4C 3341 3230 3343 3334 3345 3733 3633 3638 3A 23 34 41 49 73 63 68 3635 3634 3245 3633 3738 3738 3230 3230 65 43 2E 63 78 78 30 20 3230 3230 3230 3230 20 20 20 This looks like a gdb response.... $O4153534552204641494C3A23344149736368432E6378783020202020 Again decoding the hex by hand.... 4153 5345 5254 2046 4149 4C3A 2334 4149 AS SE RT F AI L: #4 AI 7363 6843 2E63 7878 3020 2020 20 sc hC .c xx 0 Well it starts out good....Anybody understand the stuff at the end? Try adding [the equivalent of] RedBoot's start_console()/end_console() to the eCos 'assert' functions. That way, the assert message will come out on the serial port directly. I used a different solution to what Gary suggested, but i have an answer... static void Cyg_Scheduler::unlock_inner(unsigned int = 0) /export/home/lunn/ecos-1.5.2/packages/kernel/v1_5_2/src/sched/sched.cxx Bad next thread It will take some more work to get the line number... but its not needed since that message only appears once in the function..... This assert says that the next thread chosen to be scheduled does not seem to be an instance of a schedulable thread. It is another indication that some sort of data corruption has taken place. Since it sees to happen after an NTP broadcast, we should concentrate on that area of the code. A Status Report.... I was able to reproduce a variety of failures debugging over the network using eCos and RedBoot trunk sources. These were caused by numerous things: 1) Asserts firing in the app ethernet driver, because of differences in the way it is called by RedBoot when under heavy load/low on memory. The real problem though is that it was not possible to report the assert, nor take a breakpoint on cyg_assert_fail(), because the driver itself is used so to do, and the call in question already went through RedBoot, so it's not expecting to be called back at that moment. All asserts and APIs are fixed, it was the plc2 with SMSC LAN91C96 ether driver I was using. 2) i) Messages such as "Out of MBUFs" and the like being printed by RedBoot's and the application's generic ethernet drivers - both of the eth_drv.c files. Again the problem is that it was not possible to do the print, because the driver itself is used so to do, the recursion is unexpected and uncontrolled. I added configury to steer these messages to a specific console (serial port) so a) you can see them, and b) they do not screw up the debug protocol by recursing into the network driver - same problem again. ii) There remained a further printf in if_ethersubr.c which I have now simply deleted - it printed a message if it sees a packet of unrecognized address family. iii) There may be other unguarded printfs that I have not yet located. 3) When out of memory the RedBoot generic ether handler forwarded bogus "packets" to the application stack, by accidentally using a null pointer to get at the data. That's all fixed. The files affected are (excluding the lan91c96 driver): ./io/eth/current/cdl/eth_drivers.cdl ./io/eth/current/src/net/eth_drv.c ./io/eth/current/src/stand_alone/eth_drv.c ./redboot/current/cdl/redboot.cdl ./redboot/current/src/net/net_io.c ./net/tcpip/current/src/sys/net/if_ethersubr.c NB: these changes mean that to run a network app, via network debugging, you must configure *both* the application, and RedBoot, specially. That's if the network is going to be busy enough to provoke any "low memory" messages, anyway - a simple ping test will likely be fine whatever. Also, having configured thus, any warning messages will come out *in clear* on the serial device selected - even if you change your mind and use that serial port for the debug channel. This should be harmless, but you won't see the warnings. Of course, RedBoot cannot print any warnings if serial debugging, but the application network stack might. The relevent config options are: For the app: eth_drivers.cdl: cdl_component CYGPKG_IO_ETH_DRIVERS_WARN_FORCE_CONSOLE { eth_drivers.cdl: cdl_option CYGPKG_IO_ETH_DRIVERS_WARN_FORCE_CONSOLE_NUMBER { (optionally disable CYGPKG_IO_ETH_DRIVERS_WARN_NO_MBUFS in the same package) For RedBoot: redboot.cdl: cdl_option CYGDBG_REDBOOT_NET_DEBUG_CONSOLE_NUMBER { The bug (3) could have caused the if_ethersubr.c printf mentioned in (2)(ii), even with all other warnings messages turned off. But given (3) fixed, it is extremely unlikely that (2)(ii) would bite - or any other unguarded printfs, now that the app stack is getting only good packets rather than injections of random data. I was able to run a network-debugged network app, (snmpping changed to run forever) on our house network with plenty of "foreign" traffic (including NTP) overnight and all was well. snmpwalk was hammering at it, together with periods of flood pings of size 2800, so that the 10Mbit network segment was >40% used, and had constant collisions while the floodping was active. This is with the configury to steer RedBoot's and the app's network buffer warning messages to serial 1, which I watched using minicom. Many "out of MBUFs" warnings appeared there, and if RedBoot's diagnostics are enabled (CYGPKG_IO_ETH_DRIVERS_WARN_NO_MBUFS, enabled by default) then it too reports dropping packets &c. I merged all the relevent changes into the ASCOM branch, and it is not so reliable; it hangs with the application dead as soon as you overload the network (though not RedBoot's net presence, you can ping it). Perhaps the LAN91C96 driver needs the regular "tickle" function which the trunk provides to keep it alive when the stack's output queue becomes full, or the fix for the internal application stack queue being full and never restarting [CASE 106613] is needed. But I would expect that to bite when debugging over serial also; it seems not to, serial debug is as solidly reliable as ever. I have not yet investigated those options because this is getting too close to importing the complete trunk into that branch. ;-( More when time permits... Interesting; it appears that a branch app works fine with a trunk RedBoot, and a trunk app fails with a branch RedBoot. Useful datum, that. Supports my opinion that it's not the app net stack - suggests it's something in the stubs. 2002-01-30. Main development trunk of eCos/RedBoot. After a lot of elapsed time, I now believe that I have reliable network debugging of a network-using application, on a live network where other hosts are eg. flood pinging the target even whilst the app is stopped. There were many issues which stopped it working, which are unique to the networked debug situation, distinct from the serial debug case (where sharing with the "real" async serial device is banned). These are the issues and their solutions in the order they were discovered and dealt with (initially by Gary, thanks Gary - I'm not claiming all the credit here by any means): o Network debug traffic such as $O packets from application printf()s would suffer descheduling and interrupts from the ethernet device, so the virtual vector call into RedBoot was invoked re-entrantly and the GDB protocol got out of sync. Solution: disable interrupts in all virtual vector calls. o Forwarding packets and tx completion info from RedBoot's net driver to the app net driver would allow the stack to wake up the stack anyway. Solution: lock the scheduler in all virtual vector calls. o The RedBoot driver would sometimes print messages if it was out of memory. Most of these were routed delberately to a serial console, but a couple would re-entrantly try to print using GDB protocol and the network driver itself. Solution: catch all of these, and add a control to steer the messages to any serial console so you can see them on e.g. a PLC2 where only serial #1 is taken out to a connector. o The application driver would also print messages if it was out of memory. These were always just printed, and would try to use GDB protocol and the network driver itself. This did not matter in normal use, so long as the act of printing (ie. calls from with the GDB stubs within RedBoot) did not itself provoke the messages. But if it did - or the message was provoked from any RedBoot/GDB-protocol traffic passing packets on to the application - recursion death follows. Solution: add a control optionally to force the messages to a serial console so they don't mutually recurse. Choice of console. o Red Hat's internal testing adds some code to the assert system to dump out the assert message before hitting the traditional breakpoint on cyg_assert(). This is fine, so long as the assert does not fire within a call to do debug traffic from RedBoot itself! Again, recursion death follows. There were a few cases where asserts were firing in various ethernet drivers 'cos of differences in the API when called from RedBoot and from the application;. Solution: these were fixed so that no asserts were thrown in callbacks to the app from within RedBoot. Solution: also added control optionally to force the messages to a serial console so that I could see what on earth was going on of an assert did fire. Arguably we should add this facility to public releases; having the info spew out a serial line *before* hitting the breakpoint (which then crashes 'cos you're now in a recursion) or trying to print the messages over the debug channel (ditto) is a good thing. I'll see what the team says. cyg_assert_msg() is the routine. All of the above made it possible to run an application with prints stuff via the GDB protocol over the network debug channel whilst using the network heavily. One other issue remained: o When you hit ^C to stop the application, or it takes a breakpoint as part of debugging, an application assert would fire. Without all the above fixes this just manifested as a hang. The complaint was stack out of bounds 'cos we were running on RedBoot's exception stack. The real reason was that having stopped the app and entered RedBoot uncooperatively, the scheduler wasn't locked. So if packets arrive, or a transmit completes, a callback from RedBoot into the app occurred to forward the packet &c, and the application tried to awaken the network stack and so on; either you get asserts because you're not really running on an eCos stack, or recursion death follows. Solution: lock the scheduler in the RedBoot net driver whenever you make a callback into the "previous" (application) driver. This prevents the resulting flag signalling op in the app from trying to reschedule, and thus prevents both recursion death and any of those asserts getting tested. This may be redundant with the second issue's solution above, locking the scheduler synchronously when calling into RedBoot from the app. But I think that fix is best left in place; other RedBoot features such as querying the fconfig info may not be re-entrant either. Having realized that the steering of that special diagnostic output was implemented in 3 places (RedBoot eth driver, app eth driver and assert mechanism) and that we might want to generalize it, and that to get a reliable system you had to rebuild the app and RedBoot with special configuration, and install that new RedBoot, I changed it all to use RedBoot's fconfig system. The config items are: Force console for special debug messages: true Console number for special debug messages: 1 nicknames: info_console_force: true info_console_number: 1 The default is false, so no redirection occurs. RedBoot's net debug messages would go to serial 0 in this case, and application ethernet messages would go via the standard GDB channel. Given all that, and setting the fconfig items as above, I now have a reliable network debug system on the PLC2. Last time I tried, I was unable to get any of this to work with the RedBoot built from the ASCOM branch (1.5.2 IIRC) - the only thing to suggest is either a complete import of everything used to build RedBoot, which I don't like, or wait for 2.0, which I do like. All of this will be in anoncvs soon enough. The end. 1) The assert stuff sounds interesting. We already have something like this since we write the assert message into Flash and then cause a reboot. Having official eCos support for this would be go. 2) Have you looked at the effect all this interrupt disabling and scheduling locking has on latency? Have you run the origional acceptance test we had for EBSA. Are the timing requirements still met? Re: Andrew's question 2 about latency: All bets are off if you are debugging (ie. you stop the system). This also applies to *any* debug printouts and always has; there's no news here. All the latency testing I ever did switches the latency test system off and on around all printf()s; the released testcases all do this. For a real application which does *not* call diag_printf() and the like (ie. does not generate GDB $O packets and wait for the '+' ACK from GDB) (and does not hit any asserts either) there will not be any of this to-ing and fro-ing into RedBoot so latency is unaffected. There remain a couple of unconditional diag_printf()s in the application ethernet driver "warning: eth_recv out of MBUFs" and "out of MBUFs [2]" which would provoke this. I guess we should change those also to be conditioned off by the "verbosity" setting in cyg_io_eth_net_debug. I'll fiddle with it. - Huge My mistake; you can turn 'em *all* off with CYGDBG_IO_ETH_DRIVERS_DEBUG = 0 and CYGPKG_IO_ETH_DRIVERS_WARN_NO_MBUFS = 0 so no change is needed. They're both on by default, mind. - Huge This bug has moved to http://bugs.ecos.sourceware.org/show_bug.cgi?id=55963 |