Bug 55963

Summary:

eth gdb still lock ups.

Product:

[Retired] eCos

Reporter:

Andrew Lunn <andrew.lunn>

Component:

Ethernet drivers

Assignee:

Hugo Tyson <hmt>

Status:

CLOSED WONTFIX

QA Contact:

George Thomas <gthomas>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

1.5.2

CC:

hmt, jlarmour, jskov, nickg

Target Milestone:

---

Target Release:

---

Hardware:

strongarm

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2003-06-20 16:06:06 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Trace from debug session	none
ecos.ecc being used to build flood	none
Attached the new hal_if.h so i don't loose it!	none

Description Andrew Lunn 2001-11-09 15:43:26 UTC

Description of Problem:

A continuation of CASE #106437

The patches supplied by hugo have been applied.  Both redboot and the test
program have been recompiled. Im using 'flood' as a test program. Im still
finding that gdb looses connection to the target during a debug session.

I will attach both a log from the gdb session and the ecos.ecc file for
flood.   

Version-Release number of selected component (if applicable):

ecos-1.5.2-final 
(It says current in the ecos.ecc, but that just because we renamed all the
directories. Don't ask why.

How Reproducible:

100% within a minute if you actively debug.

Steps to Reproduce:
1. connect and download flood using gdb over ethernet.
2. set a breakpoint in net_test
3. c
4. single step, examine variables, continue, ^C, single step etc 
   until gdb looses contact.

Actual Results:

See trace attached
Expected Results:

It should keep going!

Comment 1 Andrew Lunn 2001-11-09 15:44:42 UTC

Created attachment 37041 [details]
Trace from debug session

Comment 2 Andrew Lunn 2001-11-09 15:46:08 UTC

Created attachment 37042 [details]
ecos.ecc being used to build flood

Comment 3 Andrew Lunn 2001-11-19 13:08:32 UTC

I spent a little more time on this....

It locks up directly after receiving an NTP broadcast. After the broadcast the host retries the TCP segment, but the target does not respond.

Comment 4 Hugo Tyson 2001-11-30 16:44:21 UTC

I have reproduced this in the ASCOM branch but
not in the trunk.  So something is missing.

Comment 5 Hugo Tyson 2001-12-04 19:02:26 UTC

Date: Tue, 4 Dec 2001 18:59:17 GMT
X-Authentication-Warning: masala.cambridge.redhat.com: hmt set sender to
hmt using -f
From: Hugo Tyson <hmt>
To: andrew.lunn
Cc: alexs, CASEs, hmt
In-reply-to: <200111061633.fA6GXjs29640.redhat.com> (message
	from Hugo Tyson on Tue, 6 Nov 2001 16:33:45 GMT)
Subject: Re: CASE 106437

Oh dear oh dear - there's a load of additional changes of which I was
unaware to add to the earlier patch I sent.  This involves locking
(disabling interrupts) around every single call into RedBoot in the virtual
vectors ie. every diag_printf(), that sort of thing.  The changes are
large, but repetitive, so I enclose a complete new file rather than patch.

The file is $PACKAGES/hal/common/$VERSION/include/hal_if.h and it just
drops into the tree we released to you for ecos-1.5.2   That took some
fiddling because quite a lot of other stuff has changed too in the
meanwhile.

It is of course best to rebuild RedBoot as well as all and any eCos
apps, but I'm really not sure that this is completely necessary - for the
simple reason that once we are within RedBoot (in the stubs or whatever)
interrupts are off anyway, so this new second layer of locking is
unneccessary.  But for best results, do rebuild RedBoot (or equivalent
bootstraps/stubs) if you see any problems at all, please.

Thanks, and sorry for the time its taken to fully get this out to you.

	- Huge

Comment 6 Andrew Lunn 2001-12-05 15:17:14 UTC

Created attachment 39728 [details]
Attached the new hal_if.h so i don't loose it!

Comment 7 Andrew Lunn 2001-12-06 15:53:55 UTC

Nope, still broken. In fact, its more broken :-(

I rebuilt everything. I can connect with gdb and download the image. I then hit
c and it locks up. ^C causes gdb to send packets over TCP, but the target never
responds. A break point at cyg_start has no effect, its never reached, or the
stub does not inform gdb. Thare are no packets sent to the network. No dhcp, no
pings etc. Nothing.

If i turn the flood elf into srec and from the redboot serial console tftp to
load it and execute it, it works as expect, flood pinging the server returned in
the dhcp response.

Comment 8 Hugo Tyson 2001-12-06 16:35:26 UTC

If it prints *anything* until after the network has been completely initialized
(eg. printing the bootp record) then it will hang.  I know you know that, but
I'm just checking.  Have you tried a breakpoint in the init routine of the ether
driver?  (dunno what platform you're trying here).  It should get that far...
(before cyg_start IIRC)

Comment 9 Andrew Lunn 2001-12-06 16:55:13 UTC

No i didn't know this. It worked before. Im using the same test case as before
and it ran until the NTP broadcast was received.

If this is true the default values for CYGHWR_NET_DRIVER_ETH?_BOOTP_SHOW should
be changed. They default to 1. Also you test programs are broken.....

void
net_test(cyg_addrword_t param)
{
    struct protoent *p;

    diag_printf("Start Flood PING test\n");
    init_all_network_interfaces();
    diag_printf("Interfaces up:\n");
    cyg_kmem_print_stats();

So it will do a diag_printf before the interfaces are up!

Hum, when you say network, are you including or excluding the network interface?

Comment 10 Hugo Tyson 2001-12-06 17:57:11 UTC

The printing thing might be a red herring; but a) it probably depends
on the hardware, b) most people debug over serial and like to know how network
startup is getting along.  I just tried it with a PLC2 (SA1110 + SMSClan91C96)
and it works AOK with network debug even with all that printing in there.
It's an "if in doubt, turn off that printing" thing.  Ah, actually it probably
matters far more with > 1 interface, 'cos DHCP has to switch eth0 off whilst
dealing with eth1, and vice versa.

Printing before even starting to init the interfaces is OK, sorry for that
confusion.

So maybe it's down to NTP broadcasts...

Comment 11 Andrew Lunn 2001-12-07 13:48:04 UTC

After a nights sleep, i've now correctly build redboot. (I hope).

It still does not work, be at least we are back to NTP killing it.....

arm-elf-gdb flood
tar rem 192.168.11.229:9000
load
b floodsend
c

At this point dhcp does its stuff and then prints out the network buffer stats.
It then hits the break point

Twiddle thumbs until NTP broadcast occurs.

cont

We are dead. No responce from target.....

14:38:04.310337 tuxp.ma.tech.ascom.ch.ntp > 192.168.9.255.ntp: v4 bcast strat 6
poll 6 prec -16
14:38:04.311661 192.168.11.229.9000 > tuxp2.ma.tech.ascom.ch.3799: P
10247:10311(64) ack 329234 win 1458
14:38:04.330283 tuxp2.ma.tech.ascom.ch.3799 > 192.168.11.229.9000: . ack 10311
win 32120 (DF)

Something interesting here. On receiving the ntp broadcast the target imeadiatly
sends something to gdb! Turning on remotedebug does not provide anything
interesting....

Breakpoint 1, floodsend (param=0)
    at /lhome/lunn/ecos-1.5.2/packages/net/tcpip/v1_5_2/tests/flood.c:294
294     in /lhome/lunn/ecos-1.5.2/packages/net/tcpip/v1_5_2/tests/flood.c
(gdb) c
Continuing.
Sending packet: $s#73...Timed out.

Comment 12 Andrew Lunn 2001-12-07 14:22:35 UTC

Hum.... I used tcpdump to catch the packet from the target after the NTP broadcast.

14:51:56.335155 192.168.11.229.9000 > tuxp2.ma.tech.ascom.ch.3802: P
10247:10311(64) ack 2835958213 win 1458
                         4500 0068 0e97 0000 4006 d39e c0a8 0be5
                         c0a8 0b25 2328 0eda 0000 2807 a909 49c5
                         5018 05b2 4cd1 0000 244f 3431 3533 3533
                         3435 3532 3534 3230 3436 3431 3439 3443
                         3341 3230 3343 3334 3345 3733 3633 3638
                         3635 3634 3245 3633 3738 3738 3230 3230
                         3230 3230 3230 3230

Decoding that by hand....
4500 0068 0e97 0000 4006 d39e c0a8 0be5    E. .h .. .. @. .. .. ..
c0a8 0b25 2328 0eda 0000 2807 a909 49c5    .. .% #( .. .. (. .. I.
5018 05b2 4cd1 0000 244f 3431 3533 3533    P. .. L. .. $O 41 53 53
3435 3532 3534 3230 3436 3431 3439 3443    45 52 54 20 46 41 49 4C
3341 3230 3343 3334 3345 3733 3633 3638    3A 23 34 41 49 73 63 68
3635 3634 3245 3633 3738 3738 3230 3230    65 43 2E 63 78 78 30 20
3230 3230 3230 3230                        20 20 20

This looks like a gdb response....
$O4153534552204641494C3A23344149736368432E6378783020202020

Again decoding the hex by hand....
4153 5345 5254 2046 4149 4C3A 2334 4149     AS SE RT  F AI L: #4 AI
7363 6843 2E63 7878 3020 2020 20            sc hC .c xx 0

Well it starts out good....Anybody understand the stuff at the end?

Comment 13 George Thomas 2001-12-07 17:23:27 UTC

Try adding [the equivalent of] RedBoot's start_console()/end_console()
to the eCos 'assert' functions.  That way, the assert message will come
out on the serial port directly.

Comment 14 Andrew Lunn 2001-12-07 17:46:03 UTC

I used a different solution to what Gary suggested, but i have an answer...

static void Cyg_Scheduler::unlock_inner(unsigned int = 0)
/export/home/lunn/ecos-1.5.2/packages/kernel/v1_5_2/src/sched/sched.cxx
Bad next thread

It will take some more work to get the line number... but its not needed since
that message only appears once in the function.....

Comment 15 George Thomas 2001-12-07 18:07:14 UTC

This assert says that the next thread chosen to be scheduled does not seem
to be an instance of a schedulable thread.  It is another indication that
some sort of data corruption has taken place.  Since it sees to happen after
an NTP broadcast, we should concentrate on that area of the code.

Comment 16 Hugo Tyson 2001-12-14 15:07:24 UTC

A Status Report....

I was able to reproduce a variety of failures debugging over the network
using eCos and RedBoot trunk sources.  These were caused by numerous
things:

1) Asserts firing in the app ethernet driver, because of differences in the
   way it is called by RedBoot when under heavy load/low on memory.

   The real problem though is that it was not possible to report the
   assert, nor take a breakpoint on cyg_assert_fail(), because the driver
   itself is used so to do, and the call in question already went through
   RedBoot, so it's not expecting to be called back at that moment.  All
   asserts and APIs are fixed, it was the plc2 with SMSC LAN91C96 ether
   driver I was using.

2) i) Messages such as "Out of MBUFs" and the like being printed by
   RedBoot's and the application's generic ethernet drivers - both of the
   eth_drv.c files.  Again the problem is that it was not possible to do
   the print, because the driver itself is used so to do, the recursion is
   unexpected and uncontrolled.

   I added configury to steer these messages to a specific console (serial
   port) so a) you can see them, and b) they do not screw up the debug
   protocol by recursing into the network driver - same problem again.

   ii) There remained a further printf in if_ethersubr.c which I have now
   simply deleted - it printed a message if it sees a packet of
   unrecognized address family.

   iii) There may be other unguarded printfs that I have not yet located.

3) When out of memory the RedBoot generic ether handler forwarded bogus
   "packets" to the application stack, by accidentally using a null pointer
   to get at the data.  That's all fixed.

The files affected are (excluding the lan91c96 driver):

 ./io/eth/current/cdl/eth_drivers.cdl
 ./io/eth/current/src/net/eth_drv.c
 ./io/eth/current/src/stand_alone/eth_drv.c

 ./redboot/current/cdl/redboot.cdl
 ./redboot/current/src/net/net_io.c

 ./net/tcpip/current/src/sys/net/if_ethersubr.c

NB: these changes mean that to run a network app, via network debugging,
you must configure *both* the application, and RedBoot, specially.  That's
if the network is going to be busy enough to provoke any "low memory"
messages, anyway - a simple ping test will likely be fine whatever.  Also,
having configured thus, any warning messages will come out *in clear* on
the serial device selected - even if you change your mind and use that
serial port for the debug channel.  This should be harmless, but you won't
see the warnings.  Of course, RedBoot cannot print any warnings if serial
debugging, but the application network stack might.

The relevent config options are:

For the app:
eth_drivers.cdl:       cdl_component CYGPKG_IO_ETH_DRIVERS_WARN_FORCE_CONSOLE {
eth_drivers.cdl:       cdl_option
CYGPKG_IO_ETH_DRIVERS_WARN_FORCE_CONSOLE_NUMBER {
 (optionally disable CYGPKG_IO_ETH_DRIVERS_WARN_NO_MBUFS in the same package)

For RedBoot:
redboot.cdl:           cdl_option CYGDBG_REDBOOT_NET_DEBUG_CONSOLE_NUMBER {



The bug (3) could have caused the if_ethersubr.c printf mentioned in
(2)(ii), even with all other warnings messages turned off.  But given (3)
fixed, it is extremely unlikely that (2)(ii) would bite - or any other
unguarded printfs, now that the app stack is getting only good packets
rather than injections of random data.


I was able to run a network-debugged network app, (snmpping changed to run
forever) on our house network with plenty of "foreign" traffic (including
NTP) overnight and all was well.  snmpwalk was hammering at it, together
with periods of flood pings of size 2800, so that the 10Mbit network
segment was >40% used, and had constant collisions while the floodping was
active.

This is with the configury to steer RedBoot's and the app's network buffer
warning messages to serial 1, which I watched using minicom.  Many "out of
MBUFs" warnings appeared there, and if RedBoot's diagnostics are enabled
(CYGPKG_IO_ETH_DRIVERS_WARN_NO_MBUFS, enabled by default) then it too
reports dropping packets &c.


I merged all the relevent changes into the ASCOM branch, and it is not so
reliable; it hangs with the application dead as soon as you overload the
network (though not RedBoot's net presence, you can ping it).  Perhaps the
LAN91C96 driver needs the regular "tickle" function which the trunk
provides to keep it alive when the stack's output queue becomes full, or
the fix for the internal application stack queue being full and never
restarting [CASE 106613] is needed.  But I would expect that to bite when
debugging over serial also; it seems not to, serial debug is as solidly
reliable as ever.

I have not yet investigated those options because this is getting too close
to importing the complete trunk into that branch. ;-(
More when time permits...

Comment 17 Hugo Tyson 2001-12-14 17:06:51 UTC

Interesting; it appears that a branch app works fine with a trunk RedBoot,
and a trunk app fails with a branch RedBoot.  Useful datum, that.  Supports
my opinion that it's not the app net stack - suggests it's something in
the stubs.

Comment 18 Hugo Tyson 2002-01-30 16:26:32 UTC

2002-01-30.  Main development trunk of eCos/RedBoot.

After a lot of elapsed time, I now believe that I have reliable network
debugging of a network-using application, on a live network where other
hosts are eg. flood pinging the target even whilst the app is stopped.

There were many issues which stopped it working, which are unique to the
networked debug situation, distinct from the serial debug case (where
sharing with the "real" async serial device is banned).

These are the issues and their solutions in the order they were discovered
and dealt with (initially by Gary, thanks Gary - I'm not claiming all the
credit here by any means):

 o Network debug traffic such as $O packets from application printf()s
   would suffer descheduling and interrupts from the ethernet device, so
   the virtual vector call into RedBoot was invoked re-entrantly and the
   GDB protocol got out of sync.  Solution: disable interrupts in all
   virtual vector calls.

 o Forwarding packets and tx completion info from RedBoot's net driver to
   the app net driver would allow the stack to wake up the stack anyway.
   Solution: lock the scheduler in all virtual vector calls.

 o The RedBoot driver would sometimes print messages if it was out of
   memory.  Most of these were routed delberately to a serial console, but
   a couple would re-entrantly try to print using GDB protocol and the
   network driver itself.  Solution: catch all of these, and add a control
   to steer the messages to any serial console so you can see them on
   e.g. a PLC2 where only serial #1 is taken out to a connector.

 o The application driver would also print messages if it was out of
   memory.  These were always just printed, and would try to use GDB
   protocol and the network driver itself.  This did not matter in normal
   use, so long as the act of printing (ie. calls from with the GDB stubs
   within RedBoot) did not itself provoke the messages.  But if it did - or
   the message was provoked from any RedBoot/GDB-protocol traffic passing
   packets on to the application - recursion death follows.  Solution: add
   a control optionally to force the messages to a serial console so they
   don't mutually recurse.  Choice of console.

 o Red Hat's internal testing adds some code to the assert system to dump
   out the assert message before hitting the traditional breakpoint on
   cyg_assert().  This is fine, so long as the assert does not fire within
   a call to do debug traffic from RedBoot itself!  Again, recursion death
   follows.  There were a few cases where asserts were firing in various
   ethernet drivers 'cos of differences in the API when called from RedBoot
   and from the application;.  Solution: these were fixed so that no
   asserts were thrown in callbacks to the app from within RedBoot.
   Solution: also added control optionally to force the messages to a
   serial console so that I could see what on earth was going on of an
   assert did fire.

   Arguably we should add this facility to public releases; having the info
   spew out a serial line *before* hitting the breakpoint (which then
   crashes 'cos you're now in a recursion) or trying to print the messages
   over the debug channel (ditto) is a good thing.  I'll see what the team
   says.  cyg_assert_msg() is the routine.

All of the above made it possible to run an application with prints stuff
via the GDB protocol over the network debug channel whilst using the
network heavily.  One other issue remained:

 o When you hit ^C to stop the application, or it takes a breakpoint as
   part of debugging, an application assert would fire.  Without all the
   above fixes this just manifested as a hang. The complaint was stack out
   of bounds 'cos we were running on RedBoot's exception stack.  The real
   reason was that having stopped the app and entered RedBoot
   uncooperatively, the scheduler wasn't locked.  So if packets arrive, or
   a transmit completes, a callback from RedBoot into the app occurred to
   forward the packet &c, and the application tried to awaken the network
   stack and so on; either you get asserts because you're not really
   running on an eCos stack, or recursion death follows.  Solution: lock
   the scheduler in the RedBoot net driver whenever you make a callback
   into the "previous" (application) driver.  This prevents the resulting
   flag signalling op in the app from trying to reschedule, and thus
   prevents both recursion death and any of those asserts getting tested.
   This may be redundant with the second issue's solution above, locking
   the scheduler synchronously when calling into RedBoot from the app.  But
   I think that fix is best left in place; other RedBoot features such as
   querying the fconfig info may not be re-entrant either.

Having realized that the steering of that special diagnostic output was
implemented in 3 places (RedBoot eth driver, app eth driver and assert
mechanism) and that we might want to generalize it, and that to get a
reliable system you had to rebuild the app and RedBoot with special
configuration, and install that new RedBoot, I changed it all to use
RedBoot's fconfig system.  The config items are:

    Force console for special debug messages: true
    Console number for special debug messages: 1

nicknames:    

    info_console_force: true
    info_console_number: 1

The default is false, so no redirection occurs.  RedBoot's net debug
messages would go to serial 0 in this case, and application ethernet
messages would go via the standard GDB channel.

Given all that, and setting the fconfig items as above, I now have a
reliable network debug system on the PLC2.

Last time I tried, I was unable to get any of this to work with the RedBoot
built from the ASCOM branch (1.5.2 IIRC) - the only thing to suggest is
either a complete import of everything used to build RedBoot, which I don't
like, or wait for 2.0, which I do like.  All of this will be in anoncvs
soon enough.

The end.

Comment 19 Andrew Lunn 2002-02-04 15:55:07 UTC

1) The assert stuff sounds interesting. We already have something like this
since we write the assert message into Flash and then cause a reboot. Having
official eCos support for this would be go.

2) Have you looked at the effect all this interrupt disabling and scheduling
locking has on latency? Have you run the origional acceptance test we had for
EBSA. Are the timing requirements still met?

Comment 20 Hugo Tyson 2002-02-04 17:56:51 UTC

Re: Andrew's question 2 about latency:

All bets are off if you are debugging (ie. you stop the system).
This also applies to *any* debug printouts and always has; there's no
news here.  All the latency testing I ever did switches the latency
test system off and on around all printf()s; the released testcases
all do this.

For a real application which does *not* call diag_printf() and the like
(ie. does not generate GDB $O packets and wait for the '+' ACK from GDB)
(and does not hit any asserts either)
there will not be any of this to-ing and fro-ing into RedBoot so latency
is unaffected.

There remain a couple of unconditional diag_printf()s in the application
ethernet driver "warning: eth_recv out of MBUFs" and "out of MBUFs [2]"
which would provoke this.  I guess we should change those also to be
conditioned off by the "verbosity" setting in cyg_io_eth_net_debug.

I'll fiddle with it.

	- Huge

Comment 21 Hugo Tyson 2002-02-04 18:01:02 UTC

My mistake; you can turn 'em *all* off with 
CYGDBG_IO_ETH_DRIVERS_DEBUG = 0
and
CYGPKG_IO_ETH_DRIVERS_WARN_NO_MBUFS = 0
so no change is needed.
They're both on by default, mind.

	- Huge

Comment 22 Alex Schuilenburg 2003-06-20 16:06:06 UTC

This bug has moved to http://bugs.ecos.sourceware.org/show_bug.cgi?id=55963