Bug 106516
Summary: | LTC4814-Installation with CTC causes device resets. | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Robert J Brenneman <rjbrenn> | ||||||||||||||
Component: | kernel | Assignee: | Pete Zaitcev <zaitcev> | ||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> | ||||||||||||||
Severity: | medium | Docs Contact: | |||||||||||||||
Priority: | medium | ||||||||||||||||
Version: | 3.0 | CC: | elfert, petrides, tos | ||||||||||||||
Target Milestone: | --- | ||||||||||||||||
Target Release: | --- | ||||||||||||||||
Hardware: | s390 | ||||||||||||||||
OS: | Linux | ||||||||||||||||
Whiteboard: | |||||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||
Clone Of: | Environment: | ||||||||||||||||
Last Closed: | 2005-09-09 00:03:10 UTC | Type: | --- | ||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
Embargoed: | |||||||||||||||||
Attachments: |
|
Description
Robert J Brenneman
2003-10-07 22:05:32 UTC
Sad, isn't it. It is a long standing problem; the upstream (that is, IBM Boeblingen) does not seem to care. It sits on my plate and gets pushed back by IBM's demands to integrate features in support of zfcp, zcrypt, etc. Phillip wrote a smaller program which seems to reproduce it. I may be able to get to it on Oct 24, if I'm lucky and Ingolf has not found a new exciting feature to integrate (ha!). See also bug 104823, bug 104084 (DO NOT DUP - PROTECTION BITS). Requestor, please do not dump long logs into comments box. Attach them instead. Created attachment 95044 [details]
Phil's test, modified for devel6
Robert, there's something you can do to help here. On the failing systen, try to install by whatever means are necessary, then run the attached test. It requires to be modified with actual IP addresses of the installation. See if you can get it to fail. The purpose of the exercise is to have a test case which does not require installer environment, in case I come up with a test kernel. If you are unable to use this test (admittedly, quickly made up), then it's no problem, just mark the bug. In this section of code: #if 0 /* pknirsch */ strcpy(dev.device, "ctc0"); inet_aton("192.168.20.123", &dev.ip); inet_aton("192.168.20.9", &dev.ptpaddr); inet_aton("255.255.255.0", &dev.netmask); inet_aton("192.168.20.0", &dev.network); inet_aton("192.168.20.255", &dev.broadcast); inet_aton("192.168.20.9", &dev.gateway); dev.set = PUMP_INTFINFO_HAS_NETMASK | PUMP_INTFINFO_HAS_IP | PUMP_INTFINFO_HAS_NETWORK | PUMP_INTFINFO_HAS_BROADCAST | PUMP_NETINFO_HAS_GATEWAY | PUMP_INTFINFO_HAS_PTPADDR; #else /* P3 for devel6 */ strcpy(dev.device, "ctc0"); inet_aton("192.168.5.118", &dev.ip); // inet_aton("192.168.20.6", &dev.ptpaddr); NOT USED BY pump-devel! True! inet_aton("255.255.255.255", &dev.netmask); inet_aton("192.168.20.0", &dev.broadcast); // Unconditionally used inet_aton("192.168.20.6", &dev.gateway); dev.set = PUMP_INTFINFO_HAS_NETMASK | PUMP_INTFINFO_HAS_IP | PUMP_NETINFO_HAS_GATEWAY; #endif It looks like youre always running the else block - so I'll change the IPs to match our setup here. If thats not the what you're looking for, let me know. Otherwise I'll post results assuming Im to be only running the else block. Also - Which package do I need to get pumpif.h ? Please use <pump.h>. [zaitcev@devel6 zaitcev]$ rpm -qf /usr/include/pump.h pump-devel-0.8.19-1 Thanks for looking at it. pump-devel is not included with the Taroon RC4 iso images. Is there somewhere I can download it? Created attachment 95521 [details]
pumpif.h (substitute for pump.h)
Created attachment 95522 [details]
pumpif.c (substitute for libpump.a)
What's the incantation to get pumpif.c compiled as libpump.a and it's symbols exported to the linker for nettest? Created attachment 95523 [details]
Makefile
Created attachment 95524 [details]
Makefile (right one this time)
OK - nettest is working now... So what effect am I looking for? a set of iterations and delays that consistantly break the connection? The original report was: ctc0: RX channel restart ch-0c00: Busy ! ctc0: Timeout during TX init handshake ctc0: TX channel restart then, repeated "TX channel restart". The Phil's nettest imitates what installer does. It is possible that repeated running of nettest clears problems, so it would only be apparent if there's a significant delay between re-runs, e.g. while true; to sleep 20 ./nettest done BTW, be prepared to a loss of connectivity due to default route being dropped. It has to be restored if the ctc under test is the primary connectivity of the VM guest. Robert, if you can reproduce the hang, please do this on the console: cat /proc/net/ctc/ctc0/statistics [Also I'm making "private" the erroneous comment by Bill to make it out of the way] ------ Additional Comments From billgo.com 2003-28-10 17:30 ------- subject--:Subject: [Bug 106516] LTC4814-Installation with CTC causes device resets. ------ Additional Comments From billgo.com 2003-28-10 17:30 ------- subject--:Subject: [Bug 106516] LTC4814-Installation with CTC causes device resets. ------ Additional Comments From billgo.com 2003-28-10 17:30 ------- subject--:Subject: [Bug 106516] LTC4814-Installation with CTC causes device resets. ------ Additional Comments From billgo.com 2003-28-10 17:30 ------- subject--:Subject: [Bug 106516] LTC4814-Installation with CTC causes device resets. ------ Additional Comments From billgo.com 2003-28-10 17:30 ------- subject--:Subject: [Bug 106516] LTC4814-Installation with CTC causes device resets. ------ Additional Comments From billgo.com 2003-28-10 17:30 ------- subject--:Subject: [Bug 106516] LTC4814-Installation with CTC causes device resets. ------ Additional Comments From billgo.com 2003-28-10 17:30 ------- subject--:Subject: [Bug 106516] LTC4814-Installation with CTC causes device resets. ------ Additional Comments From billgo.com 2003-28-10 17:30 ------- subject--:Subject: [Bug 106516] LTC4814-Installation with CTC causes device resets. ------ Additional Comments From billgo.com 2003-28-10 17:30 ------- subject--:Subject: [Bug 106516] LTC4814-Installation with CTC causes device resets. ------ Additional Comments From billgo.com 2003-28-10 17:30 ------- subject--:Subject: [Bug 106516] LTC4814-Installation with CTC causes device resets. ------ Additional Comments From billgo.com 2003-28-10 17:30 ------- subject--:Subject: [Bug 106516] LTC4814-Installation with CTC causes device resets. ------ Additional Comments From billgo.com 2003-28-10 17:30 ------- subject--:Subject: [Bug 106516] LTC4814-Installation with CTC causes device resets. Created attachment 95723 [details]
ctc2 (passes unit & Phil's nettest)
What do you know, it appears I fixed it (see ctc2 attachement). Modified in 2.4.21-4.11.EL. This fix is something which really needs to be discussed with IBM Germany / Boeblingen. They might have a better fix, or a rewrite. What OS is running on the peer machine? If it is Linux as well, did you ever try to use protocol 1 (Linux<->Linux)? This uses a slightly safer initial handshake and also removes the restriction of only transfering IP packets. -FE (Volker Tosta for Fritz Elfert) IMHO, remote OS has nothing to do with the problem. The root cause is that the locking in the ctc is half absent, half broken. The driver as present on DW or in Marcelo tree is not SMP safe at all. It only works as long as ifconfig thread is out of the picture. Sorry, but you're wrong. It doesn't have to do with SMP-awareness. Using your little test program i found the real error: The halt_IO(), used at various locations for aborting IO operations does not always succeed and thus, there are still interrupts delivered when the driver is not aware of that. This happens especially if the device is shutdown/started quickly. In fact, this possibility is documented in PoP. In the current driver however, halt_IO() returning a Busy condition is treated as fatal error. I will come up with a more elaborated implementation of shutting down I/O properly ... BTW: The remote OS makes LOT of difference, because i've never seen any doc about the hilevel handshake procedures of VM/TCP, z/OS or OS/390. These are confidential and in order to make the ctc driver GPL'd, i did a clean-room approach and tried to reverse engeneer the handshake. Therefore there may be some glitches or unexpected behavior which - of course - does not happen when using protocol 1 with Linux on both sides. -FE I observe what Fritz described as well, it's a different problem. When ctc races SMP, all sorts of nasties happen. What broke our patience was that it started to BUG() in timer code due to spurious calls to init_timer() in the fsm.c:add_timer(). It just runs amock unlocked, and my patch fixes that. When HaltIO fails, it looks like this: ctc0: RX channel restart ch-0150: Busy ! ctc0: TX channel restart ch-0151: Busy ! ctc0: Timeout during TX init handshake ctc0: TX channel restart <---- dead here sh-2.05b# cat /proc/net/ctc/ctc0/statistics Device FSM state: StartWait RX RX channel FSM state: Stopped TX channel FSM state: TX idle ...... The device will not leave RX_START_WAIT. If Fritz comes with a patch to fix the HaltIO problem, it is going to be welcome. But it does not invalidate my assertion of locking problems in ctc in the least. This was fixed in RHEL3 U2. Here is a copy of the Errata System message: "An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2004-188.html" Obviously, the latest release kernel should be used (RHSA-2005:472), and U6 is currently in beta and will be released soon (RHSA-2005:663). |