The following has be reported by IBM LTC: Installation with CTC causes device resets. When installing using the CTC driver, the CTC driver looses it's sync at the point in the installation immediately following selection of the NFS install method. The driver tries to recover, and it fails ~50% of the time, leaving a hung installation. Here is the output of the system console: Linux version 2.4.21-3.EL (bhcompile.redhat.com) (gcc version 3.2.3 20030502 (Red Hat Linux 3.2.3-20)) #1 SMP Fri Sep 19 13:58:11 EDT 2003 We are running under VM (31 bit mode) This machine has no PFIX support This machine has an IEEE fpu On node 0 totalpages: 65536 zone(0): 65536 pages. zone(1): 0 pages. zone(2): 0 pages. Kernel command line: root=/dev/ram0 ro ip=off ramdisk_size=40000 DASD=710C,05FD CHANDEV=ctc0,0x0C00,0x0C01 Highest subchannel number detected (hex) : 0012 Calibrating delay loop... 406.32 BogoMIPS Memory: 246720k/262144k available (2078k kernel code, 0k reserved, 547k data, 316k init) Dentry cache hash table entries: 32768 (order: 6, 262144 bytes) Inode cache hash table entries: 16384 (order: 5, 131072 bytes) Mount cache hash table entries: 512 (order: 0, 4096 bytes) Buffer cache hash table entries: 16384 (order: 4, 65536 bytes) Page-cache hash table entries: 65536 (order: 6, 262144 bytes) debug: Initialization complete POSIX conformance testing by UNIFIX Detected 2 CPU's Boot cpu address 0 cpu 0 phys_idx=0 vers=FF ident=040356 machine=7060 unused=0000 cpu 1 phys_idx=1 vers=FF ident=040356 machine=7060 unused=0000 Starting migration thread for cpu 0 Starting migration thread for cpu 1 init_mach : starting machine check handler Linux NET4.0 for Linux 2.4 Based upon Swansea University Computer Society NET3.039 Initializing RT netlink socket mach_handler : ready mach_handler : waiting for wakeup Starting kswapd VFS: Disk quotas vdquot_6.5.1 aio_setup: num_physpages = 16384 aio_setup: sizeof(struct page) = 52 pty: 2048 Unix98 ptys configured NET4: Frame Diverter 0.46 RAMDISK driver initialized: 256 RAM disks of 40000K size 1024 blocksize md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27 md: Autodetecting RAID arrays. md: autorun ... md: ... autorun DONE. Initializing Cryptographic API NET4: Linux TCP/IP 1.0 for NET4.0 IP: routing cache hash table of 1024 buckets, 16Kbytes TCP: Hash tables configured (established 8192 bind 16384) Linux IP multicast router 0.06 plus PIM-SM Initializing IPsec netlink socket NET4: Unix domain sockets 1.0/SMP for Linux NET4.0. RAMDISK: Compressed image found at block 0 Freeing initrd memory: 5262k freed VFS: Mounted root (ext2 filesystem) readonly. Freeing unused kernel memory: 79k freed Starting the S/390 initrd to configure networking. Version is 1.01 Enter the FQDN of your new Linux guest (e.g. s390.redhat.com): rjbrenn.pdl.pok.ibm.com Enter which kind of network device do you intend to use (e.g. ctc, escon, iucv, eth, hsi, tr): ctc Enter the IP address of your new Linux guest: 9.12.21.224 Enter the network address of the new Linux guest: 9.12.21.0 Enter the IP of your ctc/escon/iucv point-to-point partner: 172.16.21.224 CTC driver Version: 1.57 with CHANDEV support initialized divert: not allocating divert_blk for non-ethernet device ctc0 ctc0: read: ch 0c00 (irq 000e), write: ch 0c01 (irq 000f) proto: 0 Enter your DNS server(s), separated by colons (:): ctc0: connected with remote side 9.12.18.2 Enter your DNS search domain(s) (if any), separated by colons (:): pdl.pok.ibm.com ctc0 Link encap:Serial Line IP inet addr:9.12.21.224 P-t-P:172.16.21.224 Mask:255.255.255.255 UP POINTOPOINT RUNNING NOARP MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:100 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 127.0.0.1 0.0.0.0 255.255.255.255 UH 0 0 0 lo 172.16.21.224 0.0.0.0 255.255.255.255 UH 0 0 0 ctc0 0.0.0.0 172.16.21.224 0.0.0.0 UG 0 0 0 ctc0 Starting portmap. Journalled Block Device driver loaded Starting telnetd and sshd to allow login over the network. Connect now to 9.12.21.224 to start the installation. sshd(pam_unix)?59�: session opened for user root by (uid=0) * modules to insert vga16fb * module(s) vga16fb not found * load module set done * 254348 kB are available * modules to insert cramfs vfat sunrpc lockd nfs loop isofs floppy * loaded cramfs from /modules/modules.cgz * loaded sunrpc from /modules/modules.cgz * loaded lockd from /modules/modules.cgz * loaded nfs from /modules/modules.cgz * loaded loop from /modules/modules.cgz * module(s) vfat isofs floppy not found * inserted /tmp/cramfs.o * inserted /tmp/sunrpc.o * inserted /tmp/lockd.o * inserted /tmp/nfs.o loop: loaded (max 8 devices) * inserted /tmp/loop.o * load module set done * modules to insert ide-cd * module(s) ide-cd not found * load module set done * modules to insert sd_mod sr_mod * module(s) sd_mod sr_mod not found * load module set done * modules to insert dasd_mod * loaded dasd_mod from /modules/modules.cgz dasd: initializing... dasd: Registered successfully to major no 94 dasd: initialization finished * inserted /tmp/dasd_mod.o * load module set done * modules to insert dasd_diag_mod dasd_fba_mod dasd_eckd_mod * loaded dasd_diag_mod from /modules/modules.cgz * loaded dasd_fba_mod from /modules/modules.cgz * loaded dasd_eckd_mod from /modules/modules.cgz dasd(diag): DIAG discipline initializing dasd: <devno: 710c> Add lowmem page :0f0d3000 dasd: <devno: 710c> Add lowmem page :0f0d2000 dasd: <devno: 710c> Free lowmem page :0f0d2000 dasd: <devno: 710c> Free lowmem page :0f0d3000 dasd: <devno: 05fd> Add lowmem page :0f0d3000 dasd: <devno: 05fd> Add lowmem page :0f0d2000 dasd: <devno: 05fd> Free lowmem page :0f0d2000 dasd: <devno: 05fd> Free lowmem page :0f0d3000 * inserted /tmp/dasd_diag_mod.o dasd(fba): FBA discipline initializing dasd: <devno: 710c> Add lowmem page :0f0d2000 dasd: <devno: 710c> Add lowmem page :0f0ce000 dasd: <devno: 710c> Free lowmem page :0f0ce000 dasd: <devno: 710c> Free lowmem page :0f0d2000 dasd: <devno: 05fd> Add lowmem page :0f0d2000 dasd: <devno: 05fd> Add lowmem page :0f0ce000 dasd(fba): /dev/dasdb ( 94: 4),05fd@0d: 9336/10(CU:6310/80) 128MB at(512 B/blk) Partition check: dasdb:CMS1/ SWAPLN: dasdb1 dasd(fba): We are interested in: Dev 9336/00 @ CU 6310/00 dasd(fba): We are interested in: Dev 3370/00 @ CU 3880/00 * inserted /tmp/dasd_fba_mod.o dasd(eckd): ECKD discipline initializing dasd: <devno: 710c> Add lowmem page :0f006000 dasd: <devno: 710c> Add lowmem page :0f004000 dasd(eckd): /dev/dasda ( 94: 0),710c@01: 3390/0A(CU:3990/02) Cyl:3339 Head:15 Sec:224 dasd(eckd): /dev/dasda ( 94: 0),710c@01: 3390/0A(CU:3990/02): Configuration data read dasd: waiting for responses... dasd(eckd): /dev/dasda ( 94: 0),710c@01: (4kB blks): 2404080kB at 48kB/trk compatible disk layout dasda:VOL1/ 0X710C: dasd(eckd): We are interested in: CU 3880/00 dasd(eckd): We are interested in: CU 3990/00 dasd(eckd): We are interested in: CU 2105/00 dasd(eckd): We are interested in: CU 9343/00 * inserted /tmp/dasd_eckd_mod.o * load module set done * no firewire controller found * no pcic controller found * probing buses * finished bus probing * found nothing * going to set language to en_US.UTF-8 * setting language to en_US.UTF-8 * need to set up networking * going to pick interface * going to do getNetConfig * doing kickstart... setting it up ctc0: RX channel restart ch-0c00: Busy ! ctc0: Timeout during TX init handshake ctc0: TX channel restart ctc0: Timeout during TX init handshake ctc0: TX channel restart ctc0: Timeout during TX init handshake ctc0: TX channel restart ctc0: Timeout during TX init handshake ctc0: TX channel restart ctc0: Timeout during TX init handshake ctc0: TX channel restart ctc0: Timeout during TX init handshake ctc0: TX channel restartThis is more of a problem with hardware CTCs than with virtual CTCs. When doing an install in an LPAR where hardware CTC is the only connection to the world, the CTC driver does not appear to be able to recover itself. The other end of the CTC has to be cycled off and on again, and the Linux ctc interface needs to be cycled down then up again, and the default route pointing to the CTC peer needs to be added. All these actions have to happen before a timeout pops in the installer. If the timeout pops, the install stops and does not appear to be recoverable. The timeout appears to be on a reverse lookup of the NFS/FTP install server address.
Sad, isn't it. It is a long standing problem; the upstream (that is, IBM Boeblingen) does not seem to care. It sits on my plate and gets pushed back by IBM's demands to integrate features in support of zfcp, zcrypt, etc. Phillip wrote a smaller program which seems to reproduce it. I may be able to get to it on Oct 24, if I'm lucky and Ingolf has not found a new exciting feature to integrate (ha!). See also bug 104823, bug 104084 (DO NOT DUP - PROTECTION BITS). Requestor, please do not dump long logs into comments box. Attach them instead.
Created attachment 95044 [details] Phil's test, modified for devel6
Robert, there's something you can do to help here. On the failing systen, try to install by whatever means are necessary, then run the attached test. It requires to be modified with actual IP addresses of the installation. See if you can get it to fail. The purpose of the exercise is to have a test case which does not require installer environment, in case I come up with a test kernel. If you are unable to use this test (admittedly, quickly made up), then it's no problem, just mark the bug.
In this section of code: #if 0 /* pknirsch */ strcpy(dev.device, "ctc0"); inet_aton("192.168.20.123", &dev.ip); inet_aton("192.168.20.9", &dev.ptpaddr); inet_aton("255.255.255.0", &dev.netmask); inet_aton("192.168.20.0", &dev.network); inet_aton("192.168.20.255", &dev.broadcast); inet_aton("192.168.20.9", &dev.gateway); dev.set = PUMP_INTFINFO_HAS_NETMASK | PUMP_INTFINFO_HAS_IP | PUMP_INTFINFO_HAS_NETWORK | PUMP_INTFINFO_HAS_BROADCAST | PUMP_NETINFO_HAS_GATEWAY | PUMP_INTFINFO_HAS_PTPADDR; #else /* P3 for devel6 */ strcpy(dev.device, "ctc0"); inet_aton("192.168.5.118", &dev.ip); // inet_aton("192.168.20.6", &dev.ptpaddr); NOT USED BY pump-devel! True! inet_aton("255.255.255.255", &dev.netmask); inet_aton("192.168.20.0", &dev.broadcast); // Unconditionally used inet_aton("192.168.20.6", &dev.gateway); dev.set = PUMP_INTFINFO_HAS_NETMASK | PUMP_INTFINFO_HAS_IP | PUMP_NETINFO_HAS_GATEWAY; #endif It looks like youre always running the else block - so I'll change the IPs to match our setup here. If thats not the what you're looking for, let me know. Otherwise I'll post results assuming Im to be only running the else block.
Also - Which package do I need to get pumpif.h ?
Please use <pump.h>. [zaitcev@devel6 zaitcev]$ rpm -qf /usr/include/pump.h pump-devel-0.8.19-1 Thanks for looking at it.
pump-devel is not included with the Taroon RC4 iso images. Is there somewhere I can download it?
Created attachment 95521 [details] pumpif.h (substitute for pump.h)
Created attachment 95522 [details] pumpif.c (substitute for libpump.a)
What's the incantation to get pumpif.c compiled as libpump.a and it's symbols exported to the linker for nettest?
Created attachment 95523 [details] Makefile
Created attachment 95524 [details] Makefile (right one this time)
OK - nettest is working now... So what effect am I looking for? a set of iterations and delays that consistantly break the connection?
The original report was: ctc0: RX channel restart ch-0c00: Busy ! ctc0: Timeout during TX init handshake ctc0: TX channel restart then, repeated "TX channel restart". The Phil's nettest imitates what installer does. It is possible that repeated running of nettest clears problems, so it would only be apparent if there's a significant delay between re-runs, e.g. while true; to sleep 20 ./nettest done BTW, be prepared to a loss of connectivity due to default route being dropped. It has to be restored if the ctc under test is the primary connectivity of the VM guest.
Robert, if you can reproduce the hang, please do this on the console: cat /proc/net/ctc/ctc0/statistics [Also I'm making "private" the erroneous comment by Bill to make it out of the way]
------ Additional Comments From billgo.com 2003-28-10 17:30 ------- subject--:Subject: [Bug 106516] LTC4814-Installation with CTC causes device resets.
Created attachment 95723 [details] ctc2 (passes unit & Phil's nettest)
What do you know, it appears I fixed it (see ctc2 attachement).
Modified in 2.4.21-4.11.EL. This fix is something which really needs to be discussed with IBM Germany / Boeblingen. They might have a better fix, or a rewrite.
What OS is running on the peer machine? If it is Linux as well, did you ever try to use protocol 1 (Linux<->Linux)? This uses a slightly safer initial handshake and also removes the restriction of only transfering IP packets. -FE (Volker Tosta for Fritz Elfert)
IMHO, remote OS has nothing to do with the problem. The root cause is that the locking in the ctc is half absent, half broken. The driver as present on DW or in Marcelo tree is not SMP safe at all. It only works as long as ifconfig thread is out of the picture.
Sorry, but you're wrong. It doesn't have to do with SMP-awareness. Using your little test program i found the real error: The halt_IO(), used at various locations for aborting IO operations does not always succeed and thus, there are still interrupts delivered when the driver is not aware of that. This happens especially if the device is shutdown/started quickly. In fact, this possibility is documented in PoP. In the current driver however, halt_IO() returning a Busy condition is treated as fatal error. I will come up with a more elaborated implementation of shutting down I/O properly ... BTW: The remote OS makes LOT of difference, because i've never seen any doc about the hilevel handshake procedures of VM/TCP, z/OS or OS/390. These are confidential and in order to make the ctc driver GPL'd, i did a clean-room approach and tried to reverse engeneer the handshake. Therefore there may be some glitches or unexpected behavior which - of course - does not happen when using protocol 1 with Linux on both sides. -FE
I observe what Fritz described as well, it's a different problem. When ctc races SMP, all sorts of nasties happen. What broke our patience was that it started to BUG() in timer code due to spurious calls to init_timer() in the fsm.c:add_timer(). It just runs amock unlocked, and my patch fixes that. When HaltIO fails, it looks like this: ctc0: RX channel restart ch-0150: Busy ! ctc0: TX channel restart ch-0151: Busy ! ctc0: Timeout during TX init handshake ctc0: TX channel restart <---- dead here sh-2.05b# cat /proc/net/ctc/ctc0/statistics Device FSM state: StartWait RX RX channel FSM state: Stopped TX channel FSM state: TX idle ...... The device will not leave RX_START_WAIT. If Fritz comes with a patch to fix the HaltIO problem, it is going to be welcome. But it does not invalidate my assertion of locking problems in ctc in the least.
This was fixed in RHEL3 U2. Here is a copy of the Errata System message: "An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2004-188.html" Obviously, the latest release kernel should be used (RHSA-2005:472), and U6 is currently in beta and will be released soon (RHSA-2005:663).