Bug 106516

Summary: LTC4814-Installation with CTC causes device resets.
Product: Red Hat Enterprise Linux 3 Reporter: Robert J Brenneman <rjbrenn>
Component: kernelAssignee: Pete Zaitcev <zaitcev>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.0CC: elfert, petrides, tos
Target Milestone: ---   
Target Release: ---   
Hardware: s390   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-09-09 00:03:10 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Phil's test, modified for devel6
none
pumpif.h (substitute for pump.h)
none
pumpif.c (substitute for libpump.a)
none
Makefile
none
Makefile (right one this time)
none
ctc2 (passes unit & Phil's nettest) none

Description Robert J Brenneman 2003-10-07 22:05:32 UTC
The following has be reported by IBM LTC:  
Installation with CTC causes device resets.
When installing using the CTC driver, the CTC driver looses it's sync at the
point in the installation immediately following selection of the NFS install
method. The driver tries to recover, and it fails ~50% of the time, leaving a
hung installation. 

Here is the output of the system console:

Linux version 2.4.21-3.EL (bhcompile.redhat.com) (gcc version 3.2.3
20030502 (Red Hat Linux 3.2.3-20)) #1 SMP Fri Sep 19 
13:58:11 EDT 2003                                                              
                                                    
We are running under VM (31 bit mode)                                          
                                                    
This machine has no PFIX support                                               
                                                    
This machine has an IEEE fpu                                                   
                                                    
On node 0 totalpages: 65536                                                    
                                                    
zone(0): 65536 pages.                                                          
                                                    
zone(1): 0 pages.                                                              
                                                    
zone(2): 0 pages.                                                              
                                                    
Kernel command line: root=/dev/ram0 ro ip=off ramdisk_size=40000               
                     DASD=710C,05FD                 
                                                 CHANDEV=ctc0,0x0C00,0x0C01    
                                                    
Highest subchannel number detected (hex) : 0012                                
                                                    
Calibrating delay loop...                                                      
                                                    
406.32 BogoMIPS                                                                
                                                    
Memory: 246720k/262144k available (2078k kernel code, 0k reserved, 547k data,
316k init)                                            
Dentry cache hash table entries: 32768 (order: 6, 262144 bytes)                
                                                    
Inode cache hash table entries: 16384 (order: 5, 131072 bytes)                 
                                                    
Mount cache hash table entries: 512 (order: 0, 4096 bytes)                     
                                                    
Buffer cache hash table entries: 16384 (order: 4, 65536 bytes)                 
                                                    
Page-cache hash table entries: 65536 (order: 6, 262144 bytes)                  
                                                    
debug: Initialization complete                                                 
                                                    
POSIX conformance testing by UNIFIX                                            
                                                    
Detected 2 CPU's                                                               
                                                    
Boot cpu address  0                                                            
                                                    
cpu 0 phys_idx=0 vers=FF ident=040356 machine=7060 unused=0000                 
                                                    
cpu 1 phys_idx=1 vers=FF ident=040356 machine=7060 unused=0000                 
                                                    
Starting migration thread for cpu 0                                            
                                                    
Starting migration thread for cpu 1                                            
                                                    
init_mach : starting machine check handler                                     
                                                    
Linux NET4.0 for Linux 2.4                                                     
                                                    
Based upon Swansea University Computer Society NET3.039                        
                                                    
Initializing RT netlink socket                                                 
                                                    
mach_handler : ready                                                           
                                                    
mach_handler : waiting for wakeup                                              
                                                    
Starting kswapd                                                                
                                                    
VFS: Disk quotas vdquot_6.5.1                                                  
                                                    
aio_setup: num_physpages = 16384                                               
                                                    
aio_setup: sizeof(struct page) = 52                                            
                                                    
pty: 2048 Unix98 ptys configured                                               
                                                    
NET4: Frame Diverter 0.46                                                      
                                                    
RAMDISK driver initialized: 256 RAM disks of 40000K size 1024 blocksize        
                                                    
md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27                           
                                                    
md: Autodetecting RAID arrays.                                                 
                                                    
md: autorun ...                                                                
                                                    
md: ... autorun DONE.                                                          
                                                    
Initializing Cryptographic API                                                 
                                                    
NET4: Linux TCP/IP 1.0 for NET4.0                                              
                                                    
IP: routing cache hash table of 1024 buckets, 16Kbytes                         
                                                    
TCP: Hash tables configured (established 8192 bind 16384)                      
                                                    
Linux IP multicast router 0.06 plus PIM-SM                                     
                                                    
                                                                               
                                                    
Initializing IPsec netlink socket                                              
                                                    
NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.                            
                                                    
RAMDISK: Compressed image found at block 0                                     
                                                    
Freeing initrd memory: 5262k freed                                             
                                                    
VFS: Mounted root (ext2 filesystem) readonly.                                  
                                                    
Freeing unused kernel memory: 79k freed                                        
                                                    
Starting the S/390 initrd to configure networking. Version is 1.01             
                                                    
Enter the FQDN of your new Linux guest (e.g. s390.redhat.com):                 
                                                    
rjbrenn.pdl.pok.ibm.com                                                        
                                                    
Enter which kind of network device do you intend to use                        
                                                    
 (e.g. ctc, escon, iucv, eth, hsi, tr):                                        
                                                    
ctc                                                                            
                                                    
Enter the IP address of your new Linux guest:                                  
                                                    
9.12.21.224                                                                    
                                                    
Enter the network address of the new Linux guest:                              
                                                    
9.12.21.0                                                                      
                                                    
Enter the IP of your ctc/escon/iucv point-to-point partner:                    
                                                    
172.16.21.224                                                                  
                                                    
CTC driver Version: 1.57 with CHANDEV support initialized                      
                                                    
divert: not allocating divert_blk for non-ethernet device ctc0                 
                                                    
ctc0: read: ch 0c00 (irq 000e), write: ch 0c01 (irq 000f) proto: 0             
                                                    
Enter your DNS server(s), separated by colons (:):                             
                                                    
ctc0: connected with remote side                                               
                                                    
9.12.18.2                                                                      
                                                    
Enter your DNS search domain(s) (if any), separated by colons (:):             
                                                    
pdl.pok.ibm.com                                                                
                                                    
ctc0      Link encap:Serial Line IP                                            
                                                    
          inet addr:9.12.21.224  P-t-P:172.16.21.224  Mask:255.255.255.255     
                                                    
          UP POINTOPOINT RUNNING NOARP  MTU:1500  Metric:1                     
                                                    
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0                   
                                                    
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0                 
                                                    
          collisions:0 txqueuelen:100                                          
                                                    
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)                               
                                                    
                                                                               
                                                    
lo        Link encap:Local Loopback                                            
                                                    
          inet addr:127.0.0.1  Mask:255.0.0.0                                  
                                                    
          UP LOOPBACK RUNNING  MTU:16436  Metric:1                             
                                                    
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0                   
                                                    
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0                 
                                                    
          collisions:0 txqueuelen:0                                            
                                                    
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)                               
                                                    
                                                                               
                                                    
Kernel IP routing table                                                        
                                                    
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface  
                                                    
127.0.0.1       0.0.0.0         255.255.255.255 UH    0      0        0 lo     
                                                    
172.16.21.224   0.0.0.0         255.255.255.255 UH    0      0        0 ctc0   
                                                    
0.0.0.0         172.16.21.224   0.0.0.0         UG    0      0        0 ctc0   
                                                    
Starting portmap.                                                              
                                                    
Journalled Block Device driver loaded                                          
                                                    
                                                                               
                                                    
Starting telnetd and sshd to allow login over the network.                     
                                                    
                                                                               
                                                    
Connect now to 9.12.21.224 to start the installation.                          
                                                    
sshd(pam_unix)?59�: session opened for user root by (uid=0)
                                                                        
* modules to insert vga16fb                                                    
                                                    
* module(s) vga16fb not found                                                  
                                                    
* load module set done                                                         
                                                    
* 254348 kB are available                                                      
                                                    
* modules to insert cramfs vfat sunrpc lockd nfs loop isofs floppy             
                                                    
* loaded cramfs from /modules/modules.cgz                                      
                                                    
* loaded sunrpc from /modules/modules.cgz                                      
                                                    
* loaded lockd from /modules/modules.cgz                                       
                                                    
* loaded nfs from /modules/modules.cgz                                         
                                                    
* loaded loop from /modules/modules.cgz                                        
                                                    
* module(s) vfat isofs floppy not found                                        
                                                    
* inserted /tmp/cramfs.o                                                       
                                                    
* inserted /tmp/sunrpc.o                                                       
                                                    
* inserted /tmp/lockd.o                                                        
                                                    
* inserted /tmp/nfs.o                                                          
                                                    
loop: loaded (max 8 devices)                                                   
                                                    
* inserted /tmp/loop.o                                                         
                                                    
* load module set done                                                         
                                                    
* modules to insert ide-cd                                                     
                                                    
* module(s) ide-cd not found                                                   
                                                    
* load module set done                                                         
                                                    
* modules to insert sd_mod sr_mod                                              
                                                    
* module(s) sd_mod sr_mod not found                                            
                                                    
* load module set done                                                         
                                                    
* modules to insert dasd_mod                                                   
                                                    
* loaded dasd_mod from /modules/modules.cgz                                    
                                                    
dasd: initializing...                                                          
                                                    
dasd: Registered successfully to major no 94                                   
                                                    
dasd: initialization finished                                                  
                                                    
* inserted /tmp/dasd_mod.o                                                     
                                                    
* load module set done                                                         
                                                    
* modules to insert dasd_diag_mod dasd_fba_mod dasd_eckd_mod                   
                                                    
* loaded dasd_diag_mod from /modules/modules.cgz                               
                                                    
* loaded dasd_fba_mod from /modules/modules.cgz                                
                                                    
* loaded dasd_eckd_mod from /modules/modules.cgz                               
                                                    
dasd(diag): DIAG discipline initializing                                       
                                                    
dasd: <devno: 710c> Add lowmem page :0f0d3000                                  
                                                    
dasd: <devno: 710c> Add lowmem page :0f0d2000                                  
                                                    
dasd: <devno: 710c> Free lowmem page :0f0d2000                                 
                                                    
dasd: <devno: 710c> Free lowmem page :0f0d3000                                 
                                                    
dasd: <devno: 05fd> Add lowmem page :0f0d3000                                  
                                                    
dasd: <devno: 05fd> Add lowmem page :0f0d2000                                  
                                                    
dasd: <devno: 05fd> Free lowmem page :0f0d2000                                 
                                                    
dasd: <devno: 05fd> Free lowmem page :0f0d3000                                 
                                                    
* inserted /tmp/dasd_diag_mod.o                                                
                                                    
dasd(fba): FBA  discipline initializing                                        
                                                    
dasd: <devno: 710c> Add lowmem page :0f0d2000                                  
                                                    
dasd: <devno: 710c> Add lowmem page :0f0ce000                                  
                                                    
dasd: <devno: 710c> Free lowmem page :0f0ce000                                 
                                                    
dasd: <devno: 710c> Free lowmem page :0f0d2000                                 
                                                    
dasd: <devno: 05fd> Add lowmem page :0f0d2000                                  
                                                    
dasd: <devno: 05fd> Add lowmem page :0f0ce000                                  
                                                    
dasd(fba): /dev/dasdb  ( 94:  4),05fd@0d: 9336/10(CU:6310/80) 128MB at(512
B/blk)                                                   
Partition check:                                                               
                                                    
 dasdb:CMS1/  SWAPLN:                                                          
                                                    
 dasdb1                                                                        
                                                    
dasd(fba): We are interested in: Dev 9336/00 @ CU 6310/00                      
                                                    
dasd(fba): We are interested in: Dev 3370/00 @ CU 3880/00                      
                                                    
* inserted /tmp/dasd_fba_mod.o                                                 
                                                    
dasd(eckd): ECKD discipline initializing                                       
                                                    
dasd: <devno: 710c> Add lowmem page :0f006000                                  
                                                    
dasd: <devno: 710c> Add lowmem page :0f004000                                  
                                                    
dasd(eckd): /dev/dasda  ( 94:  0),710c@01: 3390/0A(CU:3990/02) Cyl:3339 Head:15
Sec:224                                             
dasd(eckd): /dev/dasda  ( 94:  0),710c@01: 3390/0A(CU:3990/02): Configuration
data read                                             
dasd: waiting for responses...                                                 
                                                    
dasd(eckd): /dev/dasda  ( 94:  0),710c@01: (4kB blks): 2404080kB at 48kB/trk
compatible disk layout                                 
 dasda:VOL1/  0X710C:                                                          
                                                    
dasd(eckd): We are interested in: CU 3880/00                                   
                                                    
dasd(eckd): We are interested in: CU 3990/00                                   
                                                    
dasd(eckd): We are interested in: CU 2105/00                                   
                                                    
dasd(eckd): We are interested in: CU 9343/00                                   
                                                    
* inserted /tmp/dasd_eckd_mod.o                                                
                                                    
* load module set done                                                         
                                                    
* no firewire controller found                                                 
                                                    
* no pcic controller found                                                     
                                                    
* probing buses                                                                
                                                    
* finished bus probing                                                         
                                                    
* found nothing                                                                
                                                    
* going to set language to en_US.UTF-8                                         
                                                    
* setting language to en_US.UTF-8                                              
                                                    
* need to set up networking                                                    
                                                    
* going to pick interface                                                      
                                                    
* going to do getNetConfig                                                     
                                                    
* doing kickstart... setting it up                                             
                                                    
ctc0: RX channel restart 
ch-0c00: Busy ! 
ctc0: Timeout during TX init handshake                                         
                                                    
ctc0: TX channel restart                                                       
                                                    
ctc0: Timeout during TX init handshake                                         
                                                    
ctc0: TX channel restart                                                       
                                                    
ctc0: Timeout during TX init handshake                                         
                                                    
ctc0: TX channel restart                                                       
                                                    
ctc0: Timeout during TX init handshake                                         
                                                    
ctc0: TX channel restart                                                       
                                                    
ctc0: Timeout during TX init handshake                                         
                                                    
ctc0: TX channel restart                                                       
                                                    
ctc0: Timeout during TX init handshake                                         
                                                    
ctc0: TX channel restartThis is more of a problem with hardware CTCs than with
virtual CTCs. When doing
an install in an LPAR where hardware CTC is the only connection to the world,
the CTC driver does not appear to be able to recover itself. The other end of
the CTC has to be cycled off and on again, and the Linux ctc interface needs to
be cycled down then up again, and the default route pointing to the CTC peer
needs to be added. All these actions have to happen before a timeout pops in the
installer. If the timeout pops, the install stops and does not appear to be
recoverable. The timeout appears to be on a reverse lookup of the NFS/FTP
install server address.

Comment 1 Pete Zaitcev 2003-10-07 22:21:23 UTC
Sad, isn't it. It is a long standing problem; the upstream (that is,
IBM Boeblingen) does not seem to care. It sits on my plate and
gets pushed back by IBM's demands to integrate features in support
of zfcp, zcrypt, etc.

Phillip wrote a smaller program which seems to reproduce it.

I may be able to get to it on Oct 24, if I'm lucky and Ingolf
has not found a new exciting feature to integrate (ha!).

See also bug 104823, bug 104084 (DO NOT DUP - PROTECTION BITS).

Requestor, please do not dump long logs into comments box.
Attach them instead.

Comment 2 Pete Zaitcev 2003-10-09 02:43:10 UTC
Created attachment 95044 [details]
Phil's test, modified for devel6

Comment 3 Pete Zaitcev 2003-10-09 02:47:15 UTC
Robert, there's something you can do to help here. On the failing
systen, try to install by whatever means are necessary, then
run the attached test. It requires to be modified with actual IP
addresses of the installation. See if you can get it to fail.

The purpose of the exercise is to have a test case which does not
require installer environment, in case I come up with a test kernel.

If you are unable to use this test (admittedly, quickly made up),
then it's no problem, just mark the bug.


Comment 4 Robert J Brenneman 2003-10-27 15:45:30 UTC
In this section of code:

#if 0 /* pknirsch */
    strcpy(dev.device, "ctc0");
    inet_aton("192.168.20.123", &dev.ip);
    inet_aton("192.168.20.9", &dev.ptpaddr);
    inet_aton("255.255.255.0", &dev.netmask);
    inet_aton("192.168.20.0", &dev.network);
    inet_aton("192.168.20.255", &dev.broadcast);
    inet_aton("192.168.20.9", &dev.gateway);
    dev.set = PUMP_INTFINFO_HAS_NETMASK | PUMP_INTFINFO_HAS_IP |
PUMP_INTFINFO_HAS_NETWORK | PUMP_INTFINFO_HAS_BROADCAST |
PUMP_NETINFO_HAS_GATEWAY | PUMP_INTFINFO_HAS_PTPADDR;
#else
    /* P3 for devel6 */
    strcpy(dev.device, "ctc0");
    inet_aton("192.168.5.118", &dev.ip);
    // inet_aton("192.168.20.6", &dev.ptpaddr); NOT USED BY pump-devel! True!
    inet_aton("255.255.255.255", &dev.netmask);
    inet_aton("192.168.20.0", &dev.broadcast); // Unconditionally used
    inet_aton("192.168.20.6", &dev.gateway);
    dev.set = PUMP_INTFINFO_HAS_NETMASK | PUMP_INTFINFO_HAS_IP |
PUMP_NETINFO_HAS_GATEWAY;
#endif

It looks like youre always running the else block - so I'll change the IPs to
match our setup here. If thats not the what you're looking for, let me know.
Otherwise I'll post results assuming Im to be only running the else block.


Comment 5 Robert J Brenneman 2003-10-27 15:49:39 UTC
Also - Which package do I need to get pumpif.h ?

Comment 6 Pete Zaitcev 2003-10-27 17:14:52 UTC
Please use <pump.h>.
[zaitcev@devel6 zaitcev]$ rpm -qf /usr/include/pump.h
pump-devel-0.8.19-1

Thanks for looking at it.


Comment 7 Robert J Brenneman 2003-10-27 17:38:47 UTC
pump-devel is not included with the Taroon RC4 iso images. Is there somewhere I
can download it?

Comment 8 Pete Zaitcev 2003-10-27 17:49:04 UTC
Created attachment 95521 [details]
pumpif.h (substitute for pump.h)

Comment 9 Pete Zaitcev 2003-10-27 17:50:06 UTC
Created attachment 95522 [details]
pumpif.c (substitute for libpump.a)

Comment 10 Robert J Brenneman 2003-10-27 18:01:38 UTC
What's the incantation to get pumpif.c compiled as libpump.a and it's symbols
exported to the linker for nettest?

Comment 11 Pete Zaitcev 2003-10-27 18:04:34 UTC
Created attachment 95523 [details]
Makefile

Comment 12 Pete Zaitcev 2003-10-27 18:05:39 UTC
Created attachment 95524 [details]
Makefile (right one this time)

Comment 13 Robert J Brenneman 2003-10-27 18:28:01 UTC
OK - nettest is working now...

So what effect am I looking for? a set of iterations and delays that
consistantly break the connection?

Comment 14 Pete Zaitcev 2003-10-27 20:01:45 UTC
The original report was:

ctc0: RX channel restart 
ch-0c00: Busy ! 
ctc0: Timeout during TX init handshake
ctc0: TX channel restart

then, repeated "TX channel restart". The Phil's nettest imitates
what installer does.

It is possible that repeated running of nettest clears problems,
so it would only be apparent if there's a significant delay between
re-runs, e.g.

while true; to
  sleep 20
  ./nettest
done

BTW, be prepared to a loss of connectivity due to default route
being dropped. It has to be restored if the ctc under test is
the primary connectivity of the VM guest.


Comment 16 Pete Zaitcev 2003-10-29 17:08:29 UTC
Robert, if you can reproduce the hang, please do this on the console:

cat /proc/net/ctc/ctc0/statistics

[Also I'm making "private" the erroneous comment by Bill to make it out of the way]


Comment 17 Bill Goodrich 2003-11-03 20:01:28 UTC
------ Additional Comments From billgo.com  2003-28-10 17:30 -------
subject--:Subject: [Bug 106516] LTC4814-Installation with CTC causes device resets. 

Comment 18 Bill Goodrich 2003-11-03 20:01:32 UTC
------ Additional Comments From billgo.com  2003-28-10 17:30 -------
subject--:Subject: [Bug 106516] LTC4814-Installation with CTC causes device resets. 

Comment 19 Bill Goodrich 2003-11-03 20:01:36 UTC
------ Additional Comments From billgo.com  2003-28-10 17:30 -------
subject--:Subject: [Bug 106516] LTC4814-Installation with CTC causes device resets. 

Comment 20 Bill Goodrich 2003-11-03 20:01:39 UTC
------ Additional Comments From billgo.com  2003-28-10 17:30 -------
subject--:Subject: [Bug 106516] LTC4814-Installation with CTC causes device resets. 

Comment 21 Bill Goodrich 2003-11-03 20:01:45 UTC
------ Additional Comments From billgo.com  2003-28-10 17:30 -------
subject--:Subject: [Bug 106516] LTC4814-Installation with CTC causes device resets. 

Comment 22 Bill Goodrich 2003-11-03 20:01:48 UTC
------ Additional Comments From billgo.com  2003-28-10 17:30 -------
subject--:Subject: [Bug 106516] LTC4814-Installation with CTC causes device resets. 

Comment 23 Bill Goodrich 2003-11-03 20:01:52 UTC
------ Additional Comments From billgo.com  2003-28-10 17:30 -------
subject--:Subject: [Bug 106516] LTC4814-Installation with CTC causes device resets. 

Comment 24 Bill Goodrich 2003-11-03 20:01:55 UTC
------ Additional Comments From billgo.com  2003-28-10 17:30 -------
subject--:Subject: [Bug 106516] LTC4814-Installation with CTC causes device resets. 

Comment 25 Bill Goodrich 2003-11-03 20:01:59 UTC
------ Additional Comments From billgo.com  2003-28-10 17:30 -------
subject--:Subject: [Bug 106516] LTC4814-Installation with CTC causes device resets. 

Comment 26 Bill Goodrich 2003-11-03 20:02:03 UTC
------ Additional Comments From billgo.com  2003-28-10 17:30 -------
subject--:Subject: [Bug 106516] LTC4814-Installation with CTC causes device resets. 

Comment 27 Bill Goodrich 2003-11-03 20:02:07 UTC
------ Additional Comments From billgo.com  2003-28-10 17:30 -------
subject--:Subject: [Bug 106516] LTC4814-Installation with CTC causes device resets. 

Comment 28 Bill Goodrich 2003-11-03 20:02:10 UTC
------ Additional Comments From billgo.com  2003-28-10 17:30 -------
subject--:Subject: [Bug 106516] LTC4814-Installation with CTC causes device resets. 

Comment 29 Pete Zaitcev 2003-11-05 03:53:02 UTC
Created attachment 95723 [details]
ctc2 (passes unit & Phil's nettest)

Comment 30 Pete Zaitcev 2003-11-05 03:55:52 UTC
What do you know, it appears I fixed it (see ctc2 attachement).

Comment 31 Pete Zaitcev 2003-11-09 06:31:39 UTC
Modified in 2.4.21-4.11.EL.

This fix is something which really needs to be discussed with
IBM Germany / Boeblingen. They might have a better fix, or a rewrite.


Comment 32 Ingolf Salm 2003-11-12 17:15:57 UTC
What OS is running on the peer machine? If it is Linux as well, did 
you ever try to use protocol 1 (Linux<->Linux)?
This uses a slightly safer initial handshake and also removes the 
restriction of only transfering IP packets.

-FE

(Volker Tosta for Fritz Elfert)

Comment 33 Pete Zaitcev 2003-11-12 17:42:10 UTC
IMHO, remote OS has nothing to do with the problem. The root cause
is that the locking in the ctc is half absent, half broken.
The driver as present on DW or in Marcelo tree is not SMP safe at all.
It only works as long as ifconfig thread is out of the picture.


Comment 34 Fritz Elfert 2003-11-13 19:20:32 UTC
Sorry, but you're wrong. It doesn't have to do with SMP-awareness. 
Using your little test program i found the real error: The 
halt_IO(), used at various locations for aborting IO operations does 
not always succeed and thus, there are still interrupts delivered 
when the driver is not aware of that. This happens especially if the 
device is shutdown/started quickly. In fact, this possibility is 
documented in PoP. In the current driver however, halt_IO() 
returning a Busy condition is treated as fatal error. I will come up 
with a more elaborated implementation of shutting down I/O properly 
... 
 
BTW: 
The remote OS makes LOT of difference, because i've never seen any 
doc about the hilevel handshake procedures of VM/TCP, z/OS or 
OS/390. These are confidential and in order to make the ctc driver 
GPL'd, i did a clean-room approach and tried to reverse engeneer the 
handshake. Therefore there may be some glitches or unexpected 
behavior which - of course - does not happen when using protocol 1 
with Linux on both sides. 
 
-FE 
 

Comment 35 Pete Zaitcev 2003-12-11 00:29:27 UTC
I observe what Fritz described as well, it's a different problem.

When ctc races SMP, all sorts of nasties happen. What broke our
patience was that it started to BUG() in timer code due to
spurious calls to init_timer() in the fsm.c:add_timer().
It just runs amock unlocked, and my patch fixes that.

When HaltIO fails, it looks like this:

ctc0: RX channel restart
ch-0150: Busy !
ctc0: TX channel restart
ch-0151: Busy !
ctc0: Timeout during TX init handshake
ctc0: TX channel restart
<---- dead here

sh-2.05b# cat /proc/net/ctc/ctc0/statistics
Device FSM state: StartWait RX
RX channel FSM state: Stopped
TX channel FSM state: TX idle
......

The device will not leave RX_START_WAIT.

If Fritz comes with a patch to fix the HaltIO problem,
it is going to be welcome. But it does not invalidate
my assertion of locking problems in ctc in the least.


Comment 36 Ernie Petrides 2005-09-09 00:03:10 UTC
This was fixed in RHEL3 U2.  Here is a copy of the Errata System message:

 "An errata has been issued which should help the problem described in this bug
  report.  This report is therefore being closed with a resolution of ERRATA.
  For more information on the solution and/or where to find the updated files,
  please follow the link below. You may reopen this bug report if the solution
  does not work for you.

  http://rhn.redhat.com/errata/RHSA-2004-188.html"

Obviously, the latest release kernel should be used (RHSA-2005:472), and U6 is
currently in beta and will be released soon (RHSA-2005:663).