Bug 84529 - Samba 2.2.5-10 appears to have stability problems
Summary: Samba 2.2.5-10 appears to have stability problems
Alias: None
Product: Red Hat Linux
Classification: Retired
Component: samba
Version: 8.0
Hardware: All
OS: Linux
Target Milestone: ---
Assignee: Jay Fenlason
QA Contact: David Lawrence
Depends On:
TreeView+ depends on / blocked
Reported: 2003-02-18 15:05 UTC by Larry Troan
Modified: 2016-04-18 09:39 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Last Closed: 2003-03-24 17:44:15 UTC

Attachments (Terms of Use)
dag_rack_02_01.log.gz (277.94 KB, text/plain)
2003-02-21 21:59 UTC, Larry Troan
no flags Details
dag_rack_02_02.log.gz (256.89 KB, text/plain)
2003-02-21 22:00 UTC, Larry Troan
no flags Details
log.smbd.gz (330 bytes, text/plain)
2003-02-21 22:01 UTC, Larry Troan
no flags Details
smbd.log.gz (1.79 KB, text/plain)
2003-02-21 22:02 UTC, Larry Troan
no flags Details

Description Larry Troan 2003-02-18 15:05:28 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0rc1) Gecko/20020424

Description of problem:
This issue is in regards to losing clients with the version of Samba that comes
packaged with RH 8.0 (2.2.5-10). Some stress testing was done over the weekend
which showed that Dell TT DFCT42718("Linux network, clients time out under
stress") still existed and it was decided to install the latest Samba 2.2.6
<stable> release from Samba.org to see if this would solve the problem.  After
installing, the same stress that was ran before was ran to the server and did
not have any clients drop off as of this afternoon - 3 full days later.

Version-Release number of selected component (if applicable):

How reproducible:

ISSUE TRACKER 12161 opened by Dell as Sev 2. Bugzilla entered by hand because
data exceeded buffer limit.

Comment 1 Larry Troan 2003-02-18 15:09:09 UTC
FROM Issue Tracker...
NOTE: Pasted below is the details of DFCT42718 from Dell Team Track (may not be
completely updated)

Find All Issues: Item ID 42718  Bhutani, Amit
12/10/2002 1:58:24 PM
Now showing Issues 1 - 1 of 1
Standard Fields
Issue Id:  DFCT42718  
Title:  Linux Network, clients time out under stress.  
Description:  Update 26 May 2002, Michael E Brown
Red Hat bug: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=59861, see
Action Plan below.

Update 26 May 2002, Michael E Brown
I am told that this has been filed under the Red Hat and Samba bug tracking
systems. Can whoever has these bug numbers please annotate this teamtrack with
that info, as well as send them to me?

Update 27 Feb 1002, Ken Bignell, See Notes.
*****UPDATE Ken Bingell, 25 Jan 2002.  *******
SNaC able to reproduce using only one PRO/1000F in slot 7 of a Slimmerlot.
Further testing to see if we can repro in slot other than 7.

Clients time-out while performing ReadRite, and 0.bat. Sessions are still
established but clients are not preforming required stress.

OS: Linux RedHat 7.2 (2.4
BIOS: X05/X04
ESM: X19
CPUs: 4x 1.6GHz
MEMORY: 8x 256MB Samsung DIMMs
7982 Disabled

PCI2: 39160 BIOS=2.57s2s, Driver=native Linux driver.
------------------sdb---/f   ..... shared as Samba drive f
------------------sdc---/g   .... shared as Samba drive g
------------------sdd---/h   .... shared as Samba drive h
------------------sde---/i   ..... shared as Samba drive i
PCI3: Pro1000F, driver=A00 (5.2)
PCI4: Pro1000F, driver=A00 (5.2)
PCI5: Pro1000F, driver=A00 (5.2)
PCI6: Pro1000F, driver=A00 (5.2)
PCI7: Pro1000F, driver=A00 (5.2)
PCI8: Pro1000F, driver=A00 (5.2)  
Operating System:  Linux   Phase Found:  Product Test  
Problem Area:  Operating System   Severity:  1  
Build Found:   
Build Fixed:  Samba-2.2.5-6 and above  
Assigned To:  Bhutani, Amit   Component:  (None)  
Operating System Version:   
Phase Introduced:  (None)   Platform Found:  Merlot  
Priority:  (None)   Test Blockage?:  No  
Version:  1.0  
Version/Build Found:   
User Fields
Issue Type:    Defect  
Notes: Close :   
Notes: Reject Reason:   
Reject Type:  (None)  
Notes: Resolved:  11/7/2002 2:02:56 PM - Bhutani, Amit:
Amit Bhutani 11/2/2002**********
According to Red Hat Bugzilla(#59861), this problem was root caused down to
Samba. This issue is fixed by upgrading Samba to 2.2.5-6 or a higher version.

RH 8.0 comes with a base samba version of 2.2.5-10. So this issue should not
exist on RH 8.0. (SNaC or who ever submitted the original issue needs to regress
and confirm this).

As for RH 7.2, 7.3 and AS 2.1 customers I am attaching a tech sheet that will
basically instruct the customer to upgrade his samba packages.

Resolution:  Solution Found   PlatAff: ALL:  (Not Checked)  
PlatAff: Altima (PE 1500SC):  (Not Checked)   PlatAff: Ares:  (Not Checked)  
PlatAff: Ares BU:  (Not Checked)   PlatAff: Bayonet (PE 350):  (Not Checked)  
PlatAff: Avalon:  (Not Checked)   PlatAff: Beetle:  (Not Checked)  
PlatAff: Bordeaux:  (Not Checked)   PlatAff: Boxster (PE 2650):  (Not Checked)  
PlatAff: Cactus Jack:  (Not Checked)   PlatAff: Chameleon (PE 2300):  (Not
PLatAff: Civic:  (Not Checked)   PlatAff: Dagger (PE 1650):  (Not Checked)  
PlatAff: Diamond (PE 8450):  (Not Checked)   PlatAff: Discovery:  (Not Checked)  
PlatAff: Eagle (PE 6100):  (Not Checked)   PlatAff: Emerald (PE 6300):  (Not
PlatAff: Everglades:  (Not Checked)   PlatAff: Gecko (PE 1300):  (Not Checked)  
PlatAff: Gecko II (PE 1400):  (Not Checked)   PlatAff: GeckoSC (PE 1400SC): 
(Not Checked)  
PlatAff: Iguana (PE 2400):  (Not Checked)   PlatAff: Iguana FC (PE 2400ex): 
(Not Checked)  
PlatAff: Jaguar (PE 4500):  (Not Checked)   PlatAff: Lexus (PE 2500):  (Not
PlatAff: Merlot:  (Not Checked)   PlatAff: Napa (PE 6400):  (Not Checked)  
PlatAff: Newt (PE 300):  (Not Checked)   PlatAff: Opal (PE 4400):  (Not Checked)  
PlatAff: Onyx (PE 2200):  (Not Checked)   PlatAff: PA/PE Switchblade :  (Not
PlatAff: Raven/II (PE 4200):  (Not Checked)   PlatAff: Razor (PA110):  (Not
PlatAff: Razor II (PA110):  (Not Checked)   PlatAff: Redwood:  (Checked)  
PlatAff: Sabre (PA120/PE1550):  (Not Checked)   PlatAff: Sapphire (PE 4300): 
(Not Checked)  
PlatAff: Shredder HD:  (Not Checked)   PlatAff: Shredder HP:  (Not Checked)  
PlatAff: Slimerald (PE 6350):  (Not Checked)   PlatAff: SlimFast (PE 2450): 
(Not Checked)  
PlatAff: SlimMerlot:  (Not Checked)   PlatAff: SlimNapa (PE 6450):  (Not Checked)  
PlatAff: Slimphire (PE Slimphire):  (Not Checked)   PlatAff: Tupac:  (Not Checked)  
PlatAff: Viper (PE 2550):  (Not Checked)   PlatAff: WendyO (750N):  (Not Checked)  
PlatAff: WendyO (755N):  (Not Checked)   PlatAff: Yellowstone:  (Not Checked)  
Platforms Affected:  (None)   Requested Fix Date:   
Steps to Reproduce:   
Reproducability:  (None)  
Test Case Number:   
Vendor Issue #:   
Version Fixed:   
Action Plan In Place:  (Not Checked)  
Action Plan (Root Cause):  ###( 03/29/2002 -- Michael E Brown)###
Suggested action plan is to notify IPS of this issue and issue a PSQN. If
customers hit this issue we should advise them to downgrade to Samba version
2.0.9, which is the latest working version.

###( 03/26/2002 -- Michael E Brown)###
This issue has been entered into the Red Hat bug tracking system (bugzilla) as:

The Dell Linux team has been added to this bug so that we may track its
progress. At this time two things need to happen:

1) Need to give copies of the test tools that show this problem to the people
who can fix the bug, ie. attach the scripts to the bugzilla. The Samba folks do
not have test tools that show this bug. SNaC needs to do this.

2) Samba has (as of 3/24/2002) asked Intel to check out the latest version of
Samba to see if the error still exists. This needs to be followed up on. SNaC
can ask Intel the status on this.  
Action Plan Owner:  (None)   Action Plan Resolution Date:   
Estimated fix committed in:  (None)   SW Status::  N/A  
Debug State:  (None)  
Advanced Fields
Actual Time to Fix (Effort):     Est. Hours to Fix:  0  
Project Lead:  Locklear, David  
Project:  Red Hat Linux  
Release Notes:  Investigation by Extended Team  
Target Fix Date:   
Power Rating:  1a  
External Attachment:   
System Fields
Submit Date:  11/12/2001 12:46:44 PM   Submitter:  Jones, James P  
Active/Inactive:  Active   Assigned Date:  11/12/2001 12:59:37 PM  
Close Date:     Fixed Date:  11/7/2002 2:02:56 PM  
Last Modified Date:  11/7/2002 2:02:56 PM   Last Modifier:  Bhutani, Amit  
Last State Change Date:  11/7/2002 2:02:56 PM   Last State Changer:  Bhutani, Amit  
Owner:  Jones, James P   Re-Assign Date:  10/20/2002 1:00:13 AM  
Rejected Date:     Un-Assign Date:   
State:   Fixed  

Redwood Added as effected platform  by Jones, James P (10/25/2002 1:32:13 PM)

-----Original Message-----
From: Kirkpatrick, Kimberly  
Sent: Friday, October 25, 2002 1:29 PM
To: Jones, James P
Subject: DFCT42718

Can you please add Redwood to platforms affected for this defect.

Kimberly Kirkpatrick
ESG Server Product Test
Dell Computer Corporation
Office:(512) 725-1652
Pager: (512) 907-9223

Issue duplicated on Redwood platform in product test - 10/03/2002, verified
10/24/2002  by King, Scott C (10/24/2002 11:23:53 AM)
This issue still exists - duplicated on Redwood and verified to be the same on
10/24/02 via Samba logs.  Redwood platform needs to be added as those affected
by issue.

Moving to Linux Project by agreement of Merlot Team and ST Team  by Locklear,
David (4/1/2002 1:21:33 PM)
See attached email

Multiple systems, multiple NICs  by Bignell, Ken (2/27/2002 12:03:20 PM)
Intel and SNaC have been successfull at reproducing this issue on multiple PCI-X
platforms with different vendor's NICs.  This issue looks more and more like a
Linux issue.  Intel was unable to reproduce on a Unix (Solaris) system under the
same conditions.  SNaC has asked Intel to continue to investigate in the
direction of why the PRO/1000F fails so much sooner than other NICs, but a copy
of this issue is being assigned to Dell's Red Hat Linux team.

  by Amor, Mohammed (2/25/2002 10:10:28 AM)
Update from Intel:
Wednesday phone call ARs:
1.  Intel to run NFS clients.  If these run for ~72 hours, Dell will
consider this a Samba issue.
2.  Intel to look at origins of Samba error messages.  Intel will set up a
handler for signal isolation from Samba clients.
3.  Dell (Bignell) to review server and client logs for similarity of
failure messages.
4.  Dell will try to set up an automated Windows setup that will repeat the
'F5' command in a Windows environment.
5.  Dell will report on NFS share test currently running.

As of mid-day today, current status is:
1.  NFS share clients at Intel failed as fast or faster than Samba clients.
2.  Set in progress.  Intel is setting up a smaller test network outside the
lab to reproduce in a different environment, and will use this setup to
create the handler.
3.  No update.
4.  No update.
5.  No update.

 by Amor, Mohammed (2/13/2002 9:38:12 AM)

  by Amor, Mohammed (2/11/2002 3:07:33 PM)
Here is the latest update from Intel:

Below is an inventory of equipment where we've seen failures.  We've tried
just about all permutations.  

In summary, the issue appears to be independent of server, nic, switch, or
Samba s/w version.  Monday, we're going to dig deeper into diagnostic tools
(if available) on the Samba server s/w.

Dell 6650 w/ Serverworks PCI-X GCHE chipset
Dell 4400 PCI
IBM pre-production w/ Serverworks PCI-X GCLE chipset

Cisco 4006
Intel 480T

Intel PRO/1000 F  (Kodiak II)
Intel PRO/1000 XF (Eldridge A4)
Intel PRO/1000 XT (Barrow A4)
3Com 3C996B-T (BRCM 5701)

Samba Server software:
2.2.1a (ships w/ RH 7.2)
2.2.3a (latest from samba.org)

The client failures:
file open errors
file write errors
file read errors
access denied errors

Reproduced on a Jaguar (PE 4600) also.  by Bignell, Ken (1/30/2002 9:43:25 AM)
Reproduced on Jaguar with 1000F in slot 2, 16 100Mb clients running RR and WW. 
Took ~12 hours to fail.

Reduced scope of set up, still produced failure.  by Bignell, Ken (1/30/2002
9:41:23 AM)
The Slimmerlot issue was reproduced using a single network interface in slot 7.
The clients were running RR until they stopped communicating with the server
less than 24 hours into the test. Almost all of the clients lost their
connection to the server and reported "failure to open file". A few clients
continued running without ever reporting an error. All clients still had network
connectivity and could be reconnected to the server. None of the log files on
the server show any sign of errors. The adapter /proc files only show a few
Rx_FIFO and Rx_Missed errors. So far I have not been able to reproduce the
problem with a Pro1000XT.

Reduced failure to one NIC.  by Bignell, Ken (1/25/2002 12:29:35 PM)
*****UPDATE Ken Bingell, 25 Jan 2002.  *******
SNaC able to reproduce using only one PRO/1000F in slot 7 of a Slimmerlot.
Further testing to see if we can repro in slot other than 7.

ARP mask may be part of the issue. SNaC Investigating.  by Bignell, Ken
(1/22/2002 7:12:08 PM)
Thomas was able to reproduce the Linux client test time out issue on SlimMerlot
with only two PRO/1000F NICs active.  All 6 were installed, but only two active.
 He ran 4 or 5 clients against NIC 5 for a few hours, then started a couple of
clients on NIC 1.  After about an hour, client tests timed out on NIC 5.  We
took a look at the NICs and their ARP tables and Thomas noticed that some of the
clients on NIC 1 had the MAC for NIC 3 in their ARP table.  We tried again and
saw that all of the receiving was being done by NIC 3, and all of the transmit
was being done by NIC 1 and 5.  There is a feature in Linux (at least as far
back as the 2.2 kernel, so this means it was in Red Hat 7.0 and later) that is
an ARP mask that allows any NIC in host connected to the same physical network
to reply to any ARP for any IP on that host.  So if NIC 1 and 2 are connected to
the same switch that is not segmented, NIC 2 could reply to an ARP on the MAC
address of NIC 1.  This may or may not be the issue.  We need to know if PT's
test was set up with at least some of the NICs in the same physical network, I
will check with them tomorrow.  We are also trying a known failing config with
the ARP mask feature turned off to see if the issue goes away.  We will know
more tomorrow.  I will try to keep everyone updated on our progress.  

Reassign to Ken Bignell  by Locklear, David (1/15/2002 8:46:54 AM)
Ken has current action on this issue.  Recreating failure in SNaC lab.

Continuing on.  by Jones, James P (12/18/2001 4:38:24 PM)
Slot 1 Adview Card
Slot 2 39160
Slot 3 1000 (traffic) 16 clients
Slot 4 1000 (traffic) 16 clients
Slot 5 1000 (traffic) 16 clients
Slot 6 1000 (traffic) 16 clients
Slot 7 1000 (no traffic, idle) enabled
Slot 8 1000 (no traffic, idle) enabled

All clients running RR, 0.bat, both 2 and 10 mb.
1,2,5,6 on test menu.
While running in the configuration above, the system ran SMB traffic for 24
hours with no problems, errors or client faults or drop offs.
I added 16 client traffic to slots 7 and 8 each, with in an hour, the system
hard locked with no recovery.

  by Amor, Mohammed (12/18/2001 2:34:01 PM)
Ken Bignel tested this issue with a Jag and has not seen the failure

Recreated Several more times.  by Jones, James P (12/14/2001 4:57:28 PM)
In efforts to narrow the issue down to a root cause, I have ran serveral
variations with the configuration and slight differences.

Slot 1 Adview Card
Slot 2 39160
Slot 3 1000 (traffic) 16 clients
Slot 4 1000 (traffic) 16 clients
Slot 5 1000 (no traffic, idle) enabled
Slot 6 1000 (no traffic, idle) enabled
Slot 7 1000 (traffic) 16 clients
Slot 8 1000 (traffic) 16 clients
All clients running RR, 0.bat, both 2 and 10 mb.
1,2,5,6 on test menu.
While running in the configuration above, the system ran SMB traffic for 49
hours with no problems, errors or client faults or drop offs.
On the 50th hour, I added 16 client traffic to slots 5 and 6 each, with in an
hour, the system hard locked with no recovery.  

Change History
11/12/2001 12:59:37 PM by Locklear, David
 Problem Area Changed From Operating System To NICs
 Last State Change Date Changed From 11/12/2001 12:46:44 PM To 11/12/2001
12:59:37 PM
 Owner Changed From Locklear, David To Bignell, Ken
 State Changed From Created To Assigned Via Transition: Assign
 Last State Changer Changed From Jones, James P To Locklear, David
 Last Modifier Changed From Jones, James P To Locklear, David
 Assigned To Changed From (None) To Bignell, Ken
 Last Modified Date Changed From 11/12/2001 12:46:44 PM To 11/12/2001 12:59:37 PM
 Assigned Date Changed From Unassigned To 11/12/2001 12:59:37 PM
12/5/2001 7:41:21 AM by Bignell, Ken
 Assigned To Changed From Bignell, Ken To Amor, Mohammed
 Last Modified Date Changed From 11/12/2001 12:59:37 PM To 12/5/2001 7:41:21 AM
 Last Modifier Changed From Locklear, David To Bignell, Ken


Comment 2 Jay Fenlason 2003-02-18 17:17:11 UTC
The latest Samba erratum for 8.0 is 2.2.7-2.  The latest released version from
samba.org is 2.2.7a (for production) or 2.2.8pre1 (for testing).  2.2.2 through
2.2.6 have a major remote-root security hole and should not be used.

Have they tried a more recent version of samba?

Comment 3 Larry Troan 2003-02-21 21:59:24 UTC
Created attachment 90263 [details]

Comment 4 Larry Troan 2003-02-21 22:00:11 UTC
Created attachment 90264 [details]

Comment 5 Larry Troan 2003-02-21 22:01:28 UTC
Created attachment 90265 [details]

Comment 6 Larry Troan 2003-02-21 22:02:07 UTC
Created attachment 90266 [details]

Comment 7 Larry Troan 2003-03-04 20:26:02 UTC
Event posted 02-24-2003 01:14pm by Bhutani with duration of 0.00
For 8.0 we have tried: 2.2.5-10 AND 2.2.7-2 several times in various
combinations of other variables like hardware, infrastructure etc .They have all
failed. Are we expected to try the non-redhat samba rpm's i.e the latest from
samba.org ?
If YES, then which one should we try - 2.2.7a or 2.2.8pre1 ??

Status set to: Waiting on Tech
Event posted 02-27-2003 02:40pm by Bhutani with duration of 0.00
Did anybody @ Red Hat take a look at the samba error logs ?

Any feedback is appreciated as this issue is affecting several platform
schedules to slip.
Event posted 02-28-2003 02:28pm by Bhutani with duration of 0.00
Since there is no response from RH's end, Dell is starting to test with 2.2.7a-5
(latest from rawhide)

New Event

Comment 8 Larry Troan 2003-03-18 20:51:42 UTC
Event posted 03-12-2003 08:29pm by Bhutani with duration of 0.00        
Root cause found.
- In one case it was a bad switch. Failed diags.
- In an another case the "Unicast Port Storm Control Filter " setting that was
enabled on one of the switches. Disabling that "feature" caused the failure to
go away.

Test being regressed on all failing configurations currently.

Status set to: Fix Pending (e.g., do not believe this is a Linux problem)

Note You need to log in before you can comment on or make changes to this bug.