Bug 84529
Summary: | Samba 2.2.5-10 appears to have stability problems | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | Larry Troan <ltroan> | ||||||||||
Component: | samba | Assignee: | Jay Fenlason <fenlason> | ||||||||||
Status: | CLOSED NOTABUG | QA Contact: | David Lawrence <dkl> | ||||||||||
Severity: | medium | Docs Contact: | |||||||||||
Priority: | medium | ||||||||||||
Version: | 8.0 | CC: | abartlet, ichute, jfeeney, mitr, tao | ||||||||||
Target Milestone: | --- | ||||||||||||
Target Release: | --- | ||||||||||||
Hardware: | All | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2003-03-24 17:44:15 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Attachments: |
|
Description
Larry Troan
2003-02-18 15:05:28 UTC
FROM Issue Tracker... NOTE: Pasted below is the details of DFCT42718 from Dell Team Track (may not be completely updated) Find All Issues: Item ID 42718 Bhutani, Amit 12/10/2002 1:58:24 PM Now showing Issues 1 - 1 of 1 Standard Fields Issue Id: DFCT42718 Title: Linux Network, clients time out under stress. Description: Update 26 May 2002, Michael E Brown Red Hat bug: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=59861, see Action Plan below. Update 26 May 2002, Michael E Brown I am told that this has been filed under the Red Hat and Samba bug tracking systems. Can whoever has these bug numbers please annotate this teamtrack with that info, as well as send them to me? Update 27 Feb 1002, Ken Bignell, See Notes. *****UPDATE Ken Bingell, 25 Jan 2002. ******* SNaC able to reproduce using only one PRO/1000F in slot 7 of a Slimmerlot. Further testing to see if we can repro in slot other than 7. *************************************************** Clients time-out while performing ReadRite, and 0.bat. Sessions are still established but clients are not preforming required stress. OS: Linux RedHat 7.2 (2.4 BIOS: X05/X04 ESM: X19 CPUs: 4x 1.6GHz MEMORY: 8x 256MB Samsung DIMMs 2x POWER SUPPLIES BROADCOM LOM DISABLED BROADCOM LOM DISABLED 7982 Disabled SLOTS -------------- PCI1 PCI2: 39160 BIOS=2.57s2s, Driver=native Linux driver. ------------------sda---/ ---/boot ---/swap ------------------sdb---/f ..... shared as Samba drive f ------------------sdc---/g .... shared as Samba drive g ------------------sdd---/h .... shared as Samba drive h ------------------sde---/i ..... shared as Samba drive i PCI3: Pro1000F, driver=A00 (5.2) PCI4: Pro1000F, driver=A00 (5.2) PCI5: Pro1000F, driver=A00 (5.2) PCI6: Pro1000F, driver=A00 (5.2) PCI7: Pro1000F, driver=A00 (5.2) PCI8: Pro1000F, driver=A00 (5.2) Operating System: Linux Phase Found: Product Test Problem Area: Operating System Severity: 1 Build Found: Build Fixed: Samba-2.2.5-6 and above Assigned To: Bhutani, Amit Component: (None) Operating System Version: Phase Introduced: (None) Platform Found: Merlot Priority: (None) Test Blockage?: No Version: 1.0 Version/Build Found: User Fields Issue Type: Defect Notes: Close : Notes: Reject Reason: Reject Type: (None) Notes: Resolved: 11/7/2002 2:02:56 PM - Bhutani, Amit: Amit Bhutani 11/2/2002********** According to Red Hat Bugzilla(#59861), this problem was root caused down to Samba. This issue is fixed by upgrading Samba to 2.2.5-6 or a higher version. RH 8.0 comes with a base samba version of 2.2.5-10. So this issue should not exist on RH 8.0. (SNaC or who ever submitted the original issue needs to regress and confirm this). As for RH 7.2, 7.3 and AS 2.1 customers I am attaching a tech sheet that will basically instruct the customer to upgrade his samba packages. *********************************************************************************** Resolution: Solution Found PlatAff: ALL: (Not Checked) PlatAff: Altima (PE 1500SC): (Not Checked) PlatAff: Ares: (Not Checked) PlatAff: Ares BU: (Not Checked) PlatAff: Bayonet (PE 350): (Not Checked) PlatAff: Avalon: (Not Checked) PlatAff: Beetle: (Not Checked) PlatAff: Bordeaux: (Not Checked) PlatAff: Boxster (PE 2650): (Not Checked) PlatAff: Cactus Jack: (Not Checked) PlatAff: Chameleon (PE 2300): (Not Checked) PLatAff: Civic: (Not Checked) PlatAff: Dagger (PE 1650): (Not Checked) PlatAff: Diamond (PE 8450): (Not Checked) PlatAff: Discovery: (Not Checked) PlatAff: Eagle (PE 6100): (Not Checked) PlatAff: Emerald (PE 6300): (Not Checked) PlatAff: Everglades: (Not Checked) PlatAff: Gecko (PE 1300): (Not Checked) PlatAff: Gecko II (PE 1400): (Not Checked) PlatAff: GeckoSC (PE 1400SC): (Not Checked) PlatAff: Iguana (PE 2400): (Not Checked) PlatAff: Iguana FC (PE 2400ex): (Not Checked) PlatAff: Jaguar (PE 4500): (Not Checked) PlatAff: Lexus (PE 2500): (Not Checked) PlatAff: Merlot: (Not Checked) PlatAff: Napa (PE 6400): (Not Checked) PlatAff: Newt (PE 300): (Not Checked) PlatAff: Opal (PE 4400): (Not Checked) PlatAff: Onyx (PE 2200): (Not Checked) PlatAff: PA/PE Switchblade : (Not Checked) PlatAff: Raven/II (PE 4200): (Not Checked) PlatAff: Razor (PA110): (Not Checked) PlatAff: Razor II (PA110): (Not Checked) PlatAff: Redwood: (Checked) PlatAff: Sabre (PA120/PE1550): (Not Checked) PlatAff: Sapphire (PE 4300): (Not Checked) PlatAff: Shredder HD: (Not Checked) PlatAff: Shredder HP: (Not Checked) PlatAff: Slimerald (PE 6350): (Not Checked) PlatAff: SlimFast (PE 2450): (Not Checked) PlatAff: SlimMerlot: (Not Checked) PlatAff: SlimNapa (PE 6450): (Not Checked) PlatAff: Slimphire (PE Slimphire): (Not Checked) PlatAff: Tupac: (Not Checked) PlatAff: Viper (PE 2550): (Not Checked) PlatAff: WendyO (750N): (Not Checked) PlatAff: WendyO (755N): (Not Checked) PlatAff: Yellowstone: (Not Checked) Platforms Affected: (None) Requested Fix Date: Steps to Reproduce: Reproducability: (None) Test Case Number: Vendor Issue #: Version Fixed: Workaround: Action Plan In Place: (Not Checked) Action Plan (Root Cause): ###( 03/29/2002 -- Michael E Brown)### Suggested action plan is to notify IPS of this issue and issue a PSQN. If customers hit this issue we should advise them to downgrade to Samba version 2.0.9, which is the latest working version. ###( 03/26/2002 -- Michael E Brown)### This issue has been entered into the Red Hat bug tracking system (bugzilla) as: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=59861 The Dell Linux team has been added to this bug so that we may track its progress. At this time two things need to happen: 1) Need to give copies of the test tools that show this problem to the people who can fix the bug, ie. attach the scripts to the bugzilla. The Samba folks do not have test tools that show this bug. SNaC needs to do this. 2) Samba has (as of 3/24/2002) asked Intel to check out the latest version of Samba to see if the error still exists. This needs to be followed up on. SNaC can ask Intel the status on this. Action Plan Owner: (None) Action Plan Resolution Date: Estimated fix committed in: (None) SW Status:: N/A Debug State: (None) Advanced Fields Actual Time to Fix (Effort): Est. Hours to Fix: 0 Project Lead: Locklear, David Project: Red Hat Linux Release Notes: Investigation by Extended Team Target Fix Date: Power Rating: 1a External Attachment: System Fields Submit Date: 11/12/2001 12:46:44 PM Submitter: Jones, James P Active/Inactive: Active Assigned Date: 11/12/2001 12:59:37 PM Close Date: Fixed Date: 11/7/2002 2:02:56 PM Last Modified Date: 11/7/2002 2:02:56 PM Last Modifier: Bhutani, Amit Last State Change Date: 11/7/2002 2:02:56 PM Last State Changer: Bhutani, Amit Owner: Jones, James P Re-Assign Date: 10/20/2002 1:00:13 AM Rejected Date: Un-Assign Date: State: Fixed Notes Redwood Added as effected platform by Jones, James P (10/25/2002 1:32:13 PM) -----Original Message----- From: Kirkpatrick, Kimberly Sent: Friday, October 25, 2002 1:29 PM To: Jones, James P Subject: DFCT42718 James Can you please add Redwood to platforms affected for this defect. Thanks Kimberly Kirkpatrick ESG Server Product Test Dell Computer Corporation Office:(512) 725-1652 Pager: (512) 907-9223 Issue duplicated on Redwood platform in product test - 10/03/2002, verified 10/24/2002 by King, Scott C (10/24/2002 11:23:53 AM) This issue still exists - duplicated on Redwood and verified to be the same on 10/24/02 via Samba logs. Redwood platform needs to be added as those affected by issue. Moving to Linux Project by agreement of Merlot Team and ST Team by Locklear, David (4/1/2002 1:21:33 PM) See attached email Multiple systems, multiple NICs by Bignell, Ken (2/27/2002 12:03:20 PM) Intel and SNaC have been successfull at reproducing this issue on multiple PCI-X platforms with different vendor's NICs. This issue looks more and more like a Linux issue. Intel was unable to reproduce on a Unix (Solaris) system under the same conditions. SNaC has asked Intel to continue to investigate in the direction of why the PRO/1000F fails so much sooner than other NICs, but a copy of this issue is being assigned to Dell's Red Hat Linux team. by Amor, Mohammed (2/25/2002 10:10:28 AM) Update from Intel: Wednesday phone call ARs: 1. Intel to run NFS clients. If these run for ~72 hours, Dell will consider this a Samba issue. 2. Intel to look at origins of Samba error messages. Intel will set up a handler for signal isolation from Samba clients. 3. Dell (Bignell) to review server and client logs for similarity of failure messages. 4. Dell will try to set up an automated Windows setup that will repeat the 'F5' command in a Windows environment. 5. Dell will report on NFS share test currently running. As of mid-day today, current status is: 1. NFS share clients at Intel failed as fast or faster than Samba clients. 2. Set in progress. Intel is setting up a smaller test network outside the lab to reproduce in a different environment, and will use this setup to create the handler. 3. No update. 4. No update. 5. No update. by Amor, Mohammed (2/13/2002 9:38:12 AM) PLEASE SEE ATTACHED FILE FOR TODAY'S UPDATE FROM INTEL. by Amor, Mohammed (2/11/2002 3:07:33 PM) Here is the latest update from Intel: Below is an inventory of equipment where we've seen failures. We've tried just about all permutations. In summary, the issue appears to be independent of server, nic, switch, or Samba s/w version. Monday, we're going to dig deeper into diagnostic tools (if available) on the Samba server s/w. Servers: Dell 6650 w/ Serverworks PCI-X GCHE chipset Dell 4400 PCI IBM pre-production w/ Serverworks PCI-X GCLE chipset Switches: Cisco 4006 Intel 480T Nics: Intel PRO/1000 F (Kodiak II) Intel PRO/1000 XF (Eldridge A4) Intel PRO/1000 XT (Barrow A4) 3Com 3C996B-T (BRCM 5701) Samba Server software: 2.2.1a (ships w/ RH 7.2) 2.2.3a (latest from samba.org) The client failures: file open errors file write errors file read errors access denied errors Reproduced on a Jaguar (PE 4600) also. by Bignell, Ken (1/30/2002 9:43:25 AM) Reproduced on Jaguar with 1000F in slot 2, 16 100Mb clients running RR and WW. Took ~12 hours to fail. Reduced scope of set up, still produced failure. by Bignell, Ken (1/30/2002 9:41:23 AM) The Slimmerlot issue was reproduced using a single network interface in slot 7. The clients were running RR until they stopped communicating with the server less than 24 hours into the test. Almost all of the clients lost their connection to the server and reported "failure to open file". A few clients continued running without ever reporting an error. All clients still had network connectivity and could be reconnected to the server. None of the log files on the server show any sign of errors. The adapter /proc files only show a few Rx_FIFO and Rx_Missed errors. So far I have not been able to reproduce the problem with a Pro1000XT. Reduced failure to one NIC. by Bignell, Ken (1/25/2002 12:29:35 PM) *****UPDATE Ken Bingell, 25 Jan 2002. ******* SNaC able to reproduce using only one PRO/1000F in slot 7 of a Slimmerlot. Further testing to see if we can repro in slot other than 7. *************************************************** ARP mask may be part of the issue. SNaC Investigating. by Bignell, Ken (1/22/2002 7:12:08 PM) Thomas was able to reproduce the Linux client test time out issue on SlimMerlot with only two PRO/1000F NICs active. All 6 were installed, but only two active. He ran 4 or 5 clients against NIC 5 for a few hours, then started a couple of clients on NIC 1. After about an hour, client tests timed out on NIC 5. We took a look at the NICs and their ARP tables and Thomas noticed that some of the clients on NIC 1 had the MAC for NIC 3 in their ARP table. We tried again and saw that all of the receiving was being done by NIC 3, and all of the transmit was being done by NIC 1 and 5. There is a feature in Linux (at least as far back as the 2.2 kernel, so this means it was in Red Hat 7.0 and later) that is an ARP mask that allows any NIC in host connected to the same physical network to reply to any ARP for any IP on that host. So if NIC 1 and 2 are connected to the same switch that is not segmented, NIC 2 could reply to an ARP on the MAC address of NIC 1. This may or may not be the issue. We need to know if PT's test was set up with at least some of the NICs in the same physical network, I will check with them tomorrow. We are also trying a known failing config with the ARP mask feature turned off to see if the issue goes away. We will know more tomorrow. I will try to keep everyone updated on our progress. Reassign to Ken Bignell by Locklear, David (1/15/2002 8:46:54 AM) Ken has current action on this issue. Recreating failure in SNaC lab. Continuing on. by Jones, James P (12/18/2001 4:38:24 PM) Slot 1 Adview Card Slot 2 39160 Slot 3 1000 (traffic) 16 clients Slot 4 1000 (traffic) 16 clients Slot 5 1000 (traffic) 16 clients Slot 6 1000 (traffic) 16 clients Slot 7 1000 (no traffic, idle) enabled Slot 8 1000 (no traffic, idle) enabled All clients running RR, 0.bat, both 2 and 10 mb. 1,2,5,6 on test menu. While running in the configuration above, the system ran SMB traffic for 24 hours with no problems, errors or client faults or drop offs. I added 16 client traffic to slots 7 and 8 each, with in an hour, the system hard locked with no recovery. by Amor, Mohammed (12/18/2001 2:34:01 PM) Ken Bignel tested this issue with a Jag and has not seen the failure Recreated Several more times. by Jones, James P (12/14/2001 4:57:28 PM) In efforts to narrow the issue down to a root cause, I have ran serveral variations with the configuration and slight differences. Slot 1 Adview Card Slot 2 39160 Slot 3 1000 (traffic) 16 clients Slot 4 1000 (traffic) 16 clients Slot 5 1000 (no traffic, idle) enabled Slot 6 1000 (no traffic, idle) enabled Slot 7 1000 (traffic) 16 clients Slot 8 1000 (traffic) 16 clients All clients running RR, 0.bat, both 2 and 10 mb. 1,2,5,6 on test menu. While running in the configuration above, the system ran SMB traffic for 49 hours with no problems, errors or client faults or drop offs. On the 50th hour, I added 16 client traffic to slots 5 and 6 each, with in an hour, the system hard locked with no recovery. Change History 11/12/2001 12:59:37 PM by Locklear, David Problem Area Changed From Operating System To NICs Last State Change Date Changed From 11/12/2001 12:46:44 PM To 11/12/2001 12:59:37 PM Owner Changed From Locklear, David To Bignell, Ken State Changed From Created To Assigned Via Transition: Assign Last State Changer Changed From Jones, James P To Locklear, David Last Modifier Changed From Jones, James P To Locklear, David Assigned To Changed From (None) To Bignell, Ken Last Modified Date Changed From 11/12/2001 12:46:44 PM To 11/12/2001 12:59:37 PM Assigned Date Changed From Unassigned To 11/12/2001 12:59:37 PM 12/5/2001 7:41:21 AM by Bignell, Ken Assigned To Changed From Bignell, Ken To Amor, Mohammed Last Modified Date Changed From 11/12/2001 12:59:37 PM To 12/5/2001 7:41:21 AM Last Modifier Changed From Locklear, David To Bignell, Ken -------------------------------------------------------------------------------- The latest Samba erratum for 8.0 is 2.2.7-2. The latest released version from samba.org is 2.2.7a (for production) or 2.2.8pre1 (for testing). 2.2.2 through 2.2.6 have a major remote-root security hole and should not be used. Have they tried a more recent version of samba? Created attachment 90263 [details]
dag_rack_02_01.log.gz
Created attachment 90264 [details]
dag_rack_02_02.log.gz
Created attachment 90265 [details]
log.smbd.gz
Created attachment 90266 [details]
smbd.log.gz
FROM ISSUE TRACKER ----------------------- Event posted 02-24-2003 01:14pm by Bhutani with duration of 0.00 For 8.0 we have tried: 2.2.5-10 AND 2.2.7-2 several times in various combinations of other variables like hardware, infrastructure etc .They have all failed. Are we expected to try the non-redhat samba rpm's i.e the latest from samba.org ? If YES, then which one should we try - 2.2.7a or 2.2.8pre1 ?? Status set to: Waiting on Tech ----------------------- Event posted 02-27-2003 02:40pm by Bhutani with duration of 0.00 Did anybody @ Red Hat take a look at the samba error logs ? Any feedback is appreciated as this issue is affecting several platform schedules to slip. ----------------------- Event posted 02-28-2003 02:28pm by Bhutani with duration of 0.00 Since there is no response from RH's end, Dell is starting to test with 2.2.7a-5 (latest from rawhide) New Event Action FROM ISSUE TRACKER Event posted 03-12-2003 08:29pm by Bhutani with duration of 0.00 Root cause found. - In one case it was a bad switch. Failed diags. - In an another case the "Unicast Port Storm Control Filter " setting that was enabled on one of the switches. Disabling that "feature" caused the failure to go away. Test being regressed on all failing configurations currently. Status set to: Fix Pending (e.g., do not believe this is a Linux problem) |