Bug 246093 - [EMC 5.1 bug] CIFS mount to EMC NAS causes hang on file access : RFC1001 size 135 bigger than SMB for Mid=
Summary: [EMC 5.1 bug] CIFS mount to EMC NAS causes hang on file access : RFC1001 size...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.0
Hardware: All
OS: Linux
high
medium
Target Milestone: ---
: ---
Assignee: Don Domingo
QA Contact: Martin Jenner
URL:
Whiteboard:
Depends On:
Blocks: 217104 222082
TreeView+ depends on / blocked
 
Reported: 2007-06-28 14:33 UTC by Jose Plans
Modified: 2018-10-19 19:51 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-07-27 01:13:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
patch -- request extended info on CIFS open operations (1.21 KB, patch)
2007-06-28 15:51 UTC, Jeff Layton
no flags Details | Diff
patch -- on a NTCreateX call, zero out the bcc (500 bytes, patch)
2007-07-02 15:41 UTC, Jeff Layton
no flags Details | Diff
patch -- on a NTCreateX call, zero out the bcc (500 bytes, patch)
2007-07-02 15:41 UTC, Jeff Layton
no flags Details | Diff
patch -- on a NTCreateX call, zero out the bcc (500 bytes, patch)
2007-07-02 15:41 UTC, Jeff Layton
no flags Details | Diff
patch -- on a NTCreateX call, zero out the bcc (500 bytes, patch)
2007-07-02 15:42 UTC, Jeff Layton
no flags Details | Diff
patch -- on a NTCreateX call, zero out the bcc (500 bytes, patch)
2007-07-02 15:42 UTC, Jeff Layton
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Novell 247090 0 None None None Never

Description Jose Plans 2007-06-28 14:33:04 UTC
Description of problem:

After a CIFS mount, it is impossible to read/write any file on it, the
application that tries the IO hangs, and syslog messages appear in the console
in a loop :

--
kernel:  CIFS VFS: server not responding
kernel:  CIFS VFS: No response for cmd 162 mid 380
kernel:  CIFS VFS: RFC1001 size 135 bigger than SMB for Mid=384
--

The kernel version used is :
2.6.18-8.1.6.el5 686

The CIFS server used is :
EMC NAS model is EMC NS502G with firmware 5.5.19.

This issue has not been reproduced so far in other CIFS servers such as RHEL4
samba 3.0.10.


Additional info:

https://bugzilla. novell.com/show_bug.cgi?id=247090, a similar problem reported
on OpenSUSE. It seems to be a problem in EMC.

Comment 2 Jeff Layton 2007-06-28 15:51:16 UTC
Created attachment 158133 [details]
patch -- request extended info on CIFS open operations

This patch makes CIFS request extended information in open operations. Windows
apparently always requests this info with create operations. The EMC NAS
appliance assumes that all clients and never actually checks to see. So when we
get the response, there's too much data in it and the client complains about
possible buffer overrun.

The workaround here is to mimic what windows does and request this extra info,
even though we don't actually do anything with it. This means that these
packets will be slightly larger on the wire, but other than that there
shouldn't be any negative effects.

Comment 3 Jeff Layton 2007-06-28 17:47:35 UTC
Note that the above patch is unconfirmed, AFAIK. We need to have someone having
this problem test it. I'm building a set of test kernels now that contain this
patch.


Comment 4 Jeff Layton 2007-06-28 19:58:37 UTC
I've posted a set of RHEL5 test kernels with this patch on my people page:

http://people.redhat.com/jlayton/

...could you have the customer test these someplace non-critical and see if they
resolve the issue? I'd feel better about proposing this for 5.1 if we have some
confirmation that it fixes the issue.


Comment 5 Jeff Layton 2007-06-29 14:27:18 UTC
Got an email from customer that the peoplepage kernels didn't help. We'll be
working via the ticketing system from here on out though.

Comment 6 Jeff Layton 2007-06-29 14:30:31 UTC
This patch also doesn't seem to be quite correct. It does this:

pSMB->OpenFlags |= REQ_EXTENDED_INFO;

...but maybe it should be something like this?

pSMB->OpenFlags |= cpu_to_le32(REQ_EXTENDED_INFO);

...also I'm guessing that makes the client request extra info from the server,
but I'm guessing that we need to do something else to make the client actually
expect this extra info in the reply. Though maybe this should be doing that and
I'm missing something.


Comment 7 Jeff Layton 2007-06-29 14:33:28 UTC
note that the endianness change above shouldn't matter if the customer's using
x86 or x86_64


Comment 8 Jeff Layton 2007-07-02 15:41:49 UTC
Created attachment 158340 [details]
patch -- on a NTCreateX call, zero out the bcc

Steve French clarified that his intent with that patch was to try to trick the
server into filling out the BCC field correctly. Apparently, it still doesn't.
I can only assume that Windows machines don't even look at the BCC in a
NTCreateX reply.

Here's a patch I've proposed in a private email to Steve. If I've gotten right,
it should force the bcc to be 0 in a NTCreateX call and that should work around
this bug. I'm not sure if I've gotten this right and whether we want to make
this conditional on something, however, so awaiting his response.

Comment 9 Jeff Layton 2007-07-02 15:41:54 UTC
Created attachment 158341 [details]
patch -- on a NTCreateX call, zero out the bcc

Steve French clarified that his intent with that patch was to try to trick the
server into filling out the BCC field correctly. Apparently, it still doesn't.
I can only assume that Windows machines don't even look at the BCC in a
NTCreateX reply.

Here's a patch I've proposed in a private email to Steve. If I've gotten right,
it should force the bcc to be 0 in a NTCreateX call and that should work around
this bug. I'm not sure if I've gotten this right and whether we want to make
this conditional on something, however, so awaiting his response.

Comment 10 Jeff Layton 2007-07-02 15:41:58 UTC
Created attachment 158342 [details]
patch -- on a NTCreateX call, zero out the bcc

Steve French clarified that his intent with that patch was to try to trick the
server into filling out the BCC field correctly. Apparently, it still doesn't.
I can only assume that Windows machines don't even look at the BCC in a
NTCreateX reply.

Here's a patch I've proposed in a private email to Steve. If I've gotten right,
it should force the bcc to be 0 in a NTCreateX call and that should work around
this bug. I'm not sure if I've gotten this right and whether we want to make
this conditional on something, however, so awaiting his response.

Comment 11 Jeff Layton 2007-07-02 15:42:05 UTC
Created attachment 158343 [details]
patch -- on a NTCreateX call, zero out the bcc

Steve French clarified that his intent with that patch was to try to trick the
server into filling out the BCC field correctly. Apparently, it still doesn't.
I can only assume that Windows machines don't even look at the BCC in a
NTCreateX reply.

Here's a patch I've proposed in a private email to Steve. If I've gotten right,
it should force the bcc to be 0 in a NTCreateX call and that should work around
this bug. I'm not sure if I've gotten this right and whether we want to make
this conditional on something, however, so awaiting his response.

Comment 12 Jeff Layton 2007-07-02 15:42:14 UTC
Created attachment 158344 [details]
patch -- on a NTCreateX call, zero out the bcc

Steve French clarified that his intent with that patch was to try to trick the
server into filling out the BCC field correctly. Apparently, it still doesn't.
I can only assume that Windows machines don't even look at the BCC in a
NTCreateX reply.

Here's a patch I've proposed in a private email to Steve. If I've gotten right,
it should force the bcc to be 0 in a NTCreateX call and that should work around
this bug. I'm not sure if I've gotten this right and whether we want to make
this conditional on something, however, so awaiting his response.

Comment 13 Jeff Layton 2007-07-03 16:20:08 UTC
I've built a new series of test kernels that contain the patch in comment #12.
Please ask the customer to test these on a non-critical machine and let us know
if the problem goes away.

Note that this is truly just a test patch. A final patch (if any) will likely
look different.



Comment 14 Andrius Benokraitis 2007-07-05 21:24:21 UTC
Adding EMC...

Wayne, can you add the appropriate people in EMC for this issue?

Comment 15 Jeff Layton 2007-07-06 01:01:32 UTC
Some things EMC could help us with here:

1) clarify what hw/sw revs have this bug (for our support folks)
2) clarify what's being done to fix the bug (an ETA would be great)
3) test the patch in comment #12 (or the kernels on my people page), and let us
know if it's works around the issue

...any help is appreciated!

Comment 16 Wayne Berthiaume 2007-07-06 17:19:18 UTC
Added Xiangping and Li...

Comment 17 Wayne Berthiaume 2007-07-09 18:18:46 UTC
The following fix was added to the EMC Celerra DART code 5.5.27.5 which should 
resolve this issue. The fix is to take into account the extended response bit 
from the NTCreateX request in Create Flag field to return or not an extended 
response. Before the fix, the extended response was returned based on the 
client type we had determined during the negotiation.



Comment 18 Jeff Layton 2007-07-09 18:37:45 UTC
Excellent. Does the patch also correct the BCC value in the response? I believe
it's always supposed to be 0, but we were often seeing random values in that
field, leading us to believe that it might be uninitialized.



Comment 19 Issue Tracker 2007-07-10 04:13:20 UTC
This event sent from IssueTracker by rrajaram 
 issue 125277

Comment 20 Wayne Berthiaume 2007-07-10 17:01:09 UTC
The Bcc was seeing 'random' because the client was expecting a short response 
(it seems the client was not taking into account the WordCount provided in the 
response). The Bcc was intrerpreted as placed in the extended info.
As for in the fix, the client will receive a 'short' response, the WordCount is 
now as expected by the client and the Bcc is correctly set to 0.


Comment 24 Jeff Layton 2007-07-11 15:02:22 UTC
Given that there is now a server-side patch, I'm going to propose that we close
this as NOTABUG and recomment that we consider a release note so that people
hitting this bug are directed to EMC to have their Celerra patched.


Comment 25 Jeff Layton 2007-07-11 15:13:45 UTC
First pass at release note text:

Some versions of EMC's Celerra product have a known bug in their handling of
CIFS NTCreateX calls. This bug is characterized by kernel messages similar to
the following, 

kernel:  CIFS VFS: server not responding
kernel:  CIFS VFS: No response for cmd 162 mid 380
kernel:  CIFS VFS: RFC1001 size 135 bigger than SMB for Mid=384

Programs doing I/O to the mountpoint will hang. Users experiencing this issue
should contact EMC support, and reference EMC case number [need this info].


Comment 26 Don Domingo 2007-07-11 23:19:39 UTC
Thanks Jeff. will add this note to the 5.1 release notes when EMC case number
has been supplied.

Comment 27 Jeff Layton 2007-07-12 12:05:47 UTC
Andrius, could you track down the EMC case ID for this problem for the release note?

Comment 28 Andrius Benokraitis 2007-07-12 13:26:11 UTC
Wayne, do you happen to have the EMC case ID for this problem?

Comment 29 Wayne Berthiaume 2007-07-12 14:36:52 UTC
Hi Andrius.

I have an internal Remedy case number at this time. I'm working on getting a 
PRIMUS case number assigned so CS will have something to reference.

Regards,
Wayne.

Comment 31 Don Domingo 2007-07-12 23:53:08 UTC
setting NEEDINFO=berthiaume_wayne

release note will be added when we have the case number. 

thanks!

Comment 32 Rob Quagliozzi 2007-07-20 08:29:23 UTC
Hi!

just thought I'd mention that I've just tested the i686 on Jeff's peoplepage, 
and it works well so far - all CIFS problems seem to have gone. I'm unable to 
upgrade our Celerra's Dart code, and this is a good stop-gap until we get a 
new NAS/SAN in two months.

Thanks!
Rob

Comment 33 Don Domingo 2007-07-22 23:16:18 UTC
Hi Wayne, do we have the case number yet?

please advise that the deadline for release notes is August 1, 2007. thanks!

Comment 34 Wayne Berthiaume 2007-07-25 22:07:08 UTC
Hi Don.

   Truely a tough question for you. In order for the PRIMUS case to be 
accurate, is it possible to disclose what versions of RHEL this issue will be 
seen? This BZ only references RHEL 5.0 but my understanding is it is seen in 
Fedora 6 as well.

Thanks and regards,
Wayne.

Comment 35 Don Domingo 2007-07-26 01:59:19 UTC
Hi Wayne,

the release notes i'm doing right now are specific to RHEL5.1 (i am not part of
the Fedora documentation team). anyhow, won't you be able to provide a case
number for the RHEL5.1 release notes with the understanding / assumption that
only RHEL is affected? 

if you need a complete list of affected systems, perhaps Jeff Layton can help us
out? as of now, setting NEEDINFO=wayne again. 

thanks!

Comment 36 Don Domingo 2007-07-26 03:23:14 UTC
Andrius has provided the EMC case number via email. below is the RHEL5.1 release
note for this issue, added under "Known Issues":

<quote>
Some versions of the EMC Celerra are unable to properly handle CIFS NTCreateX
calls. Programs performing an I/O to the mountpoint will hang, following these
kernel messages:

kernel:  CIFS VFS: server not responding
kernel:  CIFS VFS: No response for cmd 162 mid 380
kernel:  CIFS VFS: RFC1001 size 135 bigger than SMB for Mid=384

If you encounter this issue, please contact EMC and reference case number 19189788.
</quote>

please advise if any revisions are in order. thanks!

Comment 37 Wayne Berthiaume 2007-07-26 12:03:59 UTC
The EMC Primus case number is emc165978.

ID: emc165978
Domain: EMC1
Solution Class: 3.X Compatibility

Goal       After recent Linux upgrade, accessing CIFS shares on EMC NAS causes 
Linux CIFS client to hang or possibly panic

Fact       Product: Celerra

Fact       Protocol: Server Message Block (SMB)

Fact       Protocol: Common Internet File System (CIFS)

Fact       OS: SuSE Linux 10.2 (2.6.18 Kernel, 1.45 CIFS)

Fact       OS: Fedora Core 6 (2.6.20 Kernel, 1.47 CIFS)

Fact       EMC SW: NAS Code 5.5.26.x and below

Symptom    Accessing CIFS shares on EMC NAS causes Linux CIFS client to hang or 
possibly panic

Symptom    Celerra server log contains the following or similar error messages 
when a client attempts to read a file: 

2007-03-30 15:25:19: SMB: 3:  Client=10.0.0.1 OS='Linux version 2.6.20-
1.2925.fc6', LM='CIFS VFS Client for Linux' not registered capa=0xd0dc (R=8/8) 
2007-03-30 15:25:19: SMB: 3:  Client=10.0.0.1 OS=Linux version 2.6.20-
1.2925.fc6 LM=CIFS VFS Client for Linux Extra=- type=- (1) 
2007-03-30 15:25:19: SMB: 3:  Client=10.0.0.1 OS='Linux version 2.6.20-
1.2925.fc6', LM='CIFS VFS Client for Linux' not registered capa=0xd0dc (R=8/8) 
2007-03-30 15:25:19: SMB: 3:  Client=10.0.0.1 OS=Linux version 2.6.20-
1.2925.fc6 LM=CIFS VFS Client for Linux Extra=- type=- (1)



Change     Recent upgrade to Linux client

Cause      EMC DART OS was modified to return extended response to NTcreateX 
only if the bit 0x10 is set in the Create Flag, instead of assume the extended 
info are required for W2K client (see below).  With the code modification, the 
extended response is only returned when the Extended Response Create Flag is 
set. 

Affected versions include:

Fedora Core 6 (32-bit)
Kernel: 2.6.20-1.2925
CIFS Module Version: 1.47

SuSe Linux 10.2
Kernel: 2.6.18.2-34-default
CIFS Module Version: 1.45

SMB Header 
 NT Create AndX Request 
  Create Flags: 0x00000010 
  .... .... .... .... .... .... ...1 .... = Extended Response: Extended 
responses required 
  .... .... .... .... .... .... .... 0... = Create Directory: Target of open 
can be a file 
  .... .... .... .... .... .... .... .0.. = Batch Oplock: Does NOT request 
batch oplock 
  .... .... .... .... .... .... .... ..0. = Exclusive Oplock: Does NOT request 
oplock

Fix        Upgrade to NAS Code 5.5.27.5 or later

 

Comment 39 Jeff Layton 2007-07-26 13:37:59 UTC
As best I can tell, this will be an issue on all versions of RHEL4 and RHEL5,
though with versions before 4.5, the error will look a bit different. It'll
probably also be seen on any relatively recent version of fedora too (>=FC3).

Not sure about anything before that...


Comment 40 Jay Turner 2007-07-26 14:01:40 UTC
I'm not exactly sure what this bug is asking for now.  Appears this issue has
been identified as a server-side fix and we've prepared a release note pointing
affected users to EMC for a patch.  Sounds like there's nothing left for Red Hat
to do, save making sure the release note lands??

Comment 41 Wayne Berthiaume 2007-07-26 15:53:17 UTC
Yes, the bug is array side and fixed as noted in comment #37 with an upgrade to 
the Celerra DART code. All that is left for Red Hat is docuemtation in the 
event a customer looks to Red Hat for the solution. EMC has the bug documented 
in its PRIMUS case for use by both customersand Cs to dtermine bug and fix.

Should be able to close this once Red Hat has documentation in place.

Comment 42 Andrius Benokraitis 2007-07-26 17:25:57 UTC
This issue will become a KBASE entry close to as follows:

===============================
Title: Why does the CIFS client on Red Hat Enterprise Linux versions 4 
or 5 hang when accessing shares on EMC NAS?

Some versions of EMC's Celerra product (NAS Code 5.5.26.x and below), 
have a known bug in their handling of CIFS NTCreateX calls. This issue 
is characterized by kernel messages similar to the following:

kernel:  CIFS VFS: server not responding
kernel:  CIFS VFS: No response for cmd 162 mid 380
kernel:  CIFS VFS: RFC1001 size 135 bigger than SMB for Mid=384

After a CIFS mount, it is impossible to read/write any file on it and 
the application that tries the I/O hangs.

To solve this, upgrade to NAS Code 5.5.27.5 or later. The EMC Primus 
case number is emc165978.
===============================

Comment 43 Andrius Benokraitis 2007-07-26 17:27:34 UTC
Don - I'll leave it to you if you think this needs a release note in addition to
the KBASE article Evan is creating. Please set to CLOSED/NOTABUG after your
decision to close this issue completely.

Comment 44 RHEL Program Management 2007-07-26 17:44:21 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 45 Don Domingo 2007-07-27 01:13:05 UTC
Andrius, i think it's best to have the note present in both Kbase and Release
Note. as such, adding the following under "Other Updates" (integrating it with a
note on CIFS update to 1.48aRH:

<quote>
Note that for users of the EMC Celerra product (NAS Code 5.5.26.x and below) the
CIFS client hangs when accessing shares on EMC NAS. This issue is characterized
by the following kernel messages:

kernel:  CIFS VFS: server not responding
kernel:  CIFS VFS: No response for cmd 162 mid 380
kernel:  CIFS VFS: RFC1001 size 135 bigger than SMB for Mid=384

After a CIFS mount, it becomes impossible to read/write any file on it and any
application that attempts an I/O on the mountpoint will hang. To resolve this
issue, upgrade to NAS Code 5.5.27.5 or later. The EMC Primus case number is
emc165978.
</quote>

closing this bug. thanks!

Comment 46 Evan McNabb 2007-08-06 18:28:23 UTC
The kbase submission has been accepted and is accessible at:

http://kbase.redhat.com/faq/FAQ_85_11046   (RHEL4 category)
http://kbase.redhat.com/faq/FAQ_103_11046  (RHEL5 category)

Comment 47 Don Domingo 2007-08-08 00:22:59 UTC
minor edit to release note:

<quote>
After a CIFS mount, it becomes impossible to read/write any file on it and any
application that attempts an I/O on the mountpoint will hang. To resolve this
issue, upgrade to NAS Code 5.5.27.5 or later (use EMC Primus case number emc165978).
</quote>


Note You need to log in before you can comment on or make changes to this bug.