Bug 998963
Summary: | CTDB:iozone gives read block error when run on smb mount on windows | ||||||
---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | surabhi <sbhaloth> | ||||
Component: | samba | Assignee: | Ira Cooper <ira> | ||||
Status: | CLOSED EOL | QA Contact: | surabhi <sbhaloth> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 2.1 | CC: | lmohanty, poelstra, rjoseph, sdharane, ssaha, surs, vagarwal | ||||
Target Milestone: | --- | Keywords: | ZStream | ||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | ctdb | ||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2015-12-03 17:14:17 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 956495 | ||||||
Attachments: |
|
Description
surabhi
2013-08-20 12:42:49 UTC
Please post the Windows version and the protocol version negotiated on the wire. In a simple test: Windows 7 client Samba 3.6.3 server (what I have at home) EXT4 File system SMB2 protocol negotiated I ran the following sequence: # service smbd stop; sleep 2; service smbd start That was sufficient to produce an error very similar to the one given above, even through the Windows client very quickly reconnected to the server (once it was running again). The underlying problem is that once the TCP connection is broken, both the client and the server lose a lot of state information. In particular, they lose things like file handles. In order to restart the connection, the client has to re-establish the TCP connection, reopen all files, and re-apply for all locks before any other client takes those locks away. Since, on the client side, the application holds the file handles, it becomes the applications job to reestablish the connection. Few applications actually do this, and some will crash so badly that they bring down the whole Windows system. I've seen this happen (Blue Screen). There are new features of SMB2 and SMB3 that allow the client and server to maintain more state information so that connection state can be reestablished. These features (Durable handles, Resilient handles, and Persistent handles) are not available in Samba 3.6.x. They are being developed for Samba 4.1.x and above. So, the behavior listed in this ticket is the expected behavior, even in a clustered environment. Samba 3.6.x with CTDB provides scale-out cluster support as well as course-grained failover. By course-grained, I mean what you see here. The behavior is the same is it would be if a single-node server crashed hard and came back up again quickly. Request that QE close this ticket with appropriate settings. Tried the test on windows 7 client with max protocol=SMB2 samba version samba-3.6.9-159.1.el6rhs.x86_64 xfs/samba Tried it on single node With smb stop and start the test is failing. Moving the keyword blocker. Is there a test application that we can use to see the behavior where the client does the proper re-establishment of connection after a reboot while using CTDB? I do agree that iozone does not do all these things and so the error is probably inevitable. Also we should at-least run the same test with a single node including Samba + XFS only and see what happens when a server is rebooted while live I/O is going on. OK to not have this as a blocker. To confirm the live failover with ctdb I tried the below tests and in both of the tests I/O's stopped. 1. In the ctdb setup , I mounted the volume on windows client with virtual ip of node1 and started the iozone test. When I brought down the network (ip link set dev eth0 down) of the node1 , the iozone I/O failed. 2. In the second test, I shut down the node1 while io was running and and iozone failed after shut down. I tried iozone io on single node xfs+samba. I rebooted the samba server and as expected i/o failed. Because i/o fails for a single node samba server(in case of unavailability), we run ctdb ( i.e. clustered implementation of samba servers) for node failover, and IP takeover. We will try simple application other than iozone to see it works with ctdb and update. See comment #4. We can bring down smbd, wait the 2 seconds for all smbd daemons to stop, and then restart smbd. When I did this against Samba+EXT4, iozone failed within the 2 second delay period, but by the time I had typed 'dir<enter>', the share was available again and the temporary iozone.tmp file was available. A similar test should work against a clustered configuration, as long as we allow enough time (perhaps 4 seconds instead of 2) for the failover to occur. 1) Start the iozone test on the Windows client. 2) Shut down smbd on the connected server node. 3) Allow iozone to fail. 4) Use the dir command to see if the share is available. 5) Once the share is available again, delete iozone.tmp and restart iozone. Here are following scenarios we tried: I tried the iozone test on same environment with nfs mount and it worked fine. Executed iozone on nfs mount with virtual ip and power off the node, the iozone continued and node takeover has happened. After powering on the node the iozone still continued. These are some tests i executed to test if ctdb failover and failback happens for a simple application (i.e. application other than iozone) The application is a simple Perl script which creates files and folders on the smb mount point. 1. Started the script on the mount point.Powered off one of the nodes with whose virtual ip the share was mounted and the script was running continuously and creating files. It didn't give any error which looks like the file handle got migrated to the other node and ctdb failover is working fine with a simple application. 2. In second scenario, started the same script and rebooted the node, this time the script was running fine. But once the node came back to healthy (in command "ctdb status"), the script threw the error print() on closed filehandle FH at Win_CreateDirTreeNFiles.pl line 73. <-------with reboot So it looks like failover is happening but failback is not happening properly. Please indicate the platform on which the Perl Scripts are running. A packet capture would also help. Also note that comparison between NFS and SMB is not particularly useful in this case, since NFSv3 is a stateless protocol and failover *should* work as described above. SMB is stateful, and what we are particularly losing in a failover or failback situation is state. ...and a CTDB failback is basically the same as a failover. Chris, The script was running on a Windows7 client machine. We will provide a packet trace once the set-up is free. I will attach the script with this bug in case you want to-take a look. In the above comment, the NFS and the Perl script on win7 client is different issue. I agree that NFSV3 is a stateless protocol and we should not compare with smb. As you said "CTDB failback is basically the same as a failover", I am worried why the script is not failing when we take down the node but it fails when it comes back. Created attachment 790970 [details]
Perl script used to generate I/O
Attached the Perl script which can be used to create IO.
Since Durable and Persistent handle support is not available in Samba 3.6, and since SMB is a stateful protocol, the expected behavior is that a connection loss or loss of the server will cause an I/O failure even if the client reconnects to another server in the cluster. There are other BZs open against banning and failover issues in CTDB. The original bug reported in this BZ is actually 'expected behavior'. Thank you for submitting this issue for consideration in Red Hat Gluster Storage. The release for which you requested us to review, is now End of Life. Please See https://access.redhat.com/support/policy/updates/rhs/ If you can reproduce this bug against a currently maintained version of Red Hat Gluster Storage, please feel free to file a new report against the current release. |