Bug 902684
| Summary: | Crash seen on ssl_setup_connection() | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Community] GlusterFS | Reporter: | Avra Sengupta <asengupt> | ||||
| Component: | transport | Assignee: | Jeff Darcy <jdarcy> | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | |||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | mainline | CC: | gluster-bugs, kparthas | ||||
| Target Milestone: | --- | Keywords: | Reopened | ||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | glusterfs-3.4.0 | Doc Type: | Bug Fix | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2013-07-24 17:47:24 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Avra Sengupta
2013-01-22 09:19:17 UTC
What is the nature of this alleged race? Which code paths are involved? Bug873367 was a race, fixed by http://review.gluster.org/#change,4158 a couple of months ago. How does http://review.gluster.org/#change,4118 break it again? The stack trace shows that we haven't even gotten to any glusterd code yet; we're still in the transport code for accepting a connection. The way we crash seems to indicate that we have a bad pointer to an SSL structure, and it was good when we spawned the thread. How did it become bad? Have the 4118 changes caused us to tear down a connection before we're even finished setting it up? Or maybe corrupted memory somehow? FWIW, I just ran bug-873367.t on my systems with current master plus the 4118 patch ten times with no failures of any sort. That suggests that maybe it is a race condition, but its nature is still unclear. After debugging this for a while, I realized something: we should never be in any of this code in glusterd. The options to control both SSL and private-thread usage only affect the server and client volfiles - never glusterd, NFS, etc. If use_ssl and own_thread are set for a glusterd socket transport, that means something else is amiss. There should be two INFO messages in the glusterd log saying whether these features are enabled. Avra, can you please check for a failed run and report what those messages say the state was when glusterd started? Hi Jeff, I am extremely sorry, as it was a miss on my part. The core was not generated by glusterd, but by glusterfsd. KP posted another patchset in http://review.gluster.org/#change,4118 after which I don't observe the crash at all. I am closing this bug for now. Will re-open it, if I encounter this issue again. Thanks. Regards, Avra Hi Jeff, I am re-opening this bug, as I still see the crash whenever I run bug-873367.t. I isolated the point in bug-873367.t, which actually causes the crash. After the last second umount(Test 14), we stop the volume in Test 15. When the volume stop is executed, socket_disconnect is invoked. The ssl_ssl in this call, is corrupted and is causing the crash. I am attaching the gdb logs from the core. Please let me know if any other info is required. Regards, Avra Created attachment 689771 [details]
GDB logs
The key here seems to be the fact that ssl_enabled=true and use_ssl=false, which is a combination that only happens for client connections when we're connecting to a glusterd portmapper. Sure enough, elsewhere in the gdb log we see that the translator name is patchy-client-0 and client functions are in the backtraces. Relatedly, the fact we have a non-NULL value for ssl_ssl implies that use_ssl must have been true at some time in the past, so the fact that it's false now implies that the socket was reconnected since then. Indeed, we can see that socket_gen=5, which indicates multiple disconnect/reconnect cycles. I added some code to print out these values during socket transitions, and the only place I see socket_gen=5 is in the NFS daemon. Here's the full sequence of operations on that socket. * Connected with ssl_enabled=true and use_ssl=false (portmapper connection) * Disconnected * Connected with ssl_enabled=true and use_ssl=true (brick connection) * Disconnected * Connected with ssl_enabled=true and use_ssl=false Another portmapper connection? Why? And why does it cause the test to fail (for you but not for me or during regression tests) when it doesn't even use NFS? I don't know the answers to those questions, but I do know that we can prevent that last connection from blowing up on the invalid ssl_ssl pointer when it's disconnected again. That's what http://review.gluster.com/#change,4449 does, but I'm concerned that we might just be masking a problem in the higher-level disconnect/reconnect logic. CHANGE: http://review.gluster.org/4449 (socket: null out priv->ssl_ssl on disconnect) merged in master by Anand Avati (avati) >That's what http://review.gluster.com/#change,4449 does, but I'm concerned that >we might just be masking a problem in the higher-level disconnect/reconnect >logic.
rpc_clnt layer's reconnect logic 'resets' remote-port to zero on the first connect to the brick process. This is done to ensure that the client xl is not taken by surprise when a brick goes down (resulting in a disconnect) and starts listening on a different port.
[Ref: rpc-clnt.c: rpc_clnt_notify, RPC_TRANSPORT_CONNECT case.]
FWIW, I used to get the same crash (on my laptop,) that Avra was seeing in his setup.
|