Bug 611938
Summary: | [RHEL5u3] System panic at sunrpc xprt_autoclose() | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Flavio Leitner <fleitner> |
Component: | kernel | Assignee: | Jeff Layton <jlayton> |
Status: | CLOSED ERRATA | QA Contact: | Network QE <network-qe> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 5.3 | CC: | apatrick, bfields, dtian, dwysocha, jonstanley, kzhang, moshiro, shsch21, steved, tao, trond.myklebust, yugzhang |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2011-01-13 21:41:12 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Flavio Leitner
2010-07-06 21:52:08 UTC
My first thought here is that the rpc_xprt was allocated and then freed, but a work queue was left scheduled which would call ->xprt_autoclose() later. In the meanwhile, the memory has been reallocated for another purpose, so when the work queue was scheduled, the rpc_xprt seems corrupted. I looked over the changelog and didn't see any recent change (-128.el5 or newer) that could be introducing a problem like that. But I need to read the code carefully before say anything more concrete. I basically agree with Flavio's assessment. What prevents the xprt from being torn down while there's still a task_cleanup job in the workqueue? I don't see anything that does right offhand. The task_cleanup job doesn't hold a reference to the xprt or anything. There may be some magic with the xprt->state handling that is supposed to prevent this, but it's not clear to me. Steve, Bruce -- any insight? The other thing that might be interesting is to know whether the rpc_clnt that points to the rpc_xprt has also been freed. Hi Jeff, Unfortunately, the stack trace there didn't show rpc_clnt because the work queue received a pointer to struct rpc_xprt * directly. So, I have searched the kernel memory space looking for the rpc_xprt pointer and subtracted the offset to rpc_clnt.cl_xprt. Here is the output: crash> size -o rpc_clnt.cl_xprt struct rpc_clnt { [0x8] struct rpc_xprt *cl_xprt; } [1] [2] [3] ffff810073d05bb8 ffff810073d05bb0 X ffff810073d05bd0 ffff810073d05bc8 X ffff810073d05be8 ffff810073d05be0 X ffff810073d05d50 ffff810073d05d48 ~ ffff810073d05d70 ffff810073d05d68 X ffff810073d05d90 ffff810073d05d88 X ffff810073d05da8 ffff810073d05da0 X ffff810073d05dc0 ffff810073d05db8 X ffff810073d05dd8 ffff810073d05dd0 ~ ffff810073d05e10 ffff810073d05e08 ~ The column [1] are the addresses containing the struct rpc_xprt * pointer value. The column [2] are the address of [1] minus 0x8 and the column [3] means what I think about the contents. X: not a match, ~ could be, but unlikely. The vmcore is available for you at: megatron.gsslab.rdu.redhat.com:/cores/20100702001907/work As far as I could see, there is no valid rpc_clnt using the rpc_xprt pointer. fbl In looking over this code, I don't see anything that prevents the xprt from being freed before the task_cleanup workqueue job runs. It seems like the xprt_autoclose workqueue job ought to hold a reference to the xprt until it completes. The XPRT_LOCKED bit is set over the duration of the job, so it's possible that that is supposed to prevent this somehow. It's not completely clear to me if so. Bruce pointed out this patch, that might be the culprit: commit 66af1e558538137080615e7ad6d1f2f80862de01 Author: Trond Myklebust <Trond.Myklebust> Date: Tue Nov 6 10:18:36 2007 -0500 SUNRPC: Fix a race in xs_tcp_state_change() When scheduling the autoclose RPC call, we want to ensure that we don't race against the test_bit() call in xprt_clear_locked(). Signed-off-by: Trond Myklebust <Trond.Myklebust> ....it's hard to be sure though from the core. How reproducible is this problem? ...and even with that patch, I don't see anything that prevents the xprt from being freed while xprt_autoclose is queued to the workqueue. This is a tricky one, indeed. Version-Release number of selected component: Red Hat Enterprise Linux Version Number: RHEL5 Release Number : 5.3 Architecture : x86_64 Kernel Version : 2.6.18-128.2.1 Related Package Version : none Related Middleware / Application : none Drivers or hardware or architecture dependency: none How reproducible: Unknown Step to Reproduce: Unknown Actual Results: The system panicked. Expected Results: The system doesn't panic. Hardware configuration: Model : PRIMERGY BX920S1 CPU Info : Intel(R) Xeon(R) CPU 2.93 GHz Memory Info : 2GB I wonder whether we need something like diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c index dcd0132..2a1f664 100644 --- a/net/sunrpc/xprt.c +++ b/net/sunrpc/xprt.c @@ -1129,6 +1129,7 @@ static void xprt_destroy(struct kref *kref) rpc_destroy_wait_queue(&xprt->sending); rpc_destroy_wait_queue(&xprt->resend); rpc_destroy_wait_queue(&xprt->backlog); + cancel_work_sync(&xprt->task_cleanup); /* * Tear down transport state and free the rpc_xprt */ (totally untested)? I don't see what prevents task_cleanup running after xprt destruction in the latest upstream. Adding Trond to the cc list to get his opinion... Trond this is older RHEL-5 code, but it seems upstream is similar in this regard. The question can be summarized as "what prevents an xprt from being freed before the task_cleanup job can run?". It seems like the task_cleanup job ought to hold an xprt reference, or xs_destroy (or something along those lines) should be canceling the task_cleanup job (as Bruce's patch suggests). wrt Bruce's patch -- yeah, that seems reasonable at first glance. I was thinking it might be better though in xs_destroy since we cancel the connect_worker there already. Thoughts? Oops, missed trond on the cc list. Trond, the summary is in comment 11 on this bug. Any help would be appreciated. xs_destroy is socket-specific and task_cleanup is an xprt thing, so I think it should go in the xprt cleanup somewhere, assuming cancelling is the right thing to do. Bruce sent a patch to Trond upstream and he has taken it in for 2.6.36: http://www.spinics.net/lists/netdev/msg137026.html This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. in kernel-2.6.18-219.el5 You can download this test kernel from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html *** Bug 585594 has been marked as a duplicate of this bug. *** *** Bug 709505 has been marked as a duplicate of this bug. *** *** Bug 915757 has been marked as a duplicate of this bug. *** |