Description of problem: "nfsstat -s" shows negative value. How reproducible: When nfsd has taken enough calls to overflow a signed int. Steps to Reproduce: Run nfsd service for a long time with many clients doing operations, and run "nfsstat -s". # nfsstat -s 2007年 10月 24日 水曜日 05:29:02 JST Server rpc stats: calls badcalls badauth badclnt xdrcall >> -2147459460 0 0 0 0 Server nfs v3: null getattr setattr lookup access readlink 3 0% 961806770 44% 34884113 1% 356848932 16% 311303203 14% 0 0% read write create mkdir symlink mknod 58992809 2% 63067804 2% 69266297 3% 12816 0% 0 0% 0 0% remove rmdir rename link readdir readdirplus 103983458 4% 2808 0% 5945055 0% 36564688 1% 34420909 1% 73349994 3% fsstat fsinfo pathconf commit 31820 0% 177 0% 0 0% 36532676 1% Actual results: "nfsstat -s" shows negative value. Expected results: "nfsstat -s" shows correct value. Additional info: /proc/net/rpc/nfsd formats its value as an unsigned int, but nfsstat parses this unsigned int with atoi(). It should use atoll().
Created attachment 273161 [details] Patch to parse and print values correctly.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Looks simple enough...
This patch is not upstream and needs to go there first. I'm going to reset this for 4.8 and we can consider it there. We'll also need to clone this for 5.2.
...err make that 5.3...
As far as testing this... The easiest thing might be to custom roll a kernel that initializes the /proc fields to values that would cause this. I'll plan to do that to verify the fix and will post the kernel patch here so that QA can use it too...
Setting to NEEDINFO for now. Since customer rolled the patch, I'll give them the option of pushing it upstream. It looks sane AFAICT... If they'd rather I push it upstream, please let me know the name of the person to whom I should attribute it.
Created attachment 302629 [details] patch -- parse and print values correctly The other patch missed the heavily-used print_callstats() function. This corrects that and also changes the atoll() call to be a strtoul() (since the kernel should be printing unsigned numbers for these anyway). Tested with a hacked up kernel that initialized many of these counters to 2^31+1.
Created attachment 302630 [details] kernel patch for verification This patch initializes most of the counters that nfsstat reads to 2^31+1. I missed a few, which are still initialized to 0, but this turns out to be enough to convince me that the nfsstat patch works as expected when counters overflow.
Created attachment 302638 [details] patch -- parse and print values correctly Respun patch. David Richter pointed out that the sum variable in parse_pretty_statfile() should also be an unsigned int.
Patch is now in upstream nfs-utils...
Committed in nfs-utils-1.0.6-88.EL4
Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: The 'nfsstat' command now displays correct statistics. In previous versions, performing more than 2^31 RPC calls could cause the 'nfsstat' command to incorrectly display the number of calls as "negative". This was because 'nfsstat' printed statistics from /proc/net/rpc/* files as signed integers; with this version of nfs-utils, 'nfsstat' now reads and prints these statistics as unsigned integers.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0955.html
------- Comment From pradeep.com 2010-09-22 13:42 EDT------- (In reply to comment #5) > Hello, > Please confirm the comment or package this is requested for. Please confirm the > URL to the patch. Please confirm if the patch has been tested by IBM. Here is the relevant portion of the diff for only this bug (diff is between dapl-2.0.25 which is the current version in RHEL5.5 and dapl-2.0.30 which is currently the latest dapl version) elm3b198:/home/pradeep # diff -Nup dapl-2.0.25/dapl/openib_cma/device.c dapl-2.0.30/dapl/openib_cma/device.c --- db2/RHEL5.5/dapl-2.0.25/dapl/openib_cma/device.c 2009-09-28 12:28:31.000000000 -0500 +++ dapl-2.0.30/dapl/openib_cma/device.c 2010-05-19 17:48:47.000000000 -0500 @@ -474,12 +401,6 @@ DAT_RETURN dapls_ib_close_hca(IN DAPL_HC dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " close_hca: %p->%p\n", hca_ptr, hca_ptr->ib_hca_handle); - if (hca_ptr->ib_hca_handle != IB_INVALID_HANDLE) { - if (rdma_destroy_id(hca_ptr->ib_trans.cm_id)) - return (dapl_convert_errno(errno, "ib_close_device")); - hca_ptr->ib_hca_handle = IB_INVALID_HANDLE; - } - dapl_os_lock(&g_hca_lock); if (g_ib_thread_state != IB_THREAD_RUN) { dapl_os_unlock(&g_hca_lock); @@ -508,6 +429,23 @@ DAT_RETURN dapls_ib_close_hca(IN DAPL_HC dapl_os_sleep_usec(1000); } bail: + + if (hca_ptr->ib_trans.ib_cq) + ibv_destroy_comp_channel(hca_ptr->ib_trans.ib_cq); + + if (hca_ptr->ib_trans.ib_cq_empty) { + struct ibv_comp_channel *channel; + channel = hca_ptr->ib_trans.ib_cq_empty->channel; + ibv_destroy_cq(hca_ptr->ib_trans.ib_cq_empty); + ibv_destroy_comp_channel(channel); + } + + if (hca_ptr->ib_hca_handle != IB_INVALID_HANDLE) { + if (rdma_destroy_id(hca_ptr->ib_trans.cm_id)) + return (dapl_convert_errno(errno, "ib_close_device")); + hca_ptr->ib_hca_handle = IB_INVALID_HANDLE; + } + return (DAT_SUCCESS); } This patch has been tested within IBM and yes this is requested for RHEL5.5z-stream DAT-2.0 i.e. the dapl-2.0 package.