Escalated to Bugzilla from IssueTracker
Event posted on 2009-08-13 05:30 BST by T.Miki (1) Category Defect Report (2) Abstract The "sar -d " command outputs invalid data. (3) Symptom "sar -d -f xxxx" outputs invalid data. The xxxx file includes unsigned long-long lapped around data of nr_ios,rd_sect or wr_sect. Please see my test result bleow. # sar -d -f el5_sadc_log_ffff_rd0 Linux 2.6.18-128.el5PAE (rhel5u3x86-vm) 08/11/09 13:36:00 DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util 13:36:03 dev8-0 1.00 0.00 37.21 37.33 0.00 1.00 1.00 0.10 13:36:03 dev8-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 13:36:03 dev8-2 20401.66 461404.32 97706.31 27.41 335.34 16.44 6.92 14122.82 << invalid 13:36:03 dev253-0 4.65 0.00 37.21 8.00 0.00 0.21 0.21 0.10 13:36:03 dev253-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 13:36:06 dev8-0 1.00 0.00 24.00 24.00 0.00 1.00 1.00 0.10 13:36:06 dev8-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 13:36:06 dev8-2 1.00 0.00 24.00 24.00 0.00 1.00 1.00 0.10 13:36:06 dev253-0 3.00 0.00 24.00 8.00 0.00 0.56 0.33 0.10 13:36:06 dev253-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 13:36:09 dev8-0 1.00 0.00 10.67 10.67 0.00 0.33 0.33 0.03 13:36:09 dev8-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 13:36:09 dev8-2 1.00 0.00 10.67 10.67 0.00 0.33 0.33 0.03 13:36:09 dev253-0 1.33 0.00 10.67 8.00 0.00 0.25 0.25 0.03 13:36:09 dev253-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 13:36:12 dev8-0 1.00 0.00 18.60 18.67 0.00 0.33 0.33 0.03 13:36:12 dev8-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 13:36:12 dev8-2 1.00 0.00 18.60 18.67 0.00 0.33 0.33 0.03 13:36:12 dev253-0 2.33 0.00 18.60 8.00 0.00 0.29 0.14 0.03 13:36:12 dev253-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Average: dev8-0 1.00 0.00 22.63 22.67 0.00 0.67 0.67 0.07 Average: dev8-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Average: dev8-2 5109.65 115543.01 24480.53 27.40 83.97 16.43 6.92 3536.62 Average: dev253-0 2.83 0.00 22.63 8.00 0.00 0.32 0.24 0.07 Average: dev253-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 --- el5_sadc_log_ffff_rd0 contents of disk_stats[ ] for dev8-2 00000780: 0800 0000 0100 0000 1b31 15ff ffff ffff 00000790: ffff ffff ffff ffff 607c 0400 0000 0000 000007a0: 0000 0000 0000 0000 c8ab 0800 0000 0000 000007b0: 0ebb 0600 0000 0000 867c 0600 0000 0000 000007c0: d666 0f00 0000 0000 deef 0000 0000 0000 000007d0: 0800 0000 0200 0000 c22d 1500 0000 0000 00001000: 1b31 1500 0000 0000 0000 0000 0000 0000. 00001010: d07c 0400 0000 0000 0000 0000 0000 0000 00001020: c8ab 0800 0000 0000 11bb 0600 0000 0000. 00001030: 897c 0600 0000 0000 d966 0f00 0000 0000 00001040: e1ef 0000 0000 0000 0800 0000 0200 0000 (4) Environment RHEL5.3 sysstat-7.0.2-3.el5 (5) Recreation Steps Please execute the sar command below. # sar -d -f el5_sadc_log_ffff_rd0 the "el5_sadc_log_ffff_rd0" is attatched file. Regarding the "el5_sadc_log_ffff_rd0" file, I created this file by "/usr/lib/sa/sadc -d 3 4 el5_sadc_log_ffff_rd0". And I have modified this file to recreate the reported problem. 00000780: 0800 0000 0100 0000 1b31 1500 0000 0000 00000790: 0000 0000 0000 0000 607c 0400 0000 0000 000007a0: 0000 0000 0000 0000 c8ab 0800 0000 0000 000007b0: 0ebb 0600 0000 0000 867c 0600 0000 0000 000007c0: d666 0f00 0000 0000 deef 0000 0000 0000 000007d0: 0800 0000 0200 0000 c22d 1500 0000 0000 ↓ 00000780: 0800 0000 0100 0000 1b31 15ff ffff ffff <<<<<< changed 00000790: ffff ffff ffff ffff 607c 0400 0000 0000 <<<<<< changed 000007a0: 0000 0000 0000 0000 c8ab 0800 0000 0000 000007b0: 0ebb 0600 0000 0000 867c 0600 0000 0000 000007c0: d666 0f00 0000 0000 deef 0000 0000 0000 000007d0: 0800 0000 0200 0000 c22d 1500 0000 0000 (6) Investigation I have verified that check_disk_reg( ) function of sa_common.c clears the disk_stats[ ] in line 523-530. If the nr_ios,rd_sect or wr_sect reaches near the unsigned long-long lapped around, disk_stats[ ] struct will be broken by this logic. Following is my verified information by gdb. Breakpoint 1, check_disk_reg (file_hdr=0x8055380, st_disk=0x805424c, curr=1, ref=0, pos=2) at sa_common.c:510 510 st_disk_i = st_disk[curr] + pos; (gdb) s 506 { (gdb) 510 st_disk_i = st_disk[curr] + pos; (gdb) 512 while (index < file_hdr->sa_nr_disk) { (gdb) 513 st_disk_j = st_disk[ref] + index; (gdb) 514 if ((st_disk_i->major == st_disk_j->major) && (gdb) p/x st_disk_i->major $5 = 0x8 (gdb) p/x st_disk_j->major $6 = 0x8 (gdb) p/x st_disk_i->minor $7 = 0x2 (gdb) p/x st_disk_j->minor $8 = 0x2 (gdb) p/x st_disk_i->nr_ios $9 = 0xefe1 (gdb) p/x st_disk_j->nr_ios $10 = 0xefde (gdb)p/x st_disk_i->rd_sect $11 = 0x15311b (gdb) p/x rd_sect_j->rd_sect $12 = 0xffffffffff15311b (gdb) p/x st_disk_j->rd_sect st_disk_i->wr_sect $13 = 0x47cd0 (gdb) p/x st_disk_j->wr_sect $14 = 0x47c60 (gdb) s 523 if ((st_disk_i->nr_ios < st_disk_j->nr_ios) || ... <<<<<<<<<<<<<<<<<<<< bug!! (gdb) warning: Source file is more recent than executable. 96 return __builtin___memset_chk (__dest, __ch, __len, __bos0 (__dest)); (gdb) s 553 st_disk_j->major = st_disk_i->major; (gdb) s 554 st_disk_j->minor = st_disk_i->minor; (gdb) s 557 } (gdb) p/x st_disk_i->major $16 = 0x8 (gdb) p/x st_disk_j->major $17 = 0x8 (gdb) p/x st_disk_i->minor $18 = 0x2 (gdb) p/x st_disk_j->minor $19 = 0x2 (gdb) p/x st_disk_i->nr_ios $20 = 0xefe1 (gdb) p/x st_disk_j->nr_ios $21 = 0x0 (gdb) p/x st_disk_i->rd_sect $22 = 0x15311b (gdb) p/x st_disk_j->rd_sect $23 = 0x0 (gdb) p/x st_disk_i->wr_sect $24 = 0x47cd0 (gdb) p/x st_disk_j->wr_sect $25 = 0x0 Following is my patch data for sa_common.c. --- snip --- --- sa_common.c_ORG 2006-10-22 18:23:45.000000000 +0900 +++ sa_common.c 2009-08-11 14:44:59.000000000 +0900 @@ -520,14 +520,15 @@ * NB: AFAIK, such a device cannot be unregistered with current * kernels. */ +/* if ((st_disk_i->nr_ios < st_disk_j->nr_ios) || (st_disk_i->rd_sect < st_disk_j->rd_sect) || (st_disk_i->wr_sect < st_disk_j->wr_sect)) { - memset(st_disk_j, 0, DISK_STATS_SIZE); st_disk_j->major = st_disk_i->major; st_disk_j->minor = st_disk_i->minor; } +*/ return index; } index++; Following is the output of the sar command updated by above patch. # /usr/src/redhat/BUILD/sysstat-7.0.2/sar -d -f el5_sadc_log_ffff_rd0 Linux 2.6.18-128.el5PAE (rhel5u3x86-vm) 08/11/09 13:36:00 DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util 13:36:03 dev8-0 1.00 0.00 37.21 37.33 0.00 1.00 1.00 0.10 13:36:03 dev8-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 13:36:03 dev8-2 1.00 5573825.91 37.21 5592442.67 0.00 1.00 1.00 0.10 <<<<< corrected 13:36:03 dev253-0 4.65 0.00 37.21 8.00 0.00 0.21 0.21 0.10 13:36:03 dev253-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 13:36:06 dev8-0 1.00 0.00 24.00 24.00 0.00 1.00 1.00 0.10 13:36:06 dev8-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 13:36:06 dev8-2 1.00 0.00 24.00 24.00 0.00 1.00 1.00 0.10 13:36:06 dev253-0 3.00 0.00 24.00 8.00 0.00 0.56 0.33 0.10 13:36:06 dev253-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 13:36:09 dev8-0 1.00 0.00 10.67 10.67 0.00 0.33 0.33 0.03 13:36:09 dev8-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 13:36:09 dev8-2 1.00 0.00 10.67 10.67 0.00 0.33 0.33 0.03 13:36:09 dev253-0 1.33 0.00 10.67 8.00 0.00 0.25 0.25 0.03 13:36:09 dev253-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 13:36:12 dev8-0 1.00 0.00 18.60 18.67 0.00 0.33 0.33 0.03 13:36:12 dev8-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 13:36:12 dev8-2 1.00 0.00 18.60 18.67 0.00 0.33 0.33 0.03 13:36:12 dev253-0 2.33 0.00 18.60 8.00 0.00 0.29 0.14 0.03 13:36:12 dev253-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Average: dev8-0 1.00 0.00 22.63 22.67 0.00 0.67 0.67 0.07 Average: dev8-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Average: dev8-2 1.00 1395775.04 22.63 1398124.00 0.00 0.67 0.67 0.07 Average: dev253-0 2.83 0.00 22.63 8.00 0.00 0.32 0.24 0.07 Average: dev253-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 # (7) Related Documentation/Related Bugzilla # none (8) Attachments Following files are attached in this issue. - test file for recreation this problem(sadc -d 3 4) : el5_sadc_log_ffff_rd0 - recreated package : sysstat-7.0.2-3.el5.i386.rpm - recreated package source : sysstat-7.0.2-3.el5.src.rpm - recreated package debuginfo :sysstat-debuginfo-7.0.2-3.el5.i386.rpm - sosreport archive of recreated system : sosreport-rhel5u3x86-vm-629023-09de2b.tar.bz2 - md5 sum data of sosreport archive : sosreport-rhel5u3x86-vm-629023-09de2b.tar.bz2.md5 (9) Business Impacts The customer executes the sar command to monitor their system(*1) performance every 10 minutes . (*1) They are using this system on business. If this problem occurs, the performance data will show very serious. And the customer has to analyse system performance. This needless man power cost is so heavy loss for the customer. (10) Requests I would like you to investigate and fix this problem at 5.5. This event sent from IssueTracker by mpoole [Support Engineering Group] issue 329782
Created attachment 357439 [details] customer proposed patch to cope with wraparound
Created attachment 357440 [details] adjusted sar datafile with wraparound
The customer provided patch does raise some questions. In the comment prior to the code that is proposed to be removed it notes * Disk found. * If a counter has decreased, then we may assume that the * corresponding device was unregistered, then registered again. * NB: AFAIK, such a device cannot be unregistered with current * kernels. And the check tests whether any of the new values for the device are lower than the old ones. The problem is that the comment regarding devices being unregistered is incorrect. Hot-swap of disks, USB storage, etc can all cause a device to disappear and reappear with corresponding re-use of major & minor numbers. Since any change like this invalidates the statistics for the given device it is appropriate to zero values when this occurs. On the other hand the customer patch assumes this is never the case and lets the values wrap. This unfortunately gives us two conflicting use cases. It seems more prudent to make the assumption that if only one or two values wrap (and appear reduced) then zeroing is not applicable. If, on the other hand all of the counters appear to have reduced then it is more likely that this is a new device and old values should be assumed as 0. I have produced a patch based on these assumptions.
Created attachment 357445 [details] Alternate patch for counter wrap checks
This request was evaluated by Red Hat Product Management for inclusion, but this component is not scheduled to be updated in the current Red Hat Enterprise Linux release. If you would like this request to be reviewed for the next minor release, ask your support representative to set the next rhel-x.y flag to "?".
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: The "sar -d" command used to output invalid data about block devices. With this update, the sar utility recognizes disk registration and disk overflow statistics properly, and only correct and relevant data are now displayed.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1005.html