Bug 1320790
Summary: | Version mismatch of Kernel and papi installed by yum. | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | satoken <ken.sato.ty> |
Component: | papi | Assignee: | William Cohen <wcohen> |
Status: | CLOSED ERRATA | QA Contact: | Michael Petlan <mpetlan> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 7.2 | CC: | mbenitez, mcermak, mpetlan, ohudlick, wcohen |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | papi-5.2.0-15.el7 | Doc Type: | Bug Fix |
Doc Text: |
Cause: There was a mismatch in the layout of the perf data structure used by the kernel and papi.
Consequence: papi would not detect that the kernel supported userspace reads of the performance counters using the rdpmc instruction.
Fix: Rebuilt papi with libpfm-4.7 that supplies perf data structure layout that match what the kernel uses.
Result: PAPI now correctly identifies when the kernel support the fast userspace rdpmc. However, PAPI does not currently use the rdpmc mechanism to read the counters.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2016-11-04 05:18:08 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
satoken
2016-03-24 02:30:09 UTC
RHEL 7 version of papi isn't using the libpfm bundled in papi. It should be using the unbundled libpfm-4.4 in RHEL7. However, there does seems to be a difference in /usr/include/perfmon/perf_event.h and /usr/include/linux/perf_eventh.h definitions of perf_event_mmap_page. The bitfields in struct perf_event_mmap_page definitions do not line up and the cap_usr_rdpmc bitfield is read as 0. libpfm4 version /usr/include/perfmon/perf_event.h struct perf_event_mmap_page { uint32_t version; uint32_t compat_version; uint32_t lock; uint32_t index; int64_t offset; uint64_t time_enabled; uint64_t time_running; union { uint64_t capabilities; uint64_t cap_usr_time:1, cap_usr_rdpmc:1, cap_____res:62; } SWIG_NAME(rdmap_cap); uint16_t pmc_width; uint16_t time_shift; uint32_t time_mult; uint64_t time_offset; uint64_t __reserved[120]; uint64_t data_head; uint64_t data_tail; }; kernel /usr/include/linux/perf_event.h: struct perf_event_mmap_page { __u32 version; /* version number of this structure */ __u32 compat_version; /* lowest version this is compat with */ /* * Bits needed to read the hw events in user-space. * * u32 seq, time_mult, time_shift, index, width; * u64 count, enabled, running; * u64 cyc, time_offset; * s64 pmc = 0; * * do { * seq = pc->lock; * barrier() * * enabled = pc->time_enabled; * running = pc->time_running; * * if (pc->cap_usr_time && enabled != running) { * cyc = rdtsc(); * time_offset = pc->time_offset; * time_mult = pc->time_mult; * time_shift = pc->time_shift; * } * * index = pc->index; * count = pc->offset; * if (pc->cap_user_rdpmc && index) { * width = pc->pmc_width; * pmc = rdpmc(index - 1); * } * * barrier(); * } while (pc->lock != seq); * * NOTE: for obvious reason this only works on self-monitoring * processes. */ __u32 lock; /* seqlock for synchronization */ __u32 index; /* hardware event identifier */ __s64 offset; /* add to hardware event value */ __u64 time_enabled; /* time event active */ __u64 time_running; /* time event on cpu */ union { __u64 capabilities; struct { __u64 cap_bit0 : 1, /* Always 0, deprec ated, see commit 860f085b74e9 */ cap_bit0_is_deprecated : 1, /* Always 1, signal s that bit 0 is zero */ cap_user_rdpmc : 1, /* The RDPMC instru ction can be used to read counts */ cap_user_time : 1, /* The time_* field s are used */ cap_user_time_zero : 1, /* The time_zero fi eld is used */ cap_____res : 59; }; }; /* * If cap_user_rdpmc this field provides the bit-width of the value * read using the rdpmc() or equivalent instruction. This can be used * to sign extend the result like: * * pmc <<= 64 - width; * pmc >>= 64 - width; // signed shift right * count += pmc; */ __u16 pmc_width; /* * If cap_usr_time the below fields can be used to compute the time * delta since time_enabled (in ns) using rdtsc or similar. * * u64 quot, rem; * u64 delta; * * quot = (cyc >> time_shift); * rem = cyc & ((1 << time_shift) - 1); * delta = time_offset + quot * time_mult + * ((rem * time_mult) >> time_shift); * * Where time_offset,time_mult,time_shift and cyc are read in the * seqcount loop described above. This delta can then be added to * enabled and possible running (if index), improving the scaling: * * enabled += delta; * if (index) * running += delta; * * quot = count / running; * rem = count % running; * count = quot * enabled + (rem * enabled) / running; */ __u16 time_shift; __u32 time_mult; __u64 time_offset; /* * If cap_usr_time_zero, the hardware clock (e.g. TSC) can be calculated * from sample timestamps. * * time = timestamp - time_zero; * quot = time / time_mult; * rem = time % time_mult; * cyc = (quot << time_shift) + (rem << time_shift) / time_mult; * * And vice versa: * * quot = cyc >> time_shift; * rem = cyc & ((1 << time_shift) - 1); * timestamp = time_zero + quot * time_mult + * ((rem * time_mult) >> time_shift); */ __u64 time_zero; __u32 size; /* Header size up to __reserved[] fields . */ /* * Hole for extension of the self monitor capabilities */ __u8 __reserved[118*8+4]; /* align to 1k. */ /* * Control data for the mmap() data buffer. * * User-space reading the @data_head value should issue an smp_rmb(), * after reading this value. * * When the mapping is PROT_WRITE the @data_tail value should be * written by userspace to reflect the last read data, after issueing * an smp_mb() to separate the data read from the ->data_tail store. * In this case the kernel will not over-write unread data. * * See perf_output_put_handle() for the data ordering. */ __u64 data_head; /* head in the data section */ __u64 data_tail; /* user-space written tail */ }; The libpfm-4.7 has revised the structure to match up with the kernel. Rebuilding papi with libpfm-4.7 enable papi to identify the rdpmc support. However, there is a comment in perf_event.c:_pe_init_component() stating that using rdpmc doesn't speed things up and it doesn't change the behavior of papi: /* Detect if we can use rdpmc (or equivalent) */ /* We currently do not use rdpmc as it is slower in tests */ /* than regular read (as of Linux 3.5) A quick comparison between runs of papi_cost built with libpfm-4.7 and libpfm-4.4 shows pretty similar results: papi built with libpfm-4.7: $ ./papi_cost Cost of execution for PAPI start/stop, read and accum. This test takes a while. Please be patient... ibwarn: [20461] umad_init: can't read ABI version from /sys/class/infiniband_mad/abi_version (No such file or directory): is ib_umad module loaded? Performing loop latency test... Total cost for loop latency over 1000000 iterations min cycles : 30 max cycles : 24756 mean cycles : 43.638454 std deviation: 69.410008 Performing start/stop test... Total cost for PAPI_start/stop (2 counters) over 1000000 iterations min cycles : 9352 max cycles : 2051884 mean cycles : 9719.635600 std deviation: 2259.563806 Performing read test... Total cost for PAPI_read (2 counters) over 1000000 iterations min cycles : 2510 max cycles : 92982 mean cycles : 2571.104060 std deviation: 311.403011 Performing read with timestamp test... Total cost for PAPI_read_ts (2 counters) over 1000000 iterations min cycles : 2520 max cycles : 27975 mean cycles : 2584.767873 std deviation: 265.900891 Performing accum test... Total cost for PAPI_accum (2 counters) over 1000000 iterations min cycles : 3688 max cycles : 90314 mean cycles : 3776.048480 std deviation: 418.108075 Performing reset test... Total cost for PAPI_reset (2 counters) over 1000000 iterations min cycles : 1088 max cycles : 42866 mean cycles : 1135.465613 std deviation: 209.894743 Performing DERIVED_POSTFIX PAPI_read(3 counters) test... Total cost for PAPI_read (1 derived_postfix counter) over 1000000 iterations min cycles : 3054 max cycles : 108573 mean cycles : 3138.790124 std deviation: 389.003901 Performing DERIVED_[ADD|SUB] PAPI_read(2 counters) test... Total cost for PAPI_read (1 derived_[add|sub] counter) over 1000000 iterations min cycles : 3528 max cycles : 34953 mean cycles : 3637.639911 std deviation: 394.624337 cost.c PASSED Try again with libpfm-4.4 $ ./papi_cost Cost of execution for PAPI start/stop, read and accum. This test takes a while. Please be patient... ibwarn: [31917] umad_init: can't read ABI version from /sys/class/infiniband_mad/abi_version (No such file or directory): is ib_umad module loaded? Performing loop latency test... Total cost for loop latency over 1000000 iterations min cycles : 30 max cycles : 25152 mean cycles : 42.540122 std deviation: 42.793157 Performing start/stop test... Total cost for PAPI_start/stop (2 counters) over 1000000 iterations min cycles : 9292 max cycles : 74740 mean cycles : 9645.142076 std deviation: 962.684534 Performing read test... Total cost for PAPI_read (2 counters) over 1000000 iterations min cycles : 2508 max cycles : 72816 mean cycles : 2569.350047 std deviation: 333.029009 Performing read with timestamp test... Total cost for PAPI_read_ts (2 counters) over 1000000 iterations min cycles : 2522 max cycles : 30132 mean cycles : 2613.603685 std deviation: 325.067950 Performing accum test... Total cost for PAPI_accum (2 counters) over 1000000 iterations min cycles : 3674 max cycles : 70812 mean cycles : 3748.156739 std deviation: 401.354167 Performing reset test... Total cost for PAPI_reset (2 counters) over 1000000 iterations min cycles : 1080 max cycles : 23826 mean cycles : 1117.220298 std deviation: 193.372576 Performing DERIVED_POSTFIX PAPI_read(3 counters) test... Total cost for PAPI_read (1 derived_postfix counter) over 1000000 iterations min cycles : 3034 max cycles : 48381 mean cycles : 3144.853020 std deviation: 317.766676 Performing DERIVED_[ADD|SUB] PAPI_read(2 counters) test... Total cost for PAPI_read (1 derived_[add|sub] counter) over 1000000 iterations min cycles : 3524 max cycles : 63168 mean cycles : 3625.259589 std deviation: 322.423673 cost.c PASSED Email was sent to the papi development list to better understand why the rdpmc wasn't currently being used. One of the PAPI developers, Vince Weaver responded with: On Mon, 28 Mar 2016, William Cohen wrote: > /* Detect if we can use rdpmc (or equivalent) */ > /* We currently do not use rdpmc as it is slower in tests */ > /* than regular read (as of Linux 3.5) */ > The rdpmc would be limited to speeding up reading of the counters for > self monitoring tasks, but that could be useful in some cases. Is there > a discussion about the experiments that lead to the determination that > using the rdpmc is slower? I didn't find anything on the mailing list. there are various complications on the rdpmc support, I am currently working on getting it implemented. In order for rdpmc support to be a win the mmap() support has to also be fast (and you have to do many reads in succession). It probably will be a net win in the end but it sometimes requires some extra flags to mmap(). If you're interested in the details that led to the comment in the source code there's a paper I wrote about it: Self-monitoring Overhead of the Linux perf_event Performance Counter Interface http://web.eece.maine.edu/~vweaver/projects/perf_events/overhead/ Vince Okey then, let's consider this as VERIFIED. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-2395.html |