| Summary: | Version mismatch of Kernel and papi installed by yum. | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | satoken <ken.sato.ty> |
| Component: | papi | Assignee: | William Cohen <wcohen> |
| Status: | CLOSED ERRATA | QA Contact: | Michael Petlan <mpetlan> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 7.2 | CC: | mbenitez, mcermak, mpetlan, ohudlick, wcohen |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | papi-5.2.0-15.el7 | Doc Type: | Bug Fix |
| Doc Text: |
Cause: There was a mismatch in the layout of the perf data structure used by the kernel and papi.
Consequence: papi would not detect that the kernel supported userspace reads of the performance counters using the rdpmc instruction.
Fix: Rebuilt papi with libpfm-4.7 that supplies perf data structure layout that match what the kernel uses.
Result: PAPI now correctly identifies when the kernel support the fast userspace rdpmc. However, PAPI does not currently use the rdpmc mechanism to read the counters.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-11-04 05:18:08 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
satoken
2016-03-24 02:30:09 UTC
RHEL 7 version of papi isn't using the libpfm bundled in papi. It should be using the unbundled libpfm-4.4 in RHEL7. However, there does seems to be a difference in /usr/include/perfmon/perf_event.h and /usr/include/linux/perf_eventh.h definitions of perf_event_mmap_page. The bitfields in struct perf_event_mmap_page definitions do not line up and the cap_usr_rdpmc bitfield is read as 0.
libpfm4 version /usr/include/perfmon/perf_event.h
struct perf_event_mmap_page {
uint32_t version;
uint32_t compat_version;
uint32_t lock;
uint32_t index;
int64_t offset;
uint64_t time_enabled;
uint64_t time_running;
union {
uint64_t capabilities;
uint64_t cap_usr_time:1,
cap_usr_rdpmc:1,
cap_____res:62;
} SWIG_NAME(rdmap_cap);
uint16_t pmc_width;
uint16_t time_shift;
uint32_t time_mult;
uint64_t time_offset;
uint64_t __reserved[120];
uint64_t data_head;
uint64_t data_tail;
};
kernel /usr/include/linux/perf_event.h:
struct perf_event_mmap_page {
__u32 version; /* version number of this structure */
__u32 compat_version; /* lowest version this is compat with */
/*
* Bits needed to read the hw events in user-space.
*
* u32 seq, time_mult, time_shift, index, width;
* u64 count, enabled, running;
* u64 cyc, time_offset;
* s64 pmc = 0;
*
* do {
* seq = pc->lock;
* barrier()
*
* enabled = pc->time_enabled;
* running = pc->time_running;
*
* if (pc->cap_usr_time && enabled != running) {
* cyc = rdtsc();
* time_offset = pc->time_offset;
* time_mult = pc->time_mult;
* time_shift = pc->time_shift;
* }
*
* index = pc->index;
* count = pc->offset;
* if (pc->cap_user_rdpmc && index) {
* width = pc->pmc_width;
* pmc = rdpmc(index - 1);
* }
*
* barrier();
* } while (pc->lock != seq);
*
* NOTE: for obvious reason this only works on self-monitoring
* processes.
*/
__u32 lock; /* seqlock for synchronization */
__u32 index; /* hardware event identifier */
__s64 offset; /* add to hardware event value */
__u64 time_enabled; /* time event active */
__u64 time_running; /* time event on cpu */
union {
__u64 capabilities;
struct {
__u64 cap_bit0 : 1, /* Always 0, deprec
ated, see commit 860f085b74e9 */
cap_bit0_is_deprecated : 1, /* Always 1, signal
s that bit 0 is zero */
cap_user_rdpmc : 1, /* The RDPMC instru
ction can be used to read counts */
cap_user_time : 1, /* The time_* field
s are used */
cap_user_time_zero : 1, /* The time_zero fi
eld is used */
cap_____res : 59;
};
};
/*
* If cap_user_rdpmc this field provides the bit-width of the value
* read using the rdpmc() or equivalent instruction. This can be used
* to sign extend the result like:
*
* pmc <<= 64 - width;
* pmc >>= 64 - width; // signed shift right
* count += pmc;
*/
__u16 pmc_width;
/*
* If cap_usr_time the below fields can be used to compute the time
* delta since time_enabled (in ns) using rdtsc or similar.
*
* u64 quot, rem;
* u64 delta;
*
* quot = (cyc >> time_shift);
* rem = cyc & ((1 << time_shift) - 1);
* delta = time_offset + quot * time_mult +
* ((rem * time_mult) >> time_shift);
*
* Where time_offset,time_mult,time_shift and cyc are read in the
* seqcount loop described above. This delta can then be added to
* enabled and possible running (if index), improving the scaling:
*
* enabled += delta;
* if (index)
* running += delta;
*
* quot = count / running;
* rem = count % running;
* count = quot * enabled + (rem * enabled) / running;
*/
__u16 time_shift;
__u32 time_mult;
__u64 time_offset;
/*
* If cap_usr_time_zero, the hardware clock (e.g. TSC) can be calculated
* from sample timestamps.
*
* time = timestamp - time_zero;
* quot = time / time_mult;
* rem = time % time_mult;
* cyc = (quot << time_shift) + (rem << time_shift) / time_mult;
*
* And vice versa:
*
* quot = cyc >> time_shift;
* rem = cyc & ((1 << time_shift) - 1);
* timestamp = time_zero + quot * time_mult +
* ((rem * time_mult) >> time_shift);
*/
__u64 time_zero;
__u32 size; /* Header size up to __reserved[] fields
. */
/*
* Hole for extension of the self monitor capabilities
*/
__u8 __reserved[118*8+4]; /* align to 1k. */
/*
* Control data for the mmap() data buffer.
*
* User-space reading the @data_head value should issue an smp_rmb(),
* after reading this value.
*
* When the mapping is PROT_WRITE the @data_tail value should be
* written by userspace to reflect the last read data, after issueing
* an smp_mb() to separate the data read from the ->data_tail store.
* In this case the kernel will not over-write unread data.
*
* See perf_output_put_handle() for the data ordering.
*/
__u64 data_head; /* head in the data section */
__u64 data_tail; /* user-space written tail */
};
The libpfm-4.7 has revised the structure to match up with the kernel. Rebuilding papi with libpfm-4.7 enable papi to identify the rdpmc support. However, there is a comment in perf_event.c:_pe_init_component() stating that using rdpmc doesn't speed things up and it doesn't change the behavior of papi: /* Detect if we can use rdpmc (or equivalent) */ /* We currently do not use rdpmc as it is slower in tests */ /* than regular read (as of Linux 3.5) A quick comparison between runs of papi_cost built with libpfm-4.7 and libpfm-4.4 shows pretty similar results: papi built with libpfm-4.7: $ ./papi_cost Cost of execution for PAPI start/stop, read and accum. This test takes a while. Please be patient... ibwarn: [20461] umad_init: can't read ABI version from /sys/class/infiniband_mad/abi_version (No such file or directory): is ib_umad module loaded? Performing loop latency test... Total cost for loop latency over 1000000 iterations min cycles : 30 max cycles : 24756 mean cycles : 43.638454 std deviation: 69.410008 Performing start/stop test... Total cost for PAPI_start/stop (2 counters) over 1000000 iterations min cycles : 9352 max cycles : 2051884 mean cycles : 9719.635600 std deviation: 2259.563806 Performing read test... Total cost for PAPI_read (2 counters) over 1000000 iterations min cycles : 2510 max cycles : 92982 mean cycles : 2571.104060 std deviation: 311.403011 Performing read with timestamp test... Total cost for PAPI_read_ts (2 counters) over 1000000 iterations min cycles : 2520 max cycles : 27975 mean cycles : 2584.767873 std deviation: 265.900891 Performing accum test... Total cost for PAPI_accum (2 counters) over 1000000 iterations min cycles : 3688 max cycles : 90314 mean cycles : 3776.048480 std deviation: 418.108075 Performing reset test... Total cost for PAPI_reset (2 counters) over 1000000 iterations min cycles : 1088 max cycles : 42866 mean cycles : 1135.465613 std deviation: 209.894743 Performing DERIVED_POSTFIX PAPI_read(3 counters) test... Total cost for PAPI_read (1 derived_postfix counter) over 1000000 iterations min cycles : 3054 max cycles : 108573 mean cycles : 3138.790124 std deviation: 389.003901 Performing DERIVED_[ADD|SUB] PAPI_read(2 counters) test... Total cost for PAPI_read (1 derived_[add|sub] counter) over 1000000 iterations min cycles : 3528 max cycles : 34953 mean cycles : 3637.639911 std deviation: 394.624337 cost.c PASSED Try again with libpfm-4.4 $ ./papi_cost Cost of execution for PAPI start/stop, read and accum. This test takes a while. Please be patient... ibwarn: [31917] umad_init: can't read ABI version from /sys/class/infiniband_mad/abi_version (No such file or directory): is ib_umad module loaded? Performing loop latency test... Total cost for loop latency over 1000000 iterations min cycles : 30 max cycles : 25152 mean cycles : 42.540122 std deviation: 42.793157 Performing start/stop test... Total cost for PAPI_start/stop (2 counters) over 1000000 iterations min cycles : 9292 max cycles : 74740 mean cycles : 9645.142076 std deviation: 962.684534 Performing read test... Total cost for PAPI_read (2 counters) over 1000000 iterations min cycles : 2508 max cycles : 72816 mean cycles : 2569.350047 std deviation: 333.029009 Performing read with timestamp test... Total cost for PAPI_read_ts (2 counters) over 1000000 iterations min cycles : 2522 max cycles : 30132 mean cycles : 2613.603685 std deviation: 325.067950 Performing accum test... Total cost for PAPI_accum (2 counters) over 1000000 iterations min cycles : 3674 max cycles : 70812 mean cycles : 3748.156739 std deviation: 401.354167 Performing reset test... Total cost for PAPI_reset (2 counters) over 1000000 iterations min cycles : 1080 max cycles : 23826 mean cycles : 1117.220298 std deviation: 193.372576 Performing DERIVED_POSTFIX PAPI_read(3 counters) test... Total cost for PAPI_read (1 derived_postfix counter) over 1000000 iterations min cycles : 3034 max cycles : 48381 mean cycles : 3144.853020 std deviation: 317.766676 Performing DERIVED_[ADD|SUB] PAPI_read(2 counters) test... Total cost for PAPI_read (1 derived_[add|sub] counter) over 1000000 iterations min cycles : 3524 max cycles : 63168 mean cycles : 3625.259589 std deviation: 322.423673 cost.c PASSED Email was sent to the papi development list to better understand why the rdpmc wasn't currently being used. One of the PAPI developers, Vince Weaver responded with: On Mon, 28 Mar 2016, William Cohen wrote: > /* Detect if we can use rdpmc (or equivalent) */ > /* We currently do not use rdpmc as it is slower in tests */ > /* than regular read (as of Linux 3.5) */ > The rdpmc would be limited to speeding up reading of the counters for > self monitoring tasks, but that could be useful in some cases. Is there > a discussion about the experiments that lead to the determination that > using the rdpmc is slower? I didn't find anything on the mailing list. there are various complications on the rdpmc support, I am currently working on getting it implemented. In order for rdpmc support to be a win the mmap() support has to also be fast (and you have to do many reads in succession). It probably will be a net win in the end but it sometimes requires some extra flags to mmap(). If you're interested in the details that led to the comment in the source code there's a paper I wrote about it: Self-monitoring Overhead of the Linux perf_event Performance Counter Interface http://web.eece.maine.edu/~vweaver/projects/perf_events/overhead/ Vince Okey then, let's consider this as VERIFIED. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-2395.html |