Bug 1320790

Summary: Version mismatch of Kernel and papi installed by yum.
Product: Red Hat Enterprise Linux 7 Reporter: satoken <ken.sato.ty>
Component: papiAssignee: William Cohen <wcohen>
Status: CLOSED ERRATA QA Contact: Michael Petlan <mpetlan>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.2CC: mbenitez, mcermak, mpetlan, ohudlick, wcohen
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: papi-5.2.0-15.el7 Doc Type: Bug Fix
Doc Text:
Cause: There was a mismatch in the layout of the perf data structure used by the kernel and papi. Consequence: papi would not detect that the kernel supported userspace reads of the performance counters using the rdpmc instruction. Fix: Rebuilt papi with libpfm-4.7 that supplies perf data structure layout that match what the kernel uses. Result: PAPI now correctly identifies when the kernel support the fast userspace rdpmc. However, PAPI does not currently use the rdpmc mechanism to read the counters.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-04 05:18:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description satoken 2016-03-24 02:30:09 UTC
Description of problem:
When install papi by yum with default repository, I can get PAPI-5.2.0.
But, There is a difference in the definition of 'struct perf_event_mmap_page' between Kernel and PAPI-5.2.0.
So, when initialize PAPI, PAPI always determined that RDPMC command is disable.

Version-Release number of selected component (if applicable):
PAPI-5.2.0

How reproducible:
Install papi by yum with default repository.
Or check the definition of 'struct perf_event_mmap'.
Kernel : (ex:)linux-3.10.0-327.4.5.el7/include/uapi/linux/perf_event.h
PAPI   : PAPI-5.2.0/libpfm4/include/perfmon/perf_event.h

Steps to Reproduce:
1.Check the definition of 'struct perf_event_mmap'
2.
3.

Actual results:
PAPI always dosen't use RDPMC

Expected results:
If RDPMC command is available, PAPI uses RDPMC command.

Additional info:
Nothing.

Comment 2 William Cohen 2016-03-28 19:50:23 UTC
RHEL 7 version of papi isn't using the libpfm bundled in papi.  It should be using the unbundled libpfm-4.4 in RHEL7.  However, there does seems to be a difference in /usr/include/perfmon/perf_event.h and /usr/include/linux/perf_eventh.h definitions of perf_event_mmap_page.  The bitfields in struct perf_event_mmap_page definitions do not line up and the cap_usr_rdpmc bitfield is read as 0.

libpfm4 version /usr/include/perfmon/perf_event.h

struct perf_event_mmap_page {
	uint32_t	version;
	uint32_t	compat_version;
	uint32_t	lock;
	uint32_t	index;
	int64_t		offset;
	uint64_t	time_enabled;
	uint64_t	time_running;
	union {
		uint64_t capabilities;
		uint64_t cap_usr_time:1,
			 cap_usr_rdpmc:1,
			 cap_____res:62;
	} SWIG_NAME(rdmap_cap);
	uint16_t	pmc_width;
	uint16_t	time_shift;
	uint32_t	time_mult;
	uint64_t	time_offset;

	uint64_t	__reserved[120];
	uint64_t  	data_head;
	uint64_t	data_tail;
};



kernel /usr/include/linux/perf_event.h:

struct perf_event_mmap_page {
	__u32	version;		/* version number of this structure */
	__u32	compat_version;		/* lowest version this is compat with */

	/*
	 * Bits needed to read the hw events in user-space.
	 *
	 *   u32 seq, time_mult, time_shift, index, width;
	 *   u64 count, enabled, running;
	 *   u64 cyc, time_offset;
	 *   s64 pmc = 0;
	 *
	 *   do {
	 *     seq = pc->lock;
	 *     barrier()
	 *
	 *     enabled = pc->time_enabled;
	 *     running = pc->time_running;
	 *
	 *     if (pc->cap_usr_time && enabled != running) {
	 *       cyc = rdtsc();
	 *       time_offset = pc->time_offset;
	 *       time_mult   = pc->time_mult;
	 *       time_shift  = pc->time_shift;
	 *     }
	 *
	 *     index = pc->index;
	 *     count = pc->offset;
	 *     if (pc->cap_user_rdpmc && index) {
	 *       width = pc->pmc_width;
	 *       pmc = rdpmc(index - 1);
	 *     }
	 *
	 *     barrier();
	 *   } while (pc->lock != seq);
	 *
	 * NOTE: for obvious reason this only works on self-monitoring
	 *       processes.
	 */
	__u32	lock;			/* seqlock for synchronization */
	__u32	index;			/* hardware event identifier */
	__s64	offset;			/* add to hardware event value */
	__u64	time_enabled;		/* time event active */
	__u64	time_running;		/* time event on cpu */
	union {
		__u64	capabilities;
		struct {
			__u64	cap_bit0		: 1, /* Always 0, deprec
ated, see commit 860f085b74e9 */
				cap_bit0_is_deprecated	: 1, /* Always 1, signal
s that bit 0 is zero */

				cap_user_rdpmc		: 1, /* The RDPMC instru
ction can be used to read counts */
				cap_user_time		: 1, /* The time_* field
s are used */
				cap_user_time_zero	: 1, /* The time_zero fi
eld is used */
				cap_____res		: 59;
		};
	};

	/*
	 * If cap_user_rdpmc this field provides the bit-width of the value
	 * read using the rdpmc() or equivalent instruction. This can be used
	 * to sign extend the result like:
	 *
	 *   pmc <<= 64 - width;
	 *   pmc >>= 64 - width; // signed shift right
	 *   count += pmc;
	 */
	__u16	pmc_width;

	/*
	 * If cap_usr_time the below fields can be used to compute the time
	 * delta since time_enabled (in ns) using rdtsc or similar.
	 *
	 *   u64 quot, rem;
	 *   u64 delta;
	 *
	 *   quot = (cyc >> time_shift);
	 *   rem = cyc & ((1 << time_shift) - 1);
	 *   delta = time_offset + quot * time_mult +
	 *              ((rem * time_mult) >> time_shift);
	 *
	 * Where time_offset,time_mult,time_shift and cyc are read in the
	 * seqcount loop described above. This delta can then be added to
	 * enabled and possible running (if index), improving the scaling:
	 *
	 *   enabled += delta;
	 *   if (index)
	 *     running += delta;
	 *
	 *   quot = count / running;
	 *   rem  = count % running;
	 *   count = quot * enabled + (rem * enabled) / running;
	 */
	__u16	time_shift;
	__u32	time_mult;
	__u64	time_offset;
	/*
	 * If cap_usr_time_zero, the hardware clock (e.g. TSC) can be calculated
	 * from sample timestamps.
	 *
	 *   time = timestamp - time_zero;
	 *   quot = time / time_mult;
	 *   rem  = time % time_mult;
	 *   cyc = (quot << time_shift) + (rem << time_shift) / time_mult;
	 *
	 * And vice versa:
	 *
	 *   quot = cyc >> time_shift;
	 *   rem  = cyc & ((1 << time_shift) - 1);
	 *   timestamp = time_zero + quot * time_mult +
	 *               ((rem * time_mult) >> time_shift);
	 */
	__u64	time_zero;
	__u32	size;			/* Header size up to __reserved[] fields
. */

		/*
		 * Hole for extension of the self monitor capabilities
		 */

	__u8	__reserved[118*8+4];	/* align to 1k. */

	/*
	 * Control data for the mmap() data buffer.
	 *
	 * User-space reading the @data_head value should issue an smp_rmb(),
	 * after reading this value.
	 *
	 * When the mapping is PROT_WRITE the @data_tail value should be
	 * written by userspace to reflect the last read data, after issueing
	 * an smp_mb() to separate the data read from the ->data_tail store.
	 * In this case the kernel will not over-write unread data.
	 *
	 * See perf_output_put_handle() for the data ordering.
	 */
	__u64   data_head;		/* head in the data section */
	__u64	data_tail;		/* user-space written tail */
};

Comment 3 William Cohen 2016-03-28 20:33:46 UTC
The libpfm-4.7 has revised the structure to match up with the kernel.  Rebuilding papi with libpfm-4.7 enable papi to identify the rdpmc support.  
However, there is a comment in perf_event.c:_pe_init_component() stating that using rdpmc doesn't speed things up and it doesn't change the behavior of papi:

   /* Detect if we can use rdpmc (or equivalent) */
   /* We currently do not use rdpmc as it is slower in tests */
   /* than regular read (as of Linux 3.5)                

A quick comparison between runs of papi_cost built with libpfm-4.7 and libpfm-4.4 shows pretty similar results:

papi built with libpfm-4.7:
$ ./papi_cost 
Cost of execution for PAPI start/stop, read and accum.
This test takes a while. Please be patient...
ibwarn: [20461] umad_init: can't read ABI version from /sys/class/infiniband_mad/abi_version (No such file or directory): is ib_umad module loaded?

Performing loop latency test...

Total cost for loop latency over 1000000 iterations
min cycles   : 30
max cycles   : 24756
mean cycles  : 43.638454
std deviation: 69.410008
 
Performing start/stop test...

Total cost for PAPI_start/stop (2 counters) over 1000000 iterations
min cycles   : 9352
max cycles   : 2051884
mean cycles  : 9719.635600
std deviation: 2259.563806
 
Performing read test...

Total cost for PAPI_read (2 counters) over 1000000 iterations
min cycles   : 2510
max cycles   : 92982
mean cycles  : 2571.104060
std deviation: 311.403011
 
Performing read with timestamp test...

Total cost for PAPI_read_ts (2 counters) over 1000000 iterations
min cycles   : 2520
max cycles   : 27975
mean cycles  : 2584.767873
std deviation: 265.900891
 
Performing accum test...

Total cost for PAPI_accum (2 counters) over 1000000 iterations
min cycles   : 3688
max cycles   : 90314
mean cycles  : 3776.048480
std deviation: 418.108075
 
Performing reset test...

Total cost for PAPI_reset (2 counters) over 1000000 iterations
min cycles   : 1088
max cycles   : 42866
mean cycles  : 1135.465613
std deviation: 209.894743
 
Performing DERIVED_POSTFIX PAPI_read(3 counters)  test...
Total cost for PAPI_read (1 derived_postfix counter) over 1000000 iterations
min cycles   : 3054
max cycles   : 108573
mean cycles  : 3138.790124
std deviation: 389.003901
 
Performing DERIVED_[ADD|SUB] PAPI_read(2 counters)  test...
Total cost for  PAPI_read (1 derived_[add|sub] counter) over 1000000 iterations
min cycles   : 3528
max cycles   : 34953
mean cycles  : 3637.639911
std deviation: 394.624337
 cost.c                                       PASSED


Try again with libpfm-4.4

$ ./papi_cost
Cost of execution for PAPI start/stop, read and accum.
This test takes a while. Please be patient...
ibwarn: [31917] umad_init: can't read ABI version from /sys/class/infiniband_mad/abi_version (No such file or directory): is ib_umad module loaded?

Performing loop latency test...

Total cost for loop latency over 1000000 iterations
min cycles   : 30
max cycles   : 25152
mean cycles  : 42.540122
std deviation: 42.793157
 
Performing start/stop test...

Total cost for PAPI_start/stop (2 counters) over 1000000 iterations
min cycles   : 9292
max cycles   : 74740
mean cycles  : 9645.142076
std deviation: 962.684534
 
Performing read test...

Total cost for PAPI_read (2 counters) over 1000000 iterations
min cycles   : 2508
max cycles   : 72816
mean cycles  : 2569.350047
std deviation: 333.029009
 
Performing read with timestamp test...

Total cost for PAPI_read_ts (2 counters) over 1000000 iterations
min cycles   : 2522
max cycles   : 30132
mean cycles  : 2613.603685
std deviation: 325.067950
 
Performing accum test...

Total cost for PAPI_accum (2 counters) over 1000000 iterations
min cycles   : 3674
max cycles   : 70812
mean cycles  : 3748.156739
std deviation: 401.354167
 
Performing reset test...

Total cost for PAPI_reset (2 counters) over 1000000 iterations
min cycles   : 1080
max cycles   : 23826
mean cycles  : 1117.220298
std deviation: 193.372576
 
Performing DERIVED_POSTFIX PAPI_read(3 counters)  test...
Total cost for PAPI_read (1 derived_postfix counter) over 1000000 iterations
min cycles   : 3034
max cycles   : 48381
mean cycles  : 3144.853020
std deviation: 317.766676
 
Performing DERIVED_[ADD|SUB] PAPI_read(2 counters)  test...
Total cost for  PAPI_read (1 derived_[add|sub] counter) over 1000000 iterations
min cycles   : 3524
max cycles   : 63168
mean cycles  : 3625.259589
std deviation: 322.423673
 cost.c                                       PASSED

Comment 4 William Cohen 2016-03-28 21:08:59 UTC
Email was sent to the papi development list to better understand why the rdpmc wasn't currently being used.  One of the PAPI developers, Vince Weaver responded with:

On Mon, 28 Mar 2016, William Cohen wrote:

>    /* Detect if we can use rdpmc (or equivalent) */
>    /* We currently do not use rdpmc as it is slower in tests */
>    /* than regular read (as of Linux 3.5)                    */

> The rdpmc would be limited to speeding up reading of the counters for 
> self monitoring tasks, but that could be useful in some cases. Is there 
> a discussion about the experiments that lead to the determination that 
> using the rdpmc is slower?  I didn't find anything on the mailing list.

there are various complications on the rdpmc support, I am currently 
working on getting it implemented.

In order for rdpmc support to be a win the mmap() support has to also be 
fast (and you have to do many reads in succession).  It probably will be a 
net win in the end but it sometimes requires some extra flags to mmap().

If you're interested in the details that led to the comment in the source 
code there's a paper I wrote about it:
	Self-monitoring Overhead of the Linux perf_event Performance 
	Counter Interface
	http://web.eece.maine.edu/~vweaver/projects/perf_events/overhead/

Vince

Comment 7 Michael Petlan 2016-06-22 16:06:40 UTC
Okey then, let's consider this as VERIFIED.

Comment 9 errata-xmlrpc 2016-11-04 05:18:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2395.html