Bug 1362283
| Summary: | [RHELSA-7.3] perf-trace-test: process attach failed | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | PaulB <pbunyan> |
| Component: | kernel-aarch64 | Assignee: | Christopher Covington <cov> |
| kernel-aarch64 sub component: | Platform Enablement | QA Contact: | Jeff Bastian <jbastian> |
| Status: | CLOSED NOTABUG | Docs Contact: | |
| Severity: | medium | ||
| Priority: | medium | CC: | ernunes, jbastian, jfeeney, mlangsdo, mpetlan, pbunyan |
| Version: | 7.3 | ||
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | aarch64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-08-09 17:57:58 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
PaulB
2016-08-01 19:55:58 UTC
Cov, We've only reproduced this on Hawker. Can you look into it, please? Are the actual log files like /mnt/tests/tools/perf/Sanity/basic-test-aarch64/LOGS/trace/test_basic.sh.log available anywhere? Yes, they're in the Beaker job: https://beaker.engineering.redhat.com/recipes/2900823#task43526204 Click the "+3" link next to taskout.log and you'll see 3 more logs including logs.tar.bz2. Here's a direct link: https://beaker.engineering.redhat.com/recipes/2900823/tasks/43526204/logs/logs.tar.bz2 Extract that and you'll find trace/test_basic.sh.log Michael, can you take a look at this perf failure too? Thanks! Manually re-running the task, I get the following basic_attach.log
0.000 ( 0.000 ms): ... [continued]: nanosleep()) = 0
0.047 ( 0.011 ms): close(fd: 1 ) = 0
0.064 ( 0.008 ms): close(fd: 2 ) = 0
0.073 ( 0.000 ms): exit_group(
(In reply to Christopher Covington from comment #7) > Manually re-running the task, I get the following basic_attach.log > > 0.000 ( 0.000 ms): ... [continued]: nanosleep()) = 0 > 0.047 ( 0.011 ms): close(fd: 1 > ) = 0 > 0.064 ( 0.008 ms): close(fd: 2 > ) = 0 > 0.073 ( 0.000 ms): exit_group( The test does not fail with this log, does it. This test should attach perf trace to a sleep command and check whether all the syscalls are present. The "*sleep" one needs to be there, since it is finished when perf has been already attached. The "exit*" one should appear there and be unfinished. I am not sure whether "close" needs to be there too on all architectures, however, it seems to always be. Regexp not found: "sleep" Regexp not found: "close" Regexp not found: "exit" -- [ FAIL ] -- perf_trace :: test_basic :: attach process (command exitcode + output regexp parsing) ^^^ This failure means that none of the three was found. Also, it means that perf exited with non-zero exitcode, so maybe the attach failed, or something else happened to it. However, stderr is not redirected anywhere in this particular testcase, so the fact we don't see anything in the logs, is a bit strange. Can you reproduce the issue manually on that box? Just by running: sleep 10 & perf trace -p $! -o trace.log You also might try adding verbosity level by '-vv' option. This is how it looks like on an aarch64 box when it passes, so you may compare: ------------------------------------------------------------ perf_event_attr: type 2 size 112 config 0x11 { sample_period, sample_freq } 1 sample_type IP|TID|TIME|ID|CPU|PERIOD|RAW read_format ID disabled 1 mmap 1 comm 1 task 1 sample_id_all 1 exclude_guest 1 mmap2 1 comm_exec 1 { wakeup_events, wakeup_watermark } 1 ------------------------------------------------------------ sys_perf_event_open: pid 1944 cpu -1 group_fd -1 flags 0x8 ------------------------------------------------------------ perf_event_attr: type 2 size 112 config 0x10 { sample_period, sample_freq } 1 sample_type IP|TID|TIME|ID|CPU|PERIOD|RAW read_format ID disabled 1 sample_id_all 1 exclude_guest 1 { wakeup_events, wakeup_watermark } 1 ------------------------------------------------------------ sys_perf_event_open: pid 1944 cpu -1 group_fd -1 flags 0x8 mmap size 589824B perf event ring buffer mmapped per thread You may also try to compare perf trace against strace, just attach strace to an already runnining sleep command. (In reply to Michael Petlan from comment #8) > Can you reproduce the issue manually on that box? Just by running: > > sleep 10 & > perf trace -p $! -o trace.log [root@qualcomm-hawker-01 ~]# uname -rn qualcomm-hawker-01.ml3.eng.bos.redhat.com 4.5.0-3.el7.aarch64 [root@qualcomm-hawker-01 ~]# sleep 10 & [1] 13943 [root@qualcomm-hawker-01 ~]# perf trace -p $! -o trace.log [1]+ Done sleep 10 [root@qualcomm-hawker-01 ~]# cat trace.log 0.000 ( 0.000 ms): ... [continued]: nanosleep()) = 0 0.130 ( 0.025 ms): close(fd: 1 ) = 0 0.169 ( 0.018 ms): close(fd: 2 ) = 0 0.192 ( 0.000 ms): exit_group( ########################################################################### [root@qualcomm-hawker-01 ~]# sleep 10 & [1] 14100 [root@qualcomm-hawker-01 ~]# perf trace -vv -p $! -o trace.log ------------------------------------------------------------ perf_event_attr: type 2 size 112 config 0x11 { sample_period, sample_freq } 1 sample_type IP|TID|TIME|ID|CPU|PERIOD|RAW read_format ID disabled 1 mmap 1 comm 1 task 1 sample_id_all 1 exclude_guest 1 mmap2 1 comm_exec 1 { wakeup_events, wakeup_watermark } 1 ------------------------------------------------------------ sys_perf_event_open: pid 14100 cpu -1 group_fd -1 flags 0x8 ------------------------------------------------------------ perf_event_attr: type 2 size 112 config 0x10 { sample_period, sample_freq } 1 sample_type IP|TID|TIME|ID|CPU|PERIOD|RAW read_format ID disabled 1 sample_id_all 1 exclude_guest 1 { wakeup_events, wakeup_watermark } 1 ------------------------------------------------------------ sys_perf_event_open: pid 14100 cpu -1 group_fd -1 flags 0x8 mmap size 589824B perf event ring buffer mmapped per thread [1]+ Done sleep 10 [root@qualcomm-hawker-01 ~]# cat trace.log 0.000 ( 0.000 ms): ... [continued]: nanosleep()) = 0 0.113 ( 0.025 ms): close(fd: 1 ) = 0 0.151 ( 0.017 ms): close(fd: 2 ) = 0 0.173 ( 0.000 ms): exit_group( ########################################################################### [root@qualcomm-hawker-01 ~]# sleep 10 & [1] 14188 [root@qualcomm-hawker-01 ~]# strace -p $! Process 14188 attached restart_syscall(<... resuming interrupted call ...>) = 0 close(1) = 0 close(2) = 0 exit_group(0) = ? +++ exited with 0 +++ [1]+ Done sleep 10 The /tools/perf/Sanity/basic-test-aarch64 task runs a much shorter 'sleep 0.1', and when I try that it fails like this: [root@qualcomm-hawker-01 ~]# sleep 0.1 & perf trace -vv -p $! -o trace.log [1] 32040 couldn't open /proc/32040/status [1]+ Done sleep 0.1 [root@qualcomm-hawker-01 ~]# cat trace.log Problems initializing symbol libraries! I get the same failure with a sleep of 0.1 on x86_64 with Fedora 24. I cloned the /tools/perf/Sanity/basic-test-aarch64 task to a personal git repo and modified the sleep times to be a bit longer: -export CMD_BASIC_SLEEP="sleep 0.1" -export CMD_QUICK_SLEEP="sleep 0.01" -export CMD_LONGER_SLEEP="sleep 2" +export CMD_BASIC_SLEEP="sleep 10" +export CMD_QUICK_SLEEP="sleep 1" +export CMD_LONGER_SLEEP="sleep 20" See http://git.engineering.redhat.com/git/users/jbastian/perf-tests.git/commit/?id=36e6284a1dc7 This time, the test passed with no reg-exp failures: [root@qualcomm-hawker-01 trace]# cat test_basic.sh.log -- [ SKIP ] -- perf_trace :: test_basic :: help message :: testcase skipped -- [ PASS ] -- perf_trace :: test_basic :: basic execution -- [ PASS ] -- perf_trace :: test_basic :: duration threshold -- [ PASS ] -- perf_trace :: test_basic :: systemwide -- [ PASS ] -- perf_trace :: test_basic :: full timestamp -- [ PASS ] -- perf_trace :: test_basic :: summary -- [ PASS ] -- perf_trace :: test_basic :: attach process ## [ PASS ] ## perf_trace :: test_basic SUMMARY The test took about 15 minutes to run with this modification which is about the same as the original version. https://beaker.engineering.redhat.com/recipes/2944129#task44133904 You mean that the test runs for the same time when the sleep commands are 10 times longer, don't you? AFAIK the test should take less time than 15 minutes... It could be 15 minutes if the full tracepoint check was enabled. --- However, I would just replace the CMD_BASIC_SLEEP by CMD_LONGER_SLEEP in that particular place. I think that should be sufficient. Ideas? A strange thing is that the "sleep 0.1" was always OK and it has stopped working quite recently. That could point on a performance flaw somewhere. Oh, I'm sorry, I was reading the timestamps incorrectly yesterday. It took 5 minutes in the original version, but 15 minutes with the longer sleep times. Original: https://beaker.engineering.redhat.com/recipes/2900811#task43525827 Start: +05:27:08 End: +05:32:22 Elapsed: 5m 14s Modified: https://beaker.engineering.redhat.com/recipes/2944129#task44133904 Start: +00:37:40 End: +00:52:19 Elapsed: 14m 39s I will try again with CMD_LONGER_SLEEP instead of the 10x longer sleeps. I tried it again with CMD_LONGER_SLEEP (and original sleep values) and it took 10 minutes to run:
Task: https://beaker.engineering.redhat.com/recipes/2948217#task44168770
Start: +00:33:20
End: +00:43:12
Elapsed: 9m 52s
(In reply to Jeff Bastian from comment #16) > I tried it again with CMD_LONGER_SLEEP (and original sleep values) and it > took 10 minutes to run: Oh, and the most important thing: it was successful! Instead of attaching to a background process, why not launch the process with perf? This works much better:
[root@qualcomm-hawker-02 ~]# perf trace -o trace.log sleep 0.1
[root@qualcomm-hawker-02 ~]# cat trace.log
0.790 ( 0.026 ms): sleep/32174 brk( ) = 0x16a20000
0.956 ( 0.051 ms): sleep/32174 faccessat(dfd: CWD, filename: 0x847ca1a0, mode: 4 ) = -1 ENOENT No such file or directory
1.028 ( 0.048 ms): sleep/32174 openat(dfd: CWD, filename: 0x847ca048, flags: CLOEXEC ) = 4
1.064 ( 0.020 ms): sleep/32174 newfstat(fd: 4, statbuf: 0x3ffcf899c10 ) = 0
1.138 ( 0.094 ms): sleep/32174 ... [continued]: mmap()) = 0x3ff84780000
1.171 ( 0.017 ms): sleep/32174 close(fd: 4 ) = 0
1.336 ( 0.126 ms): sleep/32174 openat(dfd: CWD, filename: 0x847e2228, flags: CLOEXEC ) = 4
1.383 ( 0.031 ms): sleep/32174 read(fd: 4, buf: 0x3ffcf899e00, count: 832 ) = 832
1.419 ( 0.018 ms): sleep/32174 newfstat(fd: 4, statbuf: 0x3ffcf899c90 ) = 0
1.489 ( 0.051 ms): sleep/32174 mmap(arg0: 0, arg1: 1528816, arg2: 5, arg3: 2050, arg4: 4, arg5: 0 ) = 0x3ff84600000
1.580 ( 0.076 ms): sleep/32174 mmap(arg0: 4395973869568, arg1: 131072, arg2: 3, arg3: 2066, arg4: 4, arg5: 1376256) = 0x3ff84760000
1.745 ( 0.023 ms): sleep/32174 close(fd: 4 ) = 0
1.988 ( 0.065 ms): sleep/32174 mprotect(start: 0x3ff84760000, len: 65536, prot: READ ) = 0
2.069 ( 0.047 ms): sleep/32174 mprotect(start: 0x410000, len: 65536, prot: READ ) = 0
2.147 ( 0.048 ms): sleep/32174 mprotect(start: 0x3ff847d0000, len: 65536, prot: READ ) = 0
2.208 ( 0.046 ms): sleep/32174 munmap(addr: 0x3ff84780000, len: 36257 ) = 0
2.658 ( 0.021 ms): sleep/32174 brk( ) = 0x16a20000
2.709 ( 0.038 ms): sleep/32174 brk(brk: 0x16a50000 ) = 0x16a50000
2.740 ( 0.015 ms): sleep/32174 brk( ) = 0x16a50000
2.850 ( 0.061 ms): sleep/32174 openat(dfd: CWD, filename: 0x84733e10, flags: CLOEXEC ) = 4
2.902 ( 0.022 ms): sleep/32174 newfstat(fd: 4, statbuf: 0x3ff84771848 ) = 0
2.961 ( 0.045 ms): sleep/32174 mmap(arg0: 0, arg1: 109675168, arg2: 1, arg3: 2, arg4: 4, arg5: 0 ) = 0x3ff7dd60000
3.020 ( 0.018 ms): sleep/32174 close(fd: 4 ) = 0
103.384 (100.170 ms): sleep/32174 nanosleep(rqtp: 0x3ffcf89a928 ) = 0
103.478 ( 0.027 ms): sleep/32174 close(fd: 1 ) = 0
103.523 ( 0.021 ms): sleep/32174 close(fd: 2 ) = 0
103.549 ( 0.000 ms): sleep/32174 exit_group(
(In reply to Jeff Bastian from comment #18) > Instead of attaching to a background process, why not launch the process > with perf? Because the particular test tests the perf-trace's ability to attach something already running. Testsuite fixed upstream by the following commit [1], which has been also backported to both RHEL versions. This makes the testsuite run longer by 2 seconds, since the CMD_LONGER_SLEEP is used only where it is really necessary. And, the most important thing, it should pass next time... ;-)
I think we might close this bug as NOTABUG, thoughts?
-------------------------------------------------------------
[1] commit f86cc2ad1609723a3ae5fe8419f465be3082516e
Author: Michael Petlan <mpetlan>
Date: Tue Aug 9 19:22:54 2016 +0200
base_trace/test_basic.sh: fix the trace attach test to be more robust
When trying to attach to $CMD_BASIC_SLEEP, there may be too long latency,
so the sleep command does not run or is being finished when perf starts.
Thus, the testcase often fails on some environments.
That's why we need $CMD_LONGER_SLEEP instead.
diff --git a/base_trace/test_basic.sh b/base_trace/test_basic.sh
index 91a3593..9bcf036 100755
--- a/base_trace/test_basic.sh
+++ b/base_trace/test_basic.sh
@@ -112,7 +112,7 @@ print_results $PERF_EXIT_CODE $CHECK_EXIT_CODE "summary"
### attach process
# perf-trace should be able to attach an existing process by '-p PID'
-$CMD_BASIC_SLEEP &
+$CMD_LONGER_SLEEP &
$CMD_PERF trace -p $! -o $LOGS_DIR/basic_attach.log
PERF_EXIT_CODE=$?
(In reply to Michael Petlan from comment #19) > (In reply to Jeff Bastian from comment #18) > > Instead of attaching to a background process, why not launch the process > > with perf? > > Because the particular test tests the perf-trace's ability to attach > something already running. Oh... I guess that's a pretty good reason! Sorry for the stupid question. :) (In reply to Michael Petlan from comment #20) > I think we might close this bug as NOTABUG, thoughts? Agreed. It's been working well for me with the longer sleep time. We can re-open it if it still occurs with this change. |