Bug 1115953
| Summary: | Failed test cases | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Miloš Prchlík <mprchlik> | ||||||
| Component: | pcp | Assignee: | Nathan Scott <nathans> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Miloš Prchlík <mprchlik> | ||||||
| Severity: | unspecified | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 6.6 | CC: | brolley, fche, mbenitez, scox, tlavigne | ||||||
| Target Milestone: | rc | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | pcp-3.9.4-5.el6 | Doc Type: | Bug Fix | ||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2014-10-14 06:44:33 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
Miloš Prchlík
2014-07-03 11:50:27 UTC
Hi Miloš, Thanks for the test runs and great initial analysis. My findings very much mirror yours ... here's the notes I made as I went through the results. If I could get a reserved machine where the tests have been run, that'd be extremely helpful for verifying the fixes (and nspr/valgrind workaround) I have so far. On the whole, the only one that has me really mystified at the moment is 061 - I'll dig deeper (test machine access will really help there). I've not yet looked into those platform-specific failures ... will do so shortly. Oh - on the missing packages question - we could add avahi and avahi-tools into the harness-installed set, and some more tests will begin to run (and one failing test will begin to pass). cheers. 061: -logputresult: pdubuf=ADDR input len=NNN output len=124 posn=NNN +logputresult: pdubuf=ADDR input len=NNN output len=144 posn=NNN ... need access to a machine with this problem, not reproducible here (I'm on a earlier RHEL6 though, so not apples-to-apples of course). Very odd, needs more investigation on my part. 128: join(1) finds the output of sort(1) two lines earlier to be unsorted. :P ... need access to a machine with this problem, not reproducible here. Hopefully fix as simple as giving the join invocation a locale matching that used by sort. 283: Avahi diagnostics in pmcd.log - probably we lack a running Avahi daemon. This warning message should be under a pmDebug guard - will fix. 386, 388, 391, 393, 395, 400, 405, 459, 490, 513, 526, 719, 720, 783, 875, 957: Valgrind finding a memory leak (PR_OpenFile/malloc - yeah, probable library init routine problem - thanks for the analysis & opening #1115952!). 600: Probable test script issue, daylight savings XYZ vs ABC whacko filtering issue. Will dig deeper, I need to take more time to understand the test a bit more though. 928: Test script issue - assumptions about test machine timezone in the output. Odd that none of the other PCP folks have reported this outside .au ... oh well. 943: Probable test script issue not catering for a lack of some per-process cgroups instrumentation in the kernel? ... wierd though, I don't see this on earlier RHEL6 variants. Access to a machine with this problem would be good. 1006, 1008, 1011, 1014, 1016, 1021, 1022: Test script issue (well, build issue) - missing logs, fixed in subsequent pcp releases already, needs backport. Hi Miloš,
I've included my notes below, from going through that last set of results.
Where there is a git commit ID next to the test#, that issue was fixed and included in a build from last week - pcp-3.9.4-4. That build has been moved through erratatool and it'd be great if we could get a new set of results and machine reservations if possible, to analyse the remaining issues (of which there will still be a handful - see below).
I did not observe the initial-primary-pmlogger-not-restarting behaviour you saw - so not sure what advice to offer there yet, sorry. :(
I found TESTOUT.log contains this output in a couple of spots:
/mnt/tests/CoreOS/pcp/Library/pcpcommon/lib.sh: line 98: rlIsRHEL7: command not found
... which is coming from the QE harness - I'll attach a patch here that I think resolves this, based on my relatively limited understanding. :)
I'd appreciate your thoughts on the ppc64 python issue too btw (test 972) - this may even be a ppc64-specific python or compiler bug perhaps? Not sure.
Anyway, my short-hand notes follow:
> All arches with very similar diff XXX.out vs XXX.out.bad:
- 943, 1006, 1008, 1011, 1014, 1015, 1016, 1021, 1022
061 - cdfa900f - log preamble non-deterministic size
600 - TODO (timezones) - passing locally though.
128 - d4a9e1e1 - sort vs join locale
283 - 39c9e564 - avahi_entry_group_new diagnostic
928 - c72755d3 - assumed au timezone
943 - 9279ce5b - handle empty cgroups procfs file
395,400 - 07423d8e - valgrind vs local ctx sample_init
-> PPC64 -> const char * strcpy (of size 3) -> libc accesses 8 bytes
1006, 1008, 1011, 1014, 1015, 1016, 1021, 1022 - 1408de56, 5e650b20, bc21c577, 8bfbd481, a875d38a, 81785d88, 0dcb2e77, e2675b31 - missing archives
386, 388, 391, 393, 395, 400, 405, 459, 490, 513, 526, 719, 720, 783, 875, 957 - fd6c591c - valgrind vs NSPR libs
x86_64: 018, 136, 314, 457
018 (passed - must be racey)
136 - long hostname might be causing log offset wierdness? (hp-ms-01-c35.moonshot1.lab.eng.rdu.redhat.com - pmcd.pmlogger.pmcd_host)
314 - 761e8a0 - taking awhile to rotate logs, causing extra messages
457 - race in pmdalogger-rotation test, i think; for some reason more likely here? - test needs work, most likely
s390x: 034, 088, 136, 178, 515, 581, 651
034 (passed)
088 (passed)
136 (passed)
178 (passed)
(*) 515 - pmie rising/fallout test - missing some output? - test has timing dependence (pmie sampling on 0.4s interval, vs 2sec sleeps)
581 (passed)
651 (passed)
i686: 093, 457, 581, 713, 725, 741
093 (passed)
457 - exactly as for x86_64 (pmdalogger)
581 (passed)
581 [notrun - fips]
725 - ed7e935d - broken test, not handling 32-bit kernel ulong correctly
741 - not sure - a 2nd "local:" connection failed with "No route to host"?
(python APIs)
ppc64: 136, 177, 214, 428, 457, 972
136 (passed)
177 (passed)
214 (notrun?)
428 (passed)
457 - exactly as for x86_64 (pmdalogger)
972 - python ppc64-specific 2.6.6 bug??? (passes on RHEL7 with 2.7+)
problem is related to unsigned long long conversion via PyArg_Parse:
< value 50933760 -> 0x3093000
> value 814940160 -> 0x30930000
(looks like conversions off by 4 bits???)
(gdb) p /x (0x30930000 >> 4)
$1 = 0x3093000
Created attachment 919500 [details]
Resolve errors from QE test harness
(In reply to Nathan Scott from comment #3) > Hi Miloš, > > I've included my notes below, from going through that last set of results. > Where there is a git commit ID next to the test#, that issue was fixed and > included in a build from last week - pcp-3.9.4-4. That build has been moved > through erratatool and it'd be great if we could get a new set of results > and machine reservations if possible, to analyse the remaining issues (of > which there will still be a handful - see below). New test run is ready, and results are available here: [1] Machines are reserved and accessible as usually. - pcp-testsuite-light aka "-g sanity" failed to start on all but one machine, with already mentioned "Starting pmlogger ... Arrgghhh ... pmlogger (pid=-P) failed to start after 20 seconds" error. However, when I logged in and run this one manually, all works nice, pmlogger starts and playes with others nicely. All instances of "-g sanity" passed except of one test case on s390x (ibm-z10-70.rhts.eng.bos.redhat.com): vvvvvvvvv --- /var/lib/pcp/testsuite/034.out 2014-07-15 01:37:15.000000000 -0400 +++ /var/lib/pcp/testsuite/034.out.bad 2014-07-30 03:41:26.383529178 -0400 @@ -6,8 +6,8 @@ 900 msec requested: no. of samples within range, avg within 10% of requested delay === src/mv-bar === -100 msec requested: no. of samples within range, avg within 10% of requested delay -500 msec requested: no. of samples within range, avg within 10% of requested delay +100 msec requested: no. of samples (49) unexpected, avg (128.53 msec) outside 10% of requested delay +500 msec requested: no. of samples (12) unexpected, avg within 10% of requested delay 900 msec requested: no. of samples within range, avg within 10% of requested delay === src/noti-bar === ^^^^^^^^^ Also there are no FIPS results, I wasn't abel to find working RHEL6.6 FIPS-installable tree - will deliver as soon as this gets solved. > > I did not observe the initial-primary-pmlogger-not-restarting behaviour you > saw - so not sure what advice to offer there yet, sorry. :( I'm gonna try to add some debugging into this, because I observe it in our automatic runs, when ./check is run for the first time and pmlogger is expected to start. All other ./check calls during automatic (pcp-testsuite - pcp-testsuite-light runs as the first task) or manual runs are fine. How about extending that 20 seconds waiting period and check what's gonna happen? > > I found TESTOUT.log contains this output in a couple of spots: > /mnt/tests/CoreOS/pcp/Library/pcpcommon/lib.sh: line 98: rlIsRHEL7: command > not found > ... which is coming from the QE harness - I'll attach a patch here that I > think resolves this, based on my relatively limited understanding. :) Thanks for noticing, that was indeed a typo! :) > > I'd appreciate your thoughts on the ppc64 python issue too btw (test 972) - > this may even be a ppc64-specific python or compiler bug perhaps? Not sure. I took a look at this one, and it kind of makes sense to me. /var/lib/pcp/pmdas/zswap/pmdazswap.python reports (in zswap_fetch_callback method) written_back_pages=12435, pagesize=65536 => 880476160 bytes, and that's what is reported in 972.out.bad. Those 4 bits you observe in conversion come from bigger page size (65536 instead of 4096) - would this work as a solution? Page size looks correct: [root@ibm-p730-02-lp1 ~]# getconf PAGESIZE 65536 [root@ibm-p730-02-lp1 ~]# > > Anyway, my short-hand notes follow: > > > > All arches with very similar diff XXX.out vs XXX.out.bad: > - 943, 1006, 1008, 1011, 1014, 1015, 1016, 1021, 1022 > > 061 - cdfa900f - log preamble non-deterministic size > 600 - TODO (timezones) - passing locally though. > > 128 - d4a9e1e1 - sort vs join locale > 283 - 39c9e564 - avahi_entry_group_new diagnostic > 928 - c72755d3 - assumed au timezone > 943 - 9279ce5b - handle empty cgroups procfs file > > 395,400 - 07423d8e - valgrind vs local ctx sample_init > -> PPC64 -> const char * strcpy (of size 3) -> libc accesses 8 bytes > > 1006, 1008, 1011, 1014, 1015, 1016, 1021, 1022 - 1408de56, 5e650b20, > bc21c577, 8bfbd481, a875d38a, 81785d88, 0dcb2e77, e2675b31 - missing archives > > 386, 388, 391, 393, 395, 400, 405, 459, 490, 513, 526, 719, 720, 783, 875, > 957 - fd6c591c - valgrind vs NSPR libs > > x86_64: 018, 136, 314, 457 > 018 (passed - must be racey) > 136 - long hostname might be causing log offset wierdness? > (hp-ms-01-c35.moonshot1.lab.eng.rdu.redhat.com - pmcd.pmlogger.pmcd_host) > 314 - 761e8a0 - taking awhile to rotate logs, causing extra messages > 457 - race in pmdalogger-rotation test, i think; for some reason more likely > here? - test needs work, most likely > > s390x: 034, 088, 136, 178, 515, 581, 651 > 034 (passed) > 088 (passed) > 136 (passed) > 178 (passed) > (*) 515 - pmie rising/fallout test - missing some output? - test has timing > dependence (pmie sampling on 0.4s interval, vs 2sec sleeps) > 581 (passed) > 651 (passed) > i686: 093, 457, 581, 713, 725, 741 > 093 (passed) > 457 - exactly as for x86_64 (pmdalogger) > 581 (passed) > 581 [notrun - fips] > 725 - ed7e935d - broken test, not handling 32-bit kernel ulong correctly > 741 - not sure - a 2nd "local:" connection failed with "No route to host"? > (python APIs) > > ppc64: 136, 177, 214, 428, 457, 972 > 136 (passed) > 177 (passed) > 214 (notrun?) > 428 (passed) > 457 - exactly as for x86_64 (pmdalogger) > 972 - python ppc64-specific 2.6.6 bug??? (passes on RHEL7 with 2.7+) > problem is related to unsigned long long conversion via PyArg_Parse: > < value 50933760 -> 0x3093000 > > value 814940160 -> 0x30930000 > (looks like conversions off by 4 bits???) > (gdb) p /x (0x30930000 >> 4) > $1 = 0x3093000 [1] https://beaker.engineering.redhat.com/tasks/executed?recipe_task_id=23117598&recipe_task_id=23117569&recipe_task_id=23117583&recipe_task_id=23117555&recipe_task_id=23116924&recipe_task_id=23116939&recipe_task_id=23117613&recipe_task_id=23116956 Hi Miloš, Apologies for the slow response - timing didn't work out for us this time, as I was travelling from right before your latest test run. I've gone through the results as well now (notes below) and think I will put together another build with the handful of fixes described below. In summary though, with the exception of s390x, things look pretty good - we're in the ~2% failure range overall. On s390x it seems most of the failures (incl. that sanity one) are due to general platform slowness - we had this on RHEL7 ISTR, but it got better late in the game with unrelated-to-PCP updates. So, s390x is at ~4% failure rate, but in areas that would be fine if the platform could schedule things a bit better. Nothing we can do about that though. So, I'll put together the next build shortly - if you do happen to find some FIPS machines, maybe wait for that build to appear before testing? Or another non-FIPS run with latest build would be great too - I'm not going away for awhile again :) so will be able to take a detailed look with you more quickly. Plus, we should be closer to 1% with next set of results, I hope. Anyway, scribbled notes follow... cheers. > [initial-primary-pmlogger-not-restarting] ... extend to 20sec? As everything is running locally, I'd be surprised if increasing this helps - it seems more likely that the primary (-P) pmlogger is simply exiting immediately, or is not even being attempted to start (for some mysterious reason), or pmlc cannot connect to it (possibly a permissions issue). The message comes when pmlc fails to connect to pmlogger (see the qa/common.check script, approx line 310) - there's a bit of debug code we could enable in there to try track it down (I'll attach a patch shortly, or maybe do a new build with it included). It may be short/long hostname related actually ... hmm, let me include a fix there too. > [...zswap...] > Those 4 bits you observe in conversion come from bigger page size Yes! That is very likely to be it, and so this looks like a test failure, where its assuming 4K pages (input data is fixed, but the values are multiplied by the pagesize). I've committed 3f0672d4e4 upstream to address this - thanks! - AVC errors: type=SYSCALL msg=audit(1406583107.479:9687): arch=c000003e syscall=59 success=yes exit=0 a0=a4e360 a1=a4dd50 a2=a4e3a0 a3=30 items=0 ppid=10832 pid=10833 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="rsyslogd" exe="/sbin/rsyslogd" subj=unconfined_u:system_r:syslogd_t:s0 key=(null) type=AVC msg=audit(1406583107.479:9687): avc: denied { append } for pid=10833 comm="rsyslogd" path="/var/lib/pcp/testsuite/555.full" dev=dm-0 ino=791743 scontext=unconfined_u:system_r:syslogd_t:s0 tcontext=unconfined_u:object_r:pcp_var_lib_t:s0 tclass=file Ah, that's interesting - rsyslog unable to write to the tests .full file. That's a failure due to an assumption in the test, not a genuine problem - I think we can fix that by getting it to write to a temp file and moving that to $seq.full at the end ... I'll prep a patch. - Missing packages => not-run tests: avahi, avahi-tools, xfsprogs (patch attached) - Failure audit - x86_64: 128 - join(1) failed, file not sorted - argh, thought I'd fixed this one. (need to look at a failing host again - worked OK when I had one) 136 - first sample is unexpected large, likely long hostname related. 457 - race condition in test for pmdalogger. 719 - valgrind 'no leaks possible' vs '0 errors from 0 contexts' (possible side-effect of earlier NSS issue now being fixed) 720 - valgrind, as above 875 - valgrind, as above 1006 - missing qa/views directory - argh 1008 - missing qa/archives directory - argh 1011 - missing qa/archives, as above 1015 - missing qa/archives, as above 1016 - missing qa/archives, as above 1021 - missing qa/archives, as above 1022 - missing qa/archives, as above -> x86_64: Failed 13 of 613 tests => ~2.1% - Failure audit - i386: 093 - too few log reads?! (test filter failure) 581 - needs filter on network devices prefixed by 'usb' (filter failure) 741 - strange python API failure creating second local: context (unsure) 128 719 720 875 1006 1008 1011 1015 1016 1021 1022 - as for other platforms. Failed 14 of 608 tests -> i386: Failed 14 of 613 tests => ~2.3% - Failure audit - ppc64: 400 - valgrind reports strncpy(3) reading more bytes than allowed. 972 - zswap pagesize issue (upstream 3f0672d4e4) 457 719 720 875 1006 1008 1011 1015 1016 1021 1022 - as for other platforms. -> ppc64: Failed 13 of 606 tests => ~2.1% - Failure audit - s390x: 034 - seems s390x cannot sample sufficiently quickly, or too quickly 044 - proc.psinfo.utime metric used, but PM_ERR_NOAGENT (no proc PMDA?) 088 - too many samples observed in the initially-created QA archives 102 - possible scheduling issue/race - no pmlogger, pmlc took 4sec to start? 177 - too many read(2) calls needed to read archive (possible dup of 088) 189 - pmie scheduling issues, kernel does not schedule quickly enough? 200 - unexpectedly long delays (>5sec/more) in pmcd connection timeouts 206 - another case of too many samples (possible dup of 088/177) 251 - yet another case of too many samples (possible dup of 088/177/206) 514 - pmie scheduling issues, dup of 189. 515 - pmie scheduling issues, dup of 189/514. 558 - pmie failed to sample in timely fashion, dup of 189/514/515. 128 719 720 875 1006 1008 1011 1015 1016 1021 1022 - as for other platforms. -> s390x: Failed 23 of 612 tests => 3.8% Created attachment 924351 [details]
Additional rpm dependencies for PCP QA testing
Hi Nathan! (In reply to Nathan Scott from comment #6) > Hi Miloš, > > Apologies for the slow response - timing didn't work out for us this time, > as I was travelling from right before your latest test run. I've gone > through the results as well now (notes below) and think I will put together > another build with the handful of fixes described below. I'm sorry for delay on my side, too, I got stuck during the RHSCL testing cycle and I was sick and... Anyway, I have results for you with pcp-3.9.4-5, reserved machines and few notes, as always. - pcp-testsuite-light aka "./check -g sanity": https://beaker.engineering.redhat.com/tasks/executed?recipe_task_id=23417152&recipe_task_id=23417165&recipe_task_id=23417191&recipe_task_id=23417095&recipe_task_id=23417178&recipe_task_id=23417131&recipe_task_id=23417119&recipe_task_id=23417107 - pcp-testsuite aka "./check -g local": https://beaker.engineering.redhat.com/tasks/executed?recipe_task_id=23417153&recipe_task_id=23417166&recipe_task_id=23417096&recipe_task_id=23417179&recipe_task_id=23417120&recipe_task_id=23417108&recipe_task_id=23417192&recipe_task_id=23417132 - "-g sanity" did not start at 6 of 8 machines, passed on the remaining 2 (x86_64 (non-FIPS) and s390x) - "-g local" was killed by a watchdog on FIPS x86_64 and ppc64 machines (but not on FIPS i386) - test case 840 (probably, the one after 832 according to logs from other machines) got stuck and caused termination of the whole task when its reserved time run out. Machines are reserved (and I'll keep them reserved at least till Monday, just to be sure none of them goes back to pool before you can get your hands on it) and passwords are set as usually. Usually I have another set of machines reserved as my own pcp playground, to avoid any violation of results, logs, local files, etc. on these I provide to you, but these days there is quite a long queue in beaker, so I hope it wont take too long before I'll be able to debug pcp-testsuite-light more deeply. > > In summary though, with the exception of s390x, things look pretty good - > we're in the ~2% failure range overall. On s390x it seems most of the > failures (incl. that sanity one) are due to general platform slowness - we > had this on RHEL7 ISTR, but it got better late in the game with > unrelated-to-PCP updates. So, s390x is at ~4% failure rate, but in areas > that would be fine if the platform could schedule things a bit better. > Nothing we can do about that though. > > So, I'll put together the next build shortly - if you do happen to find some > FIPS machines, maybe wait for that build to appear before testing? Or > another non-FIPS run with latest build would be great too - I'm not going > away for awhile again :) so will be able to take a detailed look with you > more quickly. Plus, we should be closer to 1% with next set of results, I > hope. Anyway, scribbled notes follow... cheers. There were some troubles with RHEL6.6 beaker trees with regard to a FIPS (not installable, not available, ...) but this seems to be solved so this time there are results for both non-FIPS and FIPS machines. You can find out which one you are inspecting at the beginning of TESTOUT.log: :: [ 10:19:02 ] :: [ INFO ] :: ARCH=ppc64 :: [ 10:19:03 ] :: [ INFO ] :: FIPS_ENABLED=1 > > > > [initial-primary-pmlogger-not-restarting] ... extend to 20sec? > > As everything is running locally, I'd be surprised if increasing > this helps - it seems more likely that the primary (-P) pmlogger > is simply exiting immediately, or is not even being attempted to > start (for some mysterious reason), or pmlc cannot connect to it > (possibly a permissions issue). > > The message comes when pmlc fails to connect to pmlogger (see the > qa/common.check script, approx line 310) - there's a bit of debug > code we could enable in there to try track it down (I'll attach a > patch shortly, or maybe do a new build with it included). It may > be short/long hostname related actually ... hmm, let me include a > fix there too. Such a fix - or at least a pointer to what migh be going on - would be very useful because pcp-testsuite-light is effectively crippled because of this. This time it passed even on one non-s390x machine (but failed on the second 2390x one, so...). I'll have some spare time during Thu/Fri and I plan to take a better look at this, equipped with gdb, strace & friend :) > > > [...zswap...] > > Those 4 bits you observe in conversion come from bigger page size > > Yes! That is very likely to be it, and so this looks like a test > failure, where its assuming 4K pages (input data is fixed, but the > values are multiplied by the pagesize). I've committed 3f0672d4e4 > upstream to address this - thanks! > You are welcome. > > - AVC errors: > > type=SYSCALL msg=audit(1406583107.479:9687): arch=c000003e syscall=59 > success=yes exit=0 a0=a4e360 a1=a4dd50 a2=a4e3a0 a3=30 items=0 ppid=10832 > pid=10833 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 > fsgid=0 tty=(none) ses=4294967295 comm="rsyslogd" exe="/sbin/rsyslogd" > subj=unconfined_u:system_r:syslogd_t:s0 key=(null) > type=AVC msg=audit(1406583107.479:9687): avc: denied { append } for > pid=10833 comm="rsyslogd" path="/var/lib/pcp/testsuite/555.full" dev=dm-0 > ino=791743 scontext=unconfined_u:system_r:syslogd_t:s0 > tcontext=unconfined_u:object_r:pcp_var_lib_t:s0 tclass=file > > Ah, that's interesting - rsyslog unable to write to the tests .full file. > That's a failure due to an assumption in the test, not a genuine problem - > I think we can fix that by getting it to write to a temp file and moving > that to $seq.full at the end ... I'll prep a patch. > > > - Missing packages => not-run tests: > > avahi, avahi-tools, xfsprogs (patch attached) Thanks for pointing these out to me, added! Please, don't pay any attention to warnings caused by missing xfsprogs package on some archs, I'll fix it later. > > > - Failure audit - x86_64: > 128 - join(1) failed, file not sorted - argh, thought I'd fixed this one. > (need to look at a failing host again - worked OK when I had one) > 136 - first sample is unexpected large, likely long hostname related. > 457 - race condition in test for pmdalogger. > 719 - valgrind 'no leaks possible' vs '0 errors from 0 contexts' > (possible side-effect of earlier NSS issue now being fixed) > 720 - valgrind, as above > 875 - valgrind, as above > 1006 - missing qa/views directory - argh > 1008 - missing qa/archives directory - argh > 1011 - missing qa/archives, as above > 1015 - missing qa/archives, as above > 1016 - missing qa/archives, as above > 1021 - missing qa/archives, as above > 1022 - missing qa/archives, as above > -> x86_64: Failed 13 of 613 tests => ~2.1% > > - Failure audit - i386: > 093 - too few log reads?! (test filter failure) > 581 - needs filter on network devices prefixed by 'usb' (filter failure) > 741 - strange python API failure creating second local: context (unsure) > 128 719 720 875 1006 1008 1011 1015 1016 1021 1022 - as for other platforms. > Failed 14 of 608 tests > -> i386: Failed 14 of 613 tests => ~2.3% > > - Failure audit - ppc64: > 400 - valgrind reports strncpy(3) reading more bytes than allowed. > 972 - zswap pagesize issue (upstream 3f0672d4e4) > 457 719 720 875 1006 1008 1011 1015 1016 1021 1022 - as for other platforms. > -> ppc64: Failed 13 of 606 tests => ~2.1% > > - Failure audit - s390x: > 034 - seems s390x cannot sample sufficiently quickly, or too quickly > 044 - proc.psinfo.utime metric used, but PM_ERR_NOAGENT (no proc PMDA?) > 088 - too many samples observed in the initially-created QA archives > 102 - possible scheduling issue/race - no pmlogger, pmlc took 4sec to start? > 177 - too many read(2) calls needed to read archive (possible dup of 088) > 189 - pmie scheduling issues, kernel does not schedule quickly enough? > 200 - unexpectedly long delays (>5sec/more) in pmcd connection timeouts > 206 - another case of too many samples (possible dup of 088/177) > 251 - yet another case of too many samples (possible dup of 088/177/206) > 514 - pmie scheduling issues, dup of 189. > 515 - pmie scheduling issues, dup of 189/514. > 558 - pmie failed to sample in timely fashion, dup of 189/514/515. > 128 719 720 875 1006 1008 1011 1015 1016 1021 1022 - as for other platforms. > -> s390x: Failed 23 of 612 tests => 3.8% Hi Miloš,
> Such a fix - or at least a pointer to what migh be going on - would be very
> useful because pcp-testsuite-light is effectively crippled because of this.
> This time it passed even on one non-s390x machine (but failed on the second
> 2390x one, so...). I'll have some spare time during Thu/Fri and I plan to
> take a better look at this, equipped with gdb, strace & friend :)
I don't have any insights as to a cause from digging through the logs, sorry.
Its gotta be something about the initial machine state I guess... but exactly
what, I haven't been able to figure out. I think it will need a run with a
fresh machine setup to be interrupted before starting the tests, and yeah -
like you said - deeper analysis with those tools.
FWIW on the full runs we dig get, and except for the hangs the indicators are
looking alot better. Lots and lots of previously failing tests are passing -
results look a bit better than we achieved with RHEL7 even, so thats cool.
My other notes from logfile scanning...
840 is a new common failure - I've asked Dave Brolley who wrote
this one and the PCP code behind it. Its new because we've now
installed the Avahi tools - previously would've been _notrun().
s390x is looking much better now - s'funny same thing happened in
RHEL7 - something else in the system/kernel was changed I guess.
x86_64 (FIPS)
457 - known issue, pmdalogger test race
832-ish hang (could be Avahi related? could be FIPS)
-> 91% done, 1 failure! was going well up to the hang.
s390x (non-FIPS)
469 - help text is >84chars - heh, since fixed upstream but not
really an issue worth worrying about (cosmetic).
840 - Avahi issue reported above
-> 2 failures only - awesome result for s390x!
ppc64 (FIPS)
099-ish - hang (unknown - could be Avahi, could be FIPS?)
i686 (FIPS)
* Hmm - lots of never-seen-before network-related interruptions here
and difficulties getting pmie timers to fire quickly enough - I'm
wondering if something hardware - CPU/network? - related was going
on during this testing that affected this machine? (*) (oh, FIPS?)
* 160 - seemed to lose several pmstore PDUs
* 177 - lost connection to the pmlogger instance
* 295 - failures to connect to pmproxy
* 314 - possible slow pmie start, causing racey output
* 356 - pmie unable to schedule tasks quickly enough
* 428 - missing some sample values, again from pmstore
457 - known issue with pmdalogger test race condition
* 535 - pmie unable to schedule tasks quickly enough
* 558 - pmie unable to schedule tasks quickly enough
* 581 - pmie unable to schedule tasks quickly enough
741 - known issue on i386 with python and multiple contexts
840 - Avahi issue reported above
x86_64 (non-FIPS)
136 - log header is still slightly larger than expected
(earlier hostname fix has not resolved this, hmm).
840 - Avahi issue
s390x (non-FIPS)
-> just two failures 469 & 840 - same failures as above s390x - excellent!
ppc64 (non-FIPS)
400 - still strncpy-related valgrind failures, earlier workarounds
ineffective, ugh. Pretty sure this is due to glibc/valgrind
as its only this platform and not in newer versions of them.
457 - known issue with pmdalogger test race condition
840 - Avahi issue reported above
i386 (non-FIPS)
Failures:
136 - log header is still slightly larger than expected
(earlier hostname fix has not resolved this, ugh).
428 - seemed to lose several pmstore PDUs
457 - known issue with pmdalogger test race condition
713 - spurious "TCP connection reset by peer" failures.
741 - known issue on i386 with python and multiple contexts.
840 - Avahi issue reported above
This time results are from manual run, so it's just a list of log files. My comments to failed test cases are bellow, any comment is welcomed. Machines are available and accessible (search for Hostname in logs) in case you'd like to check something locally. Our QE logs are copied here: http://nest.test.redhat.com/mnt/qa/scratch/mprchlik/pcp/ Below are mentioned only fails - vast majority are broken test cases, known issues, issues caused (probably) by a slow machine. Probably only non-waivable are 709 and 722. Amazing job, guys, everything seems to work very well! :) ---------------------- 128 --- 128.out 2014-08-06 02:52:20.000000000 -0400 +++ 128.out.bad 2014-09-10 09:51:47.019451436 -0400 @@ -5,4 +5,5 @@ fetching pminfo values... comparing all values... (details in 128.full) +join: file 1 is not in sorted order done > 128: join(1) finds the output of sort(1) two lines earlier to be unsorted. :P > ... need access to a machine with this problem, not reproducible here. > Hopefully fix as simple as giving the join invocation a locale matching > that used by sort. Machine is available, ibm-p8-kvm-03-guest-08.rhts.eng.bos.redhat.com, root/redhat. But I think it's safe to waive since it's broken test case, not pcp itself. ---------------------- ppc64/136 --- 136.out 2014-08-06 02:52:20.000000000 -0400 +++ 136.out.bad 2014-09-10 05:17:20.715015250 -0400 @@ -1,5 +1,5 @@ QA output created by 136 -MYLOG.0 700-800 bytes +MYLOG.0 800-900 bytes MYLOG.1 600-700 bytes MYLOG.2 600-700 bytes MYLOG.3 600-700 bytes > 136 - long hostname might be causing log offset wierdness? > ... > 136 - first sample is unexpected large, likely long hostname related. > ... > 136 - log header is still slightly larger than expected > (earlier hostname fix has not resolved this, hmm). Hostname is indeed quite long: ibm-p8-kvm-03-guest-08.rhts.eng.bos.redhat.com. Is this possibly fixed already in upstream? ---------------------- s390x/156 --- 156.out 2014-08-06 02:45:57.000000000 -0400 +++ 156.out.bad 2014-09-10 05:20:27.472213082 -0400 @@ -170,7 +170,7 @@ Terminate PMDA if already installed ... [...install files, make output...] Updating the PMCD control file, and notifying PMCD ... -Check sample metrics have appeared ... 10+ warnings, 100+ metrics and 1200+ values +Check sample metrics have appeared ... 140 warnings, 100+ metrics and 0 values ====================== = sampledso DSO PMDA = ====================== No comments about this one, also no idea on my side. Also seems to be machine specific and one time only - we haven't met this one during months of test runs, and I'm not able to reproduce it again. ---------------------- x86/160 --- 160.out 2014-08-06 02:49:56.000000000 -0400 +++ 160.out.bad 2014-09-10 08:26:04.310199840 -0400 @@ -5,10 +5,4 @@ inst [ INST-1 or "INST-1" ] value "<host> QA-clientid bb" inst [ INST-1 or "INST-1" ] value "<host> QA-clientid ccc" inst [ INST-1 or "INST-1" ] value "<host> QA-clientid dddd" -inst [ INST-1 or "INST-1" ] value "<host> QA-clientid eeeee" -inst [ INST-1 or "INST-1" ] value "<host> QA-clientid ffffff" -inst [ INST-1 or "INST-1" ] value "<host> QA-clientid ggggggg" -inst [ INST-1 or "INST-1" ] value "<host> QA-clientid hhhhhhhh" -inst [ INST-1 or "INST-1" ] value "<host> QA-clientid iiiiiiiii" inst [ INST-2 or "INST-2" ] value "<host> QA-clientid foo-bar-mumble" -inst [ INST-3 or "INST-3" ] value "<host> QA-clientid here is a longer string to go in the ident place holder to try things out that are not quite so short and boring" > 160 - seemed to lose several pmstore PDUs Failed again, and passed on different machine, also not reproducible again. Probably safe to waive. The machine was horribly slow, Nathan already mentioned this might be related. ---------------------- 177 --- 177.out 2014-08-06 02:45:57.000000000 -0400 +++ 177.out.bad 2014-09-10 05:42:56.952007913 -0400 @@ -27,21 +27,21 @@ === pmval -t 0.1 === 58-62 samples -__pmLogRead-back 1-5 calls -__pmLogRead-forw 51-57 calls +__pmLogRead-back unexpected 8 calls, not 1-5 +__pmLogRead-forw unexpected 60 calls, not 51-57 __pmLogRead-peek 1-1 calls _pmFetchInterp 116-120 calls === pmval -t 0.2 === 28-32 samples __pmLogRead-back 1-5 calls -__pmLogRead-forw 51-57 calls +__pmLogRead-forw unexpected 60 calls, not 51-57 __pmLogRead-peek 1-1 calls _pmFetchInterp 58-60 calls === pmval -t 0.4 === 13-17 samples __pmLogRead-back 5-20 calls -__pmLogRead-forw 51-57 calls +__pmLogRead-forw unexpected 61 calls, not 51-57 __pmLogRead-peek 1-1 calls _pmFetchInterp 29-30 calls > 177 - too many read(2) calls needed to read archive (possible dup of 088) > ... > 177 - lost connection to the pmlogger instance Machine was horribly slow (athlon4.rhts.eng.bos.redhat.com, root/redhat), might be related as suggested by previous comment. But not able to reproduce it again. ---------------------- x86/200 --- 200.out 2014-08-06 02:49:56.000000000 -0400 +++ 200.out.bad 2014-09-10 09:54:08.745127273 -0400 @@ -12,14 +12,14 @@ Starting pmcd ... pmFetch failed: IPC protocol failure pmReconnectContext: success -delay: 6-10 secs +delay: 14-16 secs PMCD_RECONNECT_TIMEOUT="2" Waiting for pmcd to terminate ... Starting pmcd ... pmFetch failed: IPC protocol failure pmReconnectContext: success -delay: 6-10 secs +delay: 11 secs PMCD_RECONNECT_TIMEOUT="1-2" Waiting for pmcd to terminate ... > 200 - unexpectedly long delays (>5sec/more) in pmcd connection timeouts Slow machine, would waive. ---------------------- x86/314 --- 314.out 2014-08-06 02:49:57.000000000 -0400 +++ 314.out.bad 2014-09-10 10:43:41.723191101 -0400 @@ -3,4 +3,5 @@ Rotate, rotate... Shutdown pmie process Old logfile line count is in range -New logfile line count is in range +New logfile line count has value of 110 +New logfile line count is NOT in range 9. .. 63. > 314 - 761e8a0 - taking awhile to rotate logs, causing extra messages > ... > 314 - possible slow pmie start, causing racey output The very same, horribly slow machine. Waived. ---------------------- x86/356 --- 356.out 2014-08-06 02:49:57.000000000 -0400 +++ 356.out.bad 2014-09-10 10:58:43.945932020 -0400 @@ -15,8 +15,73 @@ [TIMESTAMP] pmie(PID) Info: Re-established connection to pmcd on host HOST +sleepTight: negative delay (-3.490459). sched=1410361103.645210, cur=1410361107.135669 +Last sleepTight (retry) until: DATE.645210 +Last Task dump @ 0x1b55870 + nth=0 delta=1.000 tick=10 next=(nil) prev=(nil) + eval time: DATE.645210 + retry time: DATE.000000 + host=athlon4.rhts.eng.bos.redhat.com (up) + rules: + expr_1 +... + +This sleepTight() entry: DATE.135665 +Harvest children done: DATE.135668 +Want sleepTight (eval) until: DATE.645210 +This Task dump @ 0x1b55870 + nth=0 delta=1.000 tick=10 next=(nil) prev=(nil) + eval time: DATE.645210 + retry time: DATE.000000 + host=athlon4.rhts.eng.bos.redhat.com (up) + rules: + expr_1 +... + +sleepTight: negative delay (-2.520393). sched=1410361104.645210, cur=1410361107.165603 +Last sleepTight (eval) until: DATE.645210 +Last Task dump @ 0x1b55870 + nth=0 delta=1.000 tick=11 next=(nil) prev=(nil) + eval time: DATE.645210 + retry time: DATE.000000 + host=athlon4.rhts.eng.bos.redhat.com (up) + rules: + expr_1 +... + +This sleepTight() entry: DATE.165601 +Harvest children done: DATE.165602 +Want sleepTight (eval) until: DATE.645210 +This Task dump @ 0x1b55870 + nth=0 delta=1.000 tick=11 next=(nil) prev=(nil) + eval time: DATE.645210 + retry time: DATE.000000 + host=athlon4.rhts.eng.bos.redhat.com (up) + rules: + expr_1 +... + +sleepTight: negative delay (-1.530933). sched=1410361105.645210, cur=1410361107.176143 +Last sleepTight (eval) until: DATE.645210 +Last Task dump @ 0x1b55870 + nth=0 delta=1.000 tick=12 next=(nil) prev=(nil) + eval time: DATE.645210 + retry time: DATE.000000 + host=athlon4.rhts.eng.bos.redhat.com (up) + rules: + expr_1 +... -expr_1: value(s) +This sleepTight() entry: DATE.176141 +Harvest children done: DATE.176142 +Want sleepTight (eval) until: DATE.645210 +This Task dump @ 0x1b55870 + nth=0 delta=1.000 tick=12 next=(nil) prev=(nil) + eval time: DATE.645210 + retry time: DATE.000000 + host=athlon4.rhts.eng.bos.redhat.com (up) + rules: + expr_1 ... [TIMESTAMP] pmie(PID) Info: evaluator exiting > 356 - pmie unable to schedule tasks quickly enough Again, slow machine, waiving. ---------------------- ppc64/400 --- 400.out 2014-08-06 02:52:22.000000000 -0400 +++ 400.out.bad 2014-09-10 09:45:13.064408028 -0400 @@ -14,10 +14,54 @@ === filtered valgrind report === Memcheck, a memory error detector Command: pminfo -x -L sampledso.event.records +Invalid read of size 8 +at 0x534881C: sample_init (sample.c:1157) +by 0x40FBD3B: __pmConnectLocal (connectlocal.c:390) +by 0x40C6C37: pmNewContext (context.c:509) +by 0x10002953: main (pminfo.c:602) +Address 0x4f2de90 is 0 bytes inside a block of size 4 alloc'd ... > 400 - still strncpy-related valgrind failures, earlier workarounds > ineffective, ugh. Pretty sure this is due to glibc/valgrind > as its only this platform and not in newer versions of them. Probably safe to waive then. ---------------------- s390x/411 --- 411.out 2014-08-06 02:45:58.000000000 -0400 +++ 411.out.bad 2014-09-10 06:18:18.631610143 -0400 @@ -1,4 +1,5 @@ QA output created by 411 +pminfo: Cannot connect to PMCD on host "local:": Connection refused You will need to choose an appropriate configuration for installation of the "simple" Performance Metrics Domain Agent (PMDA). @@ -15,6 +16,8 @@ Terminate PMDA if already installed ... [...install files, make output...] Updating the PMCD control file, and notifying PMCD ... +Starting pmcd ... +Starting pmlogger ... Check simple metrics have appeared ... 5 metrics and 9 values simple.now Looks like a one-time only issue. I'm not able to reproduce, might be some machine-specific issue. Probably safe to waive. ---------------------- ppc64/428 --- 428.out 2014-08-06 02:52:22.000000000 -0400 +++ 428.out.bad 2014-09-10 06:27:20.820474345 -0400 @@ -70,7 +70,6 @@ Unique values for sample.string.write_me ... "1024" -"16" "256" "4096" > 428 - seemed to lose several pmstore PDUs During testing phase it failed on several other arches, but passed also many times. According to Nathan's comments it looks like nothing to worry too much about, as it might be caused by very slow or busy machine. Would waive. ---------------------- 457 --- 457.out 2014-08-06 02:52:22.000000000 -0400 +++ 457.out.bad 2014-09-10 06:30:42.413088706 -0400 @@ -157,10 +157,10 @@ value 0 logger.perfile.reg.bytes - value 0 + value 56 logger.perfile.reg.count - value 0 + value 2 Checking new log file logger.perfile.reg.queuemem value 0 @@ -178,10 +178,10 @@ value 56 logger.perfile.reg.bytes - value 56 + value 112 logger.perfile.reg.count - value 2 + value 4 === 4. non-existant file === Check a missing file > 457 - known issue with pmdalogger test race condition Ok, waived. ---------------------- s390x/469 --- 469.out 2014-08-06 02:45:58.000000000 -0400 +++ 469.out.bad 2014-09-10 06:28:04.601438810 -0400 @@ -1 +1,2 @@ QA output created by 469 +kernel.percpu.cpu.irq.hard, full help text is too long: 85 > 469 - help text is >84chars - heh, since fixed upstream but not > really an issue worth worrying about (cosmetic). Ok, waived, not a pcp issue. ---------------------- s390x/651 --- 651.out 2014-08-06 02:45:58.000000000 -0400 +++ 651.out.bad 2014-09-10 06:55:49.101212934 -0400 @@ -1,6 +1,7 @@ QA output created by 651 == pmproxy not running == -pmprobe: Cannot connect to PMCD on host "MYHOST": Connection refused +[Wed Sep 10 06:55:44] pmprobe(55092) Warning: __pmConnectPMCD: bad version string from pmproxy: "" +pmprobe: Cannot connect to PMCD on host "MYHOST": IPC protocol failure == simplest default case == sample.long.hundred 1 100 I'm not able to reproduce, also it's this fail after many previous runs without any trouble at all. Probably safe to waive. ---------------------- x86/709 --- 709.out 2014-08-06 02:49:59.000000000 -0400 +++ 709.out.bad 2014-09-10 16:01:50.435298931 -0400 @@ -37,7 +37,7 @@ 9M 9M 9M 9M 9M 9M 9M 9M 9M 9M 9M 9M 9M 9M 9 9 9 9 9 9 9M 9M 9M 9M 9M 9M 9M 9M 9M 9M 9M 9M 9M 9M 9 9 9 9 9 9 -sN network interface names are okay --sD disk names are okay +No match for -sD #<--------CPU--------><----------Disks-----------><----------Network----------> #cpu sys inter ctxsw KBRead Reads KBWrit Writes KBIn PktIn KBOut PktOut 9 9 9 9 9 9 9 9 9 9 9 9 Never seen before, but manifested only on x86 arch. ---------------------- 722 --- 722.out 2014-08-06 02:52:23.000000000 -0400 +++ 722.out.bad 2014-09-10 07:04:46.349426042 -0400 @@ -223,6 +223,8 @@ 9 9K 9K 9K 9K 9K 9K 9% kworker/9:9 9 9K 9K 9K 9K 9K 9K 9% kworker/9:9 9 9K 9K 9M 9K 9M 9K 9% ssh-agent +Invalid command +Type 'h' to see a list of valid commands ATOP - Day Month 9 9:9:9 9 9:9:9 elapsed PRC | sys 9h9m | user 9d | #proc 9 | #zombie 9 CPU | sys 9% | user 9% | irq 9% | idle 9% | wait 9% | @@ -447,3 +449,5 @@ 9 9s 9s 9K 9K root 9 -- - S 9% kworker/9:9 9 9s 9s 9K 9K root 9 -- - S 9% kworker/9:9 9 9s 9s 9M 9K root 9 -- - S 9% ssh-agent +Invalid command +Type 'h' to see a list of valid commands This one is strange, I'm unable to reproduce it by running "./check 722" despite it looks like some command/script on lower levels changed it's parameters... ---------------------- 840 --- 840.out 2014-08-06 02:52:23.000000000 -0400 +++ 840.out.bad 2014-09-10 07:09:50.743347205 -0400 @@ -1,2 +1 @@ QA output created by 840 - hostname = [HOSTNAME.local] > 840 is a new common failure - I've asked Dave Brolley who wrote > this one and the PCP code behind it. Its new because we've now > installed the Avahi tools - previously would've been _notrun(). This one (probably!) caused automated beaker jobs to timeout because of running out of reserved time. This time no such thing happend, but test case failed with such message mentioned above. Seems like an easy fix, but I'm not sure if this one means something wrong with pcp tools or just broken test case. ---------------------- (In reply to Miloš Prchlík from comment #10) > Probably only non-waivable are 709 and > 722. > I had a look at these, but could reproduce neither of them: 709 looks like the extra output is from some tracing code either in pcp or in the test case? 722 is strange indeed. It looks like some extra output from running pmatop. These both look like tests of the python pmapi bindings. Perhaps scox has some insight. (In reply to Miloš Prchlík from comment #10) > > ---------------------- > > 840 > > --- 840.out 2014-08-06 02:52:23.000000000 -0400 > +++ 840.out.bad 2014-09-10 07:09:50.743347205 -0400 > @@ -1,2 +1 @@ > QA output created by 840 > - hostname = [HOSTNAME.local] > > > 840 is a new common failure - I've asked Dave Brolley who wrote > > this one and the PCP code behind it. Its new because we've now > > installed the Avahi tools - previously would've been _notrun(). > > This one (probably!) caused automated beaker jobs to timeout because of > running out of reserved time. This time no such thing happend, but test case > failed with such message mentioned above. Seems like an easy fix, but I'm > not sure if this one means something wrong with pcp tools or just broken > test case. > > ---------------------- This test could fail if: 1) the output of `hostname` does not match the hostname printed by avahi-browse. This would be a test case problem. 2) the mdns port on the test machine is not open (port 5353/udp). This would be problem with the setup of the machine. I can't tell from looking at the logs. 709 -sN network interface names are okay --sD disk names are okay +No match for -sD Nathan changed this test to work more predictably; this is a testing oddity that should be okay. 722 +Invalid command +Type 'h' to see a list of valid commands Odd indeed. pmatop is a screen interface that uses curses. I cannot duplicate it but it should be fine since the output is otherwise okay. mprchlik advises that this tracker bug is effectively corrected by the current build. Thanks for following up, guys, Dave wrote: > [...] > 2) the mdns port on the test machine is not open (port 5353/udp). > This would be problem with the setup of the machine. Test should _notrun then I guess, and we'd know for sure, e.g. qa/543 does: echo | $PCP_BINADM_DIR/telnet-probe localhost 25 || _notrun "Noone listening on port 25" Stan wrote: > [...] > Nathan changed this test to work more predictably; (hmm, sounds like I may have botched that effort - sorry 'bout that) > this is a testing oddity that should be okay. I'll look into it, thanks! The 722.full should contain some detailed pointers. cheers. Verified for build pcp-3.9.4-5.el6. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-1477.html |