Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1115953

Summary:

Failed test cases

Product:

Red Hat Enterprise Linux 6

Reporter:

Miloš Prchlík <mprchlik>

Component:

pcp

Assignee:

Nathan Scott <nathans>

Status:

CLOSED ERRATA

QA Contact:

Miloš Prchlík <mprchlik>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

6.6

CC:

brolley, fche, mbenitez, scox, tlavigne

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

pcp-3.9.4-5.el6

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2014-10-14 06:44:33 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Resolve errors from QE test harness	none
Additional rpm dependencies for PCP QA testing	none

Description Miloš Prchlík 2014-07-03 11:50:27 UTC

Description of problem:

I run few rounds of pcp testsuite (using pcp-testsuite package and its QE test wrappers) and I have to report several failed test cases.

All arches with very similar diff XXX.out vs XXX.out.bad:
  - 061, 283, 600, 928, 943, 1006, 1008, 1011, 1014, 1015, 1016, 1021, 1022
  - 386, 388, 391, 393, 395, 400, 405, 459, 490, 513, 526, 719, 720, 783, 875, 957 - valgrind reports possible memory leak (https://bugzilla.redhat.com/show_bug.cgi?id=1115952) - I think it's a false positive and might be ignored/waived somehow, but we'll see

... and the rest of fails (which not a high number and most of them I suspect to be broken test cases, not pcp issues):

s390x: 034, 088, 128, 136, 178, 515, 581, 651
x86_64: 18, 136, 314, 457
i686: 093, 128, 457, 581, 713, 725, 741
ppc64: 136, 177, 214, 428, 457, 972

Logs and diff can be found using links below, unfortunately I do not have reserved machines I used for this particular runs. So far I haven't found any missing packages nor settings that could make test cases fail, and I'll be more than happy to add them to QE wrapper to fix anything that you find.


Version-Release number of selected component (if applicable):

pcp-3.9.4-3.el6


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

(TESTOUT.log contains output of QE task, there's also pcp-testsuite.tgz archive with content of /var/lib/pcp/testsuite)

pcp-testsuite: https://beaker.engineering.redhat.com/tasks/executed?recipe_task_id=22361328&recipe_task_id=22359220&recipe_task_id=22359246&recipe_task_id=22359331&recipe_task_id=22359259&recipe_task_id=22359233&recipe_task_id=22359317&recipe_task_id=22359359

https://beaker.engineering.redhat.com/tasks/executed?recipe_task_id=22359258&recipe_task_id=22359245&recipe_task_id=22361327&recipe_task_id=22359330&recipe_task_id=22359219&recipe_task_id=22359232&recipe_task_id=22359316&recipe_task_id=22359358

Comment 1 Nathan Scott 2014-07-10 02:44:10 UTC

Hi Miloš,

Thanks for the test runs and great initial analysis. My findings very much mirror yours ... here's the notes I made as I went through the results. If I could get a reserved machine where the tests have been run, that'd be extremely helpful for verifying the fixes (and nspr/valgrind workaround) I have so far.

On the whole, the only one that has me really mystified at the moment is 061 - I'll dig deeper (test machine access will really help there). I've not yet looked into those platform-specific failures ... will do so shortly.

Oh - on the missing packages question - we could add avahi and avahi-tools into the harness-installed set, and some more tests will begin to run (and one failing test will begin to pass).

cheers.

061:
-logputresult: pdubuf=ADDR input len=NNN output len=124 posn=NNN
+logputresult: pdubuf=ADDR input len=NNN output len=144 posn=NNN
... need access to a machine with this problem, not reproducible here (I'm on a earlier RHEL6 though, so not apples-to-apples of course). Very odd, needs more investigation on my part.

128:
join(1) finds the output of sort(1) two lines earlier to be unsorted. :P
... need access to a machine with this problem, not reproducible here.
Hopefully fix as simple as giving the join invocation a locale matching
that used by sort.

283:
Avahi diagnostics in pmcd.log - probably we lack a running Avahi daemon.
This warning message should be under a pmDebug guard - will fix.

386, 388, 391, 393, 395, 400, 405, 459, 490, 513, 526, 719, 720, 783, 875, 957:
Valgrind finding a memory leak (PR_OpenFile/malloc - yeah, probable library init routine problem - thanks for the analysis & opening #1115952!).

600:
Probable test script issue, daylight savings XYZ vs ABC whacko filtering
issue. Will dig deeper, I need to take more time to understand the test
a bit more though.

928:
Test script issue - assumptions about test machine timezone in the output. Odd that none of the other PCP folks have reported this outside .au ... oh well.

943:
Probable test script issue not catering for a lack of some per-process cgroups instrumentation in the kernel? ... wierd though, I don't see this on earlier RHEL6 variants. Access to a machine with this problem would be good.

1006, 1008, 1011, 1014, 1016, 1021, 1022:
Test script issue (well, build issue) - missing logs, fixed in subsequent pcp releases already, needs backport.

Comment 3 Nathan Scott 2014-07-21 01:46:52 UTC

Hi Miloš,

I've included my notes below, from going through that last set of results.
Where there is a git commit ID next to the test#, that issue was fixed and included in a build from last week - pcp-3.9.4-4.  That build has been moved through erratatool and it'd be great if we could get a new set of results and machine reservations if possible, to analyse the remaining issues (of which there will still be a handful - see below).

I did not observe the initial-primary-pmlogger-not-restarting behaviour you saw - so not sure what advice to offer there yet, sorry. :(

I found TESTOUT.log contains this output in a couple of spots:
/mnt/tests/CoreOS/pcp/Library/pcpcommon/lib.sh: line 98: rlIsRHEL7: command not found
... which is coming from the QE harness - I'll attach a patch here that I think resolves this, based on my relatively limited understanding.  :)

I'd appreciate your thoughts on the ppc64 python issue too btw (test 972) - this may even be a ppc64-specific python or compiler bug perhaps?  Not sure.

Anyway, my short-hand notes follow:


> All arches with very similar diff XXX.out vs XXX.out.bad:
  - 943, 1006, 1008, 1011, 1014, 1015, 1016, 1021, 1022

061 - cdfa900f - log preamble non-deterministic size
600 - TODO (timezones) - passing locally though.

128 - d4a9e1e1 - sort vs join locale
283 - 39c9e564 - avahi_entry_group_new diagnostic
928 - c72755d3 - assumed au timezone
943 - 9279ce5b - handle empty cgroups procfs file

395,400 - 07423d8e - valgrind vs local ctx sample_init
        -> PPC64 -> const char * strcpy (of size 3) -> libc accesses 8 bytes

1006, 1008, 1011, 1014, 1015, 1016, 1021, 1022 - 1408de56, 5e650b20, bc21c577, 8bfbd481, a875d38a, 81785d88, 0dcb2e77, e2675b31 - missing archives

386, 388, 391, 393, 395, 400, 405, 459, 490, 513, 526, 719, 720, 783, 875, 957 - fd6c591c - valgrind vs NSPR libs

x86_64: 018, 136, 314, 457
018 (passed - must be racey)
136 - long hostname might be causing log offset wierdness? (hp-ms-01-c35.moonshot1.lab.eng.rdu.redhat.com - pmcd.pmlogger.pmcd_host)
314 - 761e8a0 - taking awhile to rotate logs, causing extra messages
457 - race in pmdalogger-rotation test, i think; for some reason more likely here? - test needs work, most likely

s390x: 034, 088, 136, 178, 515, 581, 651
034 (passed)
088 (passed)
136 (passed)
178 (passed)
(*) 515 - pmie rising/fallout test - missing some output? - test has timing dependence (pmie sampling on 0.4s interval, vs 2sec sleeps)
581 (passed)
651 (passed)
i686: 093, 457, 581, 713, 725, 741
093 (passed)
457 - exactly as for x86_64 (pmdalogger)
581 (passed)
581 [notrun - fips]
725 - ed7e935d - broken test, not handling 32-bit kernel ulong correctly
741 - not sure - a 2nd "local:" connection failed with "No route to host"?
        (python APIs)

ppc64: 136, 177, 214, 428, 457, 972
136 (passed)
177 (passed)
214 (notrun?)
428 (passed)
457 - exactly as for x86_64 (pmdalogger)
972 - python ppc64-specific 2.6.6 bug???  (passes on RHEL7 with 2.7+)
      problem is related to unsigned long long conversion via PyArg_Parse:
        <    value 50933760     -> 0x3093000
        >    value 814940160    -> 0x30930000
        (looks like conversions off by 4 bits???)
        (gdb) p /x (0x30930000 >> 4)
        $1 = 0x3093000

Comment 4 Nathan Scott 2014-07-21 01:47:36 UTC

Created attachment 919500 [details]
Resolve errors from QE test harness

Comment 5 Miloš Prchlík 2014-07-30 09:11:45 UTC

(In reply to Nathan Scott from comment #3)
> Hi Miloš,
> 
> I've included my notes below, from going through that last set of results.
> Where there is a git commit ID next to the test#, that issue was fixed and
> included in a build from last week - pcp-3.9.4-4.  That build has been moved
> through erratatool and it'd be great if we could get a new set of results
> and machine reservations if possible, to analyse the remaining issues (of
> which there will still be a handful - see below).

New test run is ready, and results are available here: [1] Machines are reserved and accessible as usually.

 - pcp-testsuite-light aka "-g sanity" failed to start on all but one machine, with already mentioned "Starting pmlogger ...  Arrgghhh ... pmlogger (pid=-P) failed to start after 20 seconds" error. However, when I logged in and run this one manually, all works nice, pmlogger starts and playes with others nicely. All instances of "-g sanity" passed except of one test case on s390x (ibm-z10-70.rhts.eng.bos.redhat.com):

vvvvvvvvv
--- /var/lib/pcp/testsuite/034.out      2014-07-15 01:37:15.000000000 -0400
+++ /var/lib/pcp/testsuite/034.out.bad  2014-07-30 03:41:26.383529178 -0400
@@ -6,8 +6,8 @@
 900 msec requested: no. of samples within range, avg within 10% of requested delay
 
 === src/mv-bar ===
-100 msec requested: no. of samples within range, avg within 10% of requested delay
-500 msec requested: no. of samples within range, avg within 10% of requested delay
+100 msec requested: no. of samples (49) unexpected, avg (128.53 msec) outside 10% of requested delay
+500 msec requested: no. of samples (12) unexpected, avg within 10% of requested delay
 900 msec requested: no. of samples within range, avg within 10% of requested delay
 
 === src/noti-bar ===
^^^^^^^^^

Also there are no FIPS results, I wasn't abel to find working RHEL6.6 FIPS-installable tree - will deliver as soon as this gets solved.

> 
> I did not observe the initial-primary-pmlogger-not-restarting behaviour you
> saw - so not sure what advice to offer there yet, sorry. :(

I'm gonna try to add some debugging into this, because I observe it in our automatic runs, when ./check is run for the first time and pmlogger is expected to start. All other ./check calls during automatic (pcp-testsuite - pcp-testsuite-light runs as the first task) or manual runs are fine. How about extending that 20 seconds waiting period and check what's gonna happen?

> 
> I found TESTOUT.log contains this output in a couple of spots:
> /mnt/tests/CoreOS/pcp/Library/pcpcommon/lib.sh: line 98: rlIsRHEL7: command
> not found
> ... which is coming from the QE harness - I'll attach a patch here that I
> think resolves this, based on my relatively limited understanding.  :)

Thanks for noticing, that was indeed a typo! :)

> 
> I'd appreciate your thoughts on the ppc64 python issue too btw (test 972) -
> this may even be a ppc64-specific python or compiler bug perhaps?  Not sure.

I took a look at this one, and it kind of makes sense to me. /var/lib/pcp/pmdas/zswap/pmdazswap.python reports (in zswap_fetch_callback method) written_back_pages=12435, pagesize=65536 => 880476160 bytes, and that's what is reported in 972.out.bad. Those 4 bits you observe in conversion come from bigger page size (65536 instead of 4096) - would this work as a solution? Page size looks correct:

[root@ibm-p730-02-lp1 ~]# getconf PAGESIZE
65536
[root@ibm-p730-02-lp1 ~]# 

> 
> Anyway, my short-hand notes follow:
> 
> 
> > All arches with very similar diff XXX.out vs XXX.out.bad:
>   - 943, 1006, 1008, 1011, 1014, 1015, 1016, 1021, 1022
> 
> 061 - cdfa900f - log preamble non-deterministic size
> 600 - TODO (timezones) - passing locally though.
> 
> 128 - d4a9e1e1 - sort vs join locale
> 283 - 39c9e564 - avahi_entry_group_new diagnostic
> 928 - c72755d3 - assumed au timezone
> 943 - 9279ce5b - handle empty cgroups procfs file
> 
> 395,400 - 07423d8e - valgrind vs local ctx sample_init
>         -> PPC64 -> const char * strcpy (of size 3) -> libc accesses 8 bytes
> 
> 1006, 1008, 1011, 1014, 1015, 1016, 1021, 1022 - 1408de56, 5e650b20,
> bc21c577, 8bfbd481, a875d38a, 81785d88, 0dcb2e77, e2675b31 - missing archives
> 
> 386, 388, 391, 393, 395, 400, 405, 459, 490, 513, 526, 719, 720, 783, 875,
> 957 - fd6c591c - valgrind vs NSPR libs
> 
> x86_64: 018, 136, 314, 457
> 018 (passed - must be racey)
> 136 - long hostname might be causing log offset wierdness?
> (hp-ms-01-c35.moonshot1.lab.eng.rdu.redhat.com - pmcd.pmlogger.pmcd_host)
> 314 - 761e8a0 - taking awhile to rotate logs, causing extra messages
> 457 - race in pmdalogger-rotation test, i think; for some reason more likely
> here? - test needs work, most likely
> 
> s390x: 034, 088, 136, 178, 515, 581, 651
> 034 (passed)
> 088 (passed)
> 136 (passed)
> 178 (passed)
> (*) 515 - pmie rising/fallout test - missing some output? - test has timing
> dependence (pmie sampling on 0.4s interval, vs 2sec sleeps)
> 581 (passed)
> 651 (passed)
> i686: 093, 457, 581, 713, 725, 741
> 093 (passed)
> 457 - exactly as for x86_64 (pmdalogger)
> 581 (passed)
> 581 [notrun - fips]
> 725 - ed7e935d - broken test, not handling 32-bit kernel ulong correctly
> 741 - not sure - a 2nd "local:" connection failed with "No route to host"?
>         (python APIs)
> 
> ppc64: 136, 177, 214, 428, 457, 972
> 136 (passed)
> 177 (passed)
> 214 (notrun?)
> 428 (passed)
> 457 - exactly as for x86_64 (pmdalogger)
> 972 - python ppc64-specific 2.6.6 bug???  (passes on RHEL7 with 2.7+)
>       problem is related to unsigned long long conversion via PyArg_Parse:
>         <    value 50933760     -> 0x3093000
>         >    value 814940160    -> 0x30930000
>         (looks like conversions off by 4 bits???)
>         (gdb) p /x (0x30930000 >> 4)
>         $1 = 0x3093000


[1] https://beaker.engineering.redhat.com/tasks/executed?recipe_task_id=23117598&recipe_task_id=23117569&recipe_task_id=23117583&recipe_task_id=23117555&recipe_task_id=23116924&recipe_task_id=23116939&recipe_task_id=23117613&recipe_task_id=23116956

Comment 6 Nathan Scott 2014-08-06 04:49:33 UTC

Hi Miloš,

Apologies for the slow response - timing didn't work out for us this time, as I was travelling from right before your latest test run.  I've gone through the results as well now (notes below) and think I will put together another build with the handful of fixes described below.

In summary though, with the exception of s390x, things look pretty good - we're in the ~2% failure range overall.  On s390x it seems most of the failures (incl. that sanity one) are due to general platform slowness - we had this on RHEL7 ISTR, but it got better late in the game with unrelated-to-PCP updates.  So, s390x is at ~4% failure rate, but in areas that would be fine if the platform could schedule things a bit better.  Nothing we can do about that though.

So, I'll put together the next build shortly - if you do happen to find some FIPS machines, maybe wait for that build to appear before testing?  Or another non-FIPS run with latest build would be great too - I'm not going away for awhile again :) so will be able to take a detailed look with you more quickly.  Plus, we should be closer to 1% with next set of results, I hope.  Anyway, scribbled notes follow... cheers.


> [initial-primary-pmlogger-not-restarting] ... extend to 20sec?

As everything is running locally, I'd be surprised if increasing
this helps - it seems more likely that the primary (-P) pmlogger
is simply exiting immediately, or is not even being attempted to
start (for some mysterious reason), or pmlc cannot connect to it
(possibly a permissions issue).

The message comes when pmlc fails to connect to pmlogger (see the
qa/common.check script, approx line 310) - there's a bit of debug
code we could enable in there to try track it down (I'll attach a
patch shortly, or maybe do a new build with it included).  It may
be short/long hostname related actually ... hmm, let me include a
fix there too.

> [...zswap...]
> Those 4 bits you observe in conversion come from bigger page size

Yes!  That is very likely to be it, and so this looks like a test
failure, where its assuming 4K pages (input data is fixed, but the
values are multiplied by the pagesize).  I've committed 3f0672d4e4
upstream to address this - thanks!


- AVC errors:

type=SYSCALL msg=audit(1406583107.479:9687): arch=c000003e syscall=59 success=yes exit=0 a0=a4e360 a1=a4dd50 a2=a4e3a0 a3=30 items=0 ppid=10832 pid=10833 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="rsyslogd" exe="/sbin/rsyslogd" subj=unconfined_u:system_r:syslogd_t:s0 key=(null)
type=AVC msg=audit(1406583107.479:9687): avc:  denied  { append } for  pid=10833 comm="rsyslogd" path="/var/lib/pcp/testsuite/555.full" dev=dm-0 ino=791743 scontext=unconfined_u:system_r:syslogd_t:s0 tcontext=unconfined_u:object_r:pcp_var_lib_t:s0 tclass=file

Ah, that's interesting - rsyslog unable to write to the tests .full file.
That's a failure due to an assumption in the test, not a genuine problem -
I think we can fix that by getting it to write to a temp file and moving
that to $seq.full at the end ... I'll prep a patch.


- Missing packages => not-run tests:

avahi, avahi-tools, xfsprogs (patch attached)


- Failure audit - x86_64:
128 - join(1) failed, file not sorted - argh, thought I'd fixed this one.
      (need to look at a failing host again - worked OK when I had one)
136 - first sample is unexpected large, likely long hostname related.
457 - race condition in test for pmdalogger.
719 - valgrind 'no leaks possible' vs '0 errors from 0 contexts'
      (possible side-effect of earlier NSS issue now being fixed)
720 - valgrind, as above
875 - valgrind, as above
1006 - missing qa/views directory - argh
1008 - missing qa/archives directory - argh
1011 - missing qa/archives, as above
1015 - missing qa/archives, as above
1016 - missing qa/archives, as above
1021 - missing qa/archives, as above
1022 - missing qa/archives, as above
-> x86_64: Failed 13 of 613 tests => ~2.1%

- Failure audit - i386:
093 - too few log reads?!  (test filter failure)
581 - needs filter on network devices prefixed by 'usb' (filter failure)
741 - strange python API failure creating second local: context (unsure)
128 719 720 875 1006 1008 1011 1015 1016 1021 1022 - as for other platforms.
Failed 14 of 608 tests
-> i386: Failed 14 of 613 tests => ~2.3%

- Failure audit - ppc64:
400 - valgrind reports strncpy(3) reading more bytes than allowed.
972 - zswap pagesize issue (upstream 3f0672d4e4)
457 719 720 875 1006 1008 1011 1015 1016 1021 1022 - as for other platforms.
-> ppc64: Failed 13 of 606 tests => ~2.1%

- Failure audit - s390x:
034 - seems s390x cannot sample sufficiently quickly, or too quickly
044 - proc.psinfo.utime metric used, but PM_ERR_NOAGENT (no proc PMDA?)
088 - too many samples observed in the initially-created QA archives
102 - possible scheduling issue/race - no pmlogger, pmlc took 4sec to start?
177 - too many read(2) calls needed to read archive (possible dup of 088)
189 - pmie scheduling issues, kernel does not schedule quickly enough?
200 - unexpectedly long delays (>5sec/more) in pmcd connection timeouts
206 - another case of too many samples (possible dup of 088/177)
251 - yet another case of too many samples (possible dup of 088/177/206)
514 - pmie scheduling issues, dup of 189.
515 - pmie scheduling issues, dup of 189/514.
558 - pmie failed to sample in timely fashion, dup of 189/514/515.
128 719 720 875 1006 1008 1011 1015 1016 1021 1022 - as for other platforms.
-> s390x: Failed 23 of 612 tests => 3.8%

Comment 7 Nathan Scott 2014-08-06 05:17:56 UTC

Created attachment 924351 [details]
Additional rpm dependencies for PCP QA testing

Comment 8 Miloš Prchlík 2014-08-13 07:50:58 UTC

Hi Nathan!

(In reply to Nathan Scott from comment #6)
> Hi Miloš,
> 
> Apologies for the slow response - timing didn't work out for us this time,
> as I was travelling from right before your latest test run.  I've gone
> through the results as well now (notes below) and think I will put together
> another build with the handful of fixes described below.

I'm sorry for  delay on my side, too, I got stuck during the RHSCL testing cycle and I was sick and... Anyway, I have results for you with pcp-3.9.4-5, reserved machines and few notes, as always.

  - pcp-testsuite-light aka "./check -g sanity": https://beaker.engineering.redhat.com/tasks/executed?recipe_task_id=23417152&recipe_task_id=23417165&recipe_task_id=23417191&recipe_task_id=23417095&recipe_task_id=23417178&recipe_task_id=23417131&recipe_task_id=23417119&recipe_task_id=23417107
  - pcp-testsuite aka "./check -g local": https://beaker.engineering.redhat.com/tasks/executed?recipe_task_id=23417153&recipe_task_id=23417166&recipe_task_id=23417096&recipe_task_id=23417179&recipe_task_id=23417120&recipe_task_id=23417108&recipe_task_id=23417192&recipe_task_id=23417132
  - "-g sanity" did not start at 6 of 8 machines, passed on the remaining 2 (x86_64 (non-FIPS) and s390x)
  - "-g local" was killed by a watchdog on FIPS x86_64 and ppc64 machines (but not on FIPS i386) - test case 840 (probably, the one after 832 according to logs from other machines) got stuck and caused termination of the whole task when its reserved time run out.

Machines are reserved (and I'll keep them reserved at least till Monday, just to be sure none of them goes back to pool before you can get your hands on it) and passwords are set as usually. Usually I have another set of machines reserved as my own pcp playground, to avoid any violation of results, logs, local files, etc. on these I provide to you, but these days there is quite a long queue in beaker, so I hope it wont take too long before I'll be able to debug pcp-testsuite-light more deeply.

> 
> In summary though, with the exception of s390x, things look pretty good -
> we're in the ~2% failure range overall.  On s390x it seems most of the
> failures (incl. that sanity one) are due to general platform slowness - we
> had this on RHEL7 ISTR, but it got better late in the game with
> unrelated-to-PCP updates.  So, s390x is at ~4% failure rate, but in areas
> that would be fine if the platform could schedule things a bit better. 
> Nothing we can do about that though.
> 
> So, I'll put together the next build shortly - if you do happen to find some
> FIPS machines, maybe wait for that build to appear before testing?  Or
> another non-FIPS run with latest build would be great too - I'm not going
> away for awhile again :) so will be able to take a detailed look with you
> more quickly.  Plus, we should be closer to 1% with next set of results, I
> hope.  Anyway, scribbled notes follow... cheers.

There were some troubles with RHEL6.6 beaker trees with regard to a FIPS (not installable, not available, ...) but this seems to be solved so this time there are results for both non-FIPS and FIPS machines. You can find out which one you are inspecting at the beginning of TESTOUT.log:

:: [ 10:19:02 ] :: [ INFO    ] :: ARCH=ppc64
:: [ 10:19:03 ] :: [ INFO    ] :: FIPS_ENABLED=1

> 
> 
> > [initial-primary-pmlogger-not-restarting] ... extend to 20sec?
> 
> As everything is running locally, I'd be surprised if increasing
> this helps - it seems more likely that the primary (-P) pmlogger
> is simply exiting immediately, or is not even being attempted to
> start (for some mysterious reason), or pmlc cannot connect to it
> (possibly a permissions issue).
> 
> The message comes when pmlc fails to connect to pmlogger (see the
> qa/common.check script, approx line 310) - there's a bit of debug
> code we could enable in there to try track it down (I'll attach a
> patch shortly, or maybe do a new build with it included).  It may
> be short/long hostname related actually ... hmm, let me include a
> fix there too.

Such a fix - or at least a pointer to what migh be going on - would be very useful because pcp-testsuite-light is effectively crippled because of this. This time it passed even on one non-s390x machine (but failed on the second 2390x one, so...). I'll have some spare time during Thu/Fri and I plan to take a better look at this, equipped with gdb, strace & friend :)

> 
> > [...zswap...]
> > Those 4 bits you observe in conversion come from bigger page size
> 
> Yes!  That is very likely to be it, and so this looks like a test
> failure, where its assuming 4K pages (input data is fixed, but the
> values are multiplied by the pagesize).  I've committed 3f0672d4e4
> upstream to address this - thanks!
> 

You are welcome.

> 
> - AVC errors:
> 
> type=SYSCALL msg=audit(1406583107.479:9687): arch=c000003e syscall=59
> success=yes exit=0 a0=a4e360 a1=a4dd50 a2=a4e3a0 a3=30 items=0 ppid=10832
> pid=10833 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0
> fsgid=0 tty=(none) ses=4294967295 comm="rsyslogd" exe="/sbin/rsyslogd"
> subj=unconfined_u:system_r:syslogd_t:s0 key=(null)
> type=AVC msg=audit(1406583107.479:9687): avc:  denied  { append } for 
> pid=10833 comm="rsyslogd" path="/var/lib/pcp/testsuite/555.full" dev=dm-0
> ino=791743 scontext=unconfined_u:system_r:syslogd_t:s0
> tcontext=unconfined_u:object_r:pcp_var_lib_t:s0 tclass=file
> 
> Ah, that's interesting - rsyslog unable to write to the tests .full file.
> That's a failure due to an assumption in the test, not a genuine problem -
> I think we can fix that by getting it to write to a temp file and moving
> that to $seq.full at the end ... I'll prep a patch.
> 
> 
> - Missing packages => not-run tests:
> 
> avahi, avahi-tools, xfsprogs (patch attached)

Thanks for pointing these out to me, added! Please, don't pay any attention to warnings caused by missing xfsprogs package on some archs, I'll fix it later.

> 
> 
> - Failure audit - x86_64:
> 128 - join(1) failed, file not sorted - argh, thought I'd fixed this one.
>       (need to look at a failing host again - worked OK when I had one)
> 136 - first sample is unexpected large, likely long hostname related.
> 457 - race condition in test for pmdalogger.
> 719 - valgrind 'no leaks possible' vs '0 errors from 0 contexts'
>       (possible side-effect of earlier NSS issue now being fixed)
> 720 - valgrind, as above
> 875 - valgrind, as above
> 1006 - missing qa/views directory - argh
> 1008 - missing qa/archives directory - argh
> 1011 - missing qa/archives, as above
> 1015 - missing qa/archives, as above
> 1016 - missing qa/archives, as above
> 1021 - missing qa/archives, as above
> 1022 - missing qa/archives, as above
> -> x86_64: Failed 13 of 613 tests => ~2.1%
> 
> - Failure audit - i386:
> 093 - too few log reads?!  (test filter failure)
> 581 - needs filter on network devices prefixed by 'usb' (filter failure)
> 741 - strange python API failure creating second local: context (unsure)
> 128 719 720 875 1006 1008 1011 1015 1016 1021 1022 - as for other platforms.
> Failed 14 of 608 tests
> -> i386: Failed 14 of 613 tests => ~2.3%
> 
> - Failure audit - ppc64:
> 400 - valgrind reports strncpy(3) reading more bytes than allowed.
> 972 - zswap pagesize issue (upstream 3f0672d4e4)
> 457 719 720 875 1006 1008 1011 1015 1016 1021 1022 - as for other platforms.
> -> ppc64: Failed 13 of 606 tests => ~2.1%
> 
> - Failure audit - s390x:
> 034 - seems s390x cannot sample sufficiently quickly, or too quickly
> 044 - proc.psinfo.utime metric used, but PM_ERR_NOAGENT (no proc PMDA?)
> 088 - too many samples observed in the initially-created QA archives
> 102 - possible scheduling issue/race - no pmlogger, pmlc took 4sec to start?
> 177 - too many read(2) calls needed to read archive (possible dup of 088)
> 189 - pmie scheduling issues, kernel does not schedule quickly enough?
> 200 - unexpectedly long delays (>5sec/more) in pmcd connection timeouts
> 206 - another case of too many samples (possible dup of 088/177)
> 251 - yet another case of too many samples (possible dup of 088/177/206)
> 514 - pmie scheduling issues, dup of 189.
> 515 - pmie scheduling issues, dup of 189/514.
> 558 - pmie failed to sample in timely fashion, dup of 189/514/515.
> 128 719 720 875 1006 1008 1011 1015 1016 1021 1022 - as for other platforms.
> -> s390x: Failed 23 of 612 tests => 3.8%

Comment 9 Nathan Scott 2014-08-15 07:25:41 UTC

Hi Miloš,

> Such a fix - or at least a pointer to what migh be going on - would be very
> useful because pcp-testsuite-light is effectively crippled because of this.
> This time it passed even on one non-s390x machine (but failed on the second
> 2390x one, so...). I'll have some spare time during Thu/Fri and I plan to
> take a better look at this, equipped with gdb, strace & friend :)

I don't have any insights as to a cause from digging through the logs, sorry.
Its gotta be something about the initial machine state I guess... but exactly
what, I haven't been able to figure out.  I think it will need a run with a
fresh machine setup to be interrupted before starting the tests, and yeah -
like you said - deeper analysis with those tools.


FWIW on the full runs we dig get, and except for the hangs the indicators are
looking alot better.  Lots and lots of previously failing tests are passing -
results look a bit better than we achieved with RHEL7 even, so thats cool.

My other notes from logfile scanning...

840 is a new common failure - I've asked Dave Brolley who wrote
this one and the PCP code behind it.  Its new because we've now
installed the Avahi tools - previously would've been _notrun().

s390x is looking much better now - s'funny same thing happened in
RHEL7 - something else in the system/kernel was changed I guess.


x86_64 (FIPS)
  457 - known issue, pmdalogger test race
  832-ish hang (could be Avahi related?  could be FIPS)
-> 91% done, 1 failure! was going well up to the hang.

s390x (non-FIPS)
  469 - help text is >84chars - heh, since fixed upstream but not
        really an issue worth worrying about (cosmetic).
  840 - Avahi issue reported above
-> 2 failures only - awesome result for s390x!

ppc64 (FIPS)
  099-ish - hang (unknown - could be Avahi, could be FIPS?)

i686 (FIPS)
* Hmm - lots of never-seen-before network-related interruptions here
  and difficulties getting pmie timers to fire quickly enough - I'm
  wondering if something hardware - CPU/network? - related was going
  on during this testing that affected this machine? (*) (oh, FIPS?)

* 160 - seemed to lose several pmstore PDUs
* 177 - lost connection to the pmlogger instance
* 295 - failures to connect to pmproxy
* 314 - possible slow pmie start, causing racey output
* 356 - pmie unable to schedule tasks quickly enough
* 428 - missing some sample values, again from pmstore
  457 - known issue with pmdalogger test race condition
* 535 - pmie unable to schedule tasks quickly enough
* 558 - pmie unable to schedule tasks quickly enough
* 581 - pmie unable to schedule tasks quickly enough
  741 - known issue on i386 with python and multiple contexts
  840 - Avahi issue reported above

x86_64 (non-FIPS)
  136 - log header is still slightly larger than expected
        (earlier hostname fix has not resolved this, hmm).
  840 - Avahi issue

s390x (non-FIPS)
-> just two failures 469 & 840 - same failures as above s390x - excellent!

ppc64 (non-FIPS)
  400 - still strncpy-related valgrind failures, earlier workarounds
        ineffective, ugh.  Pretty sure this is due to glibc/valgrind
        as its only this platform and not in newer versions of them.
  457 - known issue with pmdalogger test race condition
  840 - Avahi issue reported above

i386 (non-FIPS)
Failures:
  136 - log header is still slightly larger than expected
        (earlier hostname fix has not resolved this, ugh).
  428 - seemed to lose several pmstore PDUs
  457 - known issue with pmdalogger test race condition
  713 - spurious "TCP connection reset by peer" failures.
  741 - known issue on i386 with python and multiple contexts.
  840 - Avahi issue reported above

Comment 10 Miloš Prchlík 2014-09-10 20:14:27 UTC

This time results are from manual run, so it's just a list of log files. My comments to failed test cases are bellow, any comment is welcomed. Machines are available and accessible (search for Hostname in logs) in case you'd like to check something locally. Our QE logs are copied here: http://nest.test.redhat.com/mnt/qa/scratch/mprchlik/pcp/ Below are mentioned only fails - vast majority are broken test cases, known issues, issues caused (probably) by a slow machine. Probably only non-waivable are 709 and 722. Amazing job, guys, everything seems to work very well! :)

----------------------

128

--- 128.out     2014-08-06 02:52:20.000000000 -0400
+++ 128.out.bad 2014-09-10 09:51:47.019451436 -0400
@@ -5,4 +5,5 @@
 fetching pminfo values...
 
 comparing all values... (details in 128.full)
+join: file 1 is not in sorted order
 done

> 128: join(1) finds the output of sort(1) two lines earlier to be unsorted.  :P
> ... need access to a machine with this problem, not reproducible here.
> Hopefully fix as simple as giving the join invocation a locale matching
> that used by sort.

Machine is available, ibm-p8-kvm-03-guest-08.rhts.eng.bos.redhat.com, root/redhat. But I think it's safe to waive since it's broken test case, not pcp itself.

----------------------

ppc64/136

--- 136.out     2014-08-06 02:52:20.000000000 -0400
+++ 136.out.bad 2014-09-10 05:17:20.715015250 -0400
@@ -1,5 +1,5 @@
 QA output created by 136
-MYLOG.0 700-800 bytes
+MYLOG.0 800-900 bytes
 MYLOG.1 600-700 bytes
 MYLOG.2 600-700 bytes
 MYLOG.3 600-700 bytes

> 136 - long hostname might be causing log offset wierdness?
> ...
> 136 - first sample is unexpected large, likely long hostname related.
> ...
> 136 - log header is still slightly larger than expected
>         (earlier hostname fix has not resolved this, hmm).

Hostname is indeed quite long: ibm-p8-kvm-03-guest-08.rhts.eng.bos.redhat.com. Is this possibly fixed already in upstream? 

----------------------

s390x/156

--- 156.out     2014-08-06 02:45:57.000000000 -0400
+++ 156.out.bad 2014-09-10 05:20:27.472213082 -0400
@@ -170,7 +170,7 @@
 Terminate PMDA if already installed ...
 [...install files, make output...]
 Updating the PMCD control file, and notifying PMCD ...
-Check sample metrics have appeared ... 10+ warnings, 100+ metrics and 1200+ values
+Check sample metrics have appeared ... 140 warnings, 100+ metrics and 0 values
 ======================
 = sampledso DSO PMDA =
 ======================

No comments about this one, also no idea on my side. Also seems to be machine specific and one time only - we haven't met this one during months of test runs, and I'm not able to reproduce it again.

----------------------

x86/160

--- 160.out     2014-08-06 02:49:56.000000000 -0400
+++ 160.out.bad 2014-09-10 08:26:04.310199840 -0400
@@ -5,10 +5,4 @@
 inst [ INST-1 or "INST-1" ] value "<host> QA-clientid bb"
 inst [ INST-1 or "INST-1" ] value "<host> QA-clientid ccc"
 inst [ INST-1 or "INST-1" ] value "<host> QA-clientid dddd"
-inst [ INST-1 or "INST-1" ] value "<host> QA-clientid eeeee"
-inst [ INST-1 or "INST-1" ] value "<host> QA-clientid ffffff"
-inst [ INST-1 or "INST-1" ] value "<host> QA-clientid ggggggg"
-inst [ INST-1 or "INST-1" ] value "<host> QA-clientid hhhhhhhh"
-inst [ INST-1 or "INST-1" ] value "<host> QA-clientid iiiiiiiii"
 inst [ INST-2 or "INST-2" ] value "<host> QA-clientid foo-bar-mumble"
-inst [ INST-3 or "INST-3" ] value "<host> QA-clientid here is a longer string to go in the ident place holder to try things out that are not quite so short and boring"

> 160 - seemed to lose several pmstore PDUs

Failed again, and passed on different machine, also not reproducible again. Probably safe to waive. The machine was horribly slow, Nathan already mentioned this might be related.

----------------------

177

--- 177.out     2014-08-06 02:45:57.000000000 -0400
+++ 177.out.bad 2014-09-10 05:42:56.952007913 -0400
@@ -27,21 +27,21 @@
 
 === pmval -t 0.1 ===
 58-62 samples
-__pmLogRead-back 1-5 calls
-__pmLogRead-forw 51-57 calls
+__pmLogRead-back unexpected 8 calls, not 1-5
+__pmLogRead-forw unexpected 60 calls, not 51-57
 __pmLogRead-peek 1-1 calls
 _pmFetchInterp 116-120 calls
 
 === pmval -t 0.2 ===
 28-32 samples
 __pmLogRead-back 1-5 calls
-__pmLogRead-forw 51-57 calls
+__pmLogRead-forw unexpected 60 calls, not 51-57
 __pmLogRead-peek 1-1 calls
 _pmFetchInterp 58-60 calls
 
 === pmval -t 0.4 ===
 13-17 samples
 __pmLogRead-back 5-20 calls
-__pmLogRead-forw 51-57 calls
+__pmLogRead-forw unexpected 61 calls, not 51-57
 __pmLogRead-peek 1-1 calls
 _pmFetchInterp 29-30 calls

> 177 - too many read(2) calls needed to read archive (possible dup of 088)
> ...
> 177 - lost connection to the pmlogger instance

Machine was horribly slow (athlon4.rhts.eng.bos.redhat.com, root/redhat), might be related as suggested by previous comment. But not able to reproduce it again.

----------------------

x86/200

--- 200.out     2014-08-06 02:49:56.000000000 -0400
+++ 200.out.bad 2014-09-10 09:54:08.745127273 -0400
@@ -12,14 +12,14 @@
 Starting pmcd ... 
 pmFetch failed: IPC protocol failure
 pmReconnectContext: success
-delay: 6-10 secs
+delay: 14-16 secs
 
 PMCD_RECONNECT_TIMEOUT="2"
 Waiting for pmcd to terminate ...
 Starting pmcd ... 
 pmFetch failed: IPC protocol failure
 pmReconnectContext: success
-delay: 6-10 secs
+delay: 11 secs
 
 PMCD_RECONNECT_TIMEOUT="1-2"
 Waiting for pmcd to terminate ...

> 200 - unexpectedly long delays (>5sec/more) in pmcd connection timeouts

Slow machine, would waive.

----------------------

x86/314

--- 314.out     2014-08-06 02:49:57.000000000 -0400
+++ 314.out.bad 2014-09-10 10:43:41.723191101 -0400
@@ -3,4 +3,5 @@
 Rotate, rotate...
 Shutdown pmie process
 Old logfile line count is in range
-New logfile line count is in range
+New logfile line count has value of 110
+New logfile line count is NOT in range 9. .. 63.

> 314 - 761e8a0 - taking awhile to rotate logs, causing extra messages
> ...
> 314 - possible slow pmie start, causing racey output

The very same, horribly slow machine. Waived.

----------------------

x86/356

--- 356.out     2014-08-06 02:49:57.000000000 -0400
+++ 356.out.bad 2014-09-10 10:58:43.945932020 -0400
@@ -15,8 +15,73 @@
 
 
 [TIMESTAMP] pmie(PID) Info: Re-established connection to pmcd on host HOST
+sleepTight: negative delay (-3.490459). sched=1410361103.645210, cur=1410361107.135669
+Last sleepTight (retry) until: DATE.645210
+Last Task dump @ 0x1b55870
+  nth=0 delta=1.000 tick=10 next=(nil) prev=(nil)
+  eval time: DATE.645210
+  retry time: DATE.000000
+  host=athlon4.rhts.eng.bos.redhat.com (up)
+  rules:
+    expr_1
+...
+
+This sleepTight() entry: DATE.135665
+Harvest children done: DATE.135668
+Want sleepTight (eval) until: DATE.645210
+This Task dump @ 0x1b55870
+  nth=0 delta=1.000 tick=10 next=(nil) prev=(nil)
+  eval time: DATE.645210
+  retry time: DATE.000000
+  host=athlon4.rhts.eng.bos.redhat.com (up)
+  rules:
+    expr_1
+...
+
+sleepTight: negative delay (-2.520393). sched=1410361104.645210, cur=1410361107.165603
+Last sleepTight (eval) until: DATE.645210
+Last Task dump @ 0x1b55870
+  nth=0 delta=1.000 tick=11 next=(nil) prev=(nil)
+  eval time: DATE.645210
+  retry time: DATE.000000
+  host=athlon4.rhts.eng.bos.redhat.com (up)
+  rules:
+    expr_1
+...
+
+This sleepTight() entry: DATE.165601
+Harvest children done: DATE.165602
+Want sleepTight (eval) until: DATE.645210
+This Task dump @ 0x1b55870
+  nth=0 delta=1.000 tick=11 next=(nil) prev=(nil)
+  eval time: DATE.645210
+  retry time: DATE.000000
+  host=athlon4.rhts.eng.bos.redhat.com (up)
+  rules:
+    expr_1
+...
+
+sleepTight: negative delay (-1.530933). sched=1410361105.645210, cur=1410361107.176143
+Last sleepTight (eval) until: DATE.645210
+Last Task dump @ 0x1b55870
+  nth=0 delta=1.000 tick=12 next=(nil) prev=(nil)
+  eval time: DATE.645210
+  retry time: DATE.000000
+  host=athlon4.rhts.eng.bos.redhat.com (up)
+  rules:
+    expr_1
+...
 
-expr_1: value(s)
+This sleepTight() entry: DATE.176141
+Harvest children done: DATE.176142
+Want sleepTight (eval) until: DATE.645210
+This Task dump @ 0x1b55870
+  nth=0 delta=1.000 tick=12 next=(nil) prev=(nil)
+  eval time: DATE.645210
+  retry time: DATE.000000
+  host=athlon4.rhts.eng.bos.redhat.com (up)
+  rules:
+    expr_1
 ...
 
 [TIMESTAMP] pmie(PID) Info: evaluator exiting

> 356 - pmie unable to schedule tasks quickly enough

Again, slow machine, waiving.

----------------------

ppc64/400

--- 400.out     2014-08-06 02:52:22.000000000 -0400
+++ 400.out.bad 2014-09-10 09:45:13.064408028 -0400
@@ -14,10 +14,54 @@
 === filtered valgrind report ===
 Memcheck, a memory error detector
 Command: pminfo -x -L sampledso.event.records
+Invalid read of size 8
+at 0x534881C: sample_init (sample.c:1157)
+by 0x40FBD3B: __pmConnectLocal (connectlocal.c:390)
+by 0x40C6C37: pmNewContext (context.c:509)
+by 0x10002953: main (pminfo.c:602)
+Address 0x4f2de90 is 0 bytes inside a block of size 4 alloc'd
...

> 400 - still strncpy-related valgrind failures, earlier workarounds
>       ineffective, ugh.  Pretty sure this is due to glibc/valgrind
>       as its only this platform and not in newer versions of them.

Probably safe to waive then.

----------------------

s390x/411

--- 411.out     2014-08-06 02:45:58.000000000 -0400
+++ 411.out.bad 2014-09-10 06:18:18.631610143 -0400
@@ -1,4 +1,5 @@
 QA output created by 411
+pminfo: Cannot connect to PMCD on host "local:": Connection refused
 You will need to choose an appropriate configuration for installation of
 the "simple" Performance Metrics Domain Agent (PMDA).
 
@@ -15,6 +16,8 @@
 Terminate PMDA if already installed ...
 [...install files, make output...]
 Updating the PMCD control file, and notifying PMCD ...
+Starting pmcd ... 
+Starting pmlogger ... 
 Check simple metrics have appeared ... 5 metrics and 9 values
 
 simple.now

Looks like a one-time only issue. I'm not able to reproduce, might be some machine-specific issue. Probably safe to waive.
 
----------------------

ppc64/428

--- 428.out     2014-08-06 02:52:22.000000000 -0400
+++ 428.out.bad 2014-09-10 06:27:20.820474345 -0400
@@ -70,7 +70,6 @@
 
 Unique values for sample.string.write_me ...
 "1024"
-"16"
 "256"
 "4096"
 
> 428 - seemed to lose several pmstore PDUs

During testing phase it failed on several other arches, but passed also many times. According to Nathan's comments it looks like nothing to worry too much about, as it might be caused by very slow or busy machine. Would waive.

----------------------

457

--- 457.out     2014-08-06 02:52:22.000000000 -0400
+++ 457.out.bad 2014-09-10 06:30:42.413088706 -0400
@@ -157,10 +157,10 @@
     value 0
 
 logger.perfile.reg.bytes
-    value 0
+    value 56
 
 logger.perfile.reg.count
-    value 0
+    value 2
 Checking new log file
 logger.perfile.reg.queuemem
     value 0
@@ -178,10 +178,10 @@
     value 56
 
 logger.perfile.reg.bytes
-    value 56
+    value 112
 
 logger.perfile.reg.count
-    value 2
+    value 4
 
 === 4. non-existant file ===
 Check a missing file


> 457 - known issue with pmdalogger test race condition

Ok, waived.

----------------------

s390x/469

--- 469.out     2014-08-06 02:45:58.000000000 -0400
+++ 469.out.bad 2014-09-10 06:28:04.601438810 -0400
@@ -1 +1,2 @@
 QA output created by 469
+kernel.percpu.cpu.irq.hard, full help text is too long: 85

> 469 - help text is >84chars - heh, since fixed upstream but not
>       really an issue worth worrying about (cosmetic).

Ok, waived, not a pcp issue.

----------------------

s390x/651

--- 651.out     2014-08-06 02:45:58.000000000 -0400
+++ 651.out.bad 2014-09-10 06:55:49.101212934 -0400
@@ -1,6 +1,7 @@
 QA output created by 651
 == pmproxy not running ==
-pmprobe: Cannot connect to PMCD on host "MYHOST": Connection refused
+[Wed Sep 10 06:55:44] pmprobe(55092) Warning: __pmConnectPMCD: bad version string from pmproxy: ""
+pmprobe: Cannot connect to PMCD on host "MYHOST": IPC protocol failure
 
 == simplest default case ==
 sample.long.hundred 1 100

I'm not able to reproduce, also it's this fail after many previous runs without any trouble at all. Probably safe to waive.

----------------------

x86/709

--- 709.out     2014-08-06 02:49:59.000000000 -0400
+++ 709.out.bad 2014-09-10 16:01:50.435298931 -0400
@@ -37,7 +37,7 @@
 9M 9M 9M 9M 9M 9M 9M 9M 9M 9M 9M 9M 9M 9M 9 9 9 9 9 9
 9M 9M 9M 9M 9M 9M 9M 9M 9M 9M 9M 9M 9M 9M 9 9 9 9 9 9
 -sN network interface names are okay
--sD disk names are okay
+No match for -sD
 #<--------CPU--------><----------Disks-----------><----------Network---------->
 #cpu sys inter ctxsw KBRead Reads KBWrit Writes KBIn PktIn KBOut PktOut
 9 9 9 9 9 9 9 9 9 9 9 9

Never seen before, but manifested only on x86 arch.

----------------------

722

--- 722.out     2014-08-06 02:52:23.000000000 -0400
+++ 722.out.bad 2014-09-10 07:04:46.349426042 -0400
@@ -223,6 +223,8 @@
 9 9K 9K 9K 9K 9K 9K 9% kworker/9:9
 9 9K 9K 9K 9K 9K 9K 9% kworker/9:9
 9 9K 9K 9M 9K 9M 9K 9% ssh-agent
+Invalid command
+Type 'h' to see a list of valid commands
 ATOP - Day Month 9 9:9:9 9 9:9:9 elapsed
 PRC | sys 9h9m | user 9d | #proc 9 | #zombie 9
 CPU | sys 9% | user 9% | irq 9% | idle 9% | wait 9% |
@@ -447,3 +449,5 @@
 9 9s 9s 9K 9K root 9 -- - S 9% kworker/9:9
 9 9s 9s 9K 9K root 9 -- - S 9% kworker/9:9
 9 9s 9s 9M 9K root 9 -- - S 9% ssh-agent
+Invalid command
+Type 'h' to see a list of valid commands

This one is strange, I'm unable to reproduce it by running "./check 722" despite it looks like some command/script on lower levels changed it's parameters...

----------------------

840

--- 840.out     2014-08-06 02:52:23.000000000 -0400
+++ 840.out.bad 2014-09-10 07:09:50.743347205 -0400
@@ -1,2 +1 @@
 QA output created by 840
-   hostname = [HOSTNAME.local]

> 840 is a new common failure - I've asked Dave Brolley who wrote
> this one and the PCP code behind it.  Its new because we've now
> installed the Avahi tools - previously would've been _notrun().

This one (probably!) caused automated beaker jobs to timeout because of running out of reserved time. This time no such thing happend, but test case failed with such message mentioned above. Seems like an easy fix, but I'm not sure if this one means something wrong with pcp tools or just broken test case.

----------------------

Comment 11 Dave Brolley 2014-09-11 15:19:29 UTC

(In reply to Miloš Prchlík from comment #10)
> Probably only non-waivable are 709 and
> 722.
> 
I had a look at these, but could reproduce neither of them:

709 looks like the extra output is from some tracing code either in pcp or in the test case?
722 is strange indeed. It looks like some extra output from running pmatop.

These both look like tests of the python pmapi bindings. Perhaps scox has some insight.

Comment 12 Dave Brolley 2014-09-11 15:32:17 UTC

(In reply to Miloš Prchlík from comment #10)
> 
> ----------------------
> 
> 840
> 
> --- 840.out     2014-08-06 02:52:23.000000000 -0400
> +++ 840.out.bad 2014-09-10 07:09:50.743347205 -0400
> @@ -1,2 +1 @@
>  QA output created by 840
> -   hostname = [HOSTNAME.local]
> 
> > 840 is a new common failure - I've asked Dave Brolley who wrote
> > this one and the PCP code behind it.  Its new because we've now
> > installed the Avahi tools - previously would've been _notrun().
> 
> This one (probably!) caused automated beaker jobs to timeout because of
> running out of reserved time. This time no such thing happend, but test case
> failed with such message mentioned above. Seems like an easy fix, but I'm
> not sure if this one means something wrong with pcp tools or just broken
> test case.
> 
> ----------------------

This test could fail if:
1) the output of `hostname` does not match the hostname printed by avahi-browse. This would be a test case problem.
2) the mdns port on the test machine is not open (port 5353/udp). This would be problem with the setup of the machine.

I can't tell from looking at the logs.

Comment 13 Stan Cox 2014-09-11 19:34:54 UTC

709
 -sN network interface names are okay
--sD disk names are okay
+No match for -sD
Nathan changed this test to work more predictably; this is a testing oddity that should be okay.

722
+Invalid command
+Type 'h' to see a list of valid commands
Odd indeed. pmatop is a screen interface that uses curses.  I cannot duplicate it but it should be fine since the output is otherwise okay.

Comment 14 Frank Ch. Eigler 2014-09-15 12:34:31 UTC

mprchlik advises that this tracker bug is effectively corrected by the current build.

Comment 15 Nathan Scott 2014-09-16 05:59:00 UTC

Thanks for following up, guys,

Dave wrote:
> [...]
> 2) the mdns port on the test machine is not open (port 5353/udp). 
> This would be problem with the setup of the machine.

Test should _notrun then I guess, and we'd know for sure, e.g. qa/543 does:
echo | $PCP_BINADM_DIR/telnet-probe localhost 25 || _notrun "Noone listening on port 25"

Stan wrote:
> [...]
> Nathan changed this test to work more predictably;

(hmm, sounds like I may have botched that effort - sorry 'bout that)

> this is a testing oddity that should be okay.

I'll look into it, thanks!  The 722.full should contain some detailed pointers.

cheers.

Comment 17 Miloš Prchlík 2014-09-23 06:09:38 UTC

Verified for build pcp-3.9.4-5.el6.

Comment 18 errata-xmlrpc 2014-10-14 06:44:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1477.html