Bug 621041 - core test failed on s390x RHEL6 - sched_setaffinity: Invalid argument
core test failed on s390x RHEL6 - sched_setaffinity: Invalid argument
Status: CLOSED ERRATA
Product: Red Hat Hardware Certification Program
Classification: Red Hat
Component: Test Suite (tests) (Show other bugs)
1.2
s390x Linux
urgent Severity high
: ---
: ---
Assigned To: Greg Nichols
CAI Qian
: Regression
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2010-08-04 01:56 EDT by qcui
Modified: 2010-09-21 05:19 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
The V7-1.2-14 core test no longer fails on IBM System z Red Hat Enterprise Linux 6 and 64-bit PowerPC Red Hat Enterprise Linux 6.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-09-20 08:12:50 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
output.log of core test on ppc64 RHEL6 (161 bytes, application/octet-stream)
2010-08-04 01:56 EDT, qcui
no flags Details
output.log of core test on s390x RHEL6 (414 bytes, application/octet-stream)
2010-08-04 01:57 EDT, qcui
no flags Details
clocktest strace (4.28 KB, application/octet-stream)
2010-08-10 10:44 EDT, Greg Nichols
no flags Details
sosreport for ibm-z10-15 (362.55 KB, application/x-xz)
2010-08-10 10:49 EDT, Greg Nichols
no flags Details
clocktest.c patch to use _SC_NPROCESSORS_ONLN (768 bytes, patch)
2010-08-11 13:22 EDT, Greg Nichols
no flags Details | Diff
output.log of core test on s390x RHEL6 with v7-1.2-20 (1.85 KB, application/octet-stream)
2010-08-24 07:01 EDT, qcui
no flags Details

  None (edit)
Description qcui 2010-08-04 01:56:21 EDT
Created attachment 436438 [details]
output.log of core test on ppc64 RHEL6

Description of problem:
V7-1.2-14 core test failed to run with 'Error:"./CORE2" has output on stderr' on both s390x RHEL6 and ppc64 RHEL6. But it run successfully on s390x RHEL5.5 and ppc64 RHEL5.5.


Version-Release number of selected component (if applicable):
[root@ibm-js12-vios-01-lp3 ~]# uname -a
Linux ibm-js12-vios-01-lp3.rhts.eng.bos.redhat.com 2.6.32-54.el6.ppc64 #1 SMP Tue Jul 27 23:45:44 EDT 2010 ppc64 ppc64 ppc64 GNU/Linux
[root@ibm-js12-vios-01-lp3 ~]# v7 version
V7 version 1.2, release 14


How reproducible:
Every time


Steps to Reproduce:
1.Install v7-1.2-14
2.# v7 run --test core

  
Actual results:
Fail


Expected results:
Pass
Comment 1 qcui 2010-08-04 01:57:34 EDT
Created attachment 436439 [details]
output.log of core test on s390x RHEL6
Comment 2 Greg Nichols 2010-08-04 15:45:51 EDT
Could not reproduce this on RHEL6 snapshot 7, v7 1.2 R14, ppc64

Clock Info: ------------------------------------------
kernel: clocksource: timebase mult[7d0000] shift[22] registered
kernel: Switching to clocksource timebase

Clock Source per system log: timebase
Clock Source in /sys/devices/system/clocksource/clocksource*/current_clocksource: timebase

Running clock tests
Testing for clock jitter on 8 cpus
PASSED, largest jitter seen was 0.000290
clock direction test: start time 1280950497, stop time 1280950557, sleeptime 60, delta 0
PASSED



What RHEL6 build was used?
Comment 3 qcui 2010-08-04 22:40:36 EDT
(In reply to comment #2)
> Could not reproduce this on RHEL6 snapshot 7, v7 1.2 R14, ppc64
> 
> Clock Info: ------------------------------------------
> kernel: clocksource: timebase mult[7d0000] shift[22] registered
> kernel: Switching to clocksource timebase
> 
> Clock Source per system log: timebase
> Clock Source in
> /sys/devices/system/clocksource/clocksource*/current_clocksource: timebase
> 
> Running clock tests
> Testing for clock jitter on 8 cpus
> PASSED, largest jitter seen was 0.000290
> clock direction test: start time 1280950497, stop time 1280950557, sleeptime
> 60, delta 0
> PASSED
> 
> 
> 
> What RHEL6 build was used?    

RHEL6 snapshot 8
Comment 4 Greg Nichols 2010-08-05 09:56:18 EDT
reproduced on s390x, RHEL6 snapshot 7,  not only via v7's core test, but running clocktest directly: 

[root@ibm-z10-09 core]# ./clocktest
Testing for clock jitter on 2 cpus
sched_setaffinity: Invalid argument
Comment 5 Greg Nichols 2010-08-05 10:12:43 EDT
It seems something is amiss - the cpu mask comming back looks broken:

./clocktest
Testing for clock jitter on 2 cpus
cpumask = ffec2198
cpu = 0
cpumask = ffec2198
cpumask = ffec2198
cpu = 1
cpumask = ffec2198
sched_setaffinity: Invalid argument
Comment 6 Greg Nichols 2010-08-05 11:42:49 EDT
Taking ppc64 off the summary - the bug is #621348 for ppc64 - the "tree" rpm needs to be installed.
Comment 14 Greg Nichols 2010-08-10 10:44:59 EDT
Created attachment 437907 [details]
clocktest strace
Comment 15 Greg Nichols 2010-08-10 10:49:34 EDT
Created attachment 437910 [details]
sosreport for ibm-z10-15
Comment 16 Karel Zak 2010-08-11 05:09:20 EDT
The sched_setaffinity() returns EINVAL, man page:

EINVAL The affinity bit mask mask contains no processors that are  currently  physically  on  the system  and  permitted to the process according to any restrictions that may be imposed by the "cpuset" mechanism described in cpuset(7).


Greg, how many CPUs has the machine? It's necessary distinguish between configured and online CPUs.

The sosreport (comment #15) contains only one cpu in proc/cpuinfo.

The sysconf(_SC_NPROCESSORS_CONF) which is used in the test checks for "cpuN" directories in /sys/devices/system/cpu/. It means it returns number of "present" cpus.

Please, check /sys/devices/system/cpu/online and /sys/devices/system/cpu/present.

I think it would be better to use _SC_NPROCESSORS_ONLN in the test.
Comment 18 Karel Zak 2010-08-11 10:41:43 EDT
(In reply to comment #17)
> My question in terms of hardware certification policy is: Is this change
> acceptable across all arches/systems?   Or, should we make this arch-specific
> to s390x, and even RHEL6+ specific?    

It's generic for all arches.

The difference between RHEL5 and RHEL6 is in the way how glibc implements
_SC_NPROCESSORS_CONF:

  - RHEL5 uses /proc/stat
  - RHEL6 uses /sys/devices/system/cpu/cpuN

the problem is that /proc/{stat,cpuinfo} contains on-line CPU(s) only. It means that RHEL5 glibc returns the same number for _SC_NPROCESSORS_CONF and _SC_NPROCESSORS_ONLN  (fortunately this glibc bug is fixed in RHEL6). The correct behaviour is to use /sys/devices/system/cpu to get number of configured CPUs.

So the bug in your test was invisible on RHEL5.

See (RHEL5, 4 CPUs, 2nd CPU is offline):

# grep -c process /proc/cpuinfo 
3

# grep -c cpu[[:digit:]] /proc/stat
3

# ls -d  /sys/devices/system/cpu/cpu[0-9] | grep -c cpu[[:digit:]]
4

# rpm -q glibc
glibc-2.5-49.el5_5.4

# uname -a
Linux x86-64-5s-m1.ss.eng.bos.redhat.com 2.6.18-194.8.1.el5xen #1 SMP Wed Jun 23 11:01:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
Comment 19 Karel Zak 2010-08-11 10:45:44 EDT
And yes, independently on RHEL version there should be _SC_NPROCESSORS_ONLN in the test. It's not possible to call sched_setaffinity() for off-line or unavailable CPU(s).
Comment 20 Greg Nichols 2010-08-11 13:22:42 EDT
Created attachment 438238 [details]
clocktest.c patch to use _SC_NPROCESSORS_ONLN

Also, prints a "Warning:" if the cpus online differ from the cpus configured.
Comment 23 qcui 2010-08-24 06:58:20 EDT
Re-run core test on server s390x with RHEL6.0-20100822.n.0 and v7-1.2-20. It failed with the new error "stress --cpu 12 --io 12 --vm 12 --vm-bytes 128M --timeout 10m" has output on stderr".
Comment 24 qcui 2010-08-24 07:01:56 EDT
Created attachment 440620 [details]
output.log of core test on s390x RHEL6 with v7-1.2-20
Comment 25 Greg Nichols 2010-08-24 09:50:23 EDT
I'd like to keep this bug on the SC_NPROCESSORS_/affinity issue.   From the above log, it looks as though you're verified this fix, as the stress portion of the test follows successful completion of the clock tests.

Bug 623787 will track s390x core/stress hangs and errors.
Comment 26 qcui 2010-08-25 01:43:49 EDT
Verified the clocktest in R20.el6.
Comment 30 errata-xmlrpc 2010-09-20 08:12:50 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0702.html
Comment 31 Jaromir Hradilek 2010-09-21 05:19:04 EDT
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
The V7-1.2-14 core test no longer fails on IBM System z Red Hat Enterprise Linux 6 and 64-bit PowerPC Red Hat Enterprise Linux 6.

Note You need to log in before you can comment on or make changes to this bug.