Bug 621041

Summary: core test failed on s390x RHEL6 - sched_setaffinity: Invalid argument
Product: [Retired] Red Hat Hardware Certification Program Reporter: qcui
Component: Test Suite (tests)Assignee: Greg Nichols <gnichols>
Status: CLOSED ERRATA QA Contact: Qian Cai <qcai>
Severity: high Docs Contact:
Priority: urgent    
Version: 1.2CC: kzak, nobody+295318, qcai, rlandry, ykun
Target Milestone: ---Keywords: Regression
Target Release: ---   
Hardware: s390x   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
The V7-1.2-14 core test no longer fails on IBM System z Red Hat Enterprise Linux 6 and 64-bit PowerPC Red Hat Enterprise Linux 6.
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-09-20 12:12:50 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
output.log of core test on ppc64 RHEL6
none
output.log of core test on s390x RHEL6
none
clocktest strace
none
sosreport for ibm-z10-15
none
clocktest.c patch to use _SC_NPROCESSORS_ONLN
none
output.log of core test on s390x RHEL6 with v7-1.2-20 none

Description qcui 2010-08-04 05:56:21 UTC
Created attachment 436438 [details]
output.log of core test on ppc64 RHEL6

Description of problem:
V7-1.2-14 core test failed to run with 'Error:"./CORE2" has output on stderr' on both s390x RHEL6 and ppc64 RHEL6. But it run successfully on s390x RHEL5.5 and ppc64 RHEL5.5.


Version-Release number of selected component (if applicable):
[root@ibm-js12-vios-01-lp3 ~]# uname -a
Linux ibm-js12-vios-01-lp3.rhts.eng.bos.redhat.com 2.6.32-54.el6.ppc64 #1 SMP Tue Jul 27 23:45:44 EDT 2010 ppc64 ppc64 ppc64 GNU/Linux
[root@ibm-js12-vios-01-lp3 ~]# v7 version
V7 version 1.2, release 14


How reproducible:
Every time


Steps to Reproduce:
1.Install v7-1.2-14
2.# v7 run --test core

  
Actual results:
Fail


Expected results:
Pass

Comment 1 qcui 2010-08-04 05:57:34 UTC
Created attachment 436439 [details]
output.log of core test on s390x RHEL6

Comment 2 Greg Nichols 2010-08-04 19:45:51 UTC
Could not reproduce this on RHEL6 snapshot 7, v7 1.2 R14, ppc64

Clock Info: ------------------------------------------
kernel: clocksource: timebase mult[7d0000] shift[22] registered
kernel: Switching to clocksource timebase

Clock Source per system log: timebase
Clock Source in /sys/devices/system/clocksource/clocksource*/current_clocksource: timebase

Running clock tests
Testing for clock jitter on 8 cpus
PASSED, largest jitter seen was 0.000290
clock direction test: start time 1280950497, stop time 1280950557, sleeptime 60, delta 0
PASSED



What RHEL6 build was used?

Comment 3 qcui 2010-08-05 02:40:36 UTC
(In reply to comment #2)
> Could not reproduce this on RHEL6 snapshot 7, v7 1.2 R14, ppc64
> 
> Clock Info: ------------------------------------------
> kernel: clocksource: timebase mult[7d0000] shift[22] registered
> kernel: Switching to clocksource timebase
> 
> Clock Source per system log: timebase
> Clock Source in
> /sys/devices/system/clocksource/clocksource*/current_clocksource: timebase
> 
> Running clock tests
> Testing for clock jitter on 8 cpus
> PASSED, largest jitter seen was 0.000290
> clock direction test: start time 1280950497, stop time 1280950557, sleeptime
> 60, delta 0
> PASSED
> 
> 
> 
> What RHEL6 build was used?    

RHEL6 snapshot 8

Comment 4 Greg Nichols 2010-08-05 13:56:18 UTC
reproduced on s390x, RHEL6 snapshot 7,  not only via v7's core test, but running clocktest directly: 

[root@ibm-z10-09 core]# ./clocktest
Testing for clock jitter on 2 cpus
sched_setaffinity: Invalid argument

Comment 5 Greg Nichols 2010-08-05 14:12:43 UTC
It seems something is amiss - the cpu mask comming back looks broken:

./clocktest
Testing for clock jitter on 2 cpus
cpumask = ffec2198
cpu = 0
cpumask = ffec2198
cpumask = ffec2198
cpu = 1
cpumask = ffec2198
sched_setaffinity: Invalid argument

Comment 6 Greg Nichols 2010-08-05 15:42:49 UTC
Taking ppc64 off the summary - the bug is #621348 for ppc64 - the "tree" rpm needs to be installed.

Comment 14 Greg Nichols 2010-08-10 14:44:59 UTC
Created attachment 437907 [details]
clocktest strace

Comment 15 Greg Nichols 2010-08-10 14:49:34 UTC
Created attachment 437910 [details]
sosreport for ibm-z10-15

Comment 16 Karel Zak 2010-08-11 09:09:20 UTC
The sched_setaffinity() returns EINVAL, man page:

EINVAL The affinity bit mask mask contains no processors that are  currently  physically  on  the system  and  permitted to the process according to any restrictions that may be imposed by the "cpuset" mechanism described in cpuset(7).


Greg, how many CPUs has the machine? It's necessary distinguish between configured and online CPUs.

The sosreport (comment #15) contains only one cpu in proc/cpuinfo.

The sysconf(_SC_NPROCESSORS_CONF) which is used in the test checks for "cpuN" directories in /sys/devices/system/cpu/. It means it returns number of "present" cpus.

Please, check /sys/devices/system/cpu/online and /sys/devices/system/cpu/present.

I think it would be better to use _SC_NPROCESSORS_ONLN in the test.

Comment 18 Karel Zak 2010-08-11 14:41:43 UTC
(In reply to comment #17)
> My question in terms of hardware certification policy is: Is this change
> acceptable across all arches/systems?   Or, should we make this arch-specific
> to s390x, and even RHEL6+ specific?    

It's generic for all arches.

The difference between RHEL5 and RHEL6 is in the way how glibc implements
_SC_NPROCESSORS_CONF:

  - RHEL5 uses /proc/stat
  - RHEL6 uses /sys/devices/system/cpu/cpuN

the problem is that /proc/{stat,cpuinfo} contains on-line CPU(s) only. It means that RHEL5 glibc returns the same number for _SC_NPROCESSORS_CONF and _SC_NPROCESSORS_ONLN  (fortunately this glibc bug is fixed in RHEL6). The correct behaviour is to use /sys/devices/system/cpu to get number of configured CPUs.

So the bug in your test was invisible on RHEL5.

See (RHEL5, 4 CPUs, 2nd CPU is offline):

# grep -c process /proc/cpuinfo 
3

# grep -c cpu[[:digit:]] /proc/stat
3

# ls -d  /sys/devices/system/cpu/cpu[0-9] | grep -c cpu[[:digit:]]
4

# rpm -q glibc
glibc-2.5-49.el5_5.4

# uname -a
Linux x86-64-5s-m1.ss.eng.bos.redhat.com 2.6.18-194.8.1.el5xen #1 SMP Wed Jun 23 11:01:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

Comment 19 Karel Zak 2010-08-11 14:45:44 UTC
And yes, independently on RHEL version there should be _SC_NPROCESSORS_ONLN in the test. It's not possible to call sched_setaffinity() for off-line or unavailable CPU(s).

Comment 20 Greg Nichols 2010-08-11 17:22:42 UTC
Created attachment 438238 [details]
clocktest.c patch to use _SC_NPROCESSORS_ONLN

Also, prints a "Warning:" if the cpus online differ from the cpus configured.

Comment 23 qcui 2010-08-24 10:58:20 UTC
Re-run core test on server s390x with RHEL6.0-20100822.n.0 and v7-1.2-20. It failed with the new error "stress --cpu 12 --io 12 --vm 12 --vm-bytes 128M --timeout 10m" has output on stderr".

Comment 24 qcui 2010-08-24 11:01:56 UTC
Created attachment 440620 [details]
output.log of core test on s390x RHEL6 with v7-1.2-20

Comment 25 Greg Nichols 2010-08-24 13:50:23 UTC
I'd like to keep this bug on the SC_NPROCESSORS_/affinity issue.   From the above log, it looks as though you're verified this fix, as the stress portion of the test follows successful completion of the clock tests.

Bug 623787 will track s390x core/stress hangs and errors.

Comment 26 qcui 2010-08-25 05:43:49 UTC
Verified the clocktest in R20.el6.

Comment 30 errata-xmlrpc 2010-09-20 12:12:50 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0702.html

Comment 31 Jaromir Hradilek 2010-09-21 09:19:04 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
The V7-1.2-14 core test no longer fails on IBM System z Red Hat Enterprise Linux 6 and 64-bit PowerPC Red Hat Enterprise Linux 6.