Bug 621041
Summary: | core test failed on s390x RHEL6 - sched_setaffinity: Invalid argument | ||
---|---|---|---|
Product: | [Retired] Red Hat Hardware Certification Program | Reporter: | qcui |
Component: | Test Suite (tests) | Assignee: | Greg Nichols <gnichols> |
Status: | CLOSED ERRATA | QA Contact: | Qian Cai <qcai> |
Severity: | high | Docs Contact: | |
Priority: | urgent | ||
Version: | 1.2 | CC: | kzak, nobody+295318, qcai, rlandry, ykun |
Target Milestone: | --- | Keywords: | Regression |
Target Release: | --- | ||
Hardware: | s390x | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
The V7-1.2-14 core test no longer fails on IBM System z Red Hat Enterprise Linux 6 and 64-bit PowerPC Red Hat Enterprise Linux 6.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2010-09-20 12:12:50 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Created attachment 436439 [details]
output.log of core test on s390x RHEL6
Could not reproduce this on RHEL6 snapshot 7, v7 1.2 R14, ppc64 Clock Info: ------------------------------------------ kernel: clocksource: timebase mult[7d0000] shift[22] registered kernel: Switching to clocksource timebase Clock Source per system log: timebase Clock Source in /sys/devices/system/clocksource/clocksource*/current_clocksource: timebase Running clock tests Testing for clock jitter on 8 cpus PASSED, largest jitter seen was 0.000290 clock direction test: start time 1280950497, stop time 1280950557, sleeptime 60, delta 0 PASSED What RHEL6 build was used? (In reply to comment #2) > Could not reproduce this on RHEL6 snapshot 7, v7 1.2 R14, ppc64 > > Clock Info: ------------------------------------------ > kernel: clocksource: timebase mult[7d0000] shift[22] registered > kernel: Switching to clocksource timebase > > Clock Source per system log: timebase > Clock Source in > /sys/devices/system/clocksource/clocksource*/current_clocksource: timebase > > Running clock tests > Testing for clock jitter on 8 cpus > PASSED, largest jitter seen was 0.000290 > clock direction test: start time 1280950497, stop time 1280950557, sleeptime > 60, delta 0 > PASSED > > > > What RHEL6 build was used? RHEL6 snapshot 8 reproduced on s390x, RHEL6 snapshot 7, not only via v7's core test, but running clocktest directly: [root@ibm-z10-09 core]# ./clocktest Testing for clock jitter on 2 cpus sched_setaffinity: Invalid argument It seems something is amiss - the cpu mask comming back looks broken: ./clocktest Testing for clock jitter on 2 cpus cpumask = ffec2198 cpu = 0 cpumask = ffec2198 cpumask = ffec2198 cpu = 1 cpumask = ffec2198 sched_setaffinity: Invalid argument Taking ppc64 off the summary - the bug is #621348 for ppc64 - the "tree" rpm needs to be installed. Created attachment 437907 [details]
clocktest strace
Created attachment 437910 [details]
sosreport for ibm-z10-15
The sched_setaffinity() returns EINVAL, man page: EINVAL The affinity bit mask mask contains no processors that are currently physically on the system and permitted to the process according to any restrictions that may be imposed by the "cpuset" mechanism described in cpuset(7). Greg, how many CPUs has the machine? It's necessary distinguish between configured and online CPUs. The sosreport (comment #15) contains only one cpu in proc/cpuinfo. The sysconf(_SC_NPROCESSORS_CONF) which is used in the test checks for "cpuN" directories in /sys/devices/system/cpu/. It means it returns number of "present" cpus. Please, check /sys/devices/system/cpu/online and /sys/devices/system/cpu/present. I think it would be better to use _SC_NPROCESSORS_ONLN in the test. (In reply to comment #17) > My question in terms of hardware certification policy is: Is this change > acceptable across all arches/systems? Or, should we make this arch-specific > to s390x, and even RHEL6+ specific? It's generic for all arches. The difference between RHEL5 and RHEL6 is in the way how glibc implements _SC_NPROCESSORS_CONF: - RHEL5 uses /proc/stat - RHEL6 uses /sys/devices/system/cpu/cpuN the problem is that /proc/{stat,cpuinfo} contains on-line CPU(s) only. It means that RHEL5 glibc returns the same number for _SC_NPROCESSORS_CONF and _SC_NPROCESSORS_ONLN (fortunately this glibc bug is fixed in RHEL6). The correct behaviour is to use /sys/devices/system/cpu to get number of configured CPUs. So the bug in your test was invisible on RHEL5. See (RHEL5, 4 CPUs, 2nd CPU is offline): # grep -c process /proc/cpuinfo 3 # grep -c cpu[[:digit:]] /proc/stat 3 # ls -d /sys/devices/system/cpu/cpu[0-9] | grep -c cpu[[:digit:]] 4 # rpm -q glibc glibc-2.5-49.el5_5.4 # uname -a Linux x86-64-5s-m1.ss.eng.bos.redhat.com 2.6.18-194.8.1.el5xen #1 SMP Wed Jun 23 11:01:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux And yes, independently on RHEL version there should be _SC_NPROCESSORS_ONLN in the test. It's not possible to call sched_setaffinity() for off-line or unavailable CPU(s). Created attachment 438238 [details]
clocktest.c patch to use _SC_NPROCESSORS_ONLN
Also, prints a "Warning:" if the cpus online differ from the cpus configured.
Re-run core test on server s390x with RHEL6.0-20100822.n.0 and v7-1.2-20. It failed with the new error "stress --cpu 12 --io 12 --vm 12 --vm-bytes 128M --timeout 10m" has output on stderr". Created attachment 440620 [details]
output.log of core test on s390x RHEL6 with v7-1.2-20
I'd like to keep this bug on the SC_NPROCESSORS_/affinity issue. From the above log, it looks as though you're verified this fix, as the stress portion of the test follows successful completion of the clock tests. Bug 623787 will track s390x core/stress hangs and errors. Verified the clocktest in R20.el6. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2010-0702.html Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: The V7-1.2-14 core test no longer fails on IBM System z Red Hat Enterprise Linux 6 and 64-bit PowerPC Red Hat Enterprise Linux 6. |
Created attachment 436438 [details] output.log of core test on ppc64 RHEL6 Description of problem: V7-1.2-14 core test failed to run with 'Error:"./CORE2" has output on stderr' on both s390x RHEL6 and ppc64 RHEL6. But it run successfully on s390x RHEL5.5 and ppc64 RHEL5.5. Version-Release number of selected component (if applicable): [root@ibm-js12-vios-01-lp3 ~]# uname -a Linux ibm-js12-vios-01-lp3.rhts.eng.bos.redhat.com 2.6.32-54.el6.ppc64 #1 SMP Tue Jul 27 23:45:44 EDT 2010 ppc64 ppc64 ppc64 GNU/Linux [root@ibm-js12-vios-01-lp3 ~]# v7 version V7 version 1.2, release 14 How reproducible: Every time Steps to Reproduce: 1.Install v7-1.2-14 2.# v7 run --test core Actual results: Fail Expected results: Pass