Bug 732998

Summary: test_io failure during build on PPC
Product: [Fedora] Fedora Reporter: Karsten Hopp <karsten>
Component: pythonAssignee: Dave Malcolm <dmalcolm>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: high    
Version: 16CC: dmalcolm, ivazqueznet, jonathansteffan
Target Milestone: ---   
Target Release: ---   
Hardware: ppc64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-10-28 10:22:57 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 718269    
Attachments:
Description Flags
Log from test run, with verbosity turned up none

Description Karsten Hopp 2011-08-24 12:24:34 UTC
Description of problem:
python and python3 both fail during the test_io self check:

python-2.7.2-4.1.fc16:
test test_io failed -- multiple errors occurred; run in verbose mode for details
Re-running test 'test_io' in verbose mode
test_BufferedIOBase_destructor (test.test_io.CIOTest) ... ok
test_IOBase_destructor (test.test_io.CIOTest) ... ok
test_RawIOBase_destructor (test.test_io.CIOTest) ... ok
test_RawIOBase_read (test.test_io.CIOTest) ... ok
test_TextIOBase_destructor (test.test_io.CIOTest) ... ok
test_append_mode_tell (test.test_io.CIOTest) ... ok
test_array_writes (test.test_io.CIOTest) ... ok
test_buffered_file_io (test.test_io.CIOTest) ... ok
test_close_flushes (test.test_io.CIOTest) ... ok
test_closefd (test.test_io.CIOTest) ... ok
test_closefd_attr (test.test_io.CIOTest) ... ok
test_destructor (test.test_io.CIOTest) ... make: *** [test] Alarm clock

python3-3.2.1-2.1.fc16:
test test_io failed -- multiple errors occurred; run in verbose mode for details
[153/346/1] test_ioctl
test_ioctl skipped -- Neither the process group nor the session are attached to /dev/tty
[154/346/1] test_isinstance
[155/346/1] test_iter
[156/346/1] test_iterlen
[157/346/1] test_itertools
/var/tmp/rpm-tmp.gab4vH: line 32:  3533 Alarm clock             LD_LIBRARY_PATH=$ConfDir $ConfDir/python -m test.regrtest -x $EXCLUDED_TESTS

  
Actual results:
http://ppc.koji.fedoraproject.org/koji/taskinfo?taskID=278191
http://ppc.koji.fedoraproject.org/koji/taskinfo?taskID=278193

Comment 1 Dave Malcolm 2011-08-31 19:24:35 UTC
http://ppc.koji.fedoraproject.org/koji/taskinfo?taskID=278191
  python-2.7.2-4.1.fc16 failed on ppc64:
    http://ppc.koji.fedoraproject.org/koji/getfile?taskID=278200&name=build.log
  which shows:
    test_destructor (test.test_io.CIOTest) ... make: *** [test] Alarm clock
  which appears to be a timeout, due to an unhandled SIGALRM.

http://ppc.koji.fedoraproject.org/koji/taskinfo?taskID=278193
  python3-3.2.1-2.1.fc16 failed on ppc:
    http://ppc.koji.fedoraproject.org/koji/getfile?taskID=278201&name=build.log
  which shows:
    [152/346] test_io
    test test_io failed -- multiple errors occurred; run in verbose mode for details
    ... snip ..
    [157/346/1] test_itertools
    /var/tmp/rpm-tmp.gab4vH: line 32:  3533 Alarm clock             LD_LIBRARY_PATH=$ConfDir $ConfDir/python -m test.regrtest -x $EXCLUDED_TESTS
  which again appears to be a timeout; this is happening whilst test_itertools is running, not test_io.

So there appear to be two things going on here:
  (i) a genuine failure seen within test_io, on ppc, at least
  (ii) timeouts of the build (possibly a stuck test, but I'm not convinced about that)

I'm investigating (i), and I've added --verbose3 to the invocation in python3 in rawhide to better identify failing tests in future.

As for (ii), is there anything different about the ppc.koji.fedoraproject.org that might be setting up timeouts?

I'm able to reliably run test_io on python on the ppc64 test chroot given at 
  https://bugzilla.redhat.com/show_bug.cgi?id=698726#c7
which makes me think that perhaps there's some kind of external timer killing the build.

Comment 2 Dave Malcolm 2011-08-31 23:04:26 UTC
(In reply to comment #1)
> http://ppc.koji.fedoraproject.org/koji/taskinfo?taskID=278191
>   python-2.7.2-4.1.fc16 failed on ppc64:
>     http://ppc.koji.fedoraproject.org/koji/getfile?taskID=278200&name=build.log
>   which shows:
>     test_destructor (test.test_io.CIOTest) ... make: *** [test] Alarm clock
>   which appears to be a timeout, due to an unhandled SIGALRM.

Looks like I may have snipped too much here, a test _is_ failing with test_io for python, but it looks like the SIGALRM kills it.
test_io
/builddir/build/BUILD/Python-2.7.2/Lib/test/test_io.py:2613: DeprecationWarning: classic int division
  1/0
test test_io failed -- multiple errors occurred; run in verbose mode for details

I tried a scratch build of python3 for ppc with verbosity turned up:
  http://ppc.koji.fedoraproject.org/koji/taskinfo?taskID=280329

This shows 4 specific errors inside test_io: the 4 variants of check_interrupted_write_retry, which was introduced in:
  http://hg.python.org/cpython/rev/675b48c4c88a
in response to:
  http://bugs.python.org/issue10956

Those tests (and others in that file) are indeed wiring up handlers to SIGALRM, and calling
  signal.alarm()
which could explain why we're seeing the default glibc SIGALRM handler in the test logs ("Alarm clock")

I'm not sure exactly what the issue is, though.

Comment 3 Dave Malcolm 2011-08-31 23:40:28 UTC
I ran another scratch build, this time of python, with --verbose
$ koji -s http://ppc.koji.fedoraproject.org/kojihub --weburl=http://ppc.koji.fedoraproject.org/koji build dist-f16 --scratch python-2.7.2-8.with_verbose_tests.fc17.src.rpm 
Task info: http://ppc.koji.fedoraproject.org/koji/taskinfo?taskID=280359

This shows the same 4 tests failing as on the python3 scratch build in comment #2:
  http://ppc.koji.fedoraproject.org/koji/getfile?taskID=280360&name=build.log
which is followed some time later by a SIGALRM - apparently during test_iterlen, though I suspect that some of the SIGALRM handling from test_io is "leaking", and killing that test.

Comment 4 Dave Malcolm 2011-09-01 00:02:57 UTC
The specfile is testing the debug configuration, then would test the optimized build, but is erroring out on the failures on the debug configuration.

I tried a scratch build without the debug configuration, just the optimized one:
$ koji -s http://ppc.koji.fedoraproject.org/kojihub --weburl=http://ppc.koji.fedoraproject.org/koji build dist-f16 --scratch python-2.7.2-8.with_verbose_tests.without_debug_build.fc17.src.rpm 
Task info: http://ppc.koji.fedoraproject.org/koji/taskinfo?taskID=280362

The same 4 tests within test_io failed with an optimized build:
  http://ppc.koji.fedoraproject.org/koji/getfile?taskID=280364&name=build.log

[This leads me to be reassured that the problem is not due to the patch for bug 698726)

Comment 5 Dave Malcolm 2011-09-01 15:31:39 UTC
On the grounds that something else might be broken, I tried a scratch build with test_io disabled.

It successfully ran to completion on both ppc/ppc64:
  python-2.7.2-8.with_verbose_tests.without_debug_build.without_test_io.fc17.src.rpm
  Task info: http://ppc.koji.fedoraproject.org/koji/taskinfo?taskID=280365

I then re-enabled test_io, and added some debug logging.
  python-2.7.2-8.with_verbose_tests.without_debug_build.with_verbose_test_io.2.fc17.src.rpm
  Task info: http://ppc.koji.fedoraproject.org/koji/taskinfo?taskID=280371

It failed, logging some more information on the failures.

All four failing tests are variants of:
   check_interrupted_write_retry
with slightly different parameters, and with Python vs C implementations of the underlying io library.

This test case sets a callback to respond to SIGALRM, calls signal.alarm(1), and writes 1MB of data to a pipe.  AIUI, PIPE_MAX is 64k on Linux, so the pipe ought to fill, and the writing call ought to block, leading the SIGALRM to be delivered about 1 second later.

The test case is written to respond to the SIGALRM by triggering another SIGALRM 1 second later, which it responds to by starting a reader thread, which reads from the other end of the pipe.

However, what appears to be happening is that the first SIGALRM handler is never called; the write call runs to completion, and claims it wrote 1MB of data (this assumes that I wrote my logging correctly).

The test case then tracebacks when it waits for the reader thread to finish, because the reader thread was never started.

Comment 6 Dave Malcolm 2011-09-01 22:27:17 UTC
Created attachment 521111 [details]
Log from test run, with verbosity turned up

I tried another version, with a patch to turn up the verbosity, and to only join() if the other thread was started, which works OK (albeit very verbosely) on my local builds:
python-2.7.2-8.with_verbose_tests.without_debug_build.with_verbose_test_io.4.fc17.src.rpm
Task info: http://ppc.koji.fedoraproject.org/koji/taskinfo?taskID=280422

Am attaching the log from the above, showing that indeed no data has been read from the pipes.

Comment 7 Dave Malcolm 2011-09-01 22:27:47 UTC
In order to have an up-to-date ppc build, I'm applying a patch on ppc/ppc64 which disables those 4 test_io tests.

Successful scratch build:
  http://ppc.koji.fedoraproject.org/koji/taskinfo?taskID=280473

Committed to "master" for python Rawhide/f17 as:
  http://pkgs.fedoraproject.org/gitweb/?p=python.git;a=commitdiff;h=e252525e39cf5e0755986490d367ddcf79430875
  Building python-2.7.2-9.fc17 for dist-rawhide
  Task info: http://koji.fedoraproject.org/koji/taskinfo?taskID=3319638 (succeeded)

Committed to "f16" for python as:
  http://pkgs.fedoraproject.org/gitweb/?p=python.git;a=commit;h=1e06160ce665a8500d35e8f7e736793187157e52
  Building python-2.7.2-4.2.fc16 for f16-candidate
  Task info: http://koji.fedoraproject.org/koji/taskinfo?taskID=3319863

Comment 8 Dave Malcolm 2011-09-01 22:50:47 UTC
Committed to "master" for python3 Rawhide/f17 as:
  http://pkgs.fedoraproject.org/gitweb/?p=python3.git;a=commitdiff;h=28391cb8f7ca7456115eb2a853f0b57775e7d1eb
  Building python3-3.2.1-7.fc17 for dist-rawhide
  Task info: http://koji.fedoraproject.org/koji/taskinfo?taskID=3319885

Merged to "f16" for python3 as:
  http://pkgs.fedoraproject.org/gitweb/?p=python3.git;a=commit;h=72dcf3b6227ffeb687a202024d0d7749c1c5ea11
  Building python3-3.2.1-2.2.fc16 for f16-candidate
  Task info: http://koji.fedoraproject.org/koji/taskinfo?taskID=3319887

Comment 9 Karsten Hopp 2011-09-02 10:45:52 UTC
python-2.7.2-4.2.fc16 and python3-3.2.1-2.2.fc16 built in F16 on PPC.
http://ppc.koji.fedoraproject.org/mash/branched-20110902/ already includes these packages.

Thanks Dave !
I'll leave this bugzilla open just in case you're still looking into the test_io failures. Feel free to close it if you're sure that the testcases were borked and it wasn't an issue in python itself.

Comment 10 Dave Malcolm 2011-09-02 14:12:28 UTC
Marking as MODIFIED to note that we have up-to-date python and python3 builds with a workaround - albeit without a full analysis/fix of the issue.