Bug 773116 - mysql 5.5 fails to build on ARM
Summary: mysql 5.5 fails to build on ARM
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: mysql
Version: rawhide
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Tom Lane
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On: 741325
Blocks: ARMTracker
TreeView+ depends on / blocked
 
Reported: 2012-01-11 00:02 UTC by Peter Robinson
Modified: 2012-03-05 15:36 UTC (History)
4 users (show)

Fixed In Version: mysql-5.5.21-1
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-03-05 15:36:05 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
Remove tests that fail on ARM due to lack of support for high resolution timers. (10.18 KB, patch)
2012-02-06 19:55 UTC, D. Marlin
no flags Details | Diff
patch for spec file (5.38 KB, patch)
2012-02-08 17:14 UTC, Honza Horak
no flags Details | Diff
patch for spec file (6.24 KB, patch)
2012-02-10 12:07 UTC, Honza Horak
no flags Details | Diff

Description Peter Robinson 2012-01-11 00:02:49 UTC
I've tried various releases of 5.5, the latest being 5.5.19

Full build logs:
http://arm.koji.fedoraproject.org/koji/taskinfo?taskID=253249

5.1 use to build OK.

Comment 1 Tom Lane 2012-01-11 02:49:28 UTC
Those regression-test symptoms look familiar ... I think what you've got there is lack of high-precision timer support.  I got similar failures on s390 until I patched my_rdtsc.h/.c to support that platform (see  mysql-s390-tsc.patch in our git tree).  I have neither the expertise nor the interest to develop a patch for ARM, but if you want to ...

Comment 2 Dan Horák 2012-01-11 18:01:26 UTC
IIRC I took the timestamp code for s390 from the kernel, I suppose an ARM implementation could be found there too.

Comment 3 Tom Lane 2012-01-17 16:13:11 UTC
A half-baked answer just to get the package built would be to disable those specific regression tests on ARM.  But that's a band-aid, not a fix, of course.

Comment 4 Honza Horak 2012-01-20 10:24:07 UTC
It seems there is no arm kernel with performance counter support available yet, see bug#741325.

Comment 5 Honza Horak 2012-01-24 16:22:09 UTC
Failing tests:
mysql-test/suite/perfschema/t/func_file_io.test
mysql-test/suite/perfschema/t/func_mutex.test
"my_rdtsc" from unittest/mysys/CMakeLists.txt
"pfs" from storage/perfschema/unittest/CMakeLists.txt

Comment 6 D. Marlin 2012-02-06 19:53:21 UTC
I was able to complete a scratch build of mysql-5.5.20-1.fc17 for ARM by removing two of the regressions tests (perfschema.func_file_io,perfschema.func_file_mutex):

  http://arm.koji.fedoraproject.org/koji/taskinfo?taskID=331753

I added a patch (attached) to the spec file to remove the tests for ARM only,

  +Patch100: mysql-arm-remove-tests.patch

        :

  +%ifarch %{arm}
  +%patch100 -p1
  +%endif


so this should not affect other archs.

Comment 7 D. Marlin 2012-02-06 19:55:25 UTC
Created attachment 559742 [details]
Remove tests that fail on ARM due to lack of support for high resolution timers.

Comment 8 Tom Lane 2012-02-06 21:13:42 UTC
That's a pretty unmaintainable patch, as it will break anytime upstream changes anything in either of those regression tests.  Honza Horak and I have been looking into better ways of dealing with tests that need to be disabled.  The mysql-tests infrastructure has support for selectively disabling tests by name, but we've not yet identified where's the cleanest place to patch in arch-specific disables.

Comment 9 Honza Horak 2012-02-07 16:01:09 UTC
I liked the following solution until it showed up it wouldn't work in this case (reason is described).

MySQL test structure has an ability to stamp some of the tests as "experimental tests", whose failure can be considered expected (see mysql-test/collections/README.experimental). So it's possible to filter tests by platform (windows, solaris, linux - value is got using perl -e 'print($^O)'), but it's useless now, because fedora on arm obviously identifies as operating system 'linux' and it's not possible to distinguish between architectures.

We could patch mysql-test-run.pl to be able to filter tests against architecture (which could be possibly acceptable for upstream) in the same way as we can filter them against platform and then use something like:

> perfschema.func_file_io @armv7hl  # perf counter doesn't work on armv7hl
> perfschema.func_mutex @armv7hl    # perf counter doesn't work on armv7hl
> perfschema.func_file_io @armv5tel # perf counter doesn't work on armv5tel
> perfschema.func_mutex @armv5tel   # perf counter doesn't work on armv5tel

But there is one problem: even if all failures of these tests are considered expected, mysql-test-run will end with return value 1 (FAIL). The only difference will be, that failed tests will be stampped with [exp-fail] except of [fail]. So we would have to change mysql-test-run behaviour to have a working solution using this ability and this is probably not acceptable solution for upstream at all.

This is a solution, that is not perfect, but I see it as acceptable for now:

There is "--skip-test-list=" argument when running mysql-test-run. It accepts a file as an argument; the file will then contain:

> # Tests failing on arm because performance counter support is missing
> perfschema.func_file_io :       # performance counter doesn't work on arm
> perfschema.func_mutex :         # performance counter doesn't work on arm

The "--skip-test-list" argument can be wrapped using "%ifarch armv5tel armv7hl", so the call can look like:

>     perl ./mysql-test-run.pl \
>             --force \
>             --retry=0 \
>             --ssl \
>             --mysqld=--binlog-format=mixed \
>             --suite-timeout=720 \
> %ifarch armv5tel armv7hl
>             --skip-test-list=%{SOURCE11} \
> %endif
>             --testcase-timeout=30

Comment 10 Peter Robinson 2012-02-07 16:09:46 UTC
What I don't get is this use to compile on F-14 ARM AFIACT without changes. Why is it now a problem?

Comment 11 Tom Lane 2012-02-07 16:27:43 UTC
(In reply to comment #10)
> What I don't get is this use to compile on F-14 ARM AFIACT without changes.

Really?  The only way I can see that it'd have passed is if you were checking mysql 5.0.  Or maybe these particular regression tests got added recently?

Comment 12 Honza Horak 2012-02-07 16:30:11 UTC
(In reply to comment #11)
> (In reply to comment #10)
> > What I don't get is this use to compile on F-14 ARM AFIACT without changes.
> 
> Really?  The only way I can see that it'd have passed is if you were checking
> mysql 5.0.  Or maybe these particular regression tests got added recently?

I guess all tests were disabled at all when compiling for F14.

Comment 13 Tom Lane 2012-02-07 16:34:01 UTC
oh, no, I take that back: the performance schema was added in mysql 5.5, not 5.1.  So the short answer is that mysql 5.1 did not need a cycle-level timer and 5.5 does.

Comment 14 D. Marlin 2012-02-07 17:18:10 UTC
We could use:

  %ifarch %{arm} ...


to include all ARM variants.


So is the plan to include the above patch until performance counter support is fully functional in the ARM kernels?

Comment 15 Tom Lane 2012-02-07 17:28:50 UTC
I've asked Honza to rework the patch a bit to satisfy some packaging concerns, but yeah, what we intend is to disable these two tests on ARM until something happens on the timer-support front.

Comment 16 Honza Horak 2012-02-08 17:14:58 UTC
Created attachment 560319 [details]
patch for spec file

Using this patch, rh-skipped-tests.list file will be created during build and installed into mysql-tests/rh-skipped-tests.list. mysql-disable-test.patch is replaced and mysql-tests/README is adjusted with a recommendation to always use "--skip-tests-list=rh-skipped-tests.list" option when running mysql-test-run. 

I've fired a scratch build on arm to see if it works, but I don't expect it'll have finished before I'll leave the office today, hope it will succeed:
http://arm.koji.fedoraproject.org/koji/taskinfo?taskID=358236

A little comment about --skip-test-list -- it turned out that tests, which are listed in a file passed to mysql-test-run using "--skip-test-list" argument, are not listed in the output at all (label them with [skipped] would be better). I haven't realized that before, but I don't see this as a big problem, because these tests are always skipped consciously by using the argument by a user or during build.

Comment 17 Honza Horak 2012-02-08 21:25:10 UTC
(In reply to comment #16)
> I've fired a scratch build on arm to see if it works, but I don't expect it'll
> have finished before I'll leave the office today, hope it will succeed:
> http://arm.koji.fedoraproject.org/koji/taskinfo?taskID=358236

Hm, it didn't succeed. I'm going to fix as soon as I can, on Friday at latest.

Comment 18 D. Marlin 2012-02-09 20:03:59 UTC
I noticed that in the arm.koji log for the mysql-5.5.20-1.fc17 build:

  http://arm.koji.fedoraproject.org/koji/getfile?taskID=300308&name=build.log

There were three test failures:

  sys_vars.query_cache_size_basic_32
  perfschema.func_file_io
  perfschema.func_mutex

but I only disabled the two perfschema in my previous patch.

Should sys_vars.query_cache_size_basic_32 be disabled for the same reason, or is that a different issue?

Comment 19 Tom Lane 2012-02-09 23:32:55 UTC
(In reply to comment #18)
> Should sys_vars.query_cache_size_basic_32 be disabled for the same reason, or
> is that a different issue?

No time to look at it right now, but that would be a different issue --- I'm pretty certain that such a test would not depend on the performance counters.  Can you get us the specific expected-vs-actual diff for that test?

Comment 20 Honza Horak 2012-02-10 12:07:03 UTC
Created attachment 560885 [details]
patch for spec file

This is a new patch, that I hope will work, let's see in the scratch build:
http://arm.koji.fedoraproject.org/koji/taskinfo?taskID=372616

I've done some changes proposed by David (test list generation was moved into %prep; particular test lists with comments are stored in external files to have .spec file cleaner, etc.)

Comment 21 Honza Horak 2012-02-10 12:14:53 UTC
(In reply to comment #19)
> (In reply to comment #18)
> > Should sys_vars.query_cache_size_basic_32 be disabled for the same reason, or
> > is that a different issue?
> 
> No time to look at it right now, but that would be a different issue --- I'm
> pretty certain that such a test would not depend on the performance counters. 
> Can you get us the specific expected-vs-actual diff for that test?

According other "random" failures during test run - I've seen a couple of them, but none was reproducible and occurred only once (e.g. main.outfile_loaddata).

Sometimes the build fails even during buildroot preparation or during configure script run, maybe all these failures are just some weird consequences in unstable arm environment. Only approx 50% of builds in arm-koji finishes with success or fails were I expect.

Comment 22 Honza Horak 2012-02-10 12:38:06 UTC
(In reply to comment #16)
> A little comment about --skip-test-list -- it turned out that tests, which are
> listed in a file passed to mysql-test-run using "--skip-test-list" argument,
> are not listed in the output at all (label them with [skipped] would be
> better). I haven't realized that before, but I don't see this as a big problem,
> because these tests are always skipped consciously by using the argument by a
> user or during build.

I'm taking this back, they are in the beginning of the output, only not in the original position:
perfschema.func_file_io   [ disabled ]  rhbz#773116 cycle counter does...
perfschema.func_mutex     [ disabled ]  rhbz#773116 cycle counter does...

(In reply to comment #21)
> According other "random" failures during test run - I've seen a couple of them,
> but none was reproducible and occurred only once (e.g. main.outfile_loaddata).

Now I've realized that main.outfile_loaddata should be skipped by default on all archs, but the patch for skipping was wrong in that particular build.

Comment 23 D. Marlin 2012-02-10 20:58:34 UTC
(In reply to comment #19)
> (In reply to comment #18)
> > Should sys_vars.query_cache_size_basic_32 be disabled for the same reason, or
> > is that a different issue?
> 
> No time to look at it right now, but that would be a different issue --- I'm
> pretty certain that such a test would not depend on the performance counters. 
> Can you get us the specific expected-vs-actual diff for that test?

I have the failure in arm.koji now:

  http://arm.koji.fedoraproject.org/koji/taskinfo?taskID=372618

Where would I find the expected-vs-actual diff for that test?

Comment 24 Tom Lane 2012-02-10 21:23:02 UTC
(In reply to comment #23)
> Where would I find the expected-vs-actual diff for that test?

Hm, actually it looks like the diff is in the build.log, at least for Honza's test run:
http://arm.koji.fedoraproject.org/koji/getfile?taskID=372618&name=build.log

If I'm reading that stack trace right, something sent mysqld a signal 6 (SIGABRT).  Maybe it was an internal abort() call, but there's not one on the stack.  Any ideas?  Might the koji environment have done that if it thought the build was taking too long?

Comment 25 Tom Lane 2012-02-10 21:25:20 UTC
btw, I notice that build completed on armv7hl and failed only on armv5tel.  Any thoughts as to the significant difference?

Comment 26 D. Marlin 2012-02-10 22:23:03 UTC
There are several potential differences:
- different physical hardware 
  (v7 ran on Trim Slice (tegra processor), v5 on Panda (OMAP processor)
- different kernels (required for Tegra and OMAP processors)
- different root file system (for v5tel vs. v7hl)

I have access to various boards and configurations, if there is an easy way to debug this test.  Please let me know what is needed, or if you would prefer, what you would need set up and we can provide access.

Note: My scratch build completed on both platforms, but I had disabled all three tests, so this is the only remaining test failure (all other tests passed).

  http://arm.koji.fedoraproject.org/koji/taskinfo?taskID=374288

Comment 27 Tom Lane 2012-02-10 22:34:06 UTC
(In reply to comment #26)
> Note: My scratch build completed on both platforms, but I had disabled all
> three tests, so this is the only remaining test failure (all other tests
> passed).

Hrmph.  That seems to eliminate my idea that maybe the build just ran overtime on the armv5tel builder.
Still, I see no plausible reason that it would be a mysql bug that a poll() call is interrupted by SIGABRT on one arm variant and not another.  I think you've got some kind of kernel bug to fight here.

I'm going to go ahead and push Honza's patch into git with just the cycle-counter-related tests disabled for now.  Do you need it now in f16/f17, or is rawhide sufficient for the arm project's purposes?

Comment 28 D. Marlin 2012-02-10 22:56:34 UTC
I think we need it in (at least) F17 and rawhide.  As for F16, I'll defer to Peter Robinson, as he reported it.

Comment 29 Tom Lane 2012-02-11 00:05:42 UTC
Hmm ... studying the gdb output a bit more closely, it seems that the test timed out; so the SIGABRT very likely got sent to mysqld by the mysql-test driver, which makes that thread's stack trace a red herring.  I now think that the interesting part of the output is

Thread 4 (LWP 1544):
#0  0x4057a770 in __lll_lock_wait_private () from /lib/libc.so.6
#1  0x40507f28 in get_free_list () from /lib/libc.so.6
#2  0x4050d388 in malloc () from /lib/libc.so.6
#3  0x0035498c in my_malloc (size=4294957792, my_flags=0) at /builddir/build/BUILD/mysql-5.5.20/mysys/my_malloc.c:38
#4  0x0013b470 in Query_cache::init_cache (this=0x84c180) at /builddir/build/BUILD/mysql-5.5.20/sql/sql_cache.cc:2191
#5  0x0013b834 in Query_cache::resize (this=0x84c180, query_cache_size_arg=4294966272) at /builddir/build/BUILD/mysql-5.5.20/sql/sql_cache.cc:1147
#6  0x001ff288 in fix_query_cache_size (self=<optimized out>, thd=<optimized out>, type=<optimized out>) at /builddir/build/BUILD/mysql-5.5.20/sql/sys_vars.cc:1847
#7  0x0010db58 in sys_var::update (this=<optimized out>, thd=0x23b8130, var=0x23ed4c0) at /builddir/build/BUILD/mysql-5.5.20/sql/set_var.cc:207
#8  0x0010e128 in set_var::update (this=<optimized out>, thd=<optimized out>) at /builddir/build/BUILD/mysql-5.5.20/sql/set_var.cc:674
#9  0x0010e7a0 in sql_set_variables (thd=0x23b8130, var_list=0x23b98cc) at /builddir/build/BUILD/mysql-5.5.20/sql/set_var.cc:578
#10 0x001649d0 in mysql_execute_command (thd=0x23b8130) at /builddir/build/BUILD/mysql-5.5.20/sql/sql_parse.cc:3178

which is evidently the thread that's actually trying to execute the commands from the test script.  Looking at this, and knowing that it timed out, we can presume that one of glibc's internal locks is stuck.  Now maybe that's mysql's fault, but given that we've not seen any such thing on other architectures, I'm still going to bet on an ARM-specific issue in glibc or perhaps the kernel.

Comment 30 D. Marlin 2012-02-12 05:08:25 UTC
I was trying to set up a system to debug this (Trim Slice, v5tel, rawhide mock chroot), but after three consecutive rpm builds in the mock chroot this test passes every time.

How did you set it up to catch a failure for debugging?

Comment 31 Honza Horak 2012-02-13 13:22:59 UTC
It seems it doesn't depend on architecture - I've seen the same failure with the same trace also on armv7hl build today:
https://arm.koji.fedoraproject.org/koji/taskinfo?taskID=384132

..and found a build of armv5tel, where query_cache_size_basic_32 passed before 5 days (which was built on armv7l.tegra):
https://arm.koji.fedoraproject.org/koji/taskinfo?taskID=358241

While comparing root.log of several builds (failed or passed) I found only one difference - /root/proc/filesystems was not mounted in chroot on some builds, but I don't see any rules between this and result of building.

(In reply to comment #30)
> How did you set it up to catch a failure for debugging?

I think there's only the trace in arm.koji's build log:
http://arm.koji.fedoraproject.org/koji/getfile?taskID=372618&name=build.log

Comment 32 D. Marlin 2012-03-03 14:50:37 UTC
We have a successful build on F17 for both v7hl and v5tel:

  http://arm.koji.fedoraproject.org/koji/buildinfo?buildID=57962

The change to disable perfschema.func_file_io and perfschema.func_mutex appears to be an effective workaround for F17+.  The correct solution will be to fully implement hardware performance monitoring on the ARM boards, but I think we can mark this as a workaround until then.

Comment 33 Honza Horak 2012-03-05 08:17:25 UTC
I was wondering why query_cache_size_basic_32 test case haven't failed and it showed up, we're obviously not alone seeing problems with it. It's been disabled by upstream since 5.5.21 with a reference to Bug#13535584, which is unfortunately inaccessible for ordinary non-Oracles [1]. 

There's also another related upstream bug [2], since the test also fails while allocating a huge amount of cache memory in our case. This is a description from the upstream bug report [2]:

"Setting query_cache_size to larger values might fail depending on the memory pressure being put on the system. This can be seen on pushbuild as the test case query_cache_size_basic tries to allocate a +3GB query cache, which succeeds in some machines and fails in others."

[1] http://bugs.mysql.com/bug.php?id=13535584
[2] http://bugs.mysql.com/bug.php?id=36747

Comment 34 Tom Lane 2012-03-05 15:36:05 UTC
(In reply to comment #33)
> I was wondering why query_cache_size_basic_32 test case haven't failed and it
> showed up, we're obviously not alone seeing problems with it. It's been
> disabled by upstream since 5.5.21 with a reference to Bug#13535584, which is
> unfortunately inaccessible for ordinary non-Oracles [1]. 

Fascinating.  Would love to know what symptom they are seeing, and whether it looks similar to ours or not.

> There's also another related upstream bug [2]

Those bugs are quite old and are shown as having been dealt with in 5.1-era releases.  In any case it doesn't appear to me that we're seeing an out-of-memory failure here (or if that is what causes this, it's a glibc bug not mysql's ...)

Anyway, we can safely say that 5.5.21 is going to build on ARM, because it's not going to try the questionable test.  So I'm going to close this bug.  If the problem reappears after Oracle does whatever they're gonna do to resolve their bug, we can open a new bug about that.


Note You need to log in before you can comment on or make changes to this bug.