Bug 480281
Summary: | LTP pselect01 Test Case Failure on S390(x) | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Qian Cai <qcai> | ||||||||||
Component: | kernel | Assignee: | Red Hat Kernel Manager <kernel-mgr> | ||||||||||
Status: | CLOSED NOTABUG | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||||||||
Severity: | low | Docs Contact: | |||||||||||
Priority: | low | ||||||||||||
Version: | 5.3 | CC: | cward, drepper, jburke, jjarvis, onestero, phan, vmayatsk | ||||||||||
Target Milestone: | rc | ||||||||||||
Target Release: | --- | ||||||||||||
Hardware: | s390x | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2010-05-18 06:44:37 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 593196 | ||||||||||||
Attachments: |
|
Description
Qian Cai
2009-01-16 06:39:03 UTC
Created attachment 329165 [details]
simpler reproducer
Confused. This: > gettimeofday({1232076656, 380467}, NULL) = 0 > pselect6(0, [], NULL, NULL, {3, 0}, {NULL, 8}) = 0 (Timeout) > gettimeofday({1232076660, 830618}, NULL) = 0 Looks indeed strange (but _might_ be explained by strace oddities, please use -T). But the test-case doesn't look right. > if(total_sec >= (end - start)) > printf("Sleep time was correct\n"); > else > printf("Sleep time was incorrect:%d != %d\n",total_sec,(end - start)); Why do you think "total_sec < (end - start)" is not correct? It is correct to sleep a bit more, and of course the task can be preempted before "end = time();" Another question, why do you use time() to measure the difference ? This doesn't look correct. If you run this program, int main(void) { unsigned start,end; for (;;) { start = time(NULL); end = time(NULL); printf("dt=%d\n", end-start); } } you will see it prints dt=1 sometimes. Of course, this doesn't mean we spent 1 sec between syscalls! This just means we should not use time() this way. Could you modify your test case and re-test? Also, - fd is not used, why do you open the file? - readfds is not used (note you are doing pselect(0, ...) (In reply to comment #2) > Confused. > > This: > > gettimeofday({1232076656, 380467}, NULL) = 0 > > pselect6(0, [], NULL, NULL, {3, 0}, {NULL, 8}) = 0 (Timeout) > > gettimeofday({1232076660, 830618}, NULL) = 0 > > Looks indeed strange (but _might_ be explained by strace > oddities, please use -T). Sorry for the late reply -- was on vacation in the last few days. Here is the result of using -T. ... gettimeofday({1233481609, 240249}, NULL) = 0 <0.000009> pselect6(0, [], NULL, NULL, {4, 0}, {NULL, 8}) = 0 (Timeout) <4.829942> gettimeofday({1233481614, 70340}, NULL) = 0 <0.000008> ... ... gettimeofday({1233481882, 941390}, NULL) = 0 <0.000008> pselect6(0, [], NULL, NULL, {4, 0}, {NULL, 8}) = 0 (Timeout) <5.128518> gettimeofday({1233481888, 70058}, NULL) = 0 <0.000009> ... As you can see from the above, pselect6() looks like indeed spend too much time there. > > But the test-case doesn't look right. > > > if(total_sec >= (end - start)) > > printf("Sleep time was correct\n"); > > else > > printf("Sleep time was incorrect:%d != %d\n",total_sec,(end - start)); > > Why do you think "total_sec < (end - start)" is not correct? > It is correct to sleep a bit more, and of course the task can > be preempted before "end = time();" > > Another question, why do you use time() to measure the difference ? > This doesn't look correct. If you run this program, > > int main(void) > { > unsigned start,end; > > for (;;) { > start = time(NULL); > end = time(NULL); > printf("dt=%d\n", end-start); > } > } > > you will see it prints dt=1 sometimes. Of course, this > doesn't mean we spent 1 sec between syscalls! This just > means we should not use time() this way. I agree the test code can be improved. I'll have a look at this. > > Could you modify your test case and re-test? > > Also, > > - fd is not used, why do you open the file? > > - readfds is not used (note you are doing pselect(0, ...) Thanks for the suggestion. I have updated the reproducer (in the attachment), and re-tested. The same thing happens as well. ... gettimeofday({1233482212, 213}, NULL) = 0 <0.000009> pselect6(0, NULL, NULL, NULL, {3, 0}, {NULL, 8}) = 0 (Timeout) <4.069725> gettimeofday({1233482216, 70089}, NULL) = 0 <0.000008> ... Created attachment 330540 [details]
updated simpler reproducer
The strange thing is that, the system is pretty idle, but the system call was taking so much time probably either executing or doing context-switch. pselect6(0, NULL, NULL, NULL, {3, 0}, {NULL, 8}) = 0 (Timeout) <4.069725> pselect6(0, [], NULL, NULL, {4, 0}, {NULL, 8}) = 0 (Timeout) <5.128518> Is this normal for s390x? Created attachment 341375 [details] another test-case > Is this normal for s390x? This doesn't look normal ;) could you please run this simple program and show the output? (without strace) Thanks Oleg! I have run it 10 times with 134.el5 kernel without ntpd. sleep: 1 1.007755 2 2.009930 3 3.009910 4 4.009915 5 5.009914 6 6.009994 7 7.009896 8 8.009944 9 9.009910 10 10.009962 select: 1 0.999963 2 1.999916 3 2.999955 4 3.999936 5 4.999949 6 6.150086 7 6.999791 8 8.409916 9 8.999930 10 9.999972 pselect: 1 0.999924 2 1.999915 3 2.999940 4 3.999933 5 4.999946 6 5.999968 7 6.999922 8 7.999921 9 8.999971 10 9.999906 --------------------- sleep: 1 1.003654 2 2.009784 3 3.009944 4 4.009975 5 5.009900 6 6.009912 7 7.009969 8 8.009881 9 9.009912 10 10.009885 select: 1 0.999983 2 1.999859 3 2.999954 4 4.279910 5 4.999957 6 5.999922 7 6.999952 8 7.999920 9 8.999904 10 9.999982 pselect: 1 1.000609 2 1.999220 3 2.999940 4 3.999930 5 4.999894 6 5.999921 7 6.999941 8 7.999922 9 8.999953 10 9.999886 ---------------------- sleep: 1 1.000256 2 2.009920 3 3.009915 4 4.009907 5 5.009924 6 6.009960 7 7.009925 8 8.009969 9 9.009872 10 10.009421 select: 1 0.999902 2 1.999962 3 2.999946 4 3.999956 5 4.999957 6 5.999988 7 6.999906 8 7.999951 9 8.999959 10 9.999990 pselect: 1 0.999938 2 2.000308 3 2.999589 4 3.999955 5 4.999927 6 5.999989 7 6.999942 8 7.999933 9 8.999938 10 9.999923 ------------------------- sleep: 1 1.006661 2 2.009893 3 3.009879 4 4.009969 5 5.009926 6 6.009917 7 7.009943 8 8.009928 9 9.009938 10 10.009949 select: 1 0.999914 2 1.999936 3 2.999965 4 3.999914 5 4.999937 6 5.999917 7 6.999956 8 7.999954 9 8.999951 10 9.999916 pselect: 1 0.999929 2 1.999935 3 2.999940 4 3.999957 5 4.999926 6 5.999908 7 6.999938 8 7.999962 9 8.999871 10 9.999994 ---------------------- sleep: 1 1.008979 2 2.009875 3 3.009977 4 4.009966 5 5.009923 6 6.010032 7 7.009867 8 8.010079 9 9.009842 10 10.009960 select: 1 0.999962 2 2.000001 3 2.999885 4 3.999969 5 4.999954 6 5.999930 7 6.999946 8 8.000032 9 8.999887 10 9.999959 pselect: 1 0.999950 2 1.999973 3 2.999947 4 4.000009 5 4.999977 6 5.999900 7 6.999990 8 7.999994 9 8.999894 10 10.000064 ---------------- sleep: 1 1.001548 2 2.009847 3 3.009907 4 4.009945 5 5.009973 6 6.009921 7 7.009890 8 8.009942 9 9.009933 10 10.009883 select: 1 0.999967 2 1.999918 3 2.999880 4 3.999958 5 4.999883 6 5.999928 7 6.999921 8 7.999907 9 8.999977 10 9.999911 pselect: 1 0.999907 2 1.999904 3 2.999923 4 3.999967 5 4.999896 6 5.999866 7 6.999990 8 7.999932 9 8.999873 10 9.999958 ----------------- sleep: 1 1.009735 2 2.009839 3 3.009945 4 4.009918 5 5.009940 6 6.009911 7 7.009951 8 8.009875 9 9.009927 10 10.009860 select: 1 0.999936 2 1.999927 3 2.999958 4 3.999889 5 4.999963 6 5.999929 7 6.999892 8 7.999900 9 8.999954 10 9.999908 pselect: 1 0.999917 2 1.999928 3 2.999923 4 3.999974 5 4.999926 6 5.999878 7 7.130058 8 7.999814 9 8.999980 10 9.999870 -------------- sleep: 1 1.003711 2 2.009868 3 3.009934 4 4.009882 5 5.009939 6 6.009939 7 7.009957 8 8.009875 9 9.009926 10 10.009860 select: 1 0.999919 2 2.000010 3 2.999804 4 3.999944 5 4.999890 6 5.999910 7 6.999923 8 7.999935 9 8.999960 10 9.999890 pselect: 1 0.999909 2 2.001257 3 2.998627 4 3.999897 5 4.999934 6 5.999938 7 6.999909 8 7.999927 9 8.999918 10 9.999912 ------------ sleep: 1 1.008587 2 2.009862 3 3.009930 4 4.009910 5 5.009981 6 6.009924 7 7.011863 8 8.008037 9 9.009917 10 10.009906 select: 1 0.999915 2 1.999899 3 3.010014 4 3.999923 5 4.999991 6 5.999858 7 6.999951 8 7.999952 9 8.999900 10 9.999959 pselect: 1 0.999996 2 1.999875 3 3.000013 4 3.999926 5 5.001687 6 5.998159 7 6.999941 8 7.999927 9 8.999951 10 9.999896 ----------------- sleep: 1 1.005401 2 2.009865 3 3.009958 4 4.009917 5 5.009995 6 6.009878 7 7.009933 8 8.009907 9 9.009970 10 10.009935 select: 1 0.999934 2 1.999931 3 2.999930 4 3.999929 5 5.002017 6 5.997879 7 6.999892 8 7.999937 9 8.999976 10 9.999937 pselect: 1 0.999890 2 1.999931 3 2.999977 4 3.999910 5 4.999964 6 5.999945 7 6.999950 8 8.399931 9 8.999890 10 9.999972 (In reply to comment #7) > > I have run it 10 times with 134.el5 kernel without ntpd. Thanks! And I am sorry for delays. I don't see the ~1 sec delays which you had with the previous test-case, but > select: > ... > 8 8.409916 this doesn't look good too. Note that nanosleep() (which uses hrtimers) behaves well. select/pselect use kernel/timer.c, and currently I suspect we have some s390 specific problems here. I think we can rule out scheduler. Otherwise, we should have delays with nanosleep too. Also, I do not think select() itself has problems. Will try to investigate further. Thanks Oleg Nesterov! I also hit this bug with the latest 4.8.z kernel 2.6.9-89.0.11.EL but on an ia64 machine. I had run the 'pselect01_64' come from LTP about 100 times and reproduce this bug one times. And the results of the test-case come from comment #6 is here, I have run it about 80 times. I will upload it. Created attachment 359508 [details] The outputs of test-case come from comment #6 on a ia64 machine. If ntpd will actualize time during sleep of pselect, pselect will fail. It is easy to check it by hands. (In reply to comment #12) > If ntpd will actualize time during sleep of pselect, pselect will fail. It is > easy to check it by hands. Before starting testing, the case stops ntpd: ... # ---------- Start Test ------------- DisableNTP TESTS='syscalls commands crashme dio fs fsx math mm nptl pty sched ipc' for t in $TESTS ; do CleanUp $t done for t in $TESTS ; do OUTPUTFILE=`mktemp /tmp/tmp.XXXXXX` RunTest $t done EnableNTP ... Reverse mirror and assistance request sent to IBM. ------- Comment From edpollar.com 2010-05-17 11:08 EDT------- reverse mirror of 480281 - LTP pselect01 Test Case Failure on S390(x) ------- Comment From gmuelas.com 2010-05-18 02:15 EDT------- Hello Red Hat, back in 2005 we found the a similar problem where LTC pselect was failing. At that time the problem and suggested solution was: <Help from Kevin Corry> It looks like the testcase is checking that the actual time slept is equal to the desired time, which may not always be valid. An alternative solution would be to check that the actual slept time is greater than or equal to the desired time. Then in 2008 was suggested that: <Mark A. Ver> My 2 cents ... If I understand the man page correctly, the main reason for the timeout is to keep the select/pselect call from blocking indefinitely. So just getting a return in a reasonable time means success. If we also want to test that the timeout is somewhat within the ballpark of expectations, we could give it a suitable leeway (like +/- 2 seconds). As far as I can tell the elapsed time is never more than a second off and unless this is being used for a real-time system it's not critical for the time to match. And the problem was finally close with this comment: Closing this Bug as this is not a real time system. Also, the failures has nothing to do with test case design flaw, but, completely at the mercy of how system behaves in different instances. ------- Comment From eberhard.pasch.com 2010-05-18 04:07 EDT------- LTP improved quite a bit by now. As the mirror didn't really work, can someone from the IBM PM team just paste the RH bugzilla please? My first question to RH without knowing any more details (as they are hidden) would be: does this testcase work on RHEL 5.5? |