Bug 480281

Summary: LTP pselect01 Test Case Failure on S390(x)
Product: Red Hat Enterprise Linux 5 Reporter: Qian Cai <qcai>
Component: kernelAssignee: Red Hat Kernel Manager <kernel-mgr>
Status: CLOSED NOTABUG QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: low Docs Contact:
Priority: low    
Version: 5.3CC: cward, drepper, jburke, jjarvis, onestero, phan, vmayatsk
Target Milestone: rc   
Target Release: ---   
Hardware: s390x   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-05-18 06:44:37 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 593196    
Attachments:
Description Flags
simpler reproducer
none
updated simpler reproducer
none
another test-case
none
The outputs of test-case come from comment #6 on a ia64 machine. none

Description Qian Cai 2009-01-16 06:39:03 UTC
Description of problem:
The test case is doing the following on s390(x).

start = time();
pselect(0,&readfds,NULL,NULL,(struct timespec *)&tv,NULL);
end = time();

Then, it compares whether,

tv.tv_sec = end - start

However, it fails the comparison sometimes. For example,

# strace ./pselect
...
write(1, "pselect01    0  INFO  :  Testing"..., 74pselect01    0  INFO  :  Testing basic pselect sanity,Sleeping for 3 secs
) = 74
gettimeofday({1232076656, 380467}, NULL) = 0
pselect6(0, [], NULL, NULL, {3, 0}, {NULL, 8}) = 0 (Timeout)
gettimeofday({1232076660, 830618}, NULL) = 0
write(1, "pselect01    4  FAIL  :  Sleep t"..., 57pselect01    4  FAIL  :  Sleep time was incorrect:3 != 4
) = 57
...

Here 3 != 1232076660 - 1232076656. I have also seen the failure on s390 with RHEL 4 kernel.

Version-Release number of selected component (if applicable):
kernel-2.6.18-128.el5
kernel-2.6.9-78.0.13.EL

How reproducible:
It is usually reproducible within 10 attempts.

Steps to Reproduce:
1. run the simpler reproducer 10 times.

Actual results:
Some failure like,

# ./pselect
open: Success
Basic pselect syscall testing....OK
Testing basic pselect sanity,Sleeping for 1 secs
Sleep time was correct
Testing basic pselect sanity,Sleeping for 2 secs
Sleep time was correct
Testing basic pselect sanity,Sleeping for 3 secs
Sleep time was correct
Testing basic pselect sanity,Sleeping for 4 secs
Sleep time was correct
Testing basic pselect sanity,Sleeping for 5 secs
Sleep time was correct
Testing basic pselect sanity,Sleeping for 6 secs
Sleep time was correct
Testing basic pselect sanity,Sleeping for 7 secs
Sleep time was correct
Testing basic pselect sanity,Sleeping for 8 secs
Sleep time was incorrect:8 != 9
Testing basic pselect sanity,Sleeping for 9 secs
Sleep time was correct
Testing basic pselect sanity,Sleeping for 10 secs
Sleep time was correct

Expected results:

# ./pselect
open: Success
Basic pselect syscall testing....OK
Testing basic pselect sanity,Sleeping for 1 secs
Sleep time was correct
Testing basic pselect sanity,Sleeping for 2 secs
Sleep time was correct
Testing basic pselect sanity,Sleeping for 3 secs
Sleep time was correct
Testing basic pselect sanity,Sleeping for 4 secs
Sleep time was correct
Testing basic pselect sanity,Sleeping for 5 secs
Sleep time was correct
Testing basic pselect sanity,Sleeping for 6 secs
Sleep time was correct
Testing basic pselect sanity,Sleeping for 7 secs
Sleep time was correct
Testing basic pselect sanity,Sleeping for 8 secs
Sleep time was correct
Testing basic pselect sanity,Sleeping for 9 secs
Sleep time was correct
Testing basic pselect sanity,Sleeping for 10 secs
Sleep time was correct

Comment 1 Qian Cai 2009-01-16 06:43:14 UTC
Created attachment 329165 [details]
simpler reproducer

Comment 2 Oleg Nesterov 2009-01-19 13:25:33 UTC
Confused.

This:
> gettimeofday({1232076656, 380467}, NULL) = 0
> pselect6(0, [], NULL, NULL, {3, 0}, {NULL, 8}) = 0 (Timeout)
> gettimeofday({1232076660, 830618}, NULL) = 0

Looks indeed strange (but _might_ be explained by strace
oddities, please use -T).

But the test-case doesn't look right.

>    if(total_sec >= (end - start))
>      printf("Sleep time was correct\n");
>    else
>      printf("Sleep time was incorrect:%d != %d\n",total_sec,(end - start));

Why do you think "total_sec < (end - start)" is not correct?
It is correct to sleep a bit more, and of course the task can
be preempted before "end = time();"

Another question, why do you use time() to measure the difference ?
This doesn't look correct. If you run this program,

int main(void)
{
        unsigned start,end;

        for (;;) {
                start = time(NULL);
                end = time(NULL);
                printf("dt=%d\n", end-start);
        }
}

you will see it prints dt=1 sometimes. Of course, this
doesn't mean we spent 1 sec between syscalls! This just
means we should not use time() this way.

Could you modify your test case and re-test?

Also,

     - fd is not used, why do you open the file?

     - readfds is not used (note you are doing pselect(0, ...)

Comment 3 Qian Cai 2009-02-01 10:05:09 UTC
(In reply to comment #2)
> Confused.
> 
> This:
> > gettimeofday({1232076656, 380467}, NULL) = 0
> > pselect6(0, [], NULL, NULL, {3, 0}, {NULL, 8}) = 0 (Timeout)
> > gettimeofday({1232076660, 830618}, NULL) = 0
> 
> Looks indeed strange (but _might_ be explained by strace
> oddities, please use -T).

Sorry for the late reply -- was on vacation in the last few days. Here is the result of using -T.

...
gettimeofday({1233481609, 240249}, NULL) = 0 <0.000009>
pselect6(0, [], NULL, NULL, {4, 0}, {NULL, 8}) = 0 (Timeout) <4.829942>
gettimeofday({1233481614, 70340}, NULL) = 0 <0.000008>
...

...
gettimeofday({1233481882, 941390}, NULL) = 0 <0.000008>
pselect6(0, [], NULL, NULL, {4, 0}, {NULL, 8}) = 0 (Timeout) <5.128518>
gettimeofday({1233481888, 70058}, NULL) = 0 <0.000009>
...

As you can see from the above, pselect6() looks like indeed spend too much time there.

> 
> But the test-case doesn't look right.
> 
> >    if(total_sec >= (end - start))
> >      printf("Sleep time was correct\n");
> >    else
> >      printf("Sleep time was incorrect:%d != %d\n",total_sec,(end - start));
> 
> Why do you think "total_sec < (end - start)" is not correct?
> It is correct to sleep a bit more, and of course the task can
> be preempted before "end = time();"
> 
> Another question, why do you use time() to measure the difference ?
> This doesn't look correct. If you run this program,
> 
> int main(void)
> {
>         unsigned start,end;
> 
>         for (;;) {
>                 start = time(NULL);
>                 end = time(NULL);
>                 printf("dt=%d\n", end-start);
>         }
> }
> 
> you will see it prints dt=1 sometimes. Of course, this
> doesn't mean we spent 1 sec between syscalls! This just
> means we should not use time() this way.

I agree the test code can be improved. I'll have a look at this.

> 
> Could you modify your test case and re-test?
> 
> Also,
> 
>      - fd is not used, why do you open the file?
> 
>      - readfds is not used (note you are doing pselect(0, ...)

Thanks for the suggestion. I have updated the reproducer (in the attachment), and re-tested. The same thing happens as well.

...
gettimeofday({1233482212, 213}, NULL)   = 0 <0.000009>
pselect6(0, NULL, NULL, NULL, {3, 0}, {NULL, 8}) = 0 (Timeout) <4.069725>
gettimeofday({1233482216, 70089}, NULL) = 0 <0.000008>
...

Comment 4 Qian Cai 2009-02-01 10:07:12 UTC
Created attachment 330540 [details]
updated simpler reproducer

Comment 5 Qian Cai 2009-03-16 16:25:47 UTC
The strange thing is that, the system is pretty idle, but the system call was taking so much time probably either executing or doing context-switch.
pselect6(0, NULL, NULL, NULL, {3, 0}, {NULL, 8}) = 0 (Timeout) <4.069725>
pselect6(0, [], NULL, NULL, {4, 0}, {NULL, 8}) = 0 (Timeout) <5.128518>

Is this normal for s390x?

Comment 6 Oleg Nesterov 2009-04-27 00:23:32 UTC
Created attachment 341375 [details]
another test-case

> Is this normal for s390x?

This doesn't look normal ;)

could you please run this simple program and show the output?
(without strace)

Comment 7 Qian Cai 2009-04-27 03:02:15 UTC
Thanks Oleg! I have run it 10 times with 134.el5 kernel without ntpd.

sleep:
1 1.007755
2 2.009930
3 3.009910
4 4.009915
5 5.009914
6 6.009994
7 7.009896
8 8.009944
9 9.009910
10 10.009962

select:
1 0.999963
2 1.999916
3 2.999955
4 3.999936
5 4.999949
6 6.150086
7 6.999791
8 8.409916
9 8.999930
10 9.999972

pselect:
1 0.999924
2 1.999915
3 2.999940
4 3.999933
5 4.999946
6 5.999968
7 6.999922
8 7.999921
9 8.999971
10 9.999906

---------------------

sleep:
1 1.003654
2 2.009784
3 3.009944
4 4.009975
5 5.009900
6 6.009912
7 7.009969
8 8.009881
9 9.009912
10 10.009885

select:
1 0.999983
2 1.999859
3 2.999954
4 4.279910
5 4.999957
6 5.999922
7 6.999952
8 7.999920
9 8.999904
10 9.999982

pselect:
1 1.000609
2 1.999220
3 2.999940
4 3.999930
5 4.999894
6 5.999921
7 6.999941
8 7.999922
9 8.999953
10 9.999886

----------------------

sleep:
1 1.000256
2 2.009920
3 3.009915
4 4.009907
5 5.009924
6 6.009960
7 7.009925
8 8.009969
9 9.009872
10 10.009421

select:
1 0.999902
2 1.999962
3 2.999946
4 3.999956
5 4.999957
6 5.999988
7 6.999906
8 7.999951
9 8.999959
10 9.999990

pselect:
1 0.999938
2 2.000308
3 2.999589
4 3.999955
5 4.999927
6 5.999989
7 6.999942
8 7.999933
9 8.999938
10 9.999923

-------------------------

sleep:
1 1.006661
2 2.009893
3 3.009879
4 4.009969
5 5.009926
6 6.009917
7 7.009943
8 8.009928
9 9.009938
10 10.009949

select:
1 0.999914
2 1.999936
3 2.999965
4 3.999914
5 4.999937
6 5.999917
7 6.999956
8 7.999954
9 8.999951
10 9.999916

pselect:
1 0.999929
2 1.999935
3 2.999940
4 3.999957
5 4.999926
6 5.999908
7 6.999938
8 7.999962
9 8.999871
10 9.999994

----------------------

sleep:
1 1.008979
2 2.009875
3 3.009977
4 4.009966
5 5.009923
6 6.010032
7 7.009867
8 8.010079
9 9.009842
10 10.009960

select:
1 0.999962
2 2.000001
3 2.999885
4 3.999969
5 4.999954
6 5.999930
7 6.999946
8 8.000032
9 8.999887
10 9.999959

pselect:
1 0.999950
2 1.999973
3 2.999947
4 4.000009
5 4.999977
6 5.999900
7 6.999990
8 7.999994
9 8.999894
10 10.000064

----------------

sleep:
1 1.001548
2 2.009847
3 3.009907
4 4.009945
5 5.009973
6 6.009921
7 7.009890
8 8.009942
9 9.009933
10 10.009883

select:
1 0.999967
2 1.999918
3 2.999880
4 3.999958
5 4.999883
6 5.999928
7 6.999921
8 7.999907
9 8.999977
10 9.999911

pselect:
1 0.999907
2 1.999904
3 2.999923
4 3.999967
5 4.999896
6 5.999866
7 6.999990
8 7.999932
9 8.999873
10 9.999958

-----------------

sleep:
1 1.009735
2 2.009839
3 3.009945
4 4.009918
5 5.009940
6 6.009911
7 7.009951
8 8.009875
9 9.009927
10 10.009860

select:
1 0.999936
2 1.999927
3 2.999958
4 3.999889
5 4.999963
6 5.999929
7 6.999892
8 7.999900
9 8.999954
10 9.999908

pselect:
1 0.999917
2 1.999928
3 2.999923
4 3.999974
5 4.999926
6 5.999878
7 7.130058
8 7.999814
9 8.999980
10 9.999870

--------------

sleep:
1 1.003711
2 2.009868
3 3.009934
4 4.009882
5 5.009939
6 6.009939
7 7.009957
8 8.009875
9 9.009926
10 10.009860

select:
1 0.999919
2 2.000010
3 2.999804
4 3.999944
5 4.999890
6 5.999910
7 6.999923
8 7.999935
9 8.999960
10 9.999890

pselect:
1 0.999909
2 2.001257
3 2.998627
4 3.999897
5 4.999934
6 5.999938
7 6.999909
8 7.999927
9 8.999918
10 9.999912

------------

sleep:
1 1.008587
2 2.009862
3 3.009930
4 4.009910
5 5.009981
6 6.009924
7 7.011863
8 8.008037
9 9.009917
10 10.009906

select:
1 0.999915
2 1.999899
3 3.010014
4 3.999923
5 4.999991
6 5.999858
7 6.999951
8 7.999952
9 8.999900
10 9.999959

pselect:
1 0.999996
2 1.999875
3 3.000013
4 3.999926
5 5.001687
6 5.998159
7 6.999941
8 7.999927
9 8.999951
10 9.999896

-----------------

sleep:
1 1.005401
2 2.009865
3 3.009958
4 4.009917
5 5.009995
6 6.009878
7 7.009933
8 8.009907
9 9.009970
10 10.009935

select:
1 0.999934
2 1.999931
3 2.999930
4 3.999929
5 5.002017
6 5.997879
7 6.999892
8 7.999937
9 8.999976
10 9.999937

pselect:
1 0.999890
2 1.999931
3 2.999977
4 3.999910
5 4.999964
6 5.999945
7 6.999950
8 8.399931
9 8.999890
10 9.999972

Comment 8 Oleg Nesterov 2009-04-28 18:22:17 UTC
(In reply to comment #7)
>
> I have run it 10 times with 134.el5 kernel without ntpd.

Thanks! And I am sorry for delays.

I don't see the ~1 sec delays which you had with the previous
test-case, but

> select:
> ...
> 8 8.409916

this doesn't look good too. Note that nanosleep() (which uses
hrtimers) behaves well. select/pselect use kernel/timer.c, and
currently I suspect we have some s390 specific problems here.

I think we can rule out scheduler. Otherwise, we should have
delays with nanosleep too. Also, I do not think select() itself
has problems.

Will try to investigate further.

Comment 9 Qian Cai 2009-05-01 03:48:53 UTC
Thanks Oleg Nesterov!

Comment 10 Han Pingtian 2009-09-02 10:56:08 UTC
I also hit this bug with the latest 4.8.z kernel 2.6.9-89.0.11.EL but on an ia64 machine. I had run the 'pselect01_64' come from LTP about 100 times and reproduce this bug one times. 

And the results of the test-case come from comment #6 is here, I have run it about 80 times. I will upload it.

Comment 11 Han Pingtian 2009-09-02 11:00:06 UTC
Created attachment 359508 [details]
The outputs of test-case come from comment #6 on a ia64 machine.

Comment 12 Vitaly Mayatskikh 2009-09-02 13:49:31 UTC
If ntpd will actualize time during sleep of pselect, pselect will fail. It is easy to check it by hands.

Comment 13 Han Pingtian 2009-09-03 01:57:27 UTC
(In reply to comment #12)
> If ntpd will actualize time during sleep of pselect, pselect will fail. It is
> easy to check it by hands.  

Before starting testing, the case stops ntpd:

...
# ---------- Start Test -------------

DisableNTP

TESTS='syscalls commands crashme dio fs fsx math mm nptl pty sched ipc'

for t in $TESTS ; do
    CleanUp $t
done

for t in $TESTS ; do
    OUTPUTFILE=`mktemp /tmp/tmp.XXXXXX`
    RunTest $t
done

EnableNTP
...

Comment 15 John Jarvis 2010-05-17 14:26:34 UTC
Reverse mirror and assistance request sent to IBM.

Comment 16 IBM Bug Proxy 2010-05-17 15:11:57 UTC
------- Comment From edpollar.com 2010-05-17 11:08 EDT-------
reverse mirror of 480281  - LTP pselect01 Test Case Failure on S390(x)

Comment 17 IBM Bug Proxy 2010-05-18 06:21:03 UTC
------- Comment From gmuelas.com 2010-05-18 02:15 EDT-------
Hello Red Hat,

back in 2005 we found the a similar problem where LTC pselect was failing. At that time the problem and suggested solution was:
<Help from Kevin Corry>
It looks like the testcase is checking that the actual time slept is equal to
the desired time, which may not always be valid.  An alternative solution would
be to check that the actual slept time is greater than or equal to the desired time.

Then in 2008 was suggested that:
<Mark A. Ver>
My 2 cents ...
If I understand the man page correctly, the main reason for the timeout is to
keep the select/pselect call from blocking indefinitely.  So just getting a
return in a reasonable time means success.  If we also want to test that the
timeout is somewhat within the ballpark of expectations, we could give it a
suitable leeway (like +/- 2 seconds).  As far as I can tell the elapsed time is
never more than a second off and unless this is being used for a real-time
system it's not critical for the time to match.

And the problem was finally close with this comment:
Closing this Bug as this is not a real time system. Also, the failures has
nothing to do with test case design flaw, but, completely at the mercy of how
system behaves in different instances.

Comment 18 IBM Bug Proxy 2010-05-18 08:11:43 UTC
------- Comment From eberhard.pasch.com 2010-05-18 04:07 EDT-------
LTP improved quite a bit by now. As the mirror didn't really work, can someone from the IBM PM team just paste the RH bugzilla please?

My first question to RH without knowing any more details (as they are hidden) would be: does this testcase work on RHEL 5.5?