Hide Forgot
I copy from the e-mail: On 05/09/2011 10:48 AM, wang yanfu wrote: > hi, > > There's one nfs-utils regression testcase imported in beaker[1], and > it's always failed on RHEL5.6/5.7 ppc64 arch which should pass as > expected, others arch are fine. > failed job on ppc64: > https://beaker.engineering.redhat.com/jobs/82560 > https://beaker.engineering.redhat.com/jobs/82090 > > passed job on others arch: > https://beaker.engineering.redhat.com/jobs/82053 > https://beaker.engineering.redhat.com/jobs/82059 > https://beaker.engineering.redhat.com/jobs/82064 > https://beaker.engineering.redhat.com/jobs/82058 > > I found the issue is related to restart nfs daemon by checking the > TESTOUT.log. > > It could pass if I reserve one ppc64 machine for RHEL5.6 and run it by > manual. > > And I added sleep statement between start nfs service into script like > below but also failed if run it by submitting job in beaker. > <snip> > rlPhaseStartSetup > rlAssertRpm $PACKAGE > rlServiceStart nfs > rlPhaseEnd > > rlPhaseStartTest > sleep 1 > rlRun "service nfs restart" 0 "Restarting nfs service (1)" > </snip> > > Is there any special attention against ppc64+beakerlib or what else I'm > missing? > > thanks in advance. > > > 1. https://beaker.engineering.redhat.com/tasks/6101 > > I had problems with nfs daemon refusing to stop from a script running without a TTY. The workaround was to use screen. But that should not be a problem in this case, as scripts are running on /dev/console as in RHTS, but you could try it. Cheers, -- Marian Actual results: test case failed on rhel5.6/5.7 ppc64 arch. Expected results: test case pass on rhel5.6/5.7 ppc64 arch. Additional info: Below from customer - Fujitsu engineer's investigation: Looks like nfsd cannot be killed for some reason when the test is automatically run in Beaker. I'm not sure what causes this and how to work around the issue in the test script. I tried "sleep 90" before restarting nfs service, but it didn't help. I also ran the test on RHEL6.1 ppc64 system, but the issue didn't occur. This could be an OS issue, but not for sure at all.
I ran the test + /distribution/reserve, and after logging to machine, even then root couldn't stop nfsd with init.d script. ps afxl 1 0 3384 1 18 0 0 0 svc_re S ? 0:00 [nfsd] 1 0 3385 1 18 0 0 0 svc_re S ? 0:00 [nfsd] 1 0 3386 1 18 0 0 0 svc_re S ? 0:00 [nfsd] 1 0 3387 1 18 0 0 0 svc_re S ? 0:00 [nfsd] 1 0 3388 1 18 0 0 0 svc_re S ? 0:00 [nfsd] 1 0 3389 1 19 0 0 0 svc_re S ? 0:00 [nfsd] 1 0 3390 1 19 0 0 0 svc_re S ? 0:00 [nfsd] 1 0 3391 1 19 0 0 0 svc_re S ? 0:00 [nfsd] service nfs stop -> no effect killall -2 nfsd -> no effect killall nfsd -> no effect killall -9 nfsd -> this will finally kill it
This seems like kernel issue. nfsd sets SIGINT as terminating signal, but handler seems to be left in state as it was set in userspace, before it did execve to rpc.nfsd. /etc/init.d/nfs uses SIGINT (-2) to stop nfsd. --- cut runtest.sh --- #!/bin/bash rpc.nfsd 8 --- cut --- --- cut runtest2.sh --- #!/bin/bash ./runtest.sh & --- cut --- 1. running ./runtest.sh will spawn nfsd, and it reacts to SIGINT properly by terminating -> GOOD [root@ ~]# ./runtest.sh [root@ ~]# ps afx | grep nfsd 6284 ? S< 0:00 \_ [nfsd4] 6294 pts/0 S+ 0:00 | \_ grep nfsd 6285 ? S 0:00 [nfsd] 6286 ? S 0:00 [nfsd] 6287 ? S 0:00 [nfsd] 6288 ? S 0:00 [nfsd] 6289 ? S 0:00 [nfsd] 6290 ? S 0:00 [nfsd] 6291 ? S 0:00 [nfsd] 6292 ? S 0:00 [nfsd] [root@ ~]# killall -2 nfsd [root@ ~]# ps afx | grep nfsd 6301 pts/0 S+ 0:00 | \_ grep nfsd 2. running runtest2.sh will also spawn nfsd, but SIGINT will no longer work -> BAD [root@ ~]# ./runtest2.sh [root@ ~]# ps afx | grep nfsd 6311 ? S< 0:00 \_ [nfsd4] 6321 pts/0 S+ 0:00 | \_ grep nfsd 6312 ? S 0:00 [nfsd] 6313 ? S 0:00 [nfsd] 6314 ? S 0:00 [nfsd] 6315 ? S 0:00 [nfsd] 6316 ? S 0:00 [nfsd] 6317 ? S 0:00 [nfsd] 6318 ? S 0:00 [nfsd] 6319 ? S 0:00 [nfsd] [root@ ~]# killall -2 nfsd [root@ ~]# ps afx | grep nfsd 6311 ? S< 0:00 \_ [nfsd4] 6324 pts/0 S+ 0:00 | \_ grep nfsd 6312 ? S 0:00 [nfsd] 6313 ? S 0:00 [nfsd] 6314 ? S 0:00 [nfsd] 6315 ? S 0:00 [nfsd] 6316 ? S 0:00 [nfsd] 6317 ? S 0:00 [nfsd] 6318 ? S 0:00 [nfsd] 6319 ? S 0:00 [nfsd] looking at ptrace, this is one of the differences before it does execve: from 1.) rt_sigaction(SIGINT, {0x1, [], 0}, <unfinished ...> from 2.) rt_sigaction(SIGINT, {SIG_DFL, [], 0}, {SIG_DFL, [], 0}, 8) = 0 I ran both runtest.sh ,runtest2.sh while system was reserved.
Created attachment 498907 [details] patch, which makes the problem go away With this patch, reproducer (runtest2.sh) no longer works and beaker job also gives PASS. But it doesn't explain why we see it only on ppc64.
Just run into another issue on ppc64 (Bug 657566) and seems the case is the same: el5 on ppc64 is using old harness repo.
Keeping this open for tracking and reassigning to Jan. Thanks Jan!
Upon initial discussion with Jeff Layton, filing this bug for kernel: https://bugzilla.redhat.com/show_bug.cgi?id=705387
can we close this now? the harness is now updated.
Updated harness no longer triggers the issue, so closing this one. https://beaker.engineering.redhat.com/jobs/87345 https://beaker.engineering.redhat.com/jobs/87352