Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 703686

Summary:

Restart NFS daemon failed on RHEL5.6/5.7 ppc64 arch by running automation job in beaker

Product:

[Retired] Beaker

Reporter:

yanfu,wang <yanwang>

Component:

beah

Assignee:

Jan Stancek <jstancek>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Severity:

low

Docs Contact:

Priority:

unspecified

Version:

0.7

CC:

bpeck, dcallagh, eguan, jburke, jstancek, mcsontos, rmancy, stl

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-05-20 06:58:30 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

705387

Bug Blocks:

Attachments:

Description	Flags
patch, which makes the problem go away	none

Description yanfu,wang 2011-05-11 02:09:29 UTC

I copy from the e-mail:

On 05/09/2011 10:48 AM, wang yanfu wrote:
> hi,
>
> There's one nfs-utils regression testcase imported in beaker[1], and
> it's always failed on RHEL5.6/5.7 ppc64 arch which should pass as
> expected, others arch are fine.
> failed job on ppc64:
> https://beaker.engineering.redhat.com/jobs/82560
> https://beaker.engineering.redhat.com/jobs/82090
>
> passed job on others arch:
> https://beaker.engineering.redhat.com/jobs/82053
> https://beaker.engineering.redhat.com/jobs/82059
> https://beaker.engineering.redhat.com/jobs/82064
> https://beaker.engineering.redhat.com/jobs/82058
>
> I found the issue is related to restart nfs daemon by checking the
> TESTOUT.log.
>
> It could pass if I reserve one ppc64 machine for RHEL5.6 and run it by
> manual.
>
> And I added sleep statement between start nfs service into script like
> below but also failed if run it by submitting job in beaker.
> <snip>
>      rlPhaseStartSetup
>          rlAssertRpm $PACKAGE
>          rlServiceStart nfs
>      rlPhaseEnd
>
>      rlPhaseStartTest
>          sleep 1
>          rlRun "service nfs restart" 0 "Restarting nfs service (1)"
> </snip>
>
> Is there any special attention against ppc64+beakerlib or what else I'm
> missing?
>
> thanks in advance.
>
>
> 1. https://beaker.engineering.redhat.com/tasks/6101
>
>

I had problems with nfs daemon refusing to stop from a script running
without a TTY. The workaround was to use screen.

But that should not be a problem in this case, as scripts are running on
/dev/console as in RHTS, but you could try it.

Cheers,

-- Marian

Actual results:
test case failed on rhel5.6/5.7 ppc64 arch.

Expected results:
test case pass on rhel5.6/5.7 ppc64 arch.

Additional info:
Below from customer - Fujitsu engineer's investigation:
Looks like nfsd cannot be killed for some reason when the test is
automatically run in Beaker. I'm not sure what causes this and
how to work around the issue in the test script.

I tried "sleep 90" before restarting nfs service, but it didn't help.
I also ran the test on RHEL6.1 ppc64 system, but the issue didn't occur.
This could be an OS issue, but not for sure at all.

Comment 1 Jan Stancek 2011-05-11 10:07:46 UTC

I ran the test + /distribution/reserve, and after logging to machine, even then root couldn't stop nfsd with init.d script.
ps afxl
1     0  3384     1  18   0      0     0 svc_re S    ?          0:00 [nfsd]
1     0  3385     1  18   0      0     0 svc_re S    ?          0:00 [nfsd]
1     0  3386     1  18   0      0     0 svc_re S    ?          0:00 [nfsd]
1     0  3387     1  18   0      0     0 svc_re S    ?          0:00 [nfsd]
1     0  3388     1  18   0      0     0 svc_re S    ?          0:00 [nfsd]
1     0  3389     1  19   0      0     0 svc_re S    ?          0:00 [nfsd]
1     0  3390     1  19   0      0     0 svc_re S    ?          0:00 [nfsd]
1     0  3391     1  19   0      0     0 svc_re S    ?          0:00 [nfsd]

service nfs stop -> no effect
killall -2 nfsd -> no effect
killall nfsd -> no effect
killall -9 nfsd -> this will finally kill it

Comment 2 Jan Stancek 2011-05-12 11:05:19 UTC

This seems like kernel issue. nfsd sets SIGINT as terminating signal, but handler seems to be left in state as it was set in userspace, before it did execve to rpc.nfsd.

/etc/init.d/nfs uses SIGINT (-2) to stop nfsd.

--- cut runtest.sh ---
#!/bin/bash
rpc.nfsd 8
--- cut ---

--- cut runtest2.sh ---
#!/bin/bash
./runtest.sh &
--- cut ---

1. running ./runtest.sh will spawn nfsd, and it reacts to SIGINT properly by terminating -> GOOD

[root@ ~]# ./runtest.sh 
[root@ ~]# ps afx | grep nfsd
 6284 ?        S<     0:00  \_ [nfsd4]
 6294 pts/0    S+     0:00  |       \_ grep nfsd
 6285 ?        S      0:00 [nfsd]
 6286 ?        S      0:00 [nfsd]
 6287 ?        S      0:00 [nfsd]
 6288 ?        S      0:00 [nfsd]
 6289 ?        S      0:00 [nfsd]
 6290 ?        S      0:00 [nfsd]
 6291 ?        S      0:00 [nfsd]
 6292 ?        S      0:00 [nfsd]

[root@ ~]# killall -2 nfsd
[root@ ~]# ps afx | grep nfsd
 6301 pts/0    S+     0:00  |       \_ grep nfsd

2. running runtest2.sh will also spawn nfsd, but SIGINT will no longer work -> BAD
[root@ ~]# ./runtest2.sh 
[root@ ~]# ps afx | grep nfsd
 6311 ?        S<     0:00  \_ [nfsd4]
 6321 pts/0    S+     0:00  |       \_ grep nfsd
 6312 ?        S      0:00 [nfsd]
 6313 ?        S      0:00 [nfsd]
 6314 ?        S      0:00 [nfsd]
 6315 ?        S      0:00 [nfsd]
 6316 ?        S      0:00 [nfsd]
 6317 ?        S      0:00 [nfsd]
 6318 ?        S      0:00 [nfsd]
 6319 ?        S      0:00 [nfsd]

[root@ ~]# killall -2 nfsd
[root@ ~]# ps afx | grep nfsd
 6311 ?        S<     0:00  \_ [nfsd4]
 6324 pts/0    S+     0:00  |       \_ grep nfsd
 6312 ?        S      0:00 [nfsd]
 6313 ?        S      0:00 [nfsd]
 6314 ?        S      0:00 [nfsd]
 6315 ?        S      0:00 [nfsd]
 6316 ?        S      0:00 [nfsd]
 6317 ?        S      0:00 [nfsd]
 6318 ?        S      0:00 [nfsd]
 6319 ?        S      0:00 [nfsd]

looking at ptrace, this is one of the differences before it does execve:
from 1.) rt_sigaction(SIGINT, {0x1, [], 0},  <unfinished ...>
from 2.) rt_sigaction(SIGINT, {SIG_DFL, [], 0}, {SIG_DFL, [], 0}, 8) = 0

I ran both runtest.sh ,runtest2.sh while system was reserved.

Comment 3 Jan Stancek 2011-05-14 08:31:39 UTC

Created attachment 498907 [details]
patch, which makes the problem go away

With this patch, reproducer (runtest2.sh) no longer works and beaker job also gives PASS. But it doesn't explain why we see it only on ppc64.

Comment 4 Marian Csontos 2011-05-17 14:10:53 UTC

Just run into another issue on ppc64 (Bug 657566) and seems the case is the same: el5 on ppc64 is using old harness repo.

Comment 5 Marian Csontos 2011-05-17 14:22:31 UTC

Keeping this open for tracking and reassigning to Jan.

Thanks Jan!

Comment 6 Jan Stancek 2011-05-17 14:57:55 UTC

Upon initial discussion with Jeff Layton, filing this bug for kernel: https://bugzilla.redhat.com/show_bug.cgi?id=705387

Comment 7 Bill Peck 2011-05-19 16:07:58 UTC

can we close this now?  the harness is now updated.

Comment 8 Jan Stancek 2011-05-20 06:58:30 UTC

Updated harness no longer triggers the issue, so closing this one.
https://beaker.engineering.redhat.com/jobs/87345
https://beaker.engineering.redhat.com/jobs/87352