Bug 589904 - tests which crashing the system will timeout the watchdog
tests which crashing the system will timeout the watchdog
Status: CLOSED CURRENTRELEASE
Product: Beaker
Classification: Community
Component: lab controller (Show other bugs)
0.5
All Linux
high Severity high (vote)
: ---
: ---
Assigned To: Bill Peck
: Reopened
Depends On:
Blocks: 593663 593365
  Show dependency treegraph
 
Reported: 2010-05-07 05:12 EDT by Han Pingtian
Modified: 2010-06-16 15:01 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-06-16 15:01:36 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Han Pingtian 2010-05-07 05:12:27 EDT
Description of problem:
We have some kdump testing cases which will trigger a system crash, then collecting the vmcore. But it seems when the system come back, we cannot continue the testing. It hangs on the crash system testing. This is a example:

https://beaker.engineering.redhat.com/jobs/88

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Comment 1 Bill Peck 2010-05-07 08:52:24 EDT
I didn't look at every console but the first one has this in the console log.


Waiting for network...
Starting beah-beaker-backend: -bash: line 0: echo: write error: No space left on device
[FAILED]
beah-beaker-backend running as process 

Waiting for network...
Starting beah-fwd-backend: -bash: line 0: echo: write error: No space left on device
[FAILED]
beah-fwd-backend running as process 

Starting beah-srv: -bash: line 0: echo: write error: No space left on device
[FAILED]
beah-srv running as process 

Starting RHTS testing: 
Red Hat Enterprise Linux Lean Client release 6.0 Beta (Santiago)
Kernel 2.6.32-24.el6.i686 on an i686


Seems the harness can't run if there is no disk space left on the system.

Can you verify this on a system with enough disk space.
Comment 2 Marian Csontos 2010-05-07 11:21:39 EDT
Looks like machines do not get rebooted on kernel panics.

Adding following to /etc/sysctl.conf should help:

    kernel.panic = 20

Should it be test's responsibility?
IMO it's the test who knows the panic is expected and should handle this...

What was the RHTS' behavior?
Comment 3 Han Pingtian 2010-05-09 23:20:23 EDT
(In reply to comment #2)
> Looks like machines do not get rebooted on kernel panics.
> 
> Adding following to /etc/sysctl.conf should help:
> 
>     kernel.panic = 20
> 
> Should it be test's responsibility?
> IMO it's the test who knows the panic is expected and should handle this...
> 
> What was the RHTS' behavior?    

The testing works just fine in RHTS. We didn't add 'kernel.panic = 20' into /etc/sysctl.conf in RHTS.
Comment 4 Han Pingtian 2010-05-10 03:31:28 EDT
(In reply to comment #1)
> I didn't look at every console but the first one has this in the console log.
> 
> 
> Waiting for network...
> Starting beah-beaker-backend: -bash: line 0: echo: write error: No space left
> on device
> [FAILED]
> beah-beaker-backend running as process 
> 
> Waiting for network...
> Starting beah-fwd-backend: -bash: line 0: echo: write error: No space left on
> device
> [FAILED]
> beah-fwd-backend running as process 
> 
> Starting beah-srv: -bash: line 0: echo: write error: No space left on device
> [FAILED]
> beah-srv running as process 
> 
> Starting RHTS testing: 
> Red Hat Enterprise Linux Lean Client release 6.0 Beta (Santiago)
> Kernel 2.6.32-24.el6.i686 on an i686
> 
> 
> Seems the harness can't run if there is no disk space left on the system.
> 
> Can you verify this on a system with enough disk space.    

I have run this several times again. And I think this is a bug of beaker:

https://beaker.engineering.redhat.com/recipes/139
https://beaker.engineering.redhat.com/recipes/260
Comment 5 Marian Csontos 2010-05-10 03:50:28 EDT
RHTS was monitoring console output for kernel panics, reported these and rebooted the machine.

Bill, shall we "fix" it in LC the same way as RHTS did it or try to catch it in harness?
Comment 6 Han Pingtian 2010-05-14 02:26:44 EDT
The test case will crash the system by sysrq-c. But on RHTS, when system comes back, this test case will be run again and it will detect the circumstance and finish the testing.
Comment 7 David Kovalsky 2010-05-18 11:22:20 EDT
Part of Kernel QE SignOff Criteria

Suggesting to be a blocker
Comment 8 Bill Peck 2010-05-20 11:01:49 EDT
Marian,

In regard to comment 5, legacy rhts only looked at the console when the watchdog expired, so it couldn't possibly reboot the machine.

Looking at the console logs it starts the beah process but no testing is done.  I think further investigation on the harness is needed.
Comment 11 Marian Csontos 2010-05-25 18:22:02 EDT
The problem is in stress test, which causes high load on the system (some 19+) and harness and yum do not get much cycles for themselves. After I killed all the stress processes it recovered.

Now I would be happy to hear an advice how to handle that... Would running the test process with lower priority be an acceptable solution?
Comment 12 Marian Csontos 2010-05-25 19:08:20 EDT
Update: the recovery was not complete: socket's entry in FS is not cleaned on kernel panic.
Comment 13 Han Pingtian 2010-05-25 22:49:25 EDT
(In reply to comment #11)
> The problem is in stress test, which causes high load on the system (some 19+)
> and harness and yum do not get much cycles for themselves. After I killed all
> the stress processes it recovered.
> 
> Now I would be happy to hear an advice how to handle that... Would running the
> test process with lower priority be an acceptable solution?    

hi Marina, what do you mean? I wasn't running a stress testing. I was just running a test which paniced the system by "sysrq-c".
Comment 14 Marian Csontos 2010-05-26 03:12:37 EDT
What about these?
[1] is from reported J:88, [2] is my clone of the job.

[1] https://beaker.engineering.redhat.com/recipes/139#task528
[2] https://beaker.engineering.redhat.com/recipes/1819#task13518

As I said the latter was stuck in yum trying to download the next task which got 0 to none CPU cycles as everything was squeezed by running stress tasks... (May be these were not supposed to remain running.) After killing these procs it was running for a while, until it stopped by an error...

Anyway, I have a bug. I will update you once that's fixed...
Comment 15 Han Pingtian 2010-05-26 05:15:45 EDT
Sorry for forgetting that I was running stress with the testing. But it seems this bug isn't related to the stress. Please have a look at those:
https://beaker.engineering.redhat.com/jobs/569 
https://beaker.engineering.redhat.com/jobs/565

They all crashed the system by some methods, without stress running. And timeout in the end.
Comment 16 Marian Csontos 2010-06-07 07:31:24 EDT
Partially fixed by: http://git.fedorahosted.org/git/?p=beah.git;a=commit;h=bcd69988a202a3fc68e3e2f31d104aa57012b80f
Release: beah-0.6.3-1
Verified: J:2272

Follow up: kdump works now, but two more things are required:

1. need to run tests with slightly lower priority - stress test does not leave much CPU cycles to harness.

2. need to auto-reboot system after kernel panic.
Comment 17 Han Pingtian 2010-06-10 07:19:23 EDT
Looks like this bug doesn't get fixed yet:

https://beaker.engineering.redhat.com/jobs/2724
Comment 18 Marian Csontos 2010-06-10 10:28:39 EDT
Is it the same bug?

Have a look at this:

https://beaker.engineering.redhat.com/logs/2010/24/2724/5295/50186/176418///test_log--kernel-kdump-setup-bare-metal-client-reboot.log

And there is also kernel panic reported during installation...
Comment 19 Bill Peck 2010-06-10 10:44:12 EDT
I fixed bz598892. We weren't catching kernel panics before this.

But new in beaker is active monitoring of panic's, not just when the watchdog expires.  I know we need to be able to turn this off so I added a new node to recipes.

<recipe>
 <watchdog panic="ignore"/>
</recipe>

But the above is not working due to an implementation bug.  Its fixed in git and will go out in the next release.
Comment 20 Han Pingtian 2010-06-10 22:32:06 EDT
(In reply to comment #18)
> Is it the same bug?
> 
> Have a look at this:
> 
> https://beaker.engineering.redhat.com/logs/2010/24/2724/5295/50186/176418///test_log--kernel-kdump-setup-bare-metal-client-reboot.log
> 
> And there is also kernel panic reported during installation...    

I think this is the same bug. The test case has no problem when running on rhts ...
Comment 21 Bill Peck 2010-06-11 09:24:51 EDT
Han,

Did you read my comment?  Its aborting now because the lab controller detects a panic.  Which makes sense because you are testing kdump.  This problem will go away in the next update when you can turn off the panic detection.
Comment 22 Han Pingtian 2010-06-12 02:40:20 EDT
Got it. Thanks. When will the next update be taken place, please?
Comment 23 Bill Peck 2010-06-16 15:01:36 EDT
Fixed in Current Release.

Note You need to log in before you can comment on or make changes to this bug.