Bug 692139 - Bad Data in Beaker test results
Summary: Bad Data in Beaker test results
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Beaker
Classification: Retired
Component: lab controller
Version: 0.7
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified vote
Target Milestone: ---
Assignee: Bill Peck
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-03-30 14:16 UTC by PaulB
Modified: 2012-07-13 12:51 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-07-13 12:51:09 UTC


Attachments (Terms of Use)

Description PaulB 2011-03-30 14:16:27 UTC
Description of problem:
Bad Data in Beaker test results.
Not sure where the responsibility is here:
Its my understanding that Conserver provides the data and the Lab Controller copies the data. 

Seems the console logs were not cleared prior to my test run. 
The console logs in my test run contained data from a previous test run.
A PANIC was reported due to stale data in console logs.

This issue was seen in two different jobs:
https://beaker.engineering.redhat.com/jobs/67673
   intel-urbanna-01.lab.bos.redhat.com
and
https://beaker.engineering.redhat.com/jobs/67675
   dell-pet410-01.lab.bos.redhat.com


Looking at https://beaker.engineering.redhat.com/jobs/67673:
Looking at the test results there is a report of a PANIC on a "oops".
Digging into the console logs for the test one sees the following:
 
logger: /usr/bin/rhts-test-runner.sh rhts-test-checkin 127.0.0.1:7093   intel-urbanna-01.lab.eng.bos.redhat.com 67367 /kernel/kdump/crash-sysrq-c 5400 1496111  
03/28/11 19:53:50  JobID:67367 Test:/kernel/kdump/crash-sysrq-c Response:1  
logger: /usr/bin/rhts-test-runner.sh rhts-test-update 127.0.0.1:7093 1496111 start  rh-tests-kernel-kdump-crash-sysrq-c  
03/28/11 19:53:50  testID:1496111 start:  
/mnt/tests/kernel/kdump/crash-sysrq-c /mnt/tests/kernel/kdump/crash-sysrq-c  
SysRq : Trigger a crash 
BUG: unable to handle kernel NULL pointer dereference at (null) 
IP: [<ffffffff81319816>] sysrq_handle_crash+0x16/0x20 
PGD 3548ab067 PUD 35df66067 PMD 0  
Oops: 0002 [#1] SMP 

*This test does not issue a crash-sysrq-c. 
*Seems the logs were not cleared prior to the testing
=========================================================

Investigating further...

These system are on different serial consoles and are using standard serial port connection not a management interface (ie. iDRAC):
[pbunyan]$ console -i dell-pet410-01.lab.bos.redhat.comdell-pet410-01.lab.bos.redhat.com:conserver-01.eng.bos.redhat.com,4391,40826:!:serial-l2e13.mgmt.lab.eng.bos.redhat.com,7030,telnet,27::up:rw:/var/consoles/pub/dell-pet410-01.lab.bos.redhat.com,log,act,brk,300,26:1:noautoup::ixon,ixoff,autoreinit,login::0:\n
[pbunyan]$ console -i intel-urbanna-01.lab.bos.redhat.com
intel-urbanna-01.lab.bos.redhat.com:conserver-01.eng.bos.redhat.com,4281,40817:!:serial-l1a.mgmt.lab.eng.bos.redhat.com,7025,telnet,15::up:rw:/var/consoles/pub/intel-urbanna-01.lab.bos.redhat.com,log,act,brk,300,14:1:noautoup::ixon,ixoff,autoreinit,login::0:\n
[pbunyan]$ 

Connecting to netdump-01.eng.bos.redhat.com later in the day and viewing the current console data for these systems. I see the time stamps in the data as current.

I spoke with BillP and he investigated the issue. He saw nothing definitive. BillP confirmed there was an issue but things seemed to be working now.

Best,
-pbunyan

How reproducible:
 I cannot reproduce, though I have two instances of the issue.

  
Actual results:
 Bad Data in Beaker test results

Expected results:
 Data from current Job only in the logs.


Additional info:
 I have opened an the following RT ticket:
https://engineering.redhat.com/rt3/Ticket/Display.html?id=106184

Comment 1 Marian Csontos 2011-03-31 06:37:52 UTC
Looks like the console was "suspended" (or buffered) for a while and you got error from the previous incarnation:

  https://beaker.engineering.redhat.com/recipes/138193#task1496111

I have no idea how this happened...

Bill, does LC do any buffering, or is this more likely a conserver issue?
Shall LC try to connect and flush any captured console output before starting a new job?

Comment 2 Bill Peck 2011-03-31 13:58:47 UTC
When the scheduler schedules a machine to run it makes a call to the labcontroller cobbler instance to clear the console logs *if* they exist.  It looks like the lab controller didn't see any logs at the moment we called the command to clear.

The lab controller watchdog process could clear the log when it starts monitoring a log but then it would cause problems if the watchdog process was restarted while a recipe was running,  you would lose all existing data from before the restart.

But maybe that is better?

Comment 3 Marian Csontos 2011-03-31 14:19:56 UTC
> How reproducible:
>  I cannot reproduce, though I have two instances of the issue.

I am afraid this may be more common but would go unnoticed unless there was a panic stuck in the queue as there were in your case.

I would look for the issue in console.log of few jobs which had started at about the same time.

Comment 4 Jeff Burke 2012-07-13 12:51:09 UTC
This issue is resolved. Existing issue is being tracked in a new BZ Bug 837300 - [Beaker] Console Logs are not being cleared properly between automated jobs
https://bugzilla.redhat.com/show_bug.cgi?id=837300


Note You need to log in before you can comment on or make changes to this bug.