Bug 991236 - more aggressive hardware testing in bkr machine-test
more aggressive hardware testing in bkr machine-test
Status: NEW
Product: Beaker
Classification: Community
Component: general (Show other bugs)
0.13
Unspecified Unspecified
unspecified Severity unspecified (vote)
: ---
: ---
Assigned To: beaker-dev-list
tools-bugs
: FutureFeature
Depends On:
Blocks: 994970
  Show dependency treegraph
 
Reported: 2013-08-01 19:17 EDT by Dan Callaghan
Modified: 2016-05-26 09:12 EDT (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Dan Callaghan 2013-08-01 19:17:06 EDT
At present bkr machine-test just schedules /distribution/install in its jobs, which is a good way of making sure the system can boot and install a distro. But there are many hardware problems which this will never find, so it's not a good way to test a flakey system.

We could write a new task which performs actual hardware tests, such as:
* check SMART data on all disks
* perform SMART self-tests on all disks
* run bad block checking on all disks
* run a memory tester?
* run some kinds of CPU self-tests?

The bkr machine-test command could have an option --aggressive which adds this task when it schedules a job.
Comment 1 Jiri Jaburek 2013-08-02 05:20:02 EDT
Some points that might help:

* SMART is available only on quite small number of machines,
  due to machines using either
  - SCSI/SAS drives (no SMART at all)
  - additional layer (HW raid) between the drives and the OS

* bad blocks can be done using the "badblocks" utility
  - make sure to do write testing
  - make sure to specify larger "N blocks at a time" value, speed reasons
  - make sure to use `-t' to specify at least one pseudorandom pass,
    normal check DO NOT detect silent offset pointer corruption (!!)

* memory testing via memtest86+ could be hard to do automatically,
  a tool called "memtester" [1] can do it while the system is running
  - it doesn't test all memory, just what it can lock, .. still useful

* CPU stress testing can be done
  - using "cpuburn" (burnMMX, ...) running over some period of time
    (at least 30min)
  - using a Prime95 equivalent for Linux, "mprime" CLI tool, which can
    also perform stress tests with verification of result correctness,
    however it uses rather arch-specific instructions (AVX on intel),
    which might not be a relevant test method

[1] http://pyropus.ca/software/memtester/


All of this would need to be done from initramfs as HDD testing would effectively overwrite/erase everything. A few approaches come to my mind, but all of them would need all the tools along with beaker-related result uploader in the initramfs anyway:

* using anaconda and %pre section
* using RHEL-based (dracut) initramfs
* using a completely custom (glibc-based) kernel + initramfs pair
  - might be a bit more complex to be architecture-independent
  - not as complex as it seems, I've built several in the past


.. just my $0.02 ..

Note You need to log in before you can comment on or make changes to this bug.