Bug 991236

Summary: [RFE]more aggressive hardware testing in bkr machine-test
Product: [Retired] Beaker Reporter: Dan Callaghan <dcallagh>
Component: generalAssignee: beaker-dev-list
Status: CLOSED WONTFIX QA Contact: tools-bugs <tools-bugs>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 0.13CC: azelinka, cbouchar, fedora, jjaburek, qwan, tools-bugs
Target Milestone: ---Keywords: FutureFeature
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-19 21:56:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 994970    

Description Dan Callaghan 2013-08-01 23:17:06 UTC
At present bkr machine-test just schedules /distribution/install in its jobs, which is a good way of making sure the system can boot and install a distro. But there are many hardware problems which this will never find, so it's not a good way to test a flakey system.

We could write a new task which performs actual hardware tests, such as:
* check SMART data on all disks
* perform SMART self-tests on all disks
* run bad block checking on all disks
* run a memory tester?
* run some kinds of CPU self-tests?

The bkr machine-test command could have an option --aggressive which adds this task when it schedules a job.

Comment 1 Jiri Jaburek 2013-08-02 09:20:02 UTC
Some points that might help:

* SMART is available only on quite small number of machines,
  due to machines using either
  - SCSI/SAS drives (no SMART at all)
  - additional layer (HW raid) between the drives and the OS

* bad blocks can be done using the "badblocks" utility
  - make sure to do write testing
  - make sure to specify larger "N blocks at a time" value, speed reasons
  - make sure to use `-t' to specify at least one pseudorandom pass,
    normal check DO NOT detect silent offset pointer corruption (!!)

* memory testing via memtest86+ could be hard to do automatically,
  a tool called "memtester" [1] can do it while the system is running
  - it doesn't test all memory, just what it can lock, .. still useful

* CPU stress testing can be done
  - using "cpuburn" (burnMMX, ...) running over some period of time
    (at least 30min)
  - using a Prime95 equivalent for Linux, "mprime" CLI tool, which can
    also perform stress tests with verification of result correctness,
    however it uses rather arch-specific instructions (AVX on intel),
    which might not be a relevant test method

[1] http://pyropus.ca/software/memtester/


All of this would need to be done from initramfs as HDD testing would effectively overwrite/erase everything. A few approaches come to my mind, but all of them would need all the tools along with beaker-related result uploader in the initramfs anyway:

* using anaconda and %pre section
* using RHEL-based (dracut) initramfs
* using a completely custom (glibc-based) kernel + initramfs pair
  - might be a bit more complex to be architecture-independent
  - not as complex as it seems, I've built several in the past


.. just my $0.02 ..