Bug 1965024

Summary:	[DR] backup and restore should perform consistency checks on etcd snapshots
Product:	OpenShift Container Platform	Reporter:	Sam Batschelet <sbatsche>
Component:	Etcd	Assignee:	Sam Batschelet <sbatsche>
Status:	CLOSED ERRATA	QA Contact:	ge liu <geliu>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.6	CC:	wlewis
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Enhancement
Doc Text:	Feature: Validate the status of the etcd snapshot after backup and before restore Reason: Previously, backup procedure was not validating the snapshot taken to be complete; and restore was not verifying that the snapshot being attempted to be restored is valid, and not corrupted. It would be a good enhancement to validate the status of the backup. Result: If there is a corruption on disk during backup or restore, the error is clearly reported to the admin.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-07-27 23:10:12 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1976287

Description Sam Batschelet 2021-05-26 14:52:18 UTC

Description of problem: Today backup and restore operators assume that the backup state is valid. While this assumption is oftentimes true there is a chance of on corruption on disk during backup or during storage before restore.

To mitigate any risk we should use the hash output of `etcdctl snapshot status` and persist it as part of the backup resources. Docs should reflect that this information should be stored separately away from the release (similar to encryption keys).

It should then be possible to ensure the hashes match on restore.

Version-Release number of selected component (if applicable):


How reproducible: 


Steps to Reproduce:
1. run DR backup and 
2. swap out etcd state file with one from another backup using the same name.
3. restore

Actual results: restore will happily use any backup as long as the name is as expected.


Expected results: validation of backup consistency (etcdctl snapshot status) be run against the snapshot during backup and before restore. The hash from the backup is persisted and validated during restore.


Additional info:

Comment 4 Suresh Kolichala 2021-06-11 15:40:41 UTC

For 4.8, we decided not to store the checksum during the backup. That means, there is no checksum to check against during the restore.

The current PR only makes sure that the backup database is not corrupted, by running the status check against the database. So, for testing purposes:

1. run DR backup and 
2. corrupt the etcd db file (on linux use truncate to truncate the last few blocks of the database file).
3. Attempt to restore
4. The attempt to restore should fail with validation error.

Comment 9 errata-xmlrpc 2021-07-27 23:10:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438