1965024 – [DR] backup and restore should perform consistency checks on etcd snapshots

Bug 1965024 - [DR] backup and restore should perform consistency checks on etcd snapshots

Summary: [DR] backup and restore should perform consistency checks on etcd snapshots

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Sam Batschelet
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1976287
TreeView+	depends on / blocked

Reported:	2021-05-26 14:52 UTC by Sam Batschelet
Modified:	2021-07-27 23:10 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:	Feature: Validate the status of the etcd snapshot after backup and before restore Reason: Previously, backup procedure was not validating the snapshot taken to be complete; and restore was not verifying that the snapshot being attempted to be restored is valid, and not corrupted. It would be a good enhancement to validate the status of the backup. Result: If there is a corruption on disk during backup or restore, the error is clearly reported to the admin.
Clone Of:
Environment:
Last Closed:	2021-07-27 23:10:12 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-etcd-operator pull 603	0	None	open	Bug 1965024: Validate the status of the etcd snapshot during backup and restore	2021-06-03 16:48:11 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 23:10:40 UTC

Description Sam Batschelet 2021-05-26 14:52:18 UTC

Description of problem: Today backup and restore operators assume that the backup state is valid. While this assumption is oftentimes true there is a chance of on corruption on disk during backup or during storage before restore.

To mitigate any risk we should use the hash output of `etcdctl snapshot status` and persist it as part of the backup resources. Docs should reflect that this information should be stored separately away from the release (similar to encryption keys).

It should then be possible to ensure the hashes match on restore.

Version-Release number of selected component (if applicable):


How reproducible: 


Steps to Reproduce:
1. run DR backup and 
2. swap out etcd state file with one from another backup using the same name.
3. restore

Actual results: restore will happily use any backup as long as the name is as expected.


Expected results: validation of backup consistency (etcdctl snapshot status) be run against the snapshot during backup and before restore. The hash from the backup is persisted and validated during restore.


Additional info:

Comment 4 Suresh Kolichala 2021-06-11 15:40:41 UTC

For 4.8, we decided not to store the checksum during the backup. That means, there is no checksum to check against during the restore.

The current PR only makes sure that the backup database is not corrupted, by running the status check against the database. So, for testing purposes:

1. run DR backup and 
2. corrupt the etcd db file (on linux use truncate to truncate the last few blocks of the database file).
3. Attempt to restore
4. The attempt to restore should fail with validation error.

Comment 9 errata-xmlrpc 2021-07-27 23:10:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.