Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1732416

Summary:	ETCD Cluster is down because node is failing
Product:	OpenShift Container Platform	Reporter:	Jose Ortiz Padilla <jortizpa>
Component:	Etcd	Assignee:	Sam Batschelet <sbatsche>
Status:	CLOSED DUPLICATE	QA Contact:	ge liu <geliu>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	3.11.0	CC:	aprajapa, mfojtik, skolicha
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-08-07 12:53:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Comment 2 Sam Batschelet 2019-07-23 12:00:05 UTC

> 2019-07-23 03:23:11.130648 C | etcdserver: open wal error: wal: file not found

This error is literal, general causes are:

1.) on disk data corruption
2.) invalid restore process

> 2019-07-23 03:23:10.527528 W | etcdmain: found invalid file/dir snapshot_07-18-19.db under data dir /var/lib/etcd/ (Ignore this if you are upgrading etcd)

Did they have an issue on 7/18? In general, this issue is caused by not properly restoring a failed member/cluster.

# remediation
If this is the only member acting this way I would remove[1] and then add it back[2]. Note: removing a member means destroying the contents of data-dir, after safely archiving anything you want.

> 2019-07-23 03:23:11.130302 I | etcdserver: data dir = /var/lib/etcd/


[1] https://github.com/etcd-io/etcd/blob/77e1c37787fb08d8faf29ecc4a3114f62e2fff68/Documentation/op-guide/runtime-configuration.md#replace-a-failed-machine
[2] https://github.com/etcd-io/etcd/blob/77e1c37787fb08d8faf29ecc4a3114f62e2fff68/Documentation/op-guide/runtime-configuration.md#add-a-new-member

Comment 9 Sam Batschelet 2019-08-07 12:53:00 UTC

After restoring cluster this issue will be resolved per https://access.redhat.com/solutions/3885101 3.3.11 is not a valid version for v3.11.

Regarding 3.3.11

> 2019-07-23 03:23:11.130648 C | etcdserver: open wal error: wal: file not found

The reason for this issue is a gRPC-go bug which will be resolved upstream in mid August when etcd 3.3.14[1] is released.

[1] https://github.com/etcd-io/etcd/blob/master/CHANGELOG-3.3.md

*** This bug has been marked as a duplicate of bug 1672344 ***