Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1732416

Summary: ETCD Cluster is down because node is failing
Product: OpenShift Container Platform Reporter: Jose Ortiz Padilla <jortizpa>
Component: EtcdAssignee: Sam Batschelet <sbatsche>
Status: CLOSED DUPLICATE QA Contact: ge liu <geliu>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.11.0CC: aprajapa, mfojtik, skolicha
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-08-07 12:53:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 2 Sam Batschelet 2019-07-23 12:00:05 UTC
> 2019-07-23 03:23:11.130648 C | etcdserver: open wal error: wal: file not found

This error is literal, general causes are:

1.) on disk data corruption
2.) invalid restore process

> 2019-07-23 03:23:10.527528 W | etcdmain: found invalid file/dir snapshot_07-18-19.db under data dir /var/lib/etcd/ (Ignore this if you are upgrading etcd)

Did they have an issue on 7/18? In general, this issue is caused by not properly restoring a failed member/cluster.

# remediation
If this is the only member acting this way I would remove[1] and then add it back[2]. Note: removing a member means destroying the contents of data-dir, after safely archiving anything you want.

> 2019-07-23 03:23:11.130302 I | etcdserver: data dir = /var/lib/etcd/


[1] https://github.com/etcd-io/etcd/blob/77e1c37787fb08d8faf29ecc4a3114f62e2fff68/Documentation/op-guide/runtime-configuration.md#replace-a-failed-machine
[2] https://github.com/etcd-io/etcd/blob/77e1c37787fb08d8faf29ecc4a3114f62e2fff68/Documentation/op-guide/runtime-configuration.md#add-a-new-member

Comment 9 Sam Batschelet 2019-08-07 12:53:00 UTC
After restoring cluster this issue will be resolved per https://access.redhat.com/solutions/3885101 3.3.11 is not a valid version for v3.11.

Regarding 3.3.11

> 2019-07-23 03:23:11.130648 C | etcdserver: open wal error: wal: file not found

The reason for this issue is a gRPC-go bug which will be resolved upstream in mid August when etcd 3.3.14[1] is released.

[1] https://github.com/etcd-io/etcd/blob/master/CHANGELOG-3.3.md

*** This bug has been marked as a duplicate of bug 1672344 ***