Bug 1732416 - ETCD Cluster is down because node is failing
Summary: ETCD Cluster is down because node is failing
Keywords:
Status: CLOSED DUPLICATE of bug 1672344
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Sam Batschelet
QA Contact: ge liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-07-23 10:43 UTC by Jose Ortiz Padilla
Modified: 2019-08-07 12:53 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-08-07 12:53:00 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Comment 2 Sam Batschelet 2019-07-23 12:00:05 UTC
> 2019-07-23 03:23:11.130648 C | etcdserver: open wal error: wal: file not found

This error is literal, general causes are:

1.) on disk data corruption
2.) invalid restore process

> 2019-07-23 03:23:10.527528 W | etcdmain: found invalid file/dir snapshot_07-18-19.db under data dir /var/lib/etcd/ (Ignore this if you are upgrading etcd)

Did they have an issue on 7/18? In general, this issue is caused by not properly restoring a failed member/cluster.

# remediation
If this is the only member acting this way I would remove[1] and then add it back[2]. Note: removing a member means destroying the contents of data-dir, after safely archiving anything you want.

> 2019-07-23 03:23:11.130302 I | etcdserver: data dir = /var/lib/etcd/


[1] https://github.com/etcd-io/etcd/blob/77e1c37787fb08d8faf29ecc4a3114f62e2fff68/Documentation/op-guide/runtime-configuration.md#replace-a-failed-machine
[2] https://github.com/etcd-io/etcd/blob/77e1c37787fb08d8faf29ecc4a3114f62e2fff68/Documentation/op-guide/runtime-configuration.md#add-a-new-member

Comment 9 Sam Batschelet 2019-08-07 12:53:00 UTC
After restoring cluster this issue will be resolved per https://access.redhat.com/solutions/3885101 3.3.11 is not a valid version for v3.11.

Regarding 3.3.11

> 2019-07-23 03:23:11.130648 C | etcdserver: open wal error: wal: file not found

The reason for this issue is a gRPC-go bug which will be resolved upstream in mid August when etcd 3.3.14[1] is released.

[1] https://github.com/etcd-io/etcd/blob/master/CHANGELOG-3.3.md

*** This bug has been marked as a duplicate of bug 1672344 ***


Note You need to log in before you can comment on or make changes to this bug.