Bug 1570183
Summary: | etcd 3.1 failing to send heartbeats on time | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Eric Jones <erjones> |
Component: | Master | Assignee: | Michal Fojtik <mfojtik> |
Status: | CLOSED DUPLICATE | QA Contact: | Wang Haoran <haowang> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 3.4.1 | CC: | aos-bugs, jokerman, mmccomas, rhowe, rkharwar, sgaikwad |
Target Milestone: | --- | Keywords: | Unconfirmed |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-04-26 21:14:26 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Eric Jones
2018-04-20 20:16:51 UTC
Additional pieces of information that could come into play: Each of these masters is virtual machines with 8 vCPU and 32 GB RAM, and we are using local storage - I believe it is provided by HDD. As for impact, we have been working through an etcd issue on Red Hat Case # 02048341 and as a workaround, we have been running a script that will detect if the etcd cluster is unhealthy and then restart etcd and alert us. The restarting of etcd alert was what tipped us off that something was wrong. We expect to have a fix for etcd today and plan to apply it across clusters and turn off our script. However, we think there is an "unhealthy etcd cluster / master server overloaded" issue going on separately from what was described in that case, and by turning off the alert script, we may no longer be able to detect this condition. Customer is running a script that appears to query etcdctl. this might be unnecessarily increase the load on the etcd cluster. Based on information from [0] and [1], and using the etcd metrics data collected, it seems like this isn't an issue with the disk, which brings the issue to likely a CPU or Network problem. That said, uploading the sosreports, with SAR data to assist in investigation, shortly. [0] https://github.com/coreos/etcd/blob/master/Documentation/faq.md#what-does-the-etcd-warning-failed-to-send-out-heartbeat-on-time-mean [1] https://github.com/coreos/etcd/blob/master/Documentation/metrics.md#disk *** This bug has been marked as a duplicate of bug 1415839 *** |