Bug 1570183 - etcd 3.1 failing to send heartbeats on time
Summary: etcd 3.1 failing to send heartbeats on time
Keywords:
Status: CLOSED DUPLICATE of bug 1415839
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Master
Version: 3.4.1
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Michal Fojtik
QA Contact: Wang Haoran
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-20 20:16 UTC by Eric Jones
Modified: 2018-04-26 21:14 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-04-26 21:14:26 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Eric Jones 2018-04-20 20:16:51 UTC
Description of problem:
Customer is seeing lots of error [0] messages in lots of their clusters, and are concerned this is related to periodic unhealthy messages they are seeing

[0]
Apr 18 04:05:31 njrarltapp00173 etcd[1580]: start to snapshot (applied: 7247221236, lastsnap: 7247211235)
Apr 18 04:05:36 njrarltapp00173 etcd[1580]: failed to send out heartbeat on time (exceeded the 500ms timeout for 40.390554ms)
Apr 18 04:05:36 njrarltapp00173 etcd[1580]: server is likely overloaded
Apr 18 04:05:36 njrarltapp00173 etcd[1580]: failed to send out heartbeat on time (exceeded the 500ms timeout for 40.44649ms)
Apr 18 04:05:36 njrarltapp00173 etcd[1580]: server is likely overloaded


Version-Release number of selected component (if applicable):
atomic-openshift-3.4.1.44.38-1.git.0.d04b8d5.el7.x86_64
etcd-3.1.7-1.el7.x86_64

How reproducible:
Working with (rkharwar@redhat.com) to reproduce this in our standing UPS Reproducer

Additional info:
Attaching etcd metrics as well as master and etcd logs shortly

Similar to [1] but different versions of OCP and etcd

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1507590

Comment 4 Eric Jones 2018-04-20 20:30:35 UTC
Additional pieces of information that could come into play:

Each of these masters is virtual machines with 8 vCPU and 32 GB RAM, and we are using local storage - I believe it is provided by HDD.  As for impact, we have been working through an etcd issue on Red Hat Case # 02048341 and as a workaround, we have been running a script that will detect if the etcd cluster is unhealthy and then restart etcd and alert us.  The restarting of etcd alert was what tipped us off that something was wrong.  We expect to have a fix for etcd today and plan to apply it across clusters and turn off our script.  However, we think there is an "unhealthy etcd cluster / master server overloaded" issue going on separately from what was described in that case, and by turning off the alert script, we may no longer be able to detect this condition.

Customer is running a script that appears to query etcdctl. this might be unnecessarily increase the load on the etcd cluster.

Comment 5 Eric Jones 2018-04-23 21:20:03 UTC
Based on information from [0] and [1], and using the etcd metrics data collected, it seems like this isn't an issue with the disk, which brings the issue to likely a CPU or Network problem.

That said, uploading the sosreports, with SAR data to assist in investigation, shortly.

[0] https://github.com/coreos/etcd/blob/master/Documentation/faq.md#what-does-the-etcd-warning-failed-to-send-out-heartbeat-on-time-mean
[1] https://github.com/coreos/etcd/blob/master/Documentation/metrics.md#disk

Comment 10 Ryan Howe 2018-04-26 21:14:26 UTC

*** This bug has been marked as a duplicate of bug 1415839 ***


Note You need to log in before you can comment on or make changes to this bug.