Description of problem: Customer is seeing lots of error [0] messages in lots of their clusters, and are concerned this is related to periodic unhealthy messages they are seeing [0] Apr 18 04:05:31 njrarltapp00173 etcd[1580]: start to snapshot (applied: 7247221236, lastsnap: 7247211235) Apr 18 04:05:36 njrarltapp00173 etcd[1580]: failed to send out heartbeat on time (exceeded the 500ms timeout for 40.390554ms) Apr 18 04:05:36 njrarltapp00173 etcd[1580]: server is likely overloaded Apr 18 04:05:36 njrarltapp00173 etcd[1580]: failed to send out heartbeat on time (exceeded the 500ms timeout for 40.44649ms) Apr 18 04:05:36 njrarltapp00173 etcd[1580]: server is likely overloaded Version-Release number of selected component (if applicable): atomic-openshift-3.4.1.44.38-1.git.0.d04b8d5.el7.x86_64 etcd-3.1.7-1.el7.x86_64 How reproducible: Working with (rkharwar) to reproduce this in our standing UPS Reproducer Additional info: Attaching etcd metrics as well as master and etcd logs shortly Similar to [1] but different versions of OCP and etcd [1] https://bugzilla.redhat.com/show_bug.cgi?id=1507590
Additional pieces of information that could come into play: Each of these masters is virtual machines with 8 vCPU and 32 GB RAM, and we are using local storage - I believe it is provided by HDD. As for impact, we have been working through an etcd issue on Red Hat Case # 02048341 and as a workaround, we have been running a script that will detect if the etcd cluster is unhealthy and then restart etcd and alert us. The restarting of etcd alert was what tipped us off that something was wrong. We expect to have a fix for etcd today and plan to apply it across clusters and turn off our script. However, we think there is an "unhealthy etcd cluster / master server overloaded" issue going on separately from what was described in that case, and by turning off the alert script, we may no longer be able to detect this condition. Customer is running a script that appears to query etcdctl. this might be unnecessarily increase the load on the etcd cluster.
Based on information from [0] and [1], and using the etcd metrics data collected, it seems like this isn't an issue with the disk, which brings the issue to likely a CPU or Network problem. That said, uploading the sosreports, with SAR data to assist in investigation, shortly. [0] https://github.com/coreos/etcd/blob/master/Documentation/faq.md#what-does-the-etcd-warning-failed-to-send-out-heartbeat-on-time-mean [1] https://github.com/coreos/etcd/blob/master/Documentation/metrics.md#disk
*** This bug has been marked as a duplicate of bug 1415839 ***