Bug 1398083
Summary: | Manually adding a new etcd node sometimes failed when other existing etcd servcies are running | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Kenjiro Nakayama <knakayam> |
Component: | Node | Assignee: | Avesh Agarwal <avagarwa> |
Status: | CLOSED NOTABUG | QA Contact: | DeShuai Ma <dma> |
Severity: | low | Docs Contact: | |
Priority: | medium | ||
Version: | 3.3.0 | CC: | aos-bugs, avagarwa, decarr, jokerman, mmccomas |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-12-14 21:21:12 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Kenjiro Nakayama
2016-11-24 05:30:56 UTC
Sure Derek. Kenjiro, can you also attach the leader logs when the issue happens? kenjiro, After reading the 2 articles, it seems what you said is true: 1. From the first article, it says: "Take backup of etcd and transfer contents to "NEW_ETCD" Skip this step if version is lower than etcd-2.3.7-4 and etcd database size is smaller than 700mb." 2. From the 2nd article, it says: "If etcd is running on more than one host, stop it on each host:" So it seems for backup step, it is recommended to stop etcd services on other hosts. Could you please verify what was the database size when the issue happened? Based on the logs, it seems etcd is colocated with a node. In that case, it seems that this issue could happen, and this issue seems related to this bug https://bugzilla.redhat.com/show_bug.cgi?id=1389736. https://bugzilla.redhat.com/show_bug.cgi?id=1389736#c39 has more details. The error in the logs here: Nov 21 10:46:31 master04 etcd: publish error: etcdserver: request timed out seems similar to the ones in https://bugzilla.redhat.com/show_bug.cgi?id=1389736. So far my understanding is based on logs: etcd is colocated with a node, something not recommended based on https://bugzilla.redhat.com/show_bug.cgi?id=1389736, and this colocation could trigger this issue. Also as It happened once at the customer site, and also not 100% reproducible, so does not seem like a blocker. Hi Avesh, I understand that the issue probably hit the bug#1389736. However, after all, adding new nodes (w/ more than 700mb) needs to stop other existing etcd services? The article mentions "Do not start the etcd service" in the first section, but it looks like it mentions like "don't start *new* etcd service yet". Could you please confirm it? [1] https://access.redhat.com/articles/2650151 Kenjiro, Based on the article, for backup step, if db size is more 700mb and version is 2.3.7-4, it says to stop etcd service on other hosts. Yes, new etcd service should be started after steps upto 7 are done. Hi Avesh, Thank you for your clarification. I think we can close this ticket now. Thanks. |