Bug 1466638
| Summary: | ETCD data lost after migration from containerized etcd to system container etcd | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Gaoyun Pei <gpei> |
| Component: | Installer | Assignee: | Giuseppe Scrivano <gscrivan> |
| Status: | CLOSED ERRATA | QA Contact: | Gaoyun Pei <gpei> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 3.6.0 | CC: | aos-bugs, jokerman, mmccomas, sdodson, trankin |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-08-10 05:28:56 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
patch proposed here: https://github.com/openshift/openshift-ansible/pull/4668 Verify this bug with openshift-ansible-3.6.135-1.git.0.5533fe3.el7.noarch 1. Set up a containerized ocp-3.6 environment 2. Add openshift_use_etcd_system_container=true into ansible inventory file, re-run the byo/config.yml playbook The playbook finished successfully, after the re-run job, cluster is working well: previous project data still exist node is available and the old pod is running etcd service is running via system container Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1716 |
Description of problem: When migrating etcd from previous containerized installation to system container, installer failed when "Wait for Node Registration" due to nodes not found. Version-Release number of selected component (if applicable): openshift-ansible-3.6.126.4-1.git.0.d25d828.el7.noarch How reproducible: Always Steps to Reproduce: 1.Set up a containerized ocp-3.6 environment, make sure it's working well [root@qe-gpei-etcd-sc-2-master-1 ~]# oc get node NAME STATUS AGE VERSION qe-gpei-etcd-sc-2-master-1 Ready,SchedulingDisabled 13m v1.6.1+5115d708d7 qe-gpei-etcd-sc-2-node-registry-router-1 Ready 10m v1.6.1+5115d708d7 [root@qe-gpei-etcd-sc-2-master-1 ~]# oc get pod NAME READY STATUS RESTARTS AGE docker-registry-3-3w0x4 1/1 Running 0 7m registry-console-1-m22cg 1/1 Running 0 8m router-1-c3s21 1/1 Running 0 10m [root@qe-gpei-etcd-sc-2-master-1 ~]# oc get project NAME DISPLAY NAME STATUS default Active install-test Active kube-public Active kube-system Active logging Active management-infra Active openshift Active openshift-infra Active test111 Active 2.Add openshift_use_etcd_system_container=true into ansible inventory file, re-run the byo/config.yml playbook Actual results: TASK [openshift_manage_node : Wait for Node Registration] ********************** ... FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (1 retries left). fatal: [qe-gpei-etcd-sc-2-node-registry-router-1.0630-1gj.qe.rhcloud.com -> qe-gpei-etcd-sc-2-master-1.0630-1gj.qe.rhcloud.com]: FAILED! => { "attempts": 50, "changed": false, "failed": true, "results": { "cmd": "/usr/local/bin/oc get node qe-gpei-etcd-sc-2-node-registry-router-1 -o json -n default", "results": [ {} ], "returncode": 0, "stderr": "Error from server (NotFound): nodes \"qe-gpei-etcd-sc-2-node-registry-router-1\" not found\n", "stdout": "" }, "state": "list" } fatal: [qe-gpei-etcd-sc-2-master-1.0630-1gj.qe.rhcloud.com -> qe-gpei-etcd-sc-2-master-1.0630-1gj.qe.rhcloud.com]: FAILED! => { "attempts": 50, "changed": false, "failed": true, "results": { "cmd": "/usr/local/bin/oc get node qe-gpei-etcd-sc-2-master-1 -o json -n default", "results": [ {} ], "returncode": 0, "stderr": "Error from server (NotFound): nodes \"qe-gpei-etcd-sc-2-master-1\" not found\n", "stdout": "" }, "state": "list" } to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/config.retry On master host: [root@qe-gpei-etcd-sc-2-master-1 ~]# oc get node No resources found. [root@qe-gpei-etcd-sc-2-master-1 ~]# oc get pod No resources found. [root@qe-gpei-etcd-sc-2-master-1 ~]# oc get project NAME DISPLAY NAME STATUS default Active kube-public Active kube-system Active management-infra Active openshift Active openshift-infra Active On etcd host: [root@qe-gpei-etcd-sc-2-etcd-1 ~]# ls -R /var/lib/etcd/member/ /var/lib/etcd/member/: snap wal /var/lib/etcd/member/snap: db /var/lib/etcd/member/wal: 0000000000000000-0000000000000000.wal [root@qe-gpei-etcd-sc-2-etcd-1 ~]# [root@qe-gpei-etcd-sc-2-etcd-1 ~]# ls -R /var/lib/etcd/etcd.etcd/etcd.etcd/member/ /var/lib/etcd/etcd.etcd/etcd.etcd/member/: snap wal /var/lib/etcd/etcd.etcd/etcd.etcd/member/snap: db /var/lib/etcd/etcd.etcd/etcd.etcd/member/wal: 0000000000000000-0000000000000000.wal 0.tmp Expected results: Additional info: