Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1466638

Summary: ETCD data lost after migration from containerized etcd to system container etcd
Product: OpenShift Container Platform Reporter: Gaoyun Pei <gpei>
Component: InstallerAssignee: Giuseppe Scrivano <gscrivan>
Status: CLOSED ERRATA QA Contact: Gaoyun Pei <gpei>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.6.0CC: aos-bugs, jokerman, mmccomas, sdodson, trankin
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-10 05:28:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Gaoyun Pei 2017-06-30 06:37:06 UTC
Description of problem:
When migrating etcd from previous containerized installation to system container, installer failed when "Wait for Node Registration" due to nodes not found.  

Version-Release number of selected component (if applicable):
openshift-ansible-3.6.126.4-1.git.0.d25d828.el7.noarch

How reproducible:
Always

Steps to Reproduce:
1.Set up a containerized ocp-3.6 environment, make sure it's working well
[root@qe-gpei-etcd-sc-2-master-1 ~]# oc get node
NAME                                       STATUS                     AGE       VERSION
qe-gpei-etcd-sc-2-master-1                 Ready,SchedulingDisabled   13m       v1.6.1+5115d708d7
qe-gpei-etcd-sc-2-node-registry-router-1   Ready                      10m       v1.6.1+5115d708d7
[root@qe-gpei-etcd-sc-2-master-1 ~]# oc get pod
NAME                       READY     STATUS    RESTARTS   AGE
docker-registry-3-3w0x4    1/1       Running   0          7m
registry-console-1-m22cg   1/1       Running   0          8m
router-1-c3s21             1/1       Running   0          10m
[root@qe-gpei-etcd-sc-2-master-1 ~]# oc get project
NAME               DISPLAY NAME   STATUS
default                           Active
install-test                      Active
kube-public                       Active
kube-system                       Active
logging                           Active
management-infra                  Active
openshift                         Active
openshift-infra                   Active
test111                           Active

2.Add openshift_use_etcd_system_container=true into ansible inventory file, re-run the byo/config.yml playbook


Actual results:
TASK [openshift_manage_node : Wait for Node Registration] **********************
...
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (1 retries left).
fatal: [qe-gpei-etcd-sc-2-node-registry-router-1.0630-1gj.qe.rhcloud.com -> qe-gpei-etcd-sc-2-master-1.0630-1gj.qe.rhcloud.com]: FAILED! => {
    "attempts": 50, 
    "changed": false, 
    "failed": true, 
    "results": {
        "cmd": "/usr/local/bin/oc get node qe-gpei-etcd-sc-2-node-registry-router-1 -o json -n default", 
        "results": [
            {}
        ], 
        "returncode": 0, 
        "stderr": "Error from server (NotFound): nodes \"qe-gpei-etcd-sc-2-node-registry-router-1\" not found\n", 
        "stdout": ""
    }, 
    "state": "list"
}
fatal: [qe-gpei-etcd-sc-2-master-1.0630-1gj.qe.rhcloud.com -> qe-gpei-etcd-sc-2-master-1.0630-1gj.qe.rhcloud.com]: FAILED! => {
    "attempts": 50, 
    "changed": false, 
    "failed": true, 
    "results": {
        "cmd": "/usr/local/bin/oc get node qe-gpei-etcd-sc-2-master-1 -o json -n default", 
        "results": [
            {}
        ], 
        "returncode": 0, 
        "stderr": "Error from server (NotFound): nodes \"qe-gpei-etcd-sc-2-master-1\" not found\n", 
        "stdout": ""
    }, 
    "state": "list"
}
	to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/config.retry


On master host:
[root@qe-gpei-etcd-sc-2-master-1 ~]# oc get node
No resources found.
[root@qe-gpei-etcd-sc-2-master-1 ~]# oc get pod
No resources found.
[root@qe-gpei-etcd-sc-2-master-1 ~]# oc get project
NAME               DISPLAY NAME   STATUS
default                           Active
kube-public                       Active
kube-system                       Active
management-infra                  Active
openshift                         Active
openshift-infra                   Active


On etcd host:
[root@qe-gpei-etcd-sc-2-etcd-1 ~]# ls -R /var/lib/etcd/member/
/var/lib/etcd/member/:
snap  wal

/var/lib/etcd/member/snap:
db

/var/lib/etcd/member/wal:
0000000000000000-0000000000000000.wal

[root@qe-gpei-etcd-sc-2-etcd-1 ~]# 
[root@qe-gpei-etcd-sc-2-etcd-1 ~]# ls -R /var/lib/etcd/etcd.etcd/etcd.etcd/member/
/var/lib/etcd/etcd.etcd/etcd.etcd/member/:
snap  wal

/var/lib/etcd/etcd.etcd/etcd.etcd/member/snap:
db

/var/lib/etcd/etcd.etcd/etcd.etcd/member/wal:
0000000000000000-0000000000000000.wal  0.tmp


Expected results:


Additional info:

Comment 1 Giuseppe Scrivano 2017-07-03 19:52:56 UTC
patch proposed here:

https://github.com/openshift/openshift-ansible/pull/4668

Comment 3 Gaoyun Pei 2017-07-06 08:52:36 UTC
Verify this bug with openshift-ansible-3.6.135-1.git.0.5533fe3.el7.noarch

1. Set up a containerized ocp-3.6 environment

2. Add openshift_use_etcd_system_container=true into ansible inventory file, re-run the byo/config.yml playbook

The playbook finished successfully, after the re-run job, cluster is working well: 
previous project data still exist 
node is available and the old pod is running
etcd service is running via system container

Comment 5 errata-xmlrpc 2017-08-10 05:28:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1716