1507123 – etcd migrate v2 -> v3 playbook fails - Cannot link, file exists at destination

Bug 1507123 - etcd migrate v2 -> v3 playbook fails - Cannot link, file exists at destination

Summary: etcd migrate v2 -> v3 playbook fails - Cannot link, file exists at destination

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.6.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	urgent
Target Milestone:	---
Target Release:	3.6.z
Assignee:	Vadim Rutkovsky
QA Contact:	Weihua Meng
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-10-27 18:36 UTC by Steven Walter
Modified:	2018-04-12 05:59 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-04-12 05:59:07 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:1106	0	None	None	None	2018-04-12 05:59:32 UTC

Description Steven Walter 2017-10-27 18:36:39 UTC

Description of problem:

The playbook fails with:
-------
TASK [etcd_server_certificates : Sign and create the peer crt] *****************
changed: [atom0011.example.com -> atom0010.example.com]

TASK [etcd_server_certificates : file] *****************************************
fatal: [atom0011.example.com -> atom0010.example.com] FAILED! => {
    "changed": false,
    "dest": "/etc/etcd/generated_certs/etcd-atom0011.example.com/ca.crt",
    "failed": true,
    "gid": 0,
    "group": "root",
    "mode": "0644",
    "owner": "root",
    "secontext": "unconfined_u:object_r:etc_t:s0",
    "size": 1895,
    "src": "/etc/etcd/ca/ca.crt",
    "state": "file",
    "uid": 0
}

MSG:

Cannot link, file exists at destination
-------
Leaving the etcd cluster in down/broken state.

etcd nodes:
atom0010
atom0011
atom0015

Version-Release number of selected component (if applicable):
openshift-ansible-roles-3.6.173.0.21-2.git.0.44a4038.el7.noarch
Atomic Host 7.4.1

How reproducible:
Unconfirmed

Comment 3 Scott Dodson 2017-10-30 13:41:11 UTC

Can you get `ls -la /etc/etcd/generated_certs/etcd-atom0011.example.com/` for us? I imagine they've manually created some symlinks in there?

Comment 4 Steven Walter 2017-10-30 16:25:59 UTC

I think that's what happened; customer restored from a snapshot and issue is no longer present. Closing with INSUFFICIENT DATA. Sorry for inconvenience!

Comment 5 Scott Dodson 2017-11-01 15:11:24 UTC

Discussion with andrew we believe this is a real issue so re-open it. Problem occurs when generated_certs dir exists for the hosts we're scaling up and this particular customer will work around that by removing those directories prior to v2 to v3 migration.

Comment 7 Scott Dodson 2018-01-11 15:45:29 UTC

We've had no other reported cases of this and the customer in this case was able to isolate root cause to local modifications that they'd made to their certificate file heirarchy that would not likely happen in other scenarios.

Comment 8 Scott Dodson 2018-01-29 15:36:26 UTC

I reviewed the case and there's no indication that the problem encountered there matches the behavior described in this bug other than the etcd migration failed.

If I were to guess why the migration failed in that case the only thing I can come up with is that there's a proxy configured for the user that ansible is using. Please see https://bugzilla.redhat.com/show_bug.cgi?id=1515667 for more details on that.


Since I believe this particular bug to be related to localized problems with the environment in which it was originally observed and that customer has remedied those I'm closing this again.

Comment 9 Ryan Howe 2018-02-01 21:40:34 UTC

Hit this same error with the playbook, reopening bug.


openshift-ansible-3.6.173.0.48-1.git.0.1609d30.el7.noarch
ansible-2.4.0.0-5.el7.noarch
ansible 2.4.0.0
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /bin/ansible
  python version = 2.7.5 (default, May  3 2017, 07:55:04) [GCC 4.8.5 20150623 (Red Hat 4.8.5-14)]

fatal: [osm2.abc.com -> osm1.abc.com]: FAILED! => {
    "changed": false,
    "dest": "/etc/etcd/generated_certs/etcd-osm2.abc.com/ca.crt",
    "failed": true,
    "gid": 0,
    "group": "root",
    "mode": "0600",
    "owner": "root",
    "secontext": "unconfined_u:object_r:etc_t:s0",
    "size": 1895,
    "src": "/etc/etcd/ca/ca.crt",
    "state": "file",
    "uid": 0
}

MSG:

Cannot link, file exists at destination

Comment 11 Scott Dodson 2018-02-20 14:45:02 UTC

We intend to fix this by changing from using the existing etcd scaleup playbook to specific tasks required to scale the cluster back up during a v2 to v3 migration. The existing scaleup playbook generates certificates and several other tasks which are not necessary in this scenario.

Comment 12 Vadim Rutkovsky 2018-02-22 12:21:20 UTC

This should be fixed by https://github.com/openshift/openshift-ansible/pull/7226 - scaleup playbook would no longer be called, so this task won't be executed during migration

Comment 13 Vadim Rutkovsky 2018-02-27 14:53:56 UTC

Fix is available in openshift-ansible-3.6.173.0.104-1-4-g76aa5371e

Comment 14 Vadim Rutkovsky 2018-02-28 09:57:48 UTC

Fix for the issue is not yet released, sorry for the noise

Comment 16 Weihua Meng 2018-03-01 01:46:16 UTC

PR not in latest 3.6 build(openshift-ansible-3.6.173.0.104-1-4-g76aa5371e).

Comment 18 Weihua Meng 2018-03-06 01:39:28 UTC

PR not in latest 3.6 build(openshift-ansible-3.6.173.0.104-1-4-g76aa5371e).

Comment 19 Scott Dodson 2018-03-12 16:59:25 UTC

In openshift-ansible-3.6.173.0.105-1

Comment 20 Weihua Meng 2018-03-20 16:12:36 UTC

Fixed.
openshift-ansible-3.6.173.0.110-1.git.0.ca81843.el7.noarch

no errors found during migration.
Red Hat Enterprise Linux Atomic Host 7.4.1

Comment 23 errata-xmlrpc 2018-04-12 05:59:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1106

Note You need to log in before you can comment on or make changes to this bug.