Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1391608 - Upgrade Playbook from 3.3.0.35 to 3.3.1.3 failed on checking embedded etcd on multi-master/etcd environment
Upgrade Playbook from 3.3.0.35 to 3.3.1.3 failed on checking embedded etcd on...
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Upgrade (Show other bugs)
3.3.0
Unspecified Unspecified
high Severity medium
: ---
: 3.3.1
Assigned To: Devan Goodwin
Anping Li
: Unconfirmed
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-11-03 11:46 EDT by Eric Jones
Modified: 2016-11-15 14:11 EST (History)
7 users (show)

See Also:
Fixed In Version: openshift-ansible-3.3.50-1.git.0.5bdbeaa.el7
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-11-15 14:11:02 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Ansible hosts and ansible logs (80.57 KB, text/plain)
2016-11-14 00:48 EST, Anping Li
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2016:2778 normal SHIPPED_LIVE Moderate: atomic-openshift-utils security and bug fix update 2016-11-15 19:08:29 EST

  None (edit)
Description Eric Jones 2016-11-03 11:46:27 EDT
While running the upgrade playbook on a multi-master and multi-etcd environment, the playbook failed on checking the current embedded etcd disk usage [0] when the etcd should not be considered embedded.

Attaching hosts file shortly


[0]
TASK [Check current embedded etcd disk usage] **********************************
fatal: [<IP>]: FAILED! => {
    "failed": true
}

MSG:

the field 'args' has an invalid value, which appears to include a variable that is undefined. The error was: 'dict object' has no attribute 'etcd_data_dir'

The error appears to have been in '/usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/upgrades/upgrade_control_plane.yml': line 47, column 5, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

  # TODO: replace shell module with command and update later checks
  - name: Check current embedded etcd disk usage
changed: [<IP>]
changed: [<IP>
Comment 3 Devan Goodwin 2016-11-04 12:08:07 EDT
I think I have found a reproducer, if I install a cluster and then remove /etc/ansible/facts.d/openshift.fact on each master, then try to re-run a 3_3 upgrade, it will fail with exactly this error.

It appears openshift_master_etcd_hosts is not being set during upgrade, but the error is hidden if you have the cached fact present from running original config.yml from cluster setup. 

Looks as though their fact cache was removed or somehow they hit a way for the cached value to disappear. Working on a fix now.
Comment 4 Devan Goodwin 2016-11-04 13:28:13 EDT
If customer uses config.yml playbook (used for installation) for continued maintenance, it looks like re-running this will re-generate the facts cache, after which upgrade should complete. However my understanding is customers seldom use this playbook for ongoing maintenance.
Comment 5 Devan Goodwin 2016-11-04 15:14:09 EDT
Proposed fix:

https://github.com/openshift/openshift-ansible/pull/2730

Steps to reproduce for QE:

ansible masters -i ./hosts -a "rm /etc/ansible/facts.d/openshift.fact"
Comment 6 Devan Goodwin 2016-11-09 12:26:03 EST
I'm not 100% sure how customer hit this but I believe the above step is the best way to reproduce this bug.

The problem likely cannot affect embedded etcd deployments, or deployments with etcd on entirely separate hosts. I believe it will only trigger when etcd is colocated on the masters.

We found the issue was master facts not being fully loaded and defaulting to embedded etcd true, which causes the etcd fact loading to fail due to a missing file. (as it's not actually embedded etcd)

Fix: https://github.com/openshift/openshift-ansible/pull/2730

I have tested on containerized co-located etcd, rpm embedded etcd, rpm separate etcd hosts, and rpm co-located etcd.
Comment 8 Anping Li 2016-11-14 00:48 EST
Created attachment 1220260 [details]
Ansible hosts and ansible logs

Upgrade failed.

AnsibleUndefinedVariable: 'dict object' has no attribute 'debug_level'
fatal: [openshift-190.lab.eng.nay.redhat.com]: FAILED! => {
    "changed": false,
    "failed": true
}

MSG:

AnsibleUndefinedVariable: 'dict object' has no attribute 'debug_level'
        to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_3/upgrade.retry
Comment 11 Anping Li 2016-11-15 02:12:06 EST
It works well on atomic-openshift-utils-3.4.25-1
Comment 12 errata-xmlrpc 2016-11-15 14:11:02 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:2778

Note You need to log in before you can comment on or make changes to this bug.