Aleks could you check the contents of /etc/ansible/facts.d/openshift.fact and paste it's contents somewhere? Would you be able to re-try with this change manually, I think it might fix it: $ git diff diff --git a/playbooks/common/openshift-cluster/upgrades/pre.yml b/playbooks/common/openshift-cluster/upgrades/pre.yml index 5c8803e..7b0066b 100644 --- a/playbooks/common/openshift-cluster/upgrades/pre.yml +++ b/playbooks/common/openshift-cluster/upgrades/pre.yml @@ -232,10 +232,23 @@ ############################################################################### # Backup etcd ############################################################################### +# If facts cache were for some reason deleted, this fact may not be set, and if not set +# it will always default to true. This causes problems for the etcd data dir fact detection +# so we must first make sure this is set correctly before attempting the backup. +- name: Set master embedded_etcd fact + hosts: oo_masters_to_config + roles: + - openshift_facts + tasks: + - openshift_facts: + role: master + local_facts: + embedded_etcd: "{{ groups.oo_etcd_to_config | length == 0 }}" + - name: Backup etcd hosts: etcd_hosts_to_backup vars: - embedded_etcd: "{{ hostvars[groups.oo_first_master.0].openshift.master.embedded_etcd }}" + embedded_etcd: "{{ groups.oo_etcd_to_config | default([]) | length == 0 }}" timestamp: "{{ lookup('pipe', 'date +%Y%m%d%H%M%S') }}" roles: - openshift_facts
[root@itsrv1554 ~ ] # downloads/jq-linux64 < /etc/ansible/facts.d/openshift.fact { "node": {}, "docker": { "blocked_registries": [], "hosted_registry_network": "172.30.0.0/16", "insecure_registries": [], "additional_registries": [] }, "master": { "ha": true }, "common": { "generate_no_proxy_hosts": true, "cluster_id": "default", "is_containerized": false, "deployment_type": "openshift-enterprise" } } I have added this snipplet to the playbook. But the etcd backup runs just fine?! Tonight I will run the update again and update the ticket. Is anyone available on the weekend? I mean this is a prio 1 ticket because our production is affected with this issue.
Slight correction to the above patch, an additional line for the debug level: https://gist.github.com/dgoodwin/d06c8f89f5d78349166f4f26509b20b0
Aleks: The additional line I just added to the gist link in comment #3 adds in debug_level as well. embedded_etcd was one way this surfaced, QE found a later problem explicitly with debug_level if they fully removed the cache prior to upgrade. I believe this is related to problems arising when either the fact cache is deleted between cluster setup and upgrade, or just doesn't contain something we expect it to (perhaps because the version installed was older and we haven't yet pinned down how that is possible) In any case though, the real problem is that upgrade was assuming certain facts were set but not explicitly making sure they were set, so it would only pass if the cache was present and contained the expected values. The above change ensures certain master facts are set, specifically the two that caused failures in upgrade. I will be around at times and monitoring my email. Please just make sure to disregard my patch in comment #1 and use the gist in comment #3. Proposed PR https://github.com/openshift/openshift-ansible/pull/2826
I can reproduce the original issue reliably by installing an HA cluster with latest 3.2 installer openshift-ansible-3.2.42-1, then on each master editing /etc/ansible/facts.d/openshift.fact, and removing all 3 of the "debug_level" keys. Then when re-running 3.2 upgrade: TASK [Create the master api service env file] ********************************** Friday 18 November 2016 13:54:07 -0400 (0:00:01.011) 0:06:12.699 ******* fatal: [ec2-75-101-226-223.compute-1.amazonaws.com]: FAILED! => {"changed": false, "failed": true, "msg": "AnsibleUndefinedVariable: 'dict object' has no attribute 'debug_level'"} fatal: [ec2-107-20-5-117.compute-1.amazonaws.com]: FAILED! => {"changed": false, "failed": true, "msg": "AnsibleUndefinedVariable: 'dict object' has no attribute 'debug_level'"} fatal: [ec2-54-165-253-95.compute-1.amazonaws.com]: FAILED! => {"changed": false, "failed": true, "msg": "AnsibleUndefinedVariable: 'dict object' has no attribute 'debug_level'"} Re-trying with the fix from comment #3 and the pull request referenced above, the upgrade will complete successfully. I am very confident this will fix the customer's issue. It was likely caused by installing with an older 3.2 before these facts existed. I will be submitting a separate PR against master to significantly reduce the complexity in how these variables are being used.
WOW unbelievable. The update was now successfully 8-O. Finally after ~2 1/2 month the bug was found and fixed. Best regards Aleks
The ansible playbook works. ansible-2.2.0.0-1.el7.noarch,openshift-ansible-3.2.43-1.git.0.fe29bec.el7.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:2814