Description of problem: Encountered this error during large cluster node upgrade: fatal: [starter-us-west-2-node-compute-07d37]: FAILED! => {"changed": false, "failed": true, "msg": {"cmd": "/usr/bin/repoquery --plugins --quiet --pkgnarrow=repos --queryformat=%{version}|%{release}|%{arch}|%{repo}|%{version}-%{release} --config=/tmp/tmphbrdnv atomic-openshift-excluder", "package_found": false, "results": {}, "returncode": 1, "stderr": "rhel-7-server-rpms: Check uncompressed DB failed\n", "stdout": ""}} Version-Release number of selected component (if applicable): 3.6.173.0.5 How reproducible: Low Steps to Reproduce: 1. Run node upgrade on large cluster & hope Additional info: After encountering this error, I ran repoquery from the same node and it did not report an error.
This bug is a general class of problems associated with the need to retry operations related to rpmdb, yum, and repoquery. Luke is proposing a strategy for adding a retry pattern that we could apply as a general solution to this problem. https://github.com/openshift/openshift-ansible/pull/5125
That PR is waiting on some package spec issues to be worked out in https://github.com/openshift/openshift-ansible/pull/4264 so that the new action_plugin path can be added (otherwise tasks would be totally broken under the RPM).
With https://github.com/openshift/openshift-ansible/pull/5125 stalled at the moment, I disentangled the repoquery fixes and made a new PR: https://github.com/openshift/openshift-ansible/pull/5401
This bug is really hard to be reproduced in QE's cluster, so QE only verify this bug via code review and make sure no regression is introduced. Re-test this bug with openshift-ansible-3.7.0-0.126.4.git.0.3fc2b9b.el7.noarch, the PR is merged, and not introduce any regression bug. But the retries are not been added for repoquery_cmd in playbooks/common/openshift-cluster/upgrades/docker/upgrade_check.yml.
You are right, I missed that one because I wasn't looking in the playbooks for tasks. Thank you for catching this.
https://github.com/openshift/openshift-ansible/pull/5401 merged to fix this further.
My apologies. The follow-on PR was https://github.com/openshift/openshift-ansible/pull/5464
Verified this bug with openshift-ansible-3.7.0-0.127.0.git.0.b9941e4.el7.noarch, and PASS. PR is merged, and no regression bug is found.
I don't think this change needs to be documented as it really only addresses the issue partially. I'd rather say something about it once the yum retries PR merges.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188