Description of problem: Version-Release number of the following components: openshift-ansible v3.6.173.0.5 ansible 2.2.3.0 How reproducible: Rare Steps to Reproduce: 1. Ran upgrade on large HA cluster (>100 nodes). Occurred on one. Actual results: Using module file /usr/lib/python2.7/site-packages/ansible/modules/core/packaging/os/yum.py <54.174.162.89> ESTABLISH SSH CONNECTION FOR USER: root <54.174.162.89> SSH: EXEC ssh -o ControlMaster=auto -o ControlPersist=600s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o ControlPath=/home/opsmedic/.ansible/cp/ansible-ssh-%h-%p-%r 54.174.162.89 '/bin/sh -c '"'"'/usr/bin/python && sleep 0'"'"'' fatal: [starter-us-east-1-node-compute-8bcb6]: FAILED! => { "changed": false, "failed": true, "invocation": { "module_args": { "conf_file": null, "disable_gpg_check": false, "disablerepo": null, "enablerepo": null, "exclude": null, "install_repoquery": true, "list": null, "name": [ "rQ4" ], "state": "latest", "update_cache": false, "validate_certs": true } }, "msg": "Traceback (most recent call last):\n File \"/usr/bin/yum\", line 29, in <module>\n yummain.user_main(sys.argv[1:], exit_code=True)\n File \"/usr/share/yum-cli/yummain.py\", line 370, in user_main\n errcode = main(args)\n File \"/usr/share/yum-cli/yummain.py\", line 179, in main\n result, resultmsgs = base.doCommands()\n File \"/usr/share/yum-cli/cli.py\", line 573, in doCommands\n return self.yum_cli_commands[self.basecmd].doCommand(self, self.basecmd, self.extcmds)\n File \"/usr/share/yum-cli/yumcommands.py\", line 1626, in doCommand\n ypl = base.returnPkgLists(extcmds, repoid=repoid)\n File \"/usr/share/yum-cli/cli.py\", line 1400, in returnPkgLists\n ignore_case=True, repoid=repoid)\n File \"/usr/lib/python2.7/site-packages/yum/__init__.py\", line 3005, in doPackageLists\n for (n,a,e,v,r) in self.up.getUpdatesList():\n File \"/usr/lib/python2.7/site-packages/yum/__init__.py\", line 1093, in <lambda>\n up = property(fget=lambda self: self._getUpdates(),\n File \"/usr/lib/python2.7/site-packages/yum/__init__.py\", line 838, in _getUpdates\n self._up = rpmUtils.updates.Updates(self.rpmdb.simplePkgList(), self.pkgSack.simplePkgList())\n File \"/usr/lib/python2.7/site-packages/yum/__init__.py\", line 1074, in <lambda>\n pkgSack = property(fget=lambda self: self._getSacks(),\n File \"/usr/lib/python2.7/site-packages/yum/__init__.py\", line 778, in _getSacks\n self.repos.populateSack(which=repos)\n File \"/usr/lib/python2.7/site-packages/yum/repos.py\", line 386, in populateSack\n sack.populate(repo, mdtype, callback, cacheonly)\n File \"/usr/lib/python2.7/site-packages/yum/yumRepo.py\", line 242, in populate\n mydbtype)\n File \"/usr/lib/python2.7/site-packages/yum/yumRepo.py\", line 287, in _check_uncompressed_db_gen\n cached=repo.cache)\n File \"/usr/lib/python2.7/site-packages/yum/misc.py\", line 1165, in repo_gen_decompress\n return decompress(filename, dest=dest, check_timestamps=True)\n File \"/usr/lib/python2.7/site-packages/yum/misc.py\", line 1152, in decompress\n os.utime(out, (fi.st_mtime, fi.st_mtime))\nOSError: [Errno 2] No such file or directory: '/var/cache/yum/x86_64/7Server/rhel-7-server-rpms/gen/primary_db.sqlite'\n", "rc": 1, "results": [] }
We need to add failure tolerance to all node operations rather than just the drain and upgrade phases.
We've added retries around yum transactions.
https://github.com/openshift/openshift-ansible/pull/5401
Already did verification in https://bugzilla.redhat.com/show_bug.cgi?id=1482551#c8, and PASS.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188