Description of problem: The advanced installer has failed for me a couple of times when running something like "watch -n 3 'for prj in default logging openshift-infra ; do ssh -t master01 oc get pods -n $prj ; done 2>/dev/null'" to monitor the progress with the following error: ... TASK [openshift_cli_facts : openshift_facts] *********************************** ok: [master01.example.com] TASK [openshift_cli : Install clients] ***************************************** skipping: [master01.example.com] TASK [openshift_cli : Pull CLI Image] ****************************************** ok: [master01.example.com] TASK [openshift_cli : Copy client binaries/symlinks out of CLI image for use on the host] *** An exception occurred during task execution. To see the full traceback, use -vvv. The error was: IOError: [Errno 26] Text file busy: '/usr/local/bin/oc' fatal: [master01.example.com]: FAILED! => { "changed": false, "failed": true, "module_stderr": "Traceback (most recent call last):\n File \"/tmp/ansible_Gg_jsw/ansible_module_openshift_container_binary_sync.py\", line 136, in <module>\n main()\n File \"/tmp/ansible_Gg_jsw/ansible_module_openshift_container_binary_sync.py\", line 127, in main\n binary_syncer.sync()\n File \"/tmp/ansible_Gg_jsw/ansible_module_openshift_container_binary_sync.py\", line 75, in sync\n self._sync_binary('oc')\n File \"/tmp/ansible_Gg_jsw/ansible_module_openshift_container_binary_sync.py\", line 107, in _sync_binary\n shutil.move(src_path, dest_path)\n File \"/usr/lib64/python2.7/shutil.py\", line 301, in move\n copy2(src, real_dst)\n File \"/usr/lib64/python2.7/shutil.py\", line 130, in copy2\n copyfile(src, dst)\n File \"/usr/lib64/python2.7/shutil.py\", line 83, in copyfile\n with open(dst, 'wb') as fdst:\nIOError: [Errno 26] Text file busy: '/usr/local/bin/oc'\n", "module_stdout": "" } MSG: MODULE FAILURE This has not been observer with RHEL installations. It wouldn't seem an unreasonable requirement to be able to monitor the installation while it's ongoing. I haven't investigated if this can happen during upgrades if there is some casual use of the oc command at that time. Version-Release number of selected component (if applicable): 3.4.1.5
It's a race condition that would occur on any containerized install where someone may also be using the oc binary when we attempt to update it. We'd need to lock the file and retry but they could also run a long running oc command so there's still potential for failure. The workaround is to just leave the hosts be while the installer is running so setting priority to low until we start seeing cases attached to this BZ.
Thinking through this a little bit I can think of a few workarounds but none that would be very elegant. Here is an example: We could provide a temporary oc for use if there is a need to be running oc commands on the target hosts (assuming oc exists on the host). The problem with this is that the operator would need to use the temporary oc and not use the official one. If they did use the official one the error would occur just as noted. I agree with Scott's workaround noting not to use oc on the target hosts. If others end up hitting this we can figure some kind of workable path.
Just ran in to this on containerized install on RHEL as well so it's not only limited to Atomic hosts.
There appear to be no active cases related to this bug. As such we're closing this bug in order to focus on bugs that are still tied to active customer cases. Please re-open this bug if you feel it was closed in error or a new active case is attached.