Bug 1423363

Summary: Installation fails with text file busy when replacing an oc binary that is in use
Product: OpenShift Container Platform Reporter: Marko Myllynen <myllynen>
Component: InstallerAssignee: Steve Milner <smilner>
Status: CLOSED WONTFIX QA Contact: Johnny Liu <jialiu>
Severity: low Docs Contact:
Priority: low    
Version: 3.4.0CC: aos-bugs, jokerman, jswensso, mmccomas
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-31 15:56:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Marko Myllynen 2017-02-17 06:03:50 UTC
Description of problem:
The advanced installer has failed for me a couple of times when running something like "watch -n 3 'for prj in default logging openshift-infra ; do ssh -t master01 oc get pods -n $prj ; done 2>/dev/null'" to monitor the progress with the following error:

...
TASK [openshift_cli_facts : openshift_facts] ***********************************
ok: [master01.example.com]

TASK [openshift_cli : Install clients] *****************************************
skipping: [master01.example.com]

TASK [openshift_cli : Pull CLI Image] ******************************************
ok: [master01.example.com]

TASK [openshift_cli : Copy client binaries/symlinks out of CLI image for use on the host] ***
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: IOError: [Errno 26] Text file busy: '/usr/local/bin/oc'
fatal: [master01.example.com]: FAILED! => {
    "changed": false, 
    "failed": true, 
    "module_stderr": "Traceback (most recent call last):\n  File \"/tmp/ansible_Gg_jsw/ansible_module_openshift_container_binary_sync.py\", line 136, in <module>\n    main()\n  File \"/tmp/ansible_Gg_jsw/ansible_module_openshift_container_binary_sync.py\", line 127, in main\n    binary_syncer.sync()\n  File \"/tmp/ansible_Gg_jsw/ansible_module_openshift_container_binary_sync.py\", line 75, in sync\n    self._sync_binary('oc')\n  File \"/tmp/ansible_Gg_jsw/ansible_module_openshift_container_binary_sync.py\", line 107, in _sync_binary\n    shutil.move(src_path, dest_path)\n  File \"/usr/lib64/python2.7/shutil.py\", line 301, in move\n    copy2(src, real_dst)\n  File \"/usr/lib64/python2.7/shutil.py\", line 130, in copy2\n    copyfile(src, dst)\n  File \"/usr/lib64/python2.7/shutil.py\", line 83, in copyfile\n    with open(dst, 'wb') as fdst:\nIOError: [Errno 26] Text file busy: '/usr/local/bin/oc'\n", 
    "module_stdout": ""
}

MSG:

MODULE FAILURE

This has not been observer with RHEL installations.

It wouldn't seem an unreasonable requirement to be able to monitor the installation while it's ongoing. I haven't investigated if this can happen during upgrades if there is some casual use of the oc command at that time.

Version-Release number of selected component (if applicable):
3.4.1.5

Comment 1 Scott Dodson 2017-05-30 15:56:37 UTC
It's a race condition that would occur on any containerized install where someone may also be using the oc binary when we attempt to update it. We'd need to lock the file and retry but they could also run a long running oc command so there's still potential for failure.

The workaround is to just leave the hosts be while the installer is running so setting priority to low until we start seeing cases attached to this BZ.

Comment 5 Steve Milner 2017-05-30 21:31:47 UTC
Thinking through this a little bit I can think of a few workarounds but none that would be very elegant. Here is an example: We could provide a temporary oc for use if there is a need to be running oc commands on the target hosts (assuming oc exists on the host). The problem with this is that the operator would need to use the temporary oc and not use the official one. If they did use the official one the error would occur just as noted.

I agree with Scott's workaround noting not to use oc on the target hosts. If others end up hitting this we can figure some kind of workable path.

Comment 7 Johan Swensson 2017-09-15 11:45:24 UTC
Just ran in to this on containerized install on RHEL as well so it's not only limited to Atomic hosts.

Comment 8 Scott Dodson 2019-01-31 15:56:20 UTC
There appear to be no active cases related to this bug. As such we're closing this bug in order to focus on bugs that are still tied to active customer cases. Please re-open this bug if you feel it was closed in error or a new active case is attached.