Bug 1328480 - The kubeconfig file being updated by oc command; becomes corrupt when used simultaneously
Summary: The kubeconfig file being updated by oc command; becomes corrupt when used si...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: oc
Version: 3.1.0
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: ---
Assignee: Juan Vallejo
QA Contact: Xingxing Xia
URL:
Whiteboard:
Depends On:
Blocks: OSOPS_V3
TreeView+ depends on / blocked
 
Reported: 2016-04-19 13:32 UTC by Matt Woodson
Modified: 2024-01-31 01:09 UTC (History)
24 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-09-04 20:48:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Matt Woodson 2016-04-19 13:32:50 UTC
Description of problem:

The Openshift Operations team has written many checks that use a kubeconfig file to authenticate to Openshift.  We have noticed that this file becomes corrupt with multiple processes using the same kubeconfig file.

In our view, the kubeconfig is an authentication file.  The product seems to think it is authentication + state.  It seems unnecessary that the authentication file would also try to keep state of the cli.

We believe that the oc commands that change the namespace/project are what tend to lead to writes.

Having the oc command be able to corrupt this file is just problematic.

Version-Release number of selected component (if applicable):

atomic-openshift-3.1.1.6-5

How reproducible:


Steps to Reproduce:
1.  Run multiple operations with the same kubeconfig file that do project/ns events

Actual results:

kubeconfig file becomes corrupt.

Expected results:

kubeconfig file does not become corrupt.

Additional info:

I think there are multiple ways to approach this.  file locking, don't write to the kubeconfig file, break out functionality.

We can obviously work around this by copying the kubeconfig file before running commands, but wanted to raise awareness on this.

Comment 1 Sten Turpin 2016-04-19 14:28:15 UTC
The specific corruption usually causes the last 7 characters to be repeated on a new line, which causes a YAML parser error: 

    client-key-data: <long encrypted string>bqDKntOVbSDnYEZLg46RnQh43tj2jMzc=
j2jMzc=

Error loading config file "/root/kubconfig.bad": yaml: line 40: could not find expected ':'

Comment 2 Fabiano Franz 2016-04-19 14:58:52 UTC
Related issue upstream: https://github.com/kubernetes/kubernetes/issues/23964

Comment 3 Fabiano Franz 2016-07-19 17:52:19 UTC
This is now fixed in Origin master (since the latest Kube rebase*), the config loader implemented file locking on writes to prevent kube config from getting corrupt.

* https://github.com/openshift/origin/commit/24d4c5fc8130aee81503b525f365dc8563583ebf

Comment 4 Xingxing Xia 2016-07-20 08:56:32 UTC
Hi, all, could you please tell me your steps to make the file corrupt with old oc version? Thanks!

To verify the bug, first need know the steps to reproduce it with old oc version and then test it with latest version. If the old oc version reproduces it but the latest version does not, then could prove the bug is fixed.

However, I didn't reproduce it with below steps. I tried crontab (given user xxia and xxia2 have different projects):
$ crontab -e
3,4,5 * * * *    /home/tester/oc/versions/v3.1.1.6/oc login <master> -u xxia  --insecure-skip-tls-verify -p redhat
3,4,5 * * * *    /home/tester/oc/versions/v3.1.1.6/oc login <master> -u xxia2 --insecure-skip-tls-verify -p redhat

And tried this:
$ crontab -e
20 * * * *    oc new-project xxia-proj2 --server=https://<master>:8443 --client-certificate='/home/tester/oc/xxia/xxia.crt' --client-key='/home/tester/oc/xxia/xxia.key' --certificate-authority='/home/tester/oc/xxia/ca.crt'
20 * * * *    oc new-project xxia2-proj2 --server=https://<master>:8443 --client-certificate='/home/tester/oc/xxia2/xxia2.crt' --client-key='/home/tester/oc/xxia2/xxia2.key' --certificate-authority='/home/tester/oc/xxia2/ca.crt'

Both crontab settings did not make the kubeconfig file corrupt, the file is still good.

Comment 5 Matt Woodson 2016-07-20 13:56:25 UTC
I don't believe we ever tracked down what is causing this to change exactly.

This is what we have been doing and when we were seeing the corruption.  We have since moved to a different model where we are not using the file, but copying it before it is used, so we are working around the issue.\

What we would typically do is write script (usually in python and bash) that would use the config file to do authenticated web calls.  This is via curl or libraries within python.  We would then set these checks up via cron timers.  These would run from every couple of minutes to once a day.  All of the checks would use the same file.  Eventually, our checks would start failing because we were unable to load and parse the kubeconfig file, it was not valid yaml.

Again, we didn't track down the exact calls that were causing the corruption, but did notice it didn't work.

Hopefully the upstream fix will prevent this in the future.

Comment 6 Fabiano Franz 2016-07-20 15:27:37 UTC
I didn't try this, but you would need to have, running at the same time, 2+ processes calling an `oc` command that modifies the config file. Here's one idea to possibly reproduce this

1. Make sure you have at least 2 contexts in your ~/.kube/config
2. In bash run an infinite 'for' loop that calls something like 'oc config use-context my-ctx-1'. Put it in background; 
3. In bash run another infinite 'for' loop that calls something like 'oc config use-context my-ctx-2'. Put it in background.

You will have two commands modifying the config file concurrently and would probably get the file corrupted at some point.

Comment 7 Xingxing Xia 2016-07-22 11:26:42 UTC
I wrote a shell script:
$ cat config-race.sh
#! /bin/bash
[ $# != 3 ] && echo "usage: $0 </directory/of/oc> <project> <interval>" && exit 0
VER_DIR=$1    # /home/tester/oc/versions/v3.1.1.6 or /bin
PROJ=$2       # xxia-proj
TIME=$3       # 1.1

[ -e /home/tester/oc/is_broken ] || touch /home/tester/oc/is_broken
while true
do
  grep -q broken /home/tester/oc/is_broken && echo "broken probed" && exit -1
  if ! $VER_DIR/oc project $PROJ; then
     echo "broken" > /home/tester/oc/is_broken && echo "broken" && cat /home/tester/.kube/config
     exit -1
  fi
  sleep $TIME
done

Then I test 2 versions of oc: v3.1.1.6, v3.3.0.8, put under /home/tester/oc/versions/v3.1.1.6 and /bin respectively.

The corruption happens quickly with v3.1.1.6. For v3.3.0.8, it may live longer, but finally corrupt as well. See following comment:

Comment 10 XiaochuanWang 2016-10-08 03:23:55 UTC
Since this is reported on OCP, not sure the Target Release version or the OCP version could be used for verify. According to comment 9, it's still reproduced on v3.1.1.6 and v3.3.0.8.

Comment 11 Fabiano Franz 2017-01-20 19:07:07 UTC
Please check this using v3.5. Although I did see some errors when running the test using the script proposed in Comment 7, I could not get the file to become corrupt. The file is still valid and can be used, prints correctly in 'oc config view', etc. Tks!

Comment 12 Troy Dawson 2017-01-20 23:03:19 UTC
This should be fixed in OCP v3.5.0.7 or newer.

Comment 13 Xingxing Xia 2017-01-22 10:24:38 UTC
(In reply to Fabiano Franz from comment #11)
> Please check this using v3.5. Although I did see some errors when running
> the test using the script proposed in Comment 7, I could not get the file to
> become corrupt. The file is still valid and can be used, prints correctly in
> 'oc config view', etc. Tks!

Most of time, the file is still valid. But sometimes after more tests, the file is invalid (Checked in oc/openshift v3.5.0.7)
Below are steps. To make things simpler, use simpler script as follows instead of those in comment 7:
$ cat new-script.sh
#! /bin/bash
oc login https://<your master>:8443 --insecure-skip-tls-verify -u xxia -p redhat
while true
do
  oc project $1
  [ ! $? == 0 ] && oc config view
  sleep $2
done

Then in different terminals, run following cmd respectively:
./new-script.sh xxia-proj1 1.1
./new-script.sh xxia-proj2 1.2
./new-script.sh xxia-proj3 1.3
./new-script.sh xxia-proj4 1.4

Observe the outputs. Sometimes, some terminal will output:
Now using project "xxia-proj4" on server "https://<master>:8443".
...
Already on project "xxia-proj4" on server "https://<master>:8443".
...
error: Missing or incomplete configuration info.  Please login or point to an existing, complete config file:

  1. Via the command-line flag --config
  2. Via the KUBECONFIG environment variable
  3. In your home directory as ~/.kube/config

To view or setup config directly use the 'config' command.
See 'oc project -h' for help and examples.
apiVersion: v1
clusters: []
contexts: []
current-context: ""
kind: Config
preferences: {}
users: []



Sometimes, some terminal will output (notice "clusters: []"):
apiVersion: v1
clusters: []
contexts:
- context:
    cluster: <master>:8443
    namespace: xxia-proj1
    user: xxia/<master>:8443
  name: xxia-proj1/<master>:8443/xxia
current-context: xxia-proj1/<master>:8443/xxia
kind: Config
preferences: {}
users:
- name: xxia/<master>:8443
  user:
    token: FYzbs7eKaGlhpxyZeOLwIjyyXnxP3mNq4cyutHFceuA

Comment 14 Fabiano Franz 2017-01-23 17:44:38 UTC
Ok, thanks for the confirmation. I'm lowering severity since based on comments 7 and 13, and my own tests, it's much better now than when initially reported, although still happening so we'll keep the kub open. The kubeconfig file was not initially designed for concurrent write access, but it can handle now some fair amount of concurrent access so let's consider this of low severity.

Comment 15 Joe Stewart 2017-08-14 14:36:58 UTC
We are having a fairly frequent occurrence of this problem. Any update towards a resolution?


$ oc version
oc v3.5.5.15
kubernetes v1.5.2+43a9be4
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://xxxxxxxxxxxxxxxxxxxxxxxxx:8443
openshift v3.5.5.15
kubernetes v1.5.2+43a9be4
$

Comment 16 Fabiano Franz 2017-08-18 19:00:58 UTC
We have been recommending that people don't use the kubeconfig file for concurrent writes. 

The major reason is that kubeconfig keeps state (e.g. the current context and auth token), and using the same file by multiple threads can lead you to unpredictable states, like one thread switching the context and another thread performing commands while expecting it was still on the previous context, before the switch happened. 

You'd do that by not calling commands that perform context or auth changes (like 'oc login' or 'oc config set*') and by providing flags that explicitly set context and auth on each command call (like --context, --namespace, --token, --config, etc).


Note You need to log in before you can comment on or make changes to this bug.