Description of problem: The Openshift Operations team has written many checks that use a kubeconfig file to authenticate to Openshift. We have noticed that this file becomes corrupt with multiple processes using the same kubeconfig file. In our view, the kubeconfig is an authentication file. The product seems to think it is authentication + state. It seems unnecessary that the authentication file would also try to keep state of the cli. We believe that the oc commands that change the namespace/project are what tend to lead to writes. Having the oc command be able to corrupt this file is just problematic. Version-Release number of selected component (if applicable): atomic-openshift-3.1.1.6-5 How reproducible: Steps to Reproduce: 1. Run multiple operations with the same kubeconfig file that do project/ns events Actual results: kubeconfig file becomes corrupt. Expected results: kubeconfig file does not become corrupt. Additional info: I think there are multiple ways to approach this. file locking, don't write to the kubeconfig file, break out functionality. We can obviously work around this by copying the kubeconfig file before running commands, but wanted to raise awareness on this.
The specific corruption usually causes the last 7 characters to be repeated on a new line, which causes a YAML parser error: client-key-data: <long encrypted string>bqDKntOVbSDnYEZLg46RnQh43tj2jMzc= j2jMzc= Error loading config file "/root/kubconfig.bad": yaml: line 40: could not find expected ':'
Related issue upstream: https://github.com/kubernetes/kubernetes/issues/23964
This is now fixed in Origin master (since the latest Kube rebase*), the config loader implemented file locking on writes to prevent kube config from getting corrupt. * https://github.com/openshift/origin/commit/24d4c5fc8130aee81503b525f365dc8563583ebf
Hi, all, could you please tell me your steps to make the file corrupt with old oc version? Thanks! To verify the bug, first need know the steps to reproduce it with old oc version and then test it with latest version. If the old oc version reproduces it but the latest version does not, then could prove the bug is fixed. However, I didn't reproduce it with below steps. I tried crontab (given user xxia and xxia2 have different projects): $ crontab -e 3,4,5 * * * * /home/tester/oc/versions/v3.1.1.6/oc login <master> -u xxia --insecure-skip-tls-verify -p redhat 3,4,5 * * * * /home/tester/oc/versions/v3.1.1.6/oc login <master> -u xxia2 --insecure-skip-tls-verify -p redhat And tried this: $ crontab -e 20 * * * * oc new-project xxia-proj2 --server=https://<master>:8443 --client-certificate='/home/tester/oc/xxia/xxia.crt' --client-key='/home/tester/oc/xxia/xxia.key' --certificate-authority='/home/tester/oc/xxia/ca.crt' 20 * * * * oc new-project xxia2-proj2 --server=https://<master>:8443 --client-certificate='/home/tester/oc/xxia2/xxia2.crt' --client-key='/home/tester/oc/xxia2/xxia2.key' --certificate-authority='/home/tester/oc/xxia2/ca.crt' Both crontab settings did not make the kubeconfig file corrupt, the file is still good.
I don't believe we ever tracked down what is causing this to change exactly. This is what we have been doing and when we were seeing the corruption. We have since moved to a different model where we are not using the file, but copying it before it is used, so we are working around the issue.\ What we would typically do is write script (usually in python and bash) that would use the config file to do authenticated web calls. This is via curl or libraries within python. We would then set these checks up via cron timers. These would run from every couple of minutes to once a day. All of the checks would use the same file. Eventually, our checks would start failing because we were unable to load and parse the kubeconfig file, it was not valid yaml. Again, we didn't track down the exact calls that were causing the corruption, but did notice it didn't work. Hopefully the upstream fix will prevent this in the future.
I didn't try this, but you would need to have, running at the same time, 2+ processes calling an `oc` command that modifies the config file. Here's one idea to possibly reproduce this 1. Make sure you have at least 2 contexts in your ~/.kube/config 2. In bash run an infinite 'for' loop that calls something like 'oc config use-context my-ctx-1'. Put it in background; 3. In bash run another infinite 'for' loop that calls something like 'oc config use-context my-ctx-2'. Put it in background. You will have two commands modifying the config file concurrently and would probably get the file corrupted at some point.
I wrote a shell script: $ cat config-race.sh #! /bin/bash [ $# != 3 ] && echo "usage: $0 </directory/of/oc> <project> <interval>" && exit 0 VER_DIR=$1 # /home/tester/oc/versions/v3.1.1.6 or /bin PROJ=$2 # xxia-proj TIME=$3 # 1.1 [ -e /home/tester/oc/is_broken ] || touch /home/tester/oc/is_broken while true do grep -q broken /home/tester/oc/is_broken && echo "broken probed" && exit -1 if ! $VER_DIR/oc project $PROJ; then echo "broken" > /home/tester/oc/is_broken && echo "broken" && cat /home/tester/.kube/config exit -1 fi sleep $TIME done Then I test 2 versions of oc: v3.1.1.6, v3.3.0.8, put under /home/tester/oc/versions/v3.1.1.6 and /bin respectively. The corruption happens quickly with v3.1.1.6. For v3.3.0.8, it may live longer, but finally corrupt as well. See following comment:
Since this is reported on OCP, not sure the Target Release version or the OCP version could be used for verify. According to comment 9, it's still reproduced on v3.1.1.6 and v3.3.0.8.
Please check this using v3.5. Although I did see some errors when running the test using the script proposed in Comment 7, I could not get the file to become corrupt. The file is still valid and can be used, prints correctly in 'oc config view', etc. Tks!
This should be fixed in OCP v3.5.0.7 or newer.
(In reply to Fabiano Franz from comment #11) > Please check this using v3.5. Although I did see some errors when running > the test using the script proposed in Comment 7, I could not get the file to > become corrupt. The file is still valid and can be used, prints correctly in > 'oc config view', etc. Tks! Most of time, the file is still valid. But sometimes after more tests, the file is invalid (Checked in oc/openshift v3.5.0.7) Below are steps. To make things simpler, use simpler script as follows instead of those in comment 7: $ cat new-script.sh #! /bin/bash oc login https://<your master>:8443 --insecure-skip-tls-verify -u xxia -p redhat while true do oc project $1 [ ! $? == 0 ] && oc config view sleep $2 done Then in different terminals, run following cmd respectively: ./new-script.sh xxia-proj1 1.1 ./new-script.sh xxia-proj2 1.2 ./new-script.sh xxia-proj3 1.3 ./new-script.sh xxia-proj4 1.4 Observe the outputs. Sometimes, some terminal will output: Now using project "xxia-proj4" on server "https://<master>:8443". ... Already on project "xxia-proj4" on server "https://<master>:8443". ... error: Missing or incomplete configuration info. Please login or point to an existing, complete config file: 1. Via the command-line flag --config 2. Via the KUBECONFIG environment variable 3. In your home directory as ~/.kube/config To view or setup config directly use the 'config' command. See 'oc project -h' for help and examples. apiVersion: v1 clusters: [] contexts: [] current-context: "" kind: Config preferences: {} users: [] Sometimes, some terminal will output (notice "clusters: []"): apiVersion: v1 clusters: [] contexts: - context: cluster: <master>:8443 namespace: xxia-proj1 user: xxia/<master>:8443 name: xxia-proj1/<master>:8443/xxia current-context: xxia-proj1/<master>:8443/xxia kind: Config preferences: {} users: - name: xxia/<master>:8443 user: token: FYzbs7eKaGlhpxyZeOLwIjyyXnxP3mNq4cyutHFceuA
Ok, thanks for the confirmation. I'm lowering severity since based on comments 7 and 13, and my own tests, it's much better now than when initially reported, although still happening so we'll keep the kub open. The kubeconfig file was not initially designed for concurrent write access, but it can handle now some fair amount of concurrent access so let's consider this of low severity.
We are having a fairly frequent occurrence of this problem. Any update towards a resolution? $ oc version oc v3.5.5.15 kubernetes v1.5.2+43a9be4 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://xxxxxxxxxxxxxxxxxxxxxxxxx:8443 openshift v3.5.5.15 kubernetes v1.5.2+43a9be4 $
We have been recommending that people don't use the kubeconfig file for concurrent writes. The major reason is that kubeconfig keeps state (e.g. the current context and auth token), and using the same file by multiple threads can lead you to unpredictable states, like one thread switching the context and another thread performing commands while expecting it was still on the previous context, before the switch happened. You'd do that by not calling commands that perform context or auth changes (like 'oc login' or 'oc config set*') and by providing flags that explicitly set context and auth on each command call (like --context, --namespace, --token, --config, etc).