Description of problem: Builds are failing in starter-us-east-1 because of a delay in creating the dockercfg token used for builds. Version-Release number of selected component (if applicable): atomic-openshift-3.5.5.19-1.git.0.aef7e02.el7.x86_64 etcd-3.1.9-1.el7.x86_64 How reproducible: 20 out of 20 test builds in starter-us-east-1 hit this issue. Steps to Reproduce: 1. Create a new project and any app that does a build. For example, 'oc new-project <name>; oc new-app cakephp-mysql-persistent -n <name>' 2. 3. Actual results: The deployer and builder tokens are not present in time for the first build attempt to succeed. Subsequent builds succeed after the token has been created. Expected results: The first build should succeed. Additional info: Most of the time, this error was encountered: 'No API token found for service account "deployer", retry after the token is automatically created and added to the service account' Occasionally, this error would appear too: 'Error creating: No API token found for service account "builder", retry after the token is automatically created and added to the service account' Also, 'oc get sa' shows the docker token missing from the build sa in the project. After some time, the docker token shows up and builds will succeed. See attached logs for details.
Can we identify when this started happening and what changes might have happened around that time? Did we upgrade an versions? Make config changes? Anything like that?
If new-app is run immediately after a project is created, it races with service account token generation. This has been this way since 3.0 (c.f. https://bugzilla.redhat.com/show_bug.cgi?id=1318917#c5) The time it takes for the service account and tokens to be provisioned has been greatly improved, but the fundamental race still exists. @stefanie - how long after running `oc new-project` are all three service accounts and all nine secrets present in the namespace?
I was on-call this weekend and got paged about it. That's the first incident that I'm aware of. But according to Zabbix, there has been an increasing number of app-create failures since June 2nd. Overall, it just appears to be getting worse with time. I'm also seeing an increasing number of IOPs used over time on this Zabbix graph. The last upgrade was etcd on June 9th. atomic-openshift was upgraded on May 25th. I'll attach graphs and test how long it takes secrets to appear next...
Created attachment 1289114 [details] Zabbix IOPS graph
Created attachment 1289115 [details] Zabbix app-create graph
Created attachment 1289116 [details] Zabbix app-creates since June 1
I was keeping track of the time by watching the age of the rc's in affected projects. Some of the rc's were 4 minutes old before the dockercfg secrets were created.
Is this a cluster that is otherwise healthy, with controllers running as expected?
> I wasn't able to find any instances of 'service' and 'token' on the same line. Here's the full output of 'oc get --raw /metrics'. Oops, those metrics are on the controller metrics port, which is on port 8444 by default. Something like this: oc get --server=https://<master>:8444 --raw /metrics 2>&1 | grep serviceaccount_tokens
*** Bug 1447386 has been marked as a duplicate of this bug. ***
in the controllers unit file, what is the --listen address passed to the controllers? that would dictate where the metrics were available. also, you would need to check the metrics against the controller that currently had the active lease
Thanks, I'm on the active controller master now. [root@starter-us-east-1-master-25064 ~]# grep OPTIONS /etc/sysconfig/atomic-openshift-master-controllers OPTIONS=--loglevel=2 --listen=https://0.0.0.0:8444 I can see you were right about the port it listens on. Using that, the command to grab metrics is working: [root@starter-us-east-1-master-25064 ~]# oc get --server=https://ip-172-31-54-162.ec2.internal:8444 --raw /metrics 2>&1 | grep serviceaccount_tokens # HELP serviceaccount_tokens_secret_adds Total number of adds handled by workqueue: serviceaccount_tokens_secret # TYPE serviceaccount_tokens_secret_adds counter serviceaccount_tokens_secret_adds 3.86200587e+08 # HELP serviceaccount_tokens_secret_depth Current depth of workqueue: serviceaccount_tokens_secret # TYPE serviceaccount_tokens_secret_depth gauge serviceaccount_tokens_secret_depth 0 # HELP serviceaccount_tokens_secret_queue_latency How long an item stays in workqueueserviceaccount_tokens_secret before being requested. # TYPE serviceaccount_tokens_secret_queue_latency summary serviceaccount_tokens_secret_queue_latency{quantile="0.5"} 306044 serviceaccount_tokens_secret_queue_latency{quantile="0.9"} 538312 serviceaccount_tokens_secret_queue_latency{quantile="0.99"} 682085 serviceaccount_tokens_secret_queue_latency_sum 1.26759309437759e+14 serviceaccount_tokens_secret_queue_latency_count 3.86200587e+08 # HELP serviceaccount_tokens_secret_retries Total number of retries handled by workqueue: serviceaccount_tokens_secret # TYPE serviceaccount_tokens_secret_retries counter serviceaccount_tokens_secret_retries 0 # HELP serviceaccount_tokens_secret_work_duration How long processing an item from workqueueserviceaccount_tokens_secret takes. # TYPE serviceaccount_tokens_secret_work_duration summary serviceaccount_tokens_secret_work_duration{quantile="0.5"} 11 serviceaccount_tokens_secret_work_duration{quantile="0.9"} 85 serviceaccount_tokens_secret_work_duration{quantile="0.99"} 286 serviceaccount_tokens_secret_work_duration_sum 1.2379943135e+10 serviceaccount_tokens_secret_work_duration_count 3.86200587e+08 # HELP serviceaccount_tokens_service_adds Total number of adds handled by workqueue: serviceaccount_tokens_service # TYPE serviceaccount_tokens_service_adds counter serviceaccount_tokens_service_adds 51794 # HELP serviceaccount_tokens_service_depth Current depth of workqueue: serviceaccount_tokens_service # TYPE serviceaccount_tokens_service_depth gauge serviceaccount_tokens_service_depth 0 # HELP serviceaccount_tokens_service_queue_latency How long an item stays in workqueueserviceaccount_tokens_service before being requested. # TYPE serviceaccount_tokens_service_queue_latency summary serviceaccount_tokens_service_queue_latency{quantile="0.5"} 10 serviceaccount_tokens_service_queue_latency{quantile="0.9"} 100645 serviceaccount_tokens_service_queue_latency{quantile="0.99"} 106200 serviceaccount_tokens_service_queue_latency_sum 1.300959845899e+12 serviceaccount_tokens_service_queue_latency_count 51794 # HELP serviceaccount_tokens_service_retries Total number of retries handled by workqueue: serviceaccount_tokens_service # TYPE serviceaccount_tokens_service_retries counter serviceaccount_tokens_service_retries 18 # HELP serviceaccount_tokens_service_work_duration How long processing an item from workqueueserviceaccount_tokens_service takes. # TYPE serviceaccount_tokens_service_work_duration summary serviceaccount_tokens_service_work_duration{quantile="0.5"} 59733 serviceaccount_tokens_service_work_duration{quantile="0.9"} 135333 serviceaccount_tokens_service_work_duration{quantile="0.99"} 144715 serviceaccount_tokens_service_work_duration_sum 1.532152388e+09 serviceaccount_tokens_service_work_duration_count 51794 [root@starter-us-east-1-master-25064 ~]#
The logs are so full of stuff (think gig per hour) it's near impossible to find any needles in the haystack. What sort of things should I grep for? 'Just looking' is not really possible. We know this etcd is taking a beating. We know when it snapshots every 20 seconds it can take up to 1sec for it to respond to GETs. But I don't see much indication of things literally failing, just being slow.
Created attachment 1292450 [details] Logs from suggest /metrics while creating project
analysis shows secret watch events are not being delivered 1. namespace is created 2. immediately, service accounts are created, so ns watch event was received 3. immediately, populated tokens (from the token controller) and unpopulated tokens (from the dockercfg controller) are created, so sa watch events are received 4. then, there is a long, long silence. metrics show the queue for the token controller as sitting at 0 almost the whole time, so we're not backed up. eventually, the secrets appear. Verified there is an issue with watch event delivery: 1. manually watched service accounts and secrets oc get secrets -w oc get sa -w 2. deleted service accounts and secrets oc delete secrets --all oc delete sa --all 3. verified service accounts and secrets got recreated 4. service account events were delivered. secret events were not.
Created attachment 1292476 [details] watching etcd while creating a new project
Looking at those last logs for default-token-* create token 7zj2d WITH data create token rgs4g WITHOUT data compareAndDelete token 7zj2d create token 69bt0 WITH data *wait a minute* compareAndSwap rgs4g so that it has data. similar pattern for builder and deployer. Very odd.
That's expected behavior in 3.5… the two token creating controllers both update the service account immediately, the one creating the populated token reacts to a conflict by deleting and requeuing. This got improved in 1.7, but is unrelated to the watch delivery issue, which is what is causing the unpopulated token to remain unpopulated for 1-4 minutes.
Transferring to David while I'm out. Eric, I'd want to pin down whether the issue was in etcd or in the watch cache. If a list of secrets takes ~1 minute from etcd, it seems possible that the event window could expire before the lister/watcher that feeds the watch cache could establish a watch. That might make the watch cache devolve into full lists once a minute, which could explain both the load on the server and the extreme lag or absence of secret watch events.
Created attachment 1292688 [details] journalctl --since='-60m' -b -u atomic-openshift-master-api | grep 'ended with:'
Created attachment 1292690 [details] journalctl --since='-60m' -b -u systemd-journald -u atomic-openshift-master-api | grep -E '(ended with:)|(Suppress)' | grep master Just to make sure it is clear that we might have lost many many many more of these messages
Created attachment 1292694 [details] cat /etc/origin/master/master-config.yaml
Jun 28 19:21:40 ip-172-31-54-162.ec2.internal atomic-openshift-master-controllers[26071]: W0628 19:21:40.177584 26071 reflector.go:319] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/serviceaccount/admission.go:119: watch of *api.Secret ended with: too old resource version: 743222542 (743230476) Jun 28 19:22:59 ip-172-31-54-162.ec2.internal atomic-openshift-master-controllers[26071]: W0628 19:22:59.120664 26071 reflector.go:319] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/serviceaccount/admission.go:119: watch of *api.Secret ended with: too old resource version: 743237322 (743243970) Jun 28 19:24:14 ip-172-31-54-162.ec2.internal atomic-openshift-master-controllers[26071]: W0628 19:24:14.734873 26071 reflector.go:319] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/serviceaccount/admission.go:119: watch of *api.Secret ended with: too old resource version: 743250581 (743256981) Jun 28 19:25:35 ip-172-31-54-162.ec2.internal atomic-openshift-master-controllers[26071]: W0628 19:25:35.053245 26071 reflector.go:319] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/serviceaccount/admission.go:119: watch of *api.Secret ended with: too old resource version: 743263795 (743271318) Jun 28 19:26:54 ip-172-31-54-162.ec2.internal atomic-openshift-master-controllers[26071]: W0628 19:26:54.888254 26071 reflector.go:319] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/serviceaccount/admission.go:119: watch of *api.Secret ended with: too old resource version: 743278042 (743284459) Jun 28 19:28:14 ip-172-31-54-162.ec2.internal atomic-openshift-master-controllers[26071]: W0628 19:28:14.683790 26071 reflector.go:319] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/serviceaccount/admission.go:119: watch of *api.Secret ended with: too old resource version: 743291479 (743298150) Jun 28 19:29:32 ip-172-31-54-162.ec2.internal atomic-openshift-master-controllers[26071]: W0628 19:29:32.212156 26071 reflector.go:319] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/serviceaccount/admission.go:119: watch of *api.Secret ended with: too old resource version: 743304981 (743311696) Jun 28 19:30:53 ip-172-31-54-162.ec2.internal atomic-openshift-master-controllers[26071]: W0628 19:30:53.243054 26071 reflector.go:319] github.com/openshift/origin/vendor/k8s.io/kubernetes/plugin/pkg/admission/serviceaccount/admission.go:119: watch of *api.Secret ended with: too old resource version: 743319014 (743326934)
# time oc get secrets --all-namespaces | wc -l 107814 real 2m32.015s user 1m51.743s sys 0m25.743s
comment #33 is, I believe, from an INACTIVE controller
Created attachment 1292695 [details] journalctl --since='-60m' -u atomic-openshift-master-controllers | grep 'reflector.go'
etcd is not disk drive started, but it clearly is on fire: # time etcdctl --endpoints=https://172.31.54.162:2379 --ca-file=/etc/origin/master/master.etcd-ca.crt --cert-file=/etc/origin/master/master.etcd-client.crt --key-file=/etc/origin/master/master.etcd-client.key get --quorum /kubernetes.io/secrets/eparis/default-token-n3hh7 > /dev/null 2017-06-28 22:56:16.077912 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated real 0m0.057s user 0m0.038s sys 0m0.014s # time etcdctl --endpoints=https://172.31.54.162:2379 --ca-file=/etc/origin/master/master.etcd-ca.crt --cert-file=/etc/origin/master/master.etcd-client.crt --key-file=/etc/origin/master/master.etcd-client.key get --quorum /kubernetes.io/secrets/eparis/default-token-n3hh7 > /dev/null real 0m2.818s user 0m0.038s sys 0m0.024s
Alright, got a theory. 1. at some point the storage cache etcd watch expires 2. storage cache terminates all API secret watches 3. storage cache holds all future API secret watches until the relist is complete 4. storage cache reflector does a relist - this takes one to two minutes 5. storage cache accepts the API secret watches, but has no data for them 6. storage cache reflector does a watch, but gets rejected on a RV too old (remember its list RV is old) 7. go back to step 2. Try increasing the number of RVs available in etcd.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days