Bug 1377739 - 'error pruning blob' during pruning (connection refused)
Summary: 'error pruning blob' during pruning (connection refused)
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Image Registry
Version: 3.2.1
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Maciej Szulik
QA Contact: Wei Sun
URL:
Whiteboard:
: 1387762 (view as bug list)
Depends On:
Blocks: OSOPS_V3
TreeView+ depends on / blocked
 
Reported: 2016-09-20 13:50 UTC by Joel Diaz
Modified: 2017-02-06 11:20 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-02-06 11:20:41 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
oadm prune output (633.73 KB, application/x-bzip)
2016-09-20 13:50 UTC, Joel Diaz
no flags Details
master-api service log during pruning (31.66 KB, text/x-vhdl)
2016-11-02 17:46 UTC, Joel Diaz
no flags Details
etcd --debug output during pruning run (4.80 KB, application/x-xz)
2016-11-03 21:29 UTC, Joel Diaz
no flags Details

Description Joel Diaz 2016-09-20 13:50:17 UTC
Created attachment 1202899 [details]
oadm prune output

Description of problem:
During scheduled image pruning, OpenShift will eventually start displaying error messages like:

error pruning blob "sha256:0a3b55df4c37298c9add0149b85541ee71407d51bc887320dff007679b122ec6": error sending request: dial tcp 172.30.72.234:5000: connection refused

Version-Release number of selected component (if applicable):
[root@cloudbees-master-59c70 ~]# rpm -qa | grep atomic-open
atomic-openshift-3.2.1.15-1.git.8.c402626.el7.x86_64
atomic-openshift-clients-3.2.1.15-1.git.8.c402626.el7.x86_64
atomic-openshift-sdn-ovs-3.2.1.15-1.git.8.c402626.el7.x86_64
atomic-openshift-node-3.2.1.15-1.git.8.c402626.el7.x86_64
tuned-profiles-atomic-openshift-node-3.2.1.15-1.git.8.c402626.el7.x86_64
atomic-openshift-master-3.2.1.15-1.git.8.c402626.el7.x86_64

How reproducible:
100% of the time


Steps to Reproduce:
1. start pruning with: oadm prune images --keep-younger-than 24h --keep-tag-revisions 5 --token <TOKEN HERE> --confirm
2. wait for messages to start to appear
3. oadm eventually exists (exit code '1')

Actual results:
Many, many messages of the type:
error pruning blob "sha256:0a3b55df4c37298c9add0149b85541ee71407d51bc887320dff007679b122ec6": error sending request: dial tcp 172.30.72.234:5000: connection refused
error pruning blob "sha256:73a875b0845b3220bf6bdebb244cb7faf79c1212d6568e4e37714f6c69153bc8": error sending request: dial tcp 172.30.72.234:5000: connection refused
error pruning blob "sha256:97226030a384d44f936cfa5844ceda62f04da46d46e6bf651b898dc9ca4546dd": error sending request: dial tcp 172.30.72.234:5000: connection refused
error pruning blob "sha256:d9011083e33465170d666faa7ca14f2d925fed74355c4adec9c2e9b049b54b3f": error sending request: dial tcp 172.30.72.234:5000: connection refused
error pruning blob "sha256:92f36f4e29a90f13cc315dba9e352d9328bb96a6138b81f830202d124c8f19f6": error sending request: dial tcp 172.30.72.234:5000: connection refused

Expected results:
Clean exit code of '0'.

Additional info:

Comment 1 Michal Fojtik 2016-10-10 09:39:09 UTC
I this we will need to see the Docker Registry logs for this one (it seems the Docker Registry API is refusing connection).

Comment 2 Maciej Szulik 2016-10-10 11:06:48 UTC
This looks like errors connecting to the registry. Can you also provide the output of this command with --loglevel=4, at least. And please make sure you're able to reach registry IP (172.30.72.234:5000) from the machine you're invoking oadm prune. You can also explicitly specify that IP (if needed) using --registry-url flag.

Comment 3 Michal Fojtik 2016-11-02 08:41:54 UTC
Agree, this seems to be a registry connectivity problem, setting upcoming release as this is not a regression or blocker, but rather a networking problem.

Comment 4 Joel Diaz 2016-11-02 14:31:17 UTC
Sorry for the slow response. I went to gather the additional logs, and the cluster is no longer showing the error messages originally reported.

What I'm seeing now are messages of the types:

I1102 14:14:53.981931   36482 imagepruner.go:430] Unable to find image "sha256:325435f552c40e73c39cc5335f8908a48b4356d4d79fb669a162f8d9fa6b7487" in the graph

...

Deleting references from image streams to images ...
error updating image stream project-vera-dev/vera-nodejs to remove references to image sha256:09fff08beb3c549b4ae28555eba7b446b05c6426f9d081807df1161c05ec5830: c
lient: response is invalid json. The endpoint is probably not valid etcd cluster endpoint.

...

Deleting registry repository layer links ...
W1102 14:15:13.972756   36482 imagepruner.go:922] Unable to prune layer https://172.30.166.216:5000/v2/project-vera-dev/vera-nodejs/blobs/sha256:a119d0871c9ba84f
58deb74c712a7b837473642a70b67b908d0ad65b171dec1f, returned 404 Not Found

...

Deleting registry repository manifest data ...
W1102 14:20:20.651825   36482 imagepruner.go:922] Unable to prune layer https://172.30.166.216:5000/v2/project-vera-dev/vera-nodejs/manifests/sha256:09fff08beb3c
549b4ae28555eba7b446b05c6426f9d081807df1161c05ec5830, returned 404 Not Found

...

error pruning image from stream: client: response is invalid json. The endpoint is probably not valid etcd cluster endpoint.


Should the above issues/messages be opened as a separate bugzilla?

Comment 5 Maciej Szulik 2016-11-02 15:53:16 UTC
The 404 on layer pruning is ok, since we're trying to remove some layers several times, because they might belong to several images, not to worry about that log.

What worries me is:

error updating image stream project-vera-dev/vera-nodejs to remove references to image sha256:09fff08beb3c549b4ae28555eba7b446b05c6426f9d081807df1161c05ec5830: c
lient: response is invalid json. The endpoint is probably not valid etcd cluster endpoint.

Can you provide me with logs from the master from that time. Pruning does not connect to etcd directly, only through api-server, in that case something wrong is happening in the master and I need to know what's wrong in there.

Comment 6 Joel Diaz 2016-11-02 17:46:00 UTC
Created attachment 1216657 [details]
master-api service log during pruning

Attached is the log output from the master-api service during pruning.

Comment 7 Timothy St. Clair 2016-11-03 16:04:05 UTC
Need the etcd logs to verify.

Comment 8 Joel Diaz 2016-11-03 19:37:43 UTC
I assume you want more than the default logging output. Here's all that came out during a pruning run:

Nov 03 19:27:26 ip-172-20-250-161.ap-southeast-1.compute.internal etcd[1459]: purged file /var/lib/etcd/member/wal/00000000000024a6-00000000024ae724.wal successfully

Considering that this is an active cluster, how do you enable more verbose logging?

Comment 9 Timothy St. Clair 2016-11-03 19:56:52 UTC
you can add '--debug' to the arguments in the systemd unit file.  

Also if you have any metrics gathering on the etcd node that would be useful.  We've seen instances where pruning a large number of images with excessive layers can cause issues if not planned properly from a memory sizing and HA perspective.

Comment 10 Joel Diaz 2016-11-03 21:29:26 UTC
Created attachment 1217154 [details]
etcd --debug output during pruning run

Comment 11 Timothy St. Clair 2016-11-04 20:19:25 UTC
I see no indication of a problem in the etcd logs, but it does look like there is some pathological hot looping.  

[snip] 

Nov 03 21:18:42 ip-172-20-250-161.ap-southeast-1.compute.internal etcd[108036]: [GET] /v2/keys/kubernetes.io/serviceaccounts/openshift-infra/autopruner?quorum=false&recursive=false&sorted=false remote:172.20.250.160:47142
Nov 03 21:18:42 ip-172-20-250-161.ap-southeast-1.compute.internal etcd[108036]: [GET] /v2/keys/kubernetes.io/secrets/openshift-infra/autopruner-token-nbjhc?quorum=false&recursive=false&sorted=false remote:172.20.250.160:47142
Nov 03 21:18:42 ip-172-20-250-161.ap-southeast-1.compute.internal etcd[108036]: [GET] /v2/keys/kubernetes.io/serviceaccounts/openshift-infra/autopruner?quorum=false&recursive=false&sorted=false remote:172.20.250.160:47142
Nov 03 21:18:42 ip-172-20-250-161.ap-southeast-1.compute.internal etcd[108036]: [GET] /v2/keys/kubernetes.io/secrets/openshift-infra/autopruner-token-nbjhc?quorum=false&recursive=false&sorted=false remote:172.20.250.160:47142
Nov 03 21:18:42 ip-172-20-250-161.ap-southeast-1.compute.internal etcd[108036]: [GET] /v2/keys/kubernetes.io/serviceaccounts/openshift-infra/autopruner?quorum=false&recursive=false&sorted=false remote:172.20.250.160:47142
Nov 03 21:18:42 ip-172-20-250-161.ap-southeast-1.compute.internal etcd[108036]: [GET] /v2/keys/kubernetes.io/secrets/openshift-infra/autopruner-token-nbjhc?quorum=false&recursive=false&sorted=false remote:172.20.250.160:47142
Nov 03 21:18:42 ip-172-20-250-161.ap-southeast-1.compute.internal etcd[108036]: [GET] /v2/keys/kubernetes.io/serviceaccounts/openshift-infra/autopruner?quorum=false&recursive=false&sorted=false remote:172.20.250.160:47142
Nov 03 21:18:42 ip-172-20-250-161.ap-southeast-1.compute.internal etcd[108036]: [GET] /v2/keys/kubernetes.io/secrets/openshift-infra/autopruner-token-nbjhc?quorum=false&recursive=false&sorted=false remote:172.20.250.160:47142
Nov 03 21:18:42 ip-172-20-250-161.ap-southeast-1.compute.internal etcd[108036]: [GET] /v2/keys/kubernetes.io/serviceaccounts/openshift-infra/autopruner?quorum=false&recursive=false&sorted=false remote:172.20.250.160:47142
Nov 03 21:18:42 ip-172-20-250-161.ap-southeast-1.compute.internal etcd[108036]: [GET] /v2/keys/kubernetes.io/secrets/openshift-infra/autopruner-token-nbjhc?quorum=false&recursive=false&sorted=false remote:172.20.250.160:47142
Nov 03 21:18:42 ip-172-20-250-161.ap-southeast-1.compute.internal etcd[108036]: [GET] /v2/keys/kubernetes.io/serviceaccounts/openshift-infra/autopruner?quorum=false&recursive=false&sorted=false remote:172.20.250.160:47142 
... REPEAT. 
[snip]

Comment 12 Maciej Szulik 2016-12-01 11:45:56 UTC
*** Bug 1387762 has been marked as a duplicate of this bug. ***

Comment 13 Maciej Szulik 2017-01-03 14:41:24 UTC
Joel, sorry for the long delay. After going again through the bug it looks like etcd for some reason is overloaded by requests to serviceaccount autopruner in openshift-infra namespace. Could you, by any chance verify who's using that SA, it looks like someone is trying to GET it a lot in short period of time. 
Most importantly, can you provide me once again with logs from master (level 7, preferably) etcd, and the prunning command from exactly the same operation, so that's easier for me to correlate the events? I need to trace down what might be causing these problems.

Comment 14 Joel Diaz 2017-02-03 19:42:54 UTC
Maciej, the cluster has since been upgraded to 3.3, and I just looked and the pruning now works. So I can't repro the issue any more.


Note You need to log in before you can comment on or make changes to this bug.