Bug 1495139

Summary: device or resource busy error info in prometheus container logs after running for an hour
Product: OpenShift Container Platform Reporter: Junqi Zhao <juzhao>
Component: HawkularAssignee: Zohar Gal-Or <zgalor>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.7.0CC: aos-bugs, ccoleman, fsimonce, jcantril, mwringe, pweil, theute, zgalor
Target Milestone: ---   
Target Release: 3.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-11-28 22:12:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
device or resource busy error info in prometheus container logs none

Description Junqi Zhao 2017-09-25 09:36:30 UTC
Created attachment 1330465 [details]
device or resource busy error info in prometheus container logs

Description of problem:
Deploy prometheus and check prometheus logs after a few hours, device or resource busy error info in prometheus container logs, more info see the attached log file.
# oc logs prometheus-3994606287-x31fv -c prometheus
ts=2017-09-25T09:08:11.876143078Z caller=db.go:288 msg="reloading blocks failed" err="read meta information data/01BTW0JRGM2RY10HTHX2XRBCCH: open data/01BTW0JRGM2RY10HTHX2XRBCCH/meta.json: no such file or directory"
ts=2017-09-25T09:08:41.315337314Z caller=compact.go:327 msg="compact blocks" blocks=[01BTW0JZ463KRJF9D9TJ7NA9A2]
ts=2017-09-25T09:08:41.956814166Z caller=db.go:283 msg="compaction failed" err="delete compacted head block: remove data/01BTW0JZ463KRJF9D9TJ7NA9A2/wal/.nfs000000000203e94300000002: device or resource busy"
ts=2017-09-25T09:08:41.980123205Z caller=db.go:288 msg="reloading blocks failed" err="read meta information data/01BTW0JRGM2RY10HTHX2XRBCCH: open data/01BTW0JRGM2RY10HTHX2XRBCCH/meta.json: no such file or directory"

although it has such error, it does not affect prometheus function, prometheus pods are running
# oc get po
NAME                          READY     STATUS    RESTARTS   AGE
prometheus-3994606287-j5605   5/5       Running   0          1h
prometheus-3994606287-x31fv   5/5       Running   0          1h


Version-Release number of selected component (if applicable):
Images
prometheus:v3.7.0-7
prometheus-alertmanager:v3.7.0-7
prometheus-alert-buffer:v3.7.0-7
oauth-proxy:v3.7.0-5

How reproducible:
Always

Steps to Reproduce:
1. Deploy prometheus via ansible
2. Check logs
# oc logs ${POD} -c prometheus
3.

Actual results:
device or resource busy error info in prometheus container logs after a few hours

Expected results:
Should not be error in pod logs

Additional info:
[OSEv3:children]
masters
etcd
nfs

[masters]
${MASTER_URL} openshift_public_hostname=${MASTER_URL}

[etcd]
${ETCD} openshift_public_hostname=${ETCD}

[nfs]
${NFS} openshift_public_hostname=${NFS}


[OSEv3:vars]
ansible_ssh_user=root
ansible_ssh_private_key_file="~/libra.pem"
deployment_type=openshift-enterprise
openshift_docker_additional_registries=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888


# prometheus
openshift_prometheus_state=present
openshift_prometheus_namespace=prometheus

openshift_prometheus_replicas=2
openshift_prometheus_node_selector={'role': 'node'}

openshift_prometheus_image_proxy=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/oauth-proxy
openshift_prometheus_image_prometheus=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/prometheus
openshift_prometheus_image_alertmanager=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/prometheus-alertmanager
openshift_prometheus_image_alertbuffer=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/prometheus-alert-buffer

Comment 1 Jeff Cantrill 2017-09-26 12:55:02 UTC
@Clayton, Paul suggested we should tag you in on this one.

Comment 2 Zohar Gal-Or 2017-09-26 14:10:16 UTC
If I am not mistaken 2 replicas is a problem for prometheus, because both are using the same PV and might collide.
I think this is not and will not be supported.
Clayton, keep me honest here.

Comment 3 Junqi Zhao 2017-09-27 07:56:49 UTC
It has the same error even set openshift_prometheus_replicas=1 after a few hours

Comment 4 Federico Simoncelli 2017-09-27 09:16:46 UTC
(In reply to Junqi Zhao from comment #0)
> # oc logs prometheus-3994606287-x31fv -c prometheus
> ts=2017-09-25T09:08:11.876143078Z caller=db.go:288 msg="reloading blocks
> failed" err="read meta information data/01BTW0JRGM2RY10HTHX2XRBCCH: open
> data/01BTW0JRGM2RY10HTHX2XRBCCH/meta.json: no such file or directory"
> ts=2017-09-25T09:08:41.315337314Z caller=compact.go:327 msg="compact blocks"
> blocks=[01BTW0JZ463KRJF9D9TJ7NA9A2]
> ts=2017-09-25T09:08:41.956814166Z caller=db.go:283 msg="compaction failed"
> err="delete compacted head block: remove
> data/01BTW0JZ463KRJF9D9TJ7NA9A2/wal/.nfs000000000203e94300000002: device or
> resource busy"
> ts=2017-09-25T09:08:41.980123205Z caller=db.go:288 msg="reloading blocks
> failed" err="read meta information data/01BTW0JRGM2RY10HTHX2XRBCCH: open
> data/01BTW0JRGM2RY10HTHX2XRBCCH/meta.json: no such file or directory"

(In reply to Junqi Zhao from comment #3)
> It has the same error even set openshift_prometheus_replicas=1 after a few
> hours


Thomas the errors reported in comment 0 seem prometheus issues, should we reassign this to your team?

Comment 5 Paul Gier 2017-09-29 02:49:19 UTC
@Junqi Is the data volume mapped to an NFS share?  If so, could you try using local storage instead?  The prometheus devs recommend not using a network file system.

Comment 6 Junqi Zhao 2017-09-29 08:00:56 UTC
(In reply to Paul Gier from comment #5)
> @Junqi Is the data volume mapped to an NFS share?  
  Yes, used NFS PV.

> If so, could you try
> using local storage instead?  The prometheus devs recommend not using a
> network file system.
How to use local storage, and now it is blocked by [1] if we don't use NFS PV

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1495446

Comment 7 Thomas Heute 2017-10-04 12:29:12 UTC
1 - Right, we can't "scale" Prometheus, so it needs a maxReplicas of 1 AFAIK
2 - Indeed, NFS should not be used for Prometheus. It should be hostPath on the same node I think. (With node selector ?)

Comment 8 Zohar Gal-Or 2017-10-09 06:50:29 UTC
NFS can be used for Prometheus, but it has a few bugs, and it is now being refactored.
https://github.com/openshift/openshift-ansible/pull/5459

Comment 9 Thomas Heute 2017-10-09 12:37:30 UTC
What I and Paul meant is that NFS is not recommended, speed would likely quickly become an issue.

Comment 13 Junqi Zhao 2017-11-01 05:24:10 UTC
There is not error info in prometheus container logs.

env:
# rpm -qa | grep openshift-ansible
openshift-ansible-callback-plugins-3.7.0-0.184.0.git.0.d407445.el7.noarch
openshift-ansible-3.7.0-0.184.0.git.0.d407445.el7.noarch
openshift-ansible-filter-plugins-3.7.0-0.184.0.git.0.d407445.el7.noarch
openshift-ansible-playbooks-3.7.0-0.184.0.git.0.d407445.el7.noarch
openshift-ansible-docs-3.7.0-0.184.0.git.0.d407445.el7.noarch
openshift-ansible-lookup-plugins-3.7.0-0.184.0.git.0.d407445.el7.noarch
openshift-ansible-roles-3.7.0-0.184.0.git.0.d407445.el7.noarch


# openshift version
openshift v3.7.0-0.184.0
kubernetes v1.7.6+a08f5eeb62
etcd 3.2.8

Images:
oauth-proxy/images/v3.7.0-54
prometheus/images/v3.7.0-54
prometheus-alertmanager/images/v3.7.0-54
prometheus-alert-buffer/images/v3.7.0-51

Comment 17 errata-xmlrpc 2017-11-28 22:12:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188