Created attachment 1330465 [details] device or resource busy error info in prometheus container logs Description of problem: Deploy prometheus and check prometheus logs after a few hours, device or resource busy error info in prometheus container logs, more info see the attached log file. # oc logs prometheus-3994606287-x31fv -c prometheus ts=2017-09-25T09:08:11.876143078Z caller=db.go:288 msg="reloading blocks failed" err="read meta information data/01BTW0JRGM2RY10HTHX2XRBCCH: open data/01BTW0JRGM2RY10HTHX2XRBCCH/meta.json: no such file or directory" ts=2017-09-25T09:08:41.315337314Z caller=compact.go:327 msg="compact blocks" blocks=[01BTW0JZ463KRJF9D9TJ7NA9A2] ts=2017-09-25T09:08:41.956814166Z caller=db.go:283 msg="compaction failed" err="delete compacted head block: remove data/01BTW0JZ463KRJF9D9TJ7NA9A2/wal/.nfs000000000203e94300000002: device or resource busy" ts=2017-09-25T09:08:41.980123205Z caller=db.go:288 msg="reloading blocks failed" err="read meta information data/01BTW0JRGM2RY10HTHX2XRBCCH: open data/01BTW0JRGM2RY10HTHX2XRBCCH/meta.json: no such file or directory" although it has such error, it does not affect prometheus function, prometheus pods are running # oc get po NAME READY STATUS RESTARTS AGE prometheus-3994606287-j5605 5/5 Running 0 1h prometheus-3994606287-x31fv 5/5 Running 0 1h Version-Release number of selected component (if applicable): Images prometheus:v3.7.0-7 prometheus-alertmanager:v3.7.0-7 prometheus-alert-buffer:v3.7.0-7 oauth-proxy:v3.7.0-5 How reproducible: Always Steps to Reproduce: 1. Deploy prometheus via ansible 2. Check logs # oc logs ${POD} -c prometheus 3. Actual results: device or resource busy error info in prometheus container logs after a few hours Expected results: Should not be error in pod logs Additional info: [OSEv3:children] masters etcd nfs [masters] ${MASTER_URL} openshift_public_hostname=${MASTER_URL} [etcd] ${ETCD} openshift_public_hostname=${ETCD} [nfs] ${NFS} openshift_public_hostname=${NFS} [OSEv3:vars] ansible_ssh_user=root ansible_ssh_private_key_file="~/libra.pem" deployment_type=openshift-enterprise openshift_docker_additional_registries=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888 # prometheus openshift_prometheus_state=present openshift_prometheus_namespace=prometheus openshift_prometheus_replicas=2 openshift_prometheus_node_selector={'role': 'node'} openshift_prometheus_image_proxy=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/oauth-proxy openshift_prometheus_image_prometheus=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/prometheus openshift_prometheus_image_alertmanager=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/prometheus-alertmanager openshift_prometheus_image_alertbuffer=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/prometheus-alert-buffer
@Clayton, Paul suggested we should tag you in on this one.
If I am not mistaken 2 replicas is a problem for prometheus, because both are using the same PV and might collide. I think this is not and will not be supported. Clayton, keep me honest here.
It has the same error even set openshift_prometheus_replicas=1 after a few hours
(In reply to Junqi Zhao from comment #0) > # oc logs prometheus-3994606287-x31fv -c prometheus > ts=2017-09-25T09:08:11.876143078Z caller=db.go:288 msg="reloading blocks > failed" err="read meta information data/01BTW0JRGM2RY10HTHX2XRBCCH: open > data/01BTW0JRGM2RY10HTHX2XRBCCH/meta.json: no such file or directory" > ts=2017-09-25T09:08:41.315337314Z caller=compact.go:327 msg="compact blocks" > blocks=[01BTW0JZ463KRJF9D9TJ7NA9A2] > ts=2017-09-25T09:08:41.956814166Z caller=db.go:283 msg="compaction failed" > err="delete compacted head block: remove > data/01BTW0JZ463KRJF9D9TJ7NA9A2/wal/.nfs000000000203e94300000002: device or > resource busy" > ts=2017-09-25T09:08:41.980123205Z caller=db.go:288 msg="reloading blocks > failed" err="read meta information data/01BTW0JRGM2RY10HTHX2XRBCCH: open > data/01BTW0JRGM2RY10HTHX2XRBCCH/meta.json: no such file or directory" (In reply to Junqi Zhao from comment #3) > It has the same error even set openshift_prometheus_replicas=1 after a few > hours Thomas the errors reported in comment 0 seem prometheus issues, should we reassign this to your team?
@Junqi Is the data volume mapped to an NFS share? If so, could you try using local storage instead? The prometheus devs recommend not using a network file system.
(In reply to Paul Gier from comment #5) > @Junqi Is the data volume mapped to an NFS share? Yes, used NFS PV. > If so, could you try > using local storage instead? The prometheus devs recommend not using a > network file system. How to use local storage, and now it is blocked by [1] if we don't use NFS PV [1] https://bugzilla.redhat.com/show_bug.cgi?id=1495446
1 - Right, we can't "scale" Prometheus, so it needs a maxReplicas of 1 AFAIK 2 - Indeed, NFS should not be used for Prometheus. It should be hostPath on the same node I think. (With node selector ?)
NFS can be used for Prometheus, but it has a few bugs, and it is now being refactored. https://github.com/openshift/openshift-ansible/pull/5459
What I and Paul meant is that NFS is not recommended, speed would likely quickly become an issue.
There is not error info in prometheus container logs. env: # rpm -qa | grep openshift-ansible openshift-ansible-callback-plugins-3.7.0-0.184.0.git.0.d407445.el7.noarch openshift-ansible-3.7.0-0.184.0.git.0.d407445.el7.noarch openshift-ansible-filter-plugins-3.7.0-0.184.0.git.0.d407445.el7.noarch openshift-ansible-playbooks-3.7.0-0.184.0.git.0.d407445.el7.noarch openshift-ansible-docs-3.7.0-0.184.0.git.0.d407445.el7.noarch openshift-ansible-lookup-plugins-3.7.0-0.184.0.git.0.d407445.el7.noarch openshift-ansible-roles-3.7.0-0.184.0.git.0.d407445.el7.noarch # openshift version openshift v3.7.0-0.184.0 kubernetes v1.7.6+a08f5eeb62 etcd 3.2.8 Images: oauth-proxy/images/v3.7.0-54 prometheus/images/v3.7.0-54 prometheus-alertmanager/images/v3.7.0-54 prometheus-alert-buffer/images/v3.7.0-51
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188