Bug 1495139
Summary: | device or resource busy error info in prometheus container logs after running for an hour | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Junqi Zhao <juzhao> | ||||
Component: | Hawkular | Assignee: | Zohar Gal-Or <zgalor> | ||||
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 3.7.0 | CC: | aos-bugs, ccoleman, fsimonce, jcantril, mwringe, pweil, theute, zgalor | ||||
Target Milestone: | --- | ||||||
Target Release: | 3.7.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2017-11-28 22:12:28 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
@Clayton, Paul suggested we should tag you in on this one. If I am not mistaken 2 replicas is a problem for prometheus, because both are using the same PV and might collide. I think this is not and will not be supported. Clayton, keep me honest here. It has the same error even set openshift_prometheus_replicas=1 after a few hours (In reply to Junqi Zhao from comment #0) > # oc logs prometheus-3994606287-x31fv -c prometheus > ts=2017-09-25T09:08:11.876143078Z caller=db.go:288 msg="reloading blocks > failed" err="read meta information data/01BTW0JRGM2RY10HTHX2XRBCCH: open > data/01BTW0JRGM2RY10HTHX2XRBCCH/meta.json: no such file or directory" > ts=2017-09-25T09:08:41.315337314Z caller=compact.go:327 msg="compact blocks" > blocks=[01BTW0JZ463KRJF9D9TJ7NA9A2] > ts=2017-09-25T09:08:41.956814166Z caller=db.go:283 msg="compaction failed" > err="delete compacted head block: remove > data/01BTW0JZ463KRJF9D9TJ7NA9A2/wal/.nfs000000000203e94300000002: device or > resource busy" > ts=2017-09-25T09:08:41.980123205Z caller=db.go:288 msg="reloading blocks > failed" err="read meta information data/01BTW0JRGM2RY10HTHX2XRBCCH: open > data/01BTW0JRGM2RY10HTHX2XRBCCH/meta.json: no such file or directory" (In reply to Junqi Zhao from comment #3) > It has the same error even set openshift_prometheus_replicas=1 after a few > hours Thomas the errors reported in comment 0 seem prometheus issues, should we reassign this to your team? @Junqi Is the data volume mapped to an NFS share? If so, could you try using local storage instead? The prometheus devs recommend not using a network file system. (In reply to Paul Gier from comment #5) > @Junqi Is the data volume mapped to an NFS share? Yes, used NFS PV. > If so, could you try > using local storage instead? The prometheus devs recommend not using a > network file system. How to use local storage, and now it is blocked by [1] if we don't use NFS PV [1] https://bugzilla.redhat.com/show_bug.cgi?id=1495446 1 - Right, we can't "scale" Prometheus, so it needs a maxReplicas of 1 AFAIK 2 - Indeed, NFS should not be used for Prometheus. It should be hostPath on the same node I think. (With node selector ?) NFS can be used for Prometheus, but it has a few bugs, and it is now being refactored. https://github.com/openshift/openshift-ansible/pull/5459 What I and Paul meant is that NFS is not recommended, speed would likely quickly become an issue. There is not error info in prometheus container logs. env: # rpm -qa | grep openshift-ansible openshift-ansible-callback-plugins-3.7.0-0.184.0.git.0.d407445.el7.noarch openshift-ansible-3.7.0-0.184.0.git.0.d407445.el7.noarch openshift-ansible-filter-plugins-3.7.0-0.184.0.git.0.d407445.el7.noarch openshift-ansible-playbooks-3.7.0-0.184.0.git.0.d407445.el7.noarch openshift-ansible-docs-3.7.0-0.184.0.git.0.d407445.el7.noarch openshift-ansible-lookup-plugins-3.7.0-0.184.0.git.0.d407445.el7.noarch openshift-ansible-roles-3.7.0-0.184.0.git.0.d407445.el7.noarch # openshift version openshift v3.7.0-0.184.0 kubernetes v1.7.6+a08f5eeb62 etcd 3.2.8 Images: oauth-proxy/images/v3.7.0-54 prometheus/images/v3.7.0-54 prometheus-alertmanager/images/v3.7.0-54 prometheus-alert-buffer/images/v3.7.0-51 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188 |
Created attachment 1330465 [details] device or resource busy error info in prometheus container logs Description of problem: Deploy prometheus and check prometheus logs after a few hours, device or resource busy error info in prometheus container logs, more info see the attached log file. # oc logs prometheus-3994606287-x31fv -c prometheus ts=2017-09-25T09:08:11.876143078Z caller=db.go:288 msg="reloading blocks failed" err="read meta information data/01BTW0JRGM2RY10HTHX2XRBCCH: open data/01BTW0JRGM2RY10HTHX2XRBCCH/meta.json: no such file or directory" ts=2017-09-25T09:08:41.315337314Z caller=compact.go:327 msg="compact blocks" blocks=[01BTW0JZ463KRJF9D9TJ7NA9A2] ts=2017-09-25T09:08:41.956814166Z caller=db.go:283 msg="compaction failed" err="delete compacted head block: remove data/01BTW0JZ463KRJF9D9TJ7NA9A2/wal/.nfs000000000203e94300000002: device or resource busy" ts=2017-09-25T09:08:41.980123205Z caller=db.go:288 msg="reloading blocks failed" err="read meta information data/01BTW0JRGM2RY10HTHX2XRBCCH: open data/01BTW0JRGM2RY10HTHX2XRBCCH/meta.json: no such file or directory" although it has such error, it does not affect prometheus function, prometheus pods are running # oc get po NAME READY STATUS RESTARTS AGE prometheus-3994606287-j5605 5/5 Running 0 1h prometheus-3994606287-x31fv 5/5 Running 0 1h Version-Release number of selected component (if applicable): Images prometheus:v3.7.0-7 prometheus-alertmanager:v3.7.0-7 prometheus-alert-buffer:v3.7.0-7 oauth-proxy:v3.7.0-5 How reproducible: Always Steps to Reproduce: 1. Deploy prometheus via ansible 2. Check logs # oc logs ${POD} -c prometheus 3. Actual results: device or resource busy error info in prometheus container logs after a few hours Expected results: Should not be error in pod logs Additional info: [OSEv3:children] masters etcd nfs [masters] ${MASTER_URL} openshift_public_hostname=${MASTER_URL} [etcd] ${ETCD} openshift_public_hostname=${ETCD} [nfs] ${NFS} openshift_public_hostname=${NFS} [OSEv3:vars] ansible_ssh_user=root ansible_ssh_private_key_file="~/libra.pem" deployment_type=openshift-enterprise openshift_docker_additional_registries=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888 # prometheus openshift_prometheus_state=present openshift_prometheus_namespace=prometheus openshift_prometheus_replicas=2 openshift_prometheus_node_selector={'role': 'node'} openshift_prometheus_image_proxy=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/oauth-proxy openshift_prometheus_image_prometheus=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/prometheus openshift_prometheus_image_alertmanager=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/prometheus-alertmanager openshift_prometheus_image_alertbuffer=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/prometheus-alert-buffer