Description of problem: Alert manager shows error that targets are down. The pods are running, but prometheus shows this errors: level=error ts=2019-02-08T13:18:57.999039334Z caller=wal.go:713 component=tsdb msg="sync failed" err="flush buffer: write /prometheus/wal/001769: transport endpoint is not co nnected" level=warn ts=2019-02-08T13:19:00.166748462Z caller=manager.go:402 component="rule manager" group=general.rules msg="rule sample appending failed" err="WAL log samples: log series: write /prometheus/wal/001769: transport endpoint is not connected" Full logs will be attached. Prometheus has enough space in the storage. Version-Release number of selected component (if applicable): Openshift Container Platform 3.11.43 How reproducible: #n/a Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
> level=error ts=2019-02-05T17:04:42.756339407Z caller=nflog.go:360 component=nflog msg="Running maintenance failed" err="open /alertmanager/nflog.xxx: read-only file system" > level=info ts=2019-02-05T17:04:42.757529077Z caller=silence.go:291 component=silences msg="Running maintenance failed" err="open /alertmanager/silences.xxx: read-only file system" Can you verify that the Alertmanager file system is writable? > WAL log samples: log series: write /prometheus/wal/xxx: transport endpoint is not connected This also looks like a file system issue. What is the underlying storage provider? Are there any other components in your Openshift cluster facing similar issues? // Adding Krasi, as he is our TSDB expert. Would you mind taking a look as well?
Just checked the relevant TSDB code and it is definitely a file system issue.
Thanks Krasi! We would need more info from your side then @Vladislav.
Would you mind execing into the Alertmanager container Vladislav via `kubectl exec` [1] and try to create something inside the `/alertmanager` directory? [1] https://kubernetes.io/docs/tasks/debug-application-cluster/get-shell-running-container/
(In reply to minden from comment #6) > Would you mind execing into the Alertmanager container Vladislav via > `kubectl exec` [1] and try to create something inside the `/alertmanager` > directory? > > [1] > https://kubernetes.io/docs/tasks/debug-application-cluster/get-shell-running- > container/ This was already done and they can create files in the directory.
maybe it is running using a different user that doesn't have permissions to this folder
it would make no sense, if you do 'oc rsh <pod>' you are connecting as the user which is used to run the process. and we tested, that the user can create files on the storage
all logs indicate an issue with the file system so should dig deeper in that direction.
I just remembered that if there was an issue with the file-system Prometheus might need a restart before it can write to the same fs again even when that problem has been resolved. It might be the same for the Alertmanager. The error is different, but might be related. https://github.com/prometheus/prometheus/issues/3283
btw "transport endpoint is not connected" is an error that comes from the file system and a quick google search shows that this is probably some intermittent fault with glusterfs or the sshfs mounts or whichever file-system is used for the storage.
(In reply to Krasi from comment #16) > I just remembered that if there was an issue with the file-system Prometheus > might need a restart before it can write to the same fs again even when that > problem has been resolved. It might be the same for the Alertmanager. > > The error is different, but might be related. > https://github.com/prometheus/prometheus/issues/3283 Thx for the issue. It might be the same issue here. The filesystem might grow up, but when it was decreased the prometheus was not able to recover. Do we have the fix for that? (In reply to Krasi from comment #17) > btw "transport endpoint is not connected" is an error that comes from the > file system and a quick google search shows that this is probably some > intermittent fault with glusterfs or the sshfs mounts or whichever > file-system is used for the storage. Where you see the issue - "transport endpoint is not connected" ? Thx
just adding that the prometheus in the monitoring stack is in version 3.11.X
unfortunately no fix yet. someone started a PR for this, but so far it is not going anywhere. https://github.com/prometheus/tsdb/pull/247
> 15:37:02, 2019-02-15 message: API server is erroring for 100% of requests. > 15:37:02, 2019-02-15 message: API server is erroring for 100% of requests. This Prometheus Alert relates to the Kubernetes API (aka. Openshift API), you can find more details here: https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/runbook.md > To me it is strange we got the alert only for prometheus-0 and not for both of them (both of them are at 23% of usage): I suppose they have the same data for HA reasons: am I right ? They should have the same data. Reasons for them to not be aligned could e.g. be, that one stopped scraping targets for some time due to some lower level issue, or e.g. not being able to discover targets due to connection issues to the API-Server (see above). This might also relate to https://bugzilla.redhat.com/show_bug.cgi?id=1674270 > Where you see the issue - "transport endpoint is not connected" ? It is part of your initial comment (https://bugzilla.redhat.com/show_bug.cgi?id=1674378#c0). @Vladislav given that this bug report contains a lot of status updates, would you mind summarizing the current state of the cluster / monitoring stack?
> What is the TTL of the alert? You can find the alert definition here [1]. As soon as it does not have an error rate of 10% over the last 10 minutes it stops firing. > Also the API shows only one alert - so practically you don't know which master api node has problems. Yes, the alert does not differentiate between the instances. That would need to be done manually via the Prometheus Query UI. You can take a look at the query in [1] and remove the `instance` and `pod` from the `without` statement to look at individual instances. > just one more question; we had 2 alert instead of 3 that started at the same time; does that mean that one controller/APis was still working ? No, I am surprised that you got two. The alert is not per instance, but for all instances. For next steps see comment above. Let me know if this helps. [1] https://github.com/coreos/prometheus-operator/blob/master/contrib/kube-prometheus/manifests/prometheus-rules.yaml#L767
Hello Max, thx for the links. I have last request, we see that the two alert manager pods are running on the same node. This might be issue when the node crashes, the 2/3 pods are unavailable. We can spread it with anti affinity rules, however, it is not possible to change that in 3.11 as the operator reverts the changes. So the question is, is there any antiaffinity rules already in place? If not, I will request it in the RFE. Thank you
Yes, Alertmanager already has anti affinity rules configured: ``` affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - podAffinityTerm: labelSelector: matchExpressions: - key: alertmanager operator: In values: - main namespaces: - openshift-monitoring topologyKey: kubernetes.io/hostname weight: 100 ``` What I can imagine is that maybe for some reason there was no other eligible node available at the original scheduling time, so it didn't make a hard decision. The idea of the monitoring stack is that in any way possible we try to continue to function rather than a hard stop. If you just kick the pods and there are multiple eligible nodes, then you should see the expected behavior.
Vladislav, Is this now resolved for you? -Lucas
Hi, just adding the current bug we see: >> If there was an issue with the file-system Prometheus might need a restart before it can write to the same fs again even when that problem has been resolved. Prometheus should be able to recover automatically when the issue is resolved. Thank you
Yes I agree completely and it will probably be fixed eventually, but it is not high priority so I wouldn't expect to be fixed anytime soon. I assume this is caused by some sort of weird kernel syscall limitations , but haven't looked into it yet. There was some interest from a contributor to provide a fix, but after few reminders it is not going anywhere so chances I or some of the other maintainers would have to provide a fix at some point. https://github.com/prometheus/tsdb/pull/247
Actually I fixed this upstream and it is already included in the Prometheus 2.10 release Here is the relevant PR https://github.com/prometheus/tsdb/pull/582
We've already bumped Prometheus to 2.10 for OpenShift 4.2, so we'll be releasing the fix with that. Unfortunately backporting is not really possible as the codebase has changed so significantly since the version shipped in 3.11.
Prometheus version is 2.10 now, don't see error "transport endpoint is not connected" payload: 4.2.0-0.nightly-2019-06-24-160709
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922