1674378 – [BUG] prometheus and alertmanager - all targets are not connected and in status down

Bug 1674378 - [BUG] prometheus and alertmanager - all targets are not connected and in status down

Summary: [BUG] prometheus and alertmanager - all targets are not connected and in stat...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Frederic Branczyk
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-02-11 08:16 UTC by Vladislav Walek
Modified:	2020-02-27 13:10 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-16 06:27:41 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:2922	0	None	None	None	2019-10-16 06:27:56 UTC

Description Vladislav Walek 2019-02-11 08:16:57 UTC

Description of problem:

Alert manager shows error that targets are down. The pods are running, but prometheus shows this errors:
level=error ts=2019-02-08T13:18:57.999039334Z caller=wal.go:713 component=tsdb msg="sync failed" err="flush buffer: write /prometheus/wal/001769: transport endpoint is not co
nnected"
level=warn ts=2019-02-08T13:19:00.166748462Z caller=manager.go:402 component="rule manager" group=general.rules msg="rule sample appending failed" err="WAL log samples: log series: write /prometheus/wal/001769: transport endpoint is not connected"

Full logs will be attached.
Prometheus has enough space in the storage.

Version-Release number of selected component (if applicable):
Openshift Container Platform 3.11.43

How reproducible:
#n/a

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 minden 2019-02-11 11:38:42 UTC

> level=error ts=2019-02-05T17:04:42.756339407Z caller=nflog.go:360 component=nflog msg="Running maintenance failed" err="open /alertmanager/nflog.xxx: read-only file system"
> level=info ts=2019-02-05T17:04:42.757529077Z caller=silence.go:291 component=silences msg="Running maintenance failed" err="open /alertmanager/silences.xxx: read-only file system"

Can you verify that the Alertmanager file system is writable?

> WAL log samples: log series: write /prometheus/wal/xxx: transport endpoint is not connected

This also looks like a file system issue.

What is the underlying storage provider? Are there any other components in your Openshift cluster facing similar issues?

// Adding Krasi, as he is our TSDB expert. Would you mind taking a look as well?

Comment 3 Krasi 2019-02-11 21:08:28 UTC

Just checked the relevant TSDB code and it is definitely a file system issue.

Comment 4 minden 2019-02-12 11:54:18 UTC

Thanks Krasi!

We would need more info from your side then @Vladislav.

Comment 6 minden 2019-02-14 13:17:00 UTC

Would you mind execing into the Alertmanager container Vladislav via `kubectl exec` [1] and try to create something inside the `/alertmanager` directory?

[1] https://kubernetes.io/docs/tasks/debug-application-cluster/get-shell-running-container/

Comment 7 Vladislav Walek 2019-02-14 16:30:44 UTC

(In reply to minden from comment #6)
> Would you mind execing into the Alertmanager container Vladislav via
> `kubectl exec` [1] and try to create something inside the `/alertmanager`
> directory?
> 
> [1]
> https://kubernetes.io/docs/tasks/debug-application-cluster/get-shell-running-
> container/

This was already done and they can create files in the directory.

Comment 13 Krasi 2019-02-19 10:51:25 UTC

maybe it is running using a different user that doesn't have permissions to this folder

Comment 14 Vladislav Walek 2019-02-19 12:45:05 UTC

it would make no sense, if you do 'oc rsh <pod>' you are connecting as the user which is used to run the process.
and we tested, that the user can create files on the storage

Comment 15 Krasi 2019-02-19 13:14:17 UTC

all logs indicate an issue with the file system so should dig deeper in that direction.

Comment 16 Krasi 2019-02-19 13:21:22 UTC

I just remembered that if there was an issue with the file-system Prometheus might need a restart before it can write to the same fs again even when that problem has been resolved. It might be the same for the Alertmanager.

The error is different, but might be related.
https://github.com/prometheus/prometheus/issues/3283

Comment 17 Krasi 2019-02-19 13:30:09 UTC

btw "transport endpoint is not connected" is an error that comes from the file system and a quick google search shows that this is probably some intermittent fault with glusterfs or the sshfs mounts or whichever file-system is used for the storage.

Comment 18 Vladislav Walek 2019-02-19 13:53:50 UTC

(In reply to Krasi from comment #16)
> I just remembered that if there was an issue with the file-system Prometheus
> might need a restart before it can write to the same fs again even when that
> problem has been resolved. It might be the same for the Alertmanager.
> 
> The error is different, but might be related.
> https://github.com/prometheus/prometheus/issues/3283

Thx for the issue. It might be the same issue here. The filesystem might grow up, but when it was decreased the prometheus was not able to recover.
Do we have the fix for that?

(In reply to Krasi from comment #17)
> btw "transport endpoint is not connected" is an error that comes from the
> file system and a quick google search shows that this is probably some
> intermittent fault with glusterfs or the sshfs mounts or whichever
> file-system is used for the storage.

Where you see the issue - "transport endpoint is not connected" ?

Thx

Comment 19 Vladislav Walek 2019-02-19 13:56:38 UTC

just adding that the prometheus in the monitoring stack is in version 3.11.X

Comment 20 Krasi 2019-02-19 21:38:12 UTC

unfortunately no fix yet. someone started a PR for this, but so far it is not going anywhere.
https://github.com/prometheus/tsdb/pull/247

Comment 21 minden 2019-02-20 10:43:58 UTC

> 15:37:02, 2019-02-15 message:	API server is erroring for 100% of requests.
> 15:37:02, 2019-02-15 message:	API server is erroring for 100% of requests.

This Prometheus Alert relates to the Kubernetes API (aka. Openshift API), you can find more details here: https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/runbook.md


> To me it is strange we got the alert only for prometheus-0 and not for both of them (both of them are at 23% of usage): I suppose they have the same data for HA reasons: am I right ?

They should have the same data. Reasons for them to not be aligned could e.g. be, that one stopped scraping targets for some time due to some lower level issue, or e.g. not being able to discover targets due to connection issues to the API-Server (see above). This might also relate to https://bugzilla.redhat.com/show_bug.cgi?id=1674270


> Where you see the issue - "transport endpoint is not connected" ?

It is part of your initial comment (https://bugzilla.redhat.com/show_bug.cgi?id=1674378#c0).


@Vladislav given that this bug report contains a lot of status updates, would you mind summarizing the current state of the cluster / monitoring stack?

Comment 23 minden 2019-03-15 13:56:59 UTC

> What is the TTL of the alert?

You can find the alert definition here [1]. As soon as it does not have an error rate of 10% over the last 10 minutes it stops firing.

> Also the API shows only one alert - so practically you don't know which master api node has problems.

Yes, the alert does not differentiate between the instances. That would need to be done manually via the Prometheus Query UI. You can take a look at the query in [1] and remove the `instance` and `pod` from the `without` statement to look at individual instances.

> just one more question; we had 2 alert instead of 3 that started at the same time; does that mean that one controller/APis was still working ? 

No, I am surprised that you got two. The alert is not per instance, but for all instances. For next steps see comment above.

Let me know if this helps.


[1] https://github.com/coreos/prometheus-operator/blob/master/contrib/kube-prometheus/manifests/prometheus-rules.yaml#L767

Comment 24 Vladislav Walek 2019-03-20 20:40:23 UTC

Hello Max,
 thx for the links.

I have last request, we see that the two alert manager pods are running on the same node. This might be issue when the node crashes, the 2/3 pods are unavailable.
We can spread it with anti affinity rules, however, it is not possible to change that in 3.11 as the operator reverts the changes.

So the question is, is there any antiaffinity rules already in place? If not, I will request it in the RFE.
Thank you

Comment 25 Frederic Branczyk 2019-03-21 14:40:11 UTC

Yes, Alertmanager already has anti affinity rules configured:

```
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: alertmanager
              operator: In
              values:
              - main
          namespaces:
          - openshift-monitoring
          topologyKey: kubernetes.io/hostname
        weight: 100
```

What I can imagine is that maybe for some reason there was no other eligible node available at the original scheduling time, so it didn't make a hard decision. The idea of the monitoring stack is that in any way possible we try to continue to function rather than a hard stop. If you just kick the pods and there are multiple eligible nodes, then you should see the expected behavior.

Comment 26 lserven 2019-03-27 16:51:46 UTC

Vladislav,
Is this now resolved for you?

-Lucas

Comment 28 Vladislav Walek 2019-03-28 16:41:30 UTC

Hi,

just adding the current bug we see:

>> If there was an issue with the file-system Prometheus might need a restart before it can write to the same fs again even when that problem has been resolved. 

Prometheus should be able to recover automatically when the issue is resolved.

Thank you

Comment 29 Krasi 2019-03-29 13:41:55 UTC

Yes I agree completely and it will probably be fixed eventually, but it is not high priority so I wouldn't expect to be fixed anytime soon.

I assume this is caused by some sort of weird kernel syscall limitations , but haven't looked into it yet.

There was some interest from a contributor to provide a fix, but after few reminders it is not going anywhere so chances I or some of the other maintainers would have to provide a fix at some point.
https://github.com/prometheus/tsdb/pull/247

Comment 31 Krasi 2019-06-03 20:40:18 UTC

Actually I fixed this upstream and it is already included in the Prometheus 2.10 release

Here is the relevant PR https://github.com/prometheus/tsdb/pull/582

Comment 32 Frederic Branczyk 2019-06-04 09:54:26 UTC

We've already bumped Prometheus to 2.10 for OpenShift 4.2, so we'll be releasing the fix with that. Unfortunately backporting is not really possible as the codebase has changed so significantly since the version shipped in 3.11.

Comment 34 Junqi Zhao 2019-06-25 05:39:32 UTC

Prometheus version is 2.10 now, don't see error "transport endpoint is not connected"

payload: 4.2.0-0.nightly-2019-06-24-160709

Comment 36 errata-xmlrpc 2019-10-16 06:27:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.