Created attachment 1396843 [details] inventory file for ansible playbook install Description of problem: Ansible playbook for prometheus with settings as specified in "Openshift Container Platform 3.7 Installation and Configuration" (URL https://access.redhat.com/documentation/en-us/openshift_container_platform/3.7/pdf/installation_and_configuration/OpenShift_Container_Platform-3.7-Installation_and_Configuration-en-US.pdf ) results in prometheus installation with 3 out of 5 kubernetes-service-endpoints DOWN (alerts-proxy,alert-buffer & alertmanger endpoints are DOWN) These endpoint are working (see attached image for more details) kubernetes-apserver(1/1 up) kubernetes-cadvisor(2/2 up) kubernetes-controllers(1/1 up) kubernetes-nodes(2/2 up) kubernetes-service-endpoints (2/5 up) Version-Release number of the following components: rpm -q openshift-ansible openshift-ansible-3.7.23-1.git.0.bc406aa.el7.noarch rpm -q ansible ansible-2.4.2.0-1.el7.noarch ansible --version ansible 2.4.2.0 config file = /etc/ansible/ansible.cfg configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules'] ansible python module location = /usr/lib/python2.7/site-packages/ansible executable location = /usr/bin/ansible python version = 2.7.5 (default, May 3 2017, 07:55:04) [GCC 4.8.5 20150623 (Red Hat 4.8.5-14)] How reproducible: 100% Steps to Reproduce: 1.ansible-playbook -i /root/scripts/inventory.et9 /usr/share/ansible/openshift-ansible/playbooks/byo/config.yml note: see attached inventory file, /root/scripts/inventory.et9 2. login to openshift webui as user with cluster-admin role (e.g. oc policy add-role-to-user cluster-admin <user-name>) 3. Look at logs for pod prometheus Actual results: Please include the entire output from the last TASK line through the end of output if an error is generated prometheus pod Logs show: 2018/02/16 05:50:00 provider.go:476: Performing OAuth discovery against https://172.30.0.1/.well-known/oauth-authorization-server 2018/02/16 05:50:00 provider.go:522: 200 GET https://172.30.0.1/.well-known/oauth-authorization-server { "issuer": "https://et9.et.eng.bos.redhat.com:8443 ", "authorization_endpoint": "https://et9.com:8443/oauth/authorize ", "token_endpoint": "https://et9.et.eng.bos.redhat.com:8443/oauth/token ", "scopes_supported": [ "user:check-access", "user:full", "user:info", "user:list-projects", "user:list-scoped-projects" ], "response_types_supported": [ "code", "token" ], "grant_types_supported": [ "authorization_code", "implicit" ], "code_challenge_methods_supported": [ "plain", "S256" ] } 2018/02/16 05:50:00 provider.go:265: Delegation of authentication and authorization to OpenShift is enabled for bearer tokens and client certificates. 2018/02/16 05:50:00 oauthproxy.go:161: mapping path "/" => upstream "http://localhost:9090 " 2018/02/16 05:50:00 oauthproxy.go:184: compiled skip-auth-regex => "^/metrics" 2018/02/16 05:50:00 oauthproxy.go:190: OAuthProxy configured for Client ID: system:serviceaccount:openshift-metrics:prometheus 2018/02/16 05:50:00 oauthproxy.go:200: Cookie settings: name:_oauth_proxy secure(https):true httponly:true expiry:168h0m0s domain:<default> refresh:disabled 2018/02/16 05:50:00 http.go:96: HTTPS: listening on [::]:8443 2018/02/16 05:53:36 oauthproxy.go:657: 10.129.0.1:34596 Cookie Signature not valid 2018/02/16 05:53:36 oauthproxy.go:657: 10.129.0.1:34596 Cookie Signature not valid Expected results: Additional info: Please attach logs from ansible-playbook with the -vvv flag
The issue with two of the service endpoints being down seems to be that prometheus is automatically discovering the containerPorts defined in the stateful set, and it probably shouldn't be trying to scrape those since they are also discovered via the exposed services. The alertbuffer scrape is failing because the /metrics path requires authentication, and it should probably be set up to skip auth similar to what the prom-proxy and alertmanager-proxy are doing. PR for openshift-ansible: https://github.com/openshift/openshift-ansible/pull/7356 PR for origin: https://github.com/openshift/origin/pull/18802 Would you mind opening a separate issue for the oauth proxy errors? That may need to be assigned to the security team.
origin and openshift-ansible PRs have been merged to master, so the fix will be in 3.10
Created attachment 1415043 [details] prometheus pod logs
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1816