Bug 1546033

Summary: Promtheus ansible playbook install results in oauthproxy errors and 3 out of 5 kubernetes-service-endpoints DOWN
Product: OpenShift Container Platform Reporter: Diane Feddema <dfeddema>
Component: InstallerAssignee: Paul Gier <pgier>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.7.1CC: aos-bugs, jokerman, mmccomas
Target Milestone: ---   
Target Release: 3.10.0   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-07-30 19:09:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
inventory file for ansible playbook install
none
prometheus pod logs none

Description Diane Feddema 2018-02-16 07:07:26 UTC
Created attachment 1396843 [details]
inventory file for ansible playbook install

Description of problem:
Ansible playbook for prometheus with settings as specified in "Openshift Container Platform 3.7 Installation and Configuration" (URL https://access.redhat.com/documentation/en-us/openshift_container_platform/3.7/pdf/installation_and_configuration/OpenShift_Container_Platform-3.7-Installation_and_Configuration-en-US.pdf )

results in prometheus installation with 
3 out of 5 kubernetes-service-endpoints DOWN 
(alerts-proxy,alert-buffer & alertmanger endpoints are DOWN)

These endpoint are working (see attached image for more details)
kubernetes-apserver(1/1 up)
kubernetes-cadvisor(2/2 up)
kubernetes-controllers(1/1 up)
kubernetes-nodes(2/2 up)
kubernetes-service-endpoints (2/5 up) 


Version-Release number of the following components:
rpm -q openshift-ansible
openshift-ansible-3.7.23-1.git.0.bc406aa.el7.noarch

rpm -q ansible
ansible-2.4.2.0-1.el7.noarch

ansible --version
ansible 2.4.2.0
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, May  3 2017, 07:55:04) [GCC 4.8.5 20150623 (Red Hat 4.8.5-14)]

How reproducible:
100%
Steps to Reproduce:
1.ansible-playbook -i /root/scripts/inventory.et9 /usr/share/ansible/openshift-ansible/playbooks/byo/config.yml 

note: see attached inventory file, /root/scripts/inventory.et9

2. login to openshift webui as user with cluster-admin role
(e.g. oc policy add-role-to-user cluster-admin <user-name>)
3. Look at logs for pod prometheus 

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

prometheus pod Logs show:
2018/02/16 05:50:00 provider.go:476: Performing OAuth discovery against https://172.30.0.1/.well-known/oauth-authorization-server 
2018/02/16 05:50:00 provider.go:522: 200 GET https://172.30.0.1/.well-known/oauth-authorization-server  {
  "issuer": "https://et9.et.eng.bos.redhat.com:8443 ",
  "authorization_endpoint": "https://et9.com:8443/oauth/authorize ",
  "token_endpoint": "https://et9.et.eng.bos.redhat.com:8443/oauth/token ",
  "scopes_supported": [
    "user:check-access",
    "user:full",
    "user:info",
    "user:list-projects",
    "user:list-scoped-projects"
  ],
  "response_types_supported": [
    "code",
    "token"
  ],
  "grant_types_supported": [
    "authorization_code",
    "implicit"
  ],
  "code_challenge_methods_supported": [
    "plain",
    "S256"
  ]
}
2018/02/16 05:50:00 provider.go:265: Delegation of authentication and authorization to OpenShift is enabled for bearer tokens and client certificates.
2018/02/16 05:50:00 oauthproxy.go:161: mapping path "/" => upstream "http://localhost:9090 "
2018/02/16 05:50:00 oauthproxy.go:184: compiled skip-auth-regex => "^/metrics"
2018/02/16 05:50:00 oauthproxy.go:190: OAuthProxy configured for  Client ID: system:serviceaccount:openshift-metrics:prometheus
2018/02/16 05:50:00 oauthproxy.go:200: Cookie settings: name:_oauth_proxy secure(https):true httponly:true expiry:168h0m0s domain:<default> refresh:disabled
2018/02/16 05:50:00 http.go:96: HTTPS: listening on [::]:8443
2018/02/16 05:53:36 oauthproxy.go:657: 10.129.0.1:34596 Cookie Signature not valid
2018/02/16 05:53:36 oauthproxy.go:657: 10.129.0.1:34596 Cookie Signature not valid
Expected results:

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 Paul Gier 2018-03-02 03:48:24 UTC
The issue with two of the service endpoints being down seems to be that prometheus is automatically discovering the containerPorts defined in the stateful set, and it probably shouldn't be trying to scrape those since they are also discovered via the exposed services.

The alertbuffer scrape is failing because the /metrics path requires authentication, and it should probably be set up to skip auth similar to what the prom-proxy and alertmanager-proxy are doing.

PR for openshift-ansible: https://github.com/openshift/openshift-ansible/pull/7356
PR for origin: https://github.com/openshift/origin/pull/18802

Would you mind opening a separate issue for the oauth proxy errors?  That may need to be assigned to the security team.

Comment 2 Paul Gier 2018-03-21 20:05:40 UTC
origin and openshift-ansible PRs have been merged to master, so the fix will be in 3.10

Comment 4 Junqi Zhao 2018-03-30 09:16:19 UTC
Created attachment 1415043 [details]
prometheus pod logs

Comment 6 errata-xmlrpc 2018-07-30 19:09:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816