Bug 1743485 - [OSP15] TLS Everywhere jobs timeout
Summary: [OSP15] TLS Everywhere jobs timeout
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 15.0 (Stein)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: 15.0 (Stein)
Assignee: RHOS Maint
QA Contact: Pavan
URL:
Whiteboard:
: 1743523 1747957 (view as bug list)
Depends On:
Blocks: 1624486 1624488 1665522 1713752
TreeView+ depends on / blocked
 
Reported: 2019-08-20 06:05 UTC by Michele Baldessari
Modified: 2019-09-27 10:44 UTC (History)
23 users (show)

Fixed In Version: openstack-tripleo-heat-templates-10.6.1-0.20190823114746.30390bc.el8ost
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1744758 (view as bug list)
Environment:
Last Closed: 2019-09-21 11:24:25 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
permissive audit.log (5.29 MB, text/plain)
2019-08-20 12:25 UTC, Michele Baldessari
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2019:2811 0 None None None 2019-09-21 11:25:00 UTC

Description Michele Baldessari 2019-08-20 06:05:43 UTC
Description of problem:
I have observed this last night while testing TLS everywhere (RHOS_TRUNK-15.0-RHEL-8-20190815.n.1 puddle)

It seems to me that at a certain point a getcert command (on the haproxy service cert) just sits there hanging. I see many getcerts being successful:
Aug 19 21:13:52 controller-0.redhat.local puppet-user[25453]: Debug: Executing: '/usr/bin/getcert list'
Aug 19 21:13:52 controller-0.redhat.local puppet-user[25453]: Debug: Issuing getcert command with args: ["request", "-I", "mysql", "-f", "/etc/pki/tls/certs/mysql.crt", "-c", "IPA", "-N", "CN=controller-0.internalapi.redhat.local", "-K", "mysql/controller-0.internalapi.redhat.local", "-D", "overcloud.internalapi.redhat.local", "-D", "controller-0.internalapi.redhat.local", "-w", "-k", "/etc/pki/tls/private/mysql.key"]
Aug 19 21:13:52 controller-0.redhat.local puppet-user[25453]: Debug: Executing: '/usr/bin/getcert request -I mysql -f /etc/pki/tls/certs/mysql.crt -c IPA -N CN=controller-0.internalapi.redhat.local -K mysql/controller-0.internalapi.redhat.local -D overcloud.internalapi.redhat.local -D controller-0.internalapi.redhat.local -w -k /etc/pki/tls/private/mysql.key'
Aug 19 21:13:52 controller-0.redhat.local puppet-user[25453]: Debug: Executing: '/usr/bin/getcert list -i mysql'
Aug 19 21:13:52 controller-0.redhat.local puppet-user[25453]: Debug: Issuing getcert command with args: ["request", "-I", "rabbitmq", "-f", "/etc/pki/tls/certs/rabbitmq.crt", "-c", "IPA", "-N", "CN=controller-0.internalapi.redhat.local", "-K", "rabbitmq/controller-0.internalapi.redhat.local", "-D", "controller-0.internalapi.redhat.local", "-C", "/usr/bin/certmonger-rabbitmq-refresh.sh", "-w", "-k", "/etc/pki/tls/private/rabbitmq.key"]
Aug 19 21:13:52 controller-0.redhat.local puppet-user[25453]: Debug: Executing: '/usr/bin/getcert request -I rabbitmq -f /etc/pki/tls/certs/rabbitmq.crt -c IPA -N CN=controller-0.internalapi.redhat.local -K rabbitmq/controller-0.internalapi.redhat.local -D controller-0.internalapi.redhat.local -C /usr/bin/certmonger-rabbitmq-refresh.sh -w -k /etc/pki/tls/private/rabbitmq.key'
Aug 19 21:13:53 controller-0.redhat.local puppet-user[25453]: Debug: Executing: '/usr/bin/getcert list -i rabbitmq'
Aug 19 21:13:53 controller-0.redhat.local puppet-user[25453]: Debug: Issuing getcert command with args: ["request", "-I", "redis", "-f", "/etc/pki/tls/certs/redis.crt", "-c", "IPA", "-N", "CN=controller-0.internalapi.redhat.local", "-K", "redis/controller-0.internalapi.redhat.local", "-D", "overcloud.internalapi.redhat.local", "-D", "controller-0.internalapi.redhat.local", "-C", "/usr/bin/certmonger-redis-refresh.sh", "-w", "-k", "/etc/pki/tls/private/redis.key"]
Aug 19 21:13:53 controller-0.redhat.local puppet-user[25453]: Debug: Executing: '/usr/bin/getcert request -I redis -f /etc/pki/tls/certs/redis.crt -c IPA -N CN=controller-0.internalapi.redhat.local -K redis/controller-0.internalapi.redhat.local -D overcloud.internalapi.redhat.local -D controller-0.internalapi.redhat.local -C /usr/bin/certmonger-redis-refresh.sh -w -k /etc/pki/tls/private/redis.key'
Aug 19 21:13:54 controller-0.redhat.local puppet-user[25453]: Debug: Executing: '/usr/bin/getcert list -i redis'
Aug 19 21:13:54 controller-0.redhat.local puppet-user[25453]: Debug: Issuing getcert command with args: ["request", "-I", "novnc-proxy", "-f", "/etc/pki/tls/certs/novnc_proxy.crt", "-c", "IPA", "-N", "CN=controller-0.internalapi.redhat.local", "-K", "novnc-proxy/controller-0.internalapi.redhat.local", "-D", "controller-0.internalapi.redhat.local", "-w", "-k", "/etc/pki/tls/private/novnc_proxy.key"]
Aug 19 21:13:54 controller-0.redhat.local puppet-user[25453]: Debug: Executing: '/usr/bin/getcert request -I novnc-proxy -f /etc/pki/tls/certs/novnc_proxy.crt -c IPA -N CN=controller-0.internalapi.redhat.local -K novnc-proxy/controller-0.internalapi.redhat.local -D controller-0.internalapi.redhat.local -w -k /etc/pki/tls/private/novnc_proxy.key'
Aug 19 21:13:54 controller-0.redhat.local puppet-user[25453]: Debug: Executing: '/usr/bin/getcert list -i novnc-proxy'
Aug 19 21:13:54 controller-0.redhat.local puppet-user[25453]: Debug: Issuing getcert command with args: ["request", "-I", "httpd-ctlplane", "-f", "/etc/pki/tls/certs/httpd/httpd-ctlplane.crt", "-c", "IPA", "-N", "CN=controller-0.ctlplane.redhat.local", "-K", "HTTP/controller-0.ctlplane.redhat.local", "-D", "controller-0.ctlplane.redhat.local", "-C", "pkill -USR1 httpd", "-w", "-k", "/etc/pki/tls/private/httpd/httpd-ctlplane.key"]
Aug 19 21:13:54 controller-0.redhat.local puppet-user[25453]: Debug: Executing: '/usr/bin/getcert request -I httpd-ctlplane -f /etc/pki/tls/certs/httpd/httpd-ctlplane.crt -c IPA -N CN=controller-0.ctlplane.redhat.local -K HTTP/controller-0.ctlplane.redhat.local -D controller-0.ctlplane.redhat.local -C pkill -USR1 httpd -w -k /etc/pki/tls/private/httpd/httpd-ctlplane.key'
Aug 19 21:13:55 controller-0.redhat.local puppet-user[25453]: Debug: Executing: '/usr/bin/getcert list -i httpd-ctlplane'
Aug 19 21:13:55 controller-0.redhat.local puppet-user[25453]: Debug: Issuing getcert command with args: ["request", "-I", "httpd-external", "-f", "/etc/pki/tls/certs/httpd/httpd-external.crt", "-c", "IPA", "-N", "CN=controller-0.external.redhat.local", "-K", "HTTP/controller-0.external.redhat.local", "-D", "controller-0.external.redhat.local", "-C", "pkill -USR1 httpd", "-w", "-k", "/etc/pki/tls/private/httpd/httpd-external.key"]
Aug 19 21:13:55 controller-0.redhat.local puppet-user[25453]: Debug: Executing: '/usr/bin/getcert request -I httpd-external -f /etc/pki/tls/certs/httpd/httpd-external.crt -c IPA -N CN=controller-0.external.redhat.local -K HTTP/controller-0.external.redhat.local -D controller-0.external.redhat.local -C pkill -USR1 httpd -w -k /etc/pki/tls/private/httpd/httpd-external.key'
Aug 19 21:13:56 controller-0.redhat.local puppet-user[25453]: Debug: Executing: '/usr/bin/getcert list -i httpd-external'
Aug 19 21:13:56 controller-0.redhat.local puppet-user[25453]: Debug: Issuing getcert command with args: ["request", "-I", "httpd-internal_api", "-f", "/etc/pki/tls/certs/httpd/httpd-internal_api.crt", "-c", "IPA", "-N", "CN=controller-0.internalapi.redhat.local", "-K", "HTTP/controller-0.internalapi.redhat.local", "-D", "controller-0.internalapi.redhat.local", "-C", "pkill -USR1 httpd", "-w", "-k", "/etc/pki/tls/private/httpd/httpd-internal_api.key"]
Aug 19 21:13:56 controller-0.redhat.local puppet-user[25453]: Debug: Executing: '/usr/bin/getcert request -I httpd-internal_api -f /etc/pki/tls/certs/httpd/httpd-internal_api.crt -c IPA -N CN=controller-0.internalapi.redhat.local -K HTTP/controller-0.internalapi.redhat.local -D controller-0.internalapi.redhat.local -C pkill -USR1 httpd -w -k /etc/pki/tls/private/httpd/httpd-internal_api.key'
Aug 19 21:13:56 controller-0.redhat.local puppet-user[25453]: Debug: Executing: '/usr/bin/getcert list -i httpd-internal_api'
Aug 19 21:13:56 controller-0.redhat.local puppet-user[25453]: Debug: Issuing getcert command with args: ["request", "-I", "httpd-storage", "-f", "/etc/pki/tls/certs/httpd/httpd-storage.crt", "-c", "IPA", "-N", "CN=controller-0.storage.redhat.local", "-K", "HTTP/controller-0.storage.redhat.local", "-D", "controller-0.storage.redhat.local", "-C", "pkill -USR1 httpd", "-w", "-k", "/etc/pki/tls/private/httpd/httpd-storage.key"]
Aug 19 21:13:56 controller-0.redhat.local puppet-user[25453]: Debug: Executing: '/usr/bin/getcert request -I httpd-storage -f /etc/pki/tls/certs/httpd/httpd-storage.crt -c IPA -N CN=controller-0.storage.redhat.local -K HTTP/controller-0.storage.redhat.local -D controller-0.storage.redhat.local -C pkill -USR1 httpd -w -k /etc/pki/tls/private/httpd/httpd-storage.key'
Aug 19 21:13:57 controller-0.redhat.local puppet-user[25453]: Debug: Executing: '/usr/bin/getcert list -i httpd-storage'
Aug 19 21:13:57 controller-0.redhat.local puppet-user[25453]: Debug: Issuing getcert command with args: ["request", "-I", "httpd-storage_mgmt", "-f", "/etc/pki/tls/certs/httpd/httpd-storage_mgmt.crt", "-c", "IPA", "-N", "CN=controller-0.storagemgmt.redhat.local", "-K", "HTTP/controller-0.storagemgmt.redhat.local", "-D", "controller-0.storagemgmt.redhat.local", "-C", "pkill -USR1 httpd", "-w", "-k", "/etc/pki/tls/private/httpd/httpd-storage_mgmt.key"]
Aug 19 21:13:57 controller-0.redhat.local puppet-user[25453]: Debug: Executing: '/usr/bin/getcert request -I httpd-storage_mgmt -f /etc/pki/tls/certs/httpd/httpd-storage_mgmt.crt -c IPA -N CN=controller-0.storagemgmt.redhat.local -K HTTP/controller-0.storagemgmt.redhat.local -D controller-0.storagemgmt.redhat.local -C pkill -USR1 httpd -w -k /etc/pki/tls/private/httpd/httpd-storage_mgmt.key'
Aug 19 21:13:58 controller-0.redhat.local puppet-user[25453]: Debug: Executing: '/usr/bin/getcert list -i httpd-storage_mgmt'
Aug 19 21:13:58 controller-0.redhat.local puppet-user[25453]: Debug: Issuing getcert command with args: ["request", "-I", "libvirt-vnc-client-cert", "-f", "/etc/pki/libvirt-vnc/client-cert.pem", "-c", "IPA", "-N", "CN=controller-0.internalapi.redhat.local", "-K", "libvirt-vnc/controller-0.internalapi.redhat.local", "-D", "controller-0.internalapi.redhat.local", "-C", "systemctl reload libvirtd", "-w", "-k", "/etc/pki/libvirt-vnc/client-key.pem", "-F", "/etc/ipa/ca.crt"]
Aug 19 21:13:58 controller-0.redhat.local puppet-user[25453]: Debug: Executing: '/usr/bin/getcert request -I libvirt-vnc-client-cert -f /etc/pki/libvirt-vnc/client-cert.pem -c IPA -N CN=controller-0.internalapi.redhat.local -K libvirt-vnc/controller-0.internalapi.redhat.local -D controller-0.internalapi.redhat.local -C systemctl reload libvirtd -w -k /etc/pki/libvirt-vnc/client-key.pem -F /etc/ipa/ca.crt'
Aug 19 21:13:58 controller-0.redhat.local puppet-user[25453]: Warning: Could not get certificate: Execution of '/usr/bin/getcert request -I libvirt-vnc-client-cert -f /etc/pki/libvirt-vnc/client-cert.pem -c IPA -N CN=controller-0.internalapi.redhat.local -K libvirt-vnc/controller-0.internalapi.redhat.local -D controller-0.internalapi.redhat.local -C systemctl reload libvirtd -w -k /etc/pki/libvirt-vnc/client-key.pem -F /etc/ipa/ca.crt' returned 4: New signing request "libvirt-vnc-client-cert" added.
Aug 19 21:13:58 controller-0.redhat.local puppet-user[25453]: Debug: Executing: '/usr/bin/getcert list -i libvirt-vnc-client-cert'
Aug 19 21:13:59 controller-0.redhat.local puppet-user[25453]: Debug: Issuing getcert command with args: ["request", "-I", "haproxy-ctlplane-cert", "-f", "/etc/pki/tls/certs/haproxy/overcloud-haproxy-ctlplane.crt", "-c", "IPA", "-N", "CN=controller-0.ctlplane.redhat.local", "-K", "haproxy/controller-0.ctlplane.redhat.local", "-D", "overcloud.ctlplane.redhat.local", "-D", "controller-0.ctlplane.redhat.local", "-U", "id-kp-clientAuth", "-U", "id-kp-serverAuth", "-C", "/usr/bin/certmonger-haproxy-refresh.sh reload ctlplane", "-w", "-k", "/etc/pki/tls/private/haproxy/overcloud-haproxy-ctlplane.key"]
Aug 19 21:13:59 controller-0.redhat.local puppet-user[25453]: Debug: Executing: '/usr/bin/getcert request -I haproxy-ctlplane-cert -f /etc/pki/tls/certs/haproxy/overcloud-haproxy-ctlplane.crt -c IPA -N CN=controller-0.ctlplane.redhat.local -K haproxy/controller-0.ctlplane.redhat.local -D overcloud.ctlplane.redhat.local -D controller-0.ctlplane.redhat.local -U id-kp-clientAuth -U id-kp-serverAuth -C /usr/bin/certmonger-haproxy-refresh.sh reload ctlplane -w -k /etc/pki/tls/private/haproxy/overcloud-haproxy-ctlplane.key'

And the haproxy one is the last one and then puppet just hangs forever and the jobs timeout:
[root@controller-0 ~]# journalctl -t puppet-user |tail -n5
Aug 19 21:13:59 controller-0.redhat.local puppet-user[25453]: Debug: /Stage[main]/Tripleo::Profile::Base::Certmonger_user/Tripleo::Certmonger::Libvirt_vnc[libvirt-vnc-client-cert]/File[/etc/pki/libvirt-vnc/client-key.pem]: The container Tripleo::Certmonger::Libvirt_vnc[libvirt-vnc-client-cert] will propagate my refresh event
Aug 19 21:13:59 controller-0.redhat.local puppet-user[25453]: Debug: Tripleo::Certmonger::Libvirt_vnc[libvirt-vnc-client-cert]: The container Class[Tripleo::Profile::Base::Certmonger_user] will propagate my refresh event
Aug 19 21:13:59 controller-0.redhat.local puppet-user[25453]: Notice: /Stage[main]/Tripleo::Profile::Base::Certmonger_user/Tripleo::Certmonger::Haproxy[haproxy-ctlplane]/Certmonger_certificate[haproxy-ctlplane-cert]/ensure: created
Aug 19 21:13:59 controller-0.redhat.local puppet-user[25453]: Debug: Issuing getcert command with args: ["request", "-I", "haproxy-ctlplane-cert", "-f", "/etc/pki/tls/certs/haproxy/overcloud-haproxy-ctlplane.crt", "-c", "IPA", "-N", "CN=controller-0.ctlplane.redhat.local", "-K", "haproxy/controller-0.ctlplane.redhat.local", "-D", "overcloud.ctlplane.redhat.local", "-D", "controller-0.ctlplane.redhat.local", "-U", "id-kp-clientAuth", "-U", "id-kp-serverAuth", "-C", "/usr/bin/certmonger-haproxy-refresh.sh reload ctlplane", "-w", "-k", "/etc/pki/tls/private/haproxy/overcloud-haproxy-ctlplane.key"]

- This is the last puppet line which hangs and nothing happens afterwards:
Aug 19 21:13:59 controller-0.redhat.local puppet-user[25453]: Debug: Executing: '/usr/bin/getcert request -I haproxy-ctlplane-cert -f /etc/pki/tls/certs/haproxy/overcloud-haproxy-ctlplane.crt -c IPA -N CN=controller-0.ctlplane.redhat.local -K haproxy/controller-0.ctlplane.redhat.local -D overcloud.ctlplane.redhat.local -D controller-0.ctlplane.redhat.local -U id-kp-clientAuth -U id-kp-serverAuth -C /usr/bin/certmonger-haproxy-refresh.sh reload ctlplane -w -k /etc/pki/tls/private/haproxy/overcloud-haproxy-ctlplane.key'

After that the deployment times out with:
TASK [Run puppet host configuration for step 1] ********************************
Monday 19 August 2019  21:12:46 +0000 (0:00:00.313)       0:07:45.479 *********
Overcloud configuration failed.

Ansible timed out at 5180 seconds.

Comment 2 Michele Baldessari 2019-08-20 07:49:52 UTC
*** Bug 1743523 has been marked as a duplicate of this bug. ***

Comment 6 Michele Baldessari 2019-08-20 12:25:13 UTC
Created attachment 1606069 [details]
permissive audit.log

Comment 8 Julie Pichon 2019-08-20 14:36:32 UTC
I am seeing a lot of pkill commands on many different targets. That reminds me of bug 1703242 and the design issues discussed upstream in https://specs.openstack.org/openstack/tripleo-specs/specs/train/certificate-management.html#problem-description, though that other bug was around renewing the certificates as opposed to a new deployment.

The THT file you link to (https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/apache/apache-baremetal-puppet.j2.yaml#L118) seems different than the file mentioned on that bug (certmonger-haproxy-refresh.sh) although you do have it in comment 3. How did you narrow down to these two denials in A) for pkill?

Regarding B), the two denials related to ca.crt mention system_u:object_r:etc_t:s0 as a target so we can probably ignore the unconfined file in /var/lib. I would have expected the file to be of type cert_t rather the more generic etc_t, but I suppose we can allow certmonger to write to etc_t for now.

There are a few other denials related to certmonger trying to access hiera files like /etc/puppet/hiera.yaml and /etc/puppet/hieradata/controller_extraconfig.json. Do you reckon these are unlikely to be causing this particular issue at this point?

Comment 9 Julie Pichon 2019-08-20 15:24:20 UTC
For better visibility on how many contexts are touched by certmonger in this audit file:

#============= certmonger_t ==============
allow certmonger_t NetworkManager_t:dir { getattr search };
allow certmonger_t NetworkManager_t:file { open read };
allow certmonger_t auditd_t:dir { getattr search };
allow certmonger_t auditd_t:file { open read };
allow certmonger_t chronyd_t:dir { getattr search };
allow certmonger_t chronyd_t:file { open read };
allow certmonger_t cluster_t:dir { getattr search };
allow certmonger_t cluster_t:file { open read };
allow certmonger_t container_runtime_exec_t:file { execute execute_no_trans getattr open read };
allow certmonger_t container_share_t:blk_file { create unlink };
allow certmonger_t container_share_t:dir { add_name getattr ioctl mounton open read remove_name setattr write };
allow certmonger_t container_share_t:file { getattr lock open read write };
allow certmonger_t container_var_lib_t:dir getattr;
allow certmonger_t container_var_lib_t:file { getattr lock open read write };
allow certmonger_t container_var_run_t:dir setattr;
allow certmonger_t container_var_run_t:file { getattr lock open read write };
allow certmonger_t crond_t:dir { getattr search };
allow certmonger_t crond_t:file { open read };
allow certmonger_t etc_t:file write;                   # <---- related to ca.crt, cf. comments above
allow certmonger_t fs_t:filesystem unmount;
allow certmonger_t getty_t:dir { getattr search };
allow certmonger_t getty_t:file { open read };
allow certmonger_t gssproxy_t:dir { getattr search };
allow certmonger_t gssproxy_t:file { open read };
allow certmonger_t iptables_exec_t:file { execute execute_no_trans getattr open read };
allow certmonger_t irqbalance_t:dir { getattr search };
allow certmonger_t irqbalance_t:file { open read };
allow certmonger_t kernel_t:dir { getattr search };
allow certmonger_t kernel_t:file { open read };
allow certmonger_t oddjob_t:dir { getattr search };
allow certmonger_t oddjob_t:file { open read };
allow certmonger_t openvswitch_t:dir { getattr search };
allow certmonger_t openvswitch_t:file { open read };
allow certmonger_t policykit_t:dir { getattr search };
allow certmonger_t policykit_t:file { open read };
allow certmonger_t puppet_etc_t:file { getattr ioctl open read };
allow certmonger_t rhsmcertd_t:dir { getattr search };
allow certmonger_t rhsmcertd_t:file { open read };
allow certmonger_t rngd_t:dir { getattr search };
allow certmonger_t rngd_t:file { open read };
allow certmonger_t rpcbind_t:dir { getattr search };
allow certmonger_t rpcbind_t:file { open read };
allow certmonger_t self:capability { mknod sys_resource };
allow certmonger_t self:netlink_netfilter_socket { bind create getattr };
allow certmonger_t self:process setrlimit;
allow certmonger_t sshd_t:dir { getattr search };
allow certmonger_t sshd_t:file { open read };
allow certmonger_t sssd_t:dir { getattr search };
allow certmonger_t sssd_t:file { open read };
allow certmonger_t sysctl_net_t:dir search;
allow certmonger_t sysctl_net_t:file { open read };
allow certmonger_t syslogd_t:dir { getattr search };
allow certmonger_t syslogd_t:file { open read };
allow certmonger_t system_dbusd_t:dir { getattr search };
allow certmonger_t system_dbusd_t:file { open read };
allow certmonger_t systemd_logind_t:dir { getattr search };
allow certmonger_t systemd_logind_t:file { open read };
allow certmonger_t systemd_machined_t:dir { getattr search };
allow certmonger_t systemd_machined_t:file { open read };
allow certmonger_t tuned_t:dir { getattr search };
allow certmonger_t tuned_t:file { open read };
allow certmonger_t udev_t:dir { getattr search };
allow certmonger_t udev_t:file { open read };
allow certmonger_t unconfined_t:dir { getattr search };
allow certmonger_t unconfined_t:file { open read };
allow certmonger_t virt_qemu_ga_t:dir { getattr search };
allow certmonger_t virt_qemu_ga_t:file { open read };
allow certmonger_t virtd_t:dir { getattr search };
allow certmonger_t virtd_t:file { open read };
allow certmonger_t virtd_unit_file_t:service reload;

On the plus side, the majority appear to be read-only permissions...?

Comment 10 Grzegorz Grasza 2019-08-20 15:26:36 UTC
On B), I can confirm that the issue accessing ca.crt would manifest itself in that way (a timeout on getcert request), a similar timeout occurred in https://bugs.launchpad.net/tripleo/+bug/1830734 . Option "-w" in getcert doesn't timeout.

Comment 11 Michele Baldessari 2019-08-20 15:32:49 UTC
(In reply to Julie Pichon from comment #8)
> I am seeing a lot of pkill commands on many different targets. That reminds
> me of bug 1703242 and the design issues discussed upstream in
> https://specs.openstack.org/openstack/tripleo-specs/specs/train/certificate-
> management.html#problem-description, though that other bug was around
> renewing the certificates as opposed to a new deployment.
> 
> The THT file you link to
> (https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/
> apache/apache-baremetal-puppet.j2.yaml#L118) seems different than the file
> mentioned on that bug (certmonger-haproxy-refresh.sh) although you do have
> it in comment 3. How did you narrow down to these two denials in A) for
> pkill?

Oh so I blindly took the last two pkill denials under the (maybe wrong?) assumption
that we had a generic 'allow certmonger to run pkill' kind of issue.

In terms of code we have pkill (or kill in the case of 3) in the following places:
1) https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/apache/apache-baremetal-puppet.j2.yaml#L118
2) https://github.com/openstack/puppet-tripleo/blob/master/files/certmonger-redis-refresh.sh#L14
3) https://github.com/openstack/puppet-tripleo/blob/master/files/certmonger-haproxy-refresh.sh#L46 (this one uses kill and not pkill)

> Regarding B), the two denials related to ca.crt mention
> system_u:object_r:etc_t:s0 as a target so we can probably ignore the
> unconfined file in /var/lib. I would have expected the file to be of type
> cert_t rather the more generic etc_t, but I suppose we can allow certmonger
> to write to etc_t for now.

Agree on the /var/lib

> There are a few other denials related to certmonger trying to access hiera
> files like /etc/puppet/hiera.yaml and
> /etc/puppet/hieradata/controller_extraconfig.json. Do you reckon these are
> unlikely to be causing this particular issue at this point?

I think I missed those tbh. They do look relevant although am not entirely sure.
Want to throw a patch my way for fixing B) and we see if the deploy works and then
we investigate A) and the hiera denials a bit more in detail?

Comment 12 Nathan Kinder 2019-08-20 16:00:33 UTC
The denied write to etc_t is likely due to the use of the '-F' option to 'getcert request' seen in this snippet of the provided debug output:

Aug 19 21:13:58 controller-0.redhat.local puppet-user[25453]: Warning: Could not get certificate: Execution of '/usr/bin/getcert request -I libvirt-vnc-client-cert -f /etc/pki/libvirt-vnc/client-cert.pem -c IPA -N CN=controller-0.internalapi.redhat.local -K libvirt-vnc/controller-0.internalapi.redhat.local -D controller-0.internalapi.redhat.local -C systemctl reload libvirtd -w -k /etc/pki/libvirt-vnc/client-key.pem -F /etc/ipa/ca.crt' returned 4: New signing request "libvirt-vnc-client-cert" added.

The '-F' option will attempt to download and save the CA certificate into the location provided.  There was a recent change to make libvirt-vnc use the CA cert from /etc/ipa/ca.crt, as it would previously store its own copy.  Perhaps there is a remnant that uses '-F' that needs to be cleaned up.  We definitely don't want to have certmonger attempt to overwrite /etc/ipa/ca.crt, as this is obtained when the node is enrolled with IdM.  If we find and clean this up, the write AVC should go away.  I am not sure if that is related to the timeout.

Comment 13 Nathan Kinder 2019-08-20 16:08:25 UTC
This problem may have started happening when this change merged:

  https://review.opendev.org/#/c/673749/

As mentioned in my previous comment, there is likely existing code for using 'getcert request ... -F', the the change in path to the CA cert is causing troubles due to the use of this option.

Comment 14 Julie Pichon 2019-08-20 16:12:33 UTC
Useful comments, thank you - dumping some additional information after looking a bit more at the non-read-only denials:

container_runtime_exec_t (e.g. execute, execute_no_trans):
- /usr/bin/certmonger-rabbitmq-refresh.sh
- /usr/bin/certmonger-haproxy-refresh.sh reload ctlplane
- podman exec  cp /var/lib/kolla/config_files/src-tls/etc/pki/tls/certs/haproxy/overcloud-haproxy-ctlplane.pem /etc/pki/tls/certs/
- podman exec  cp /var/lib/kolla/config_files/src-tls/etc/pki/tls/certs/haproxy/overcloud-haproxy-ctlplane.pem /etc/pki/tls/certs/

I suspect we will need these? Though not entirely sure why the podman exec is running with certmonger_t as source context.

container_share_t starts to looks more suspicious, even though it starts from the same two scripts:
- /usr/bin/certmonger-rabbitmq-refresh.sh
- /usr/bin/certmonger-haproxy-refresh.sh reload ctlplane

Lots of mentions of the container storage overlays, not sure that we want certmonger to be able to mess with these:
type=AVC msg=audit(20/08/19 11:22:32.425:7865) : avc:  denied  { ioctl } for  pid=26655 comm=podman path=/var/lib/containers/storage/overlay dev="vda2" ino=92275029 ioctlcmd=0x581f scontext=system_u:system_r:certmonger_t:s0 tcontext=unconfined_u:object_r:container_share_t:s0 tclass=dir permissive=1 
type=AVC msg=audit(20/08/19 11:22:32.425:7866) : avc:  denied  { unlink } for  pid=26655 comm=podman name=backingFsBlockDev dev="vda2" ino=92275030 scontext=system_u:system_r:certmonger_t:s0 tcontext=unconfined_u:object_r:container_share_t:s0 tclass=blk_file permissive=1 
type=AVC msg=audit(20/08/19 11:22:32.425:7866) : avc:  denied  { remove_name } for  pid=26655 comm=podman name=backingFsBlockDev dev="vda2" ino=92275030 scontext=system_u:system_r:certmonger_t:s0 tcontext=unconfined_u:object_r:container_share_t:s0 tclass=dir permissive=1 
type=AVC msg=audit(20/08/19 11:22:32.425:7866) : avc:  denied  { write } for  pid=26655 comm=podman name=overlay dev="vda2" ino=92275029 scontext=system_u:system_r:certmonger_t:s0 tcontext=unconfined_u:object_r:container_share_t:s0 tclass=dir permissive=1 
type=AVC msg=audit(20/08/19 11:22:32.425:7868) : avc:  denied  { read write } for  pid=26655 comm=podman name=layers.lock dev="vda2" ino=46138520 scontext=system_u:system_r:certmonger_t:s0 tcontext=unconfined_u:object_r:container_share_t:s0 tclass=file permissive=1 

Similarly with container_var_lib_t:
type=AVC msg=audit(20/08/19 11:22:32.423:7856) : avc:  denied  { lock } for  pid=26655 comm=podman path=/var/lib/containers/storage/libpod/bolt_state.db dev="vda2" ino=85984286 scontext=system_u:system_r:certmonger_t:s0 tcontext=unconfined_u:object_r:container_var_lib_t:s0 tclass=file permissive=1 
type=AVC msg=audit(20/08/19 11:22:32.423:7855) : avc:  denied  { read write } for  pid=26655 comm=podman name=bolt_state.db dev="vda2" ino=85984286 scontext=system_u:system_r:certmonger_t:s0 tcontext=unconfined_u:object_r:container_var_lib_t:s0 tclass=file permissive=1 

And again for container_var_run_t:
type=AVC msg=audit(20/08/19 11:22:32.432:7876) : avc:  denied  { write } for  pid=26655 comm=podman name=alive.lck dev="tmpfs" ino=90484 scontext=system_u:system_r:certmonger_t:s0 tcontext=unconfined_u:object_r:container_var_run_t:s0 tclass=file permissive=1 
type=AVC msg=audit(20/08/19 11:22:39.634:7971) : avc:  denied  { lock } for  pid=26953 comm=podman path=/run/libpod/alive.lck dev="tmpfs" ino=90484 scontext=system_u:system_r:certmonger_t:s0 tcontext=unconfined_u:object_r:container_var_run_t:s0 tclass=file permissive=1 

fs_t:filesystem unmount AVCs seems to originate from running "podman ps --format={{.Names}}"?

iptables_exec_t (execute, execute_no_trans)
- /usr/sbin/iptables --version coming from "comm=podman path=/usr/sbin/xtables-nft-multi"?

Again something related to podman... The netlink_netfilter_socket AVCs are all related to iptables as well.

self:capability, setrlimit, again, the script and some podman commands that look suspicious:
 /usr/bin/certmonger-rabbitmq-refresh.sh 
type=AVC msg=audit(20/08/19 11:22:32.422:7853) : avc:  denied  { setrlimit } for  pid=26655 comm=podman scontext=system_u:system_r:certmonge
r_t:s0 tcontext=system_u:system_r:certmonger_t:s0 tclass=process permissive=1 
type=AVC msg=audit(20/08/19 11:22:32.422:7853) : avc:  denied  { sys_resource } for  pid=26655 comm=podman capability=sys_resource  scontext=system_u:system_r:certmonger_t:s0 tcontext=system_u:system_r:certmonger_t:s0 tclass=capability permissive=1 
type=AVC msg=audit(20/08/19 11:22:32.425:7867) : avc:  denied  { create } for  pid=26655 comm=podman name=backingFsBlockDev scontext=system_
u:system_r:certmonger_t:s0 tcontext=system_u:object_r:container_share_t:s0 tclass=blk_file permissive=1 
type=AVC msg=audit(20/08/19 11:22:32.425:7867) : avc:  denied  { add_name } for  pid=26655 comm=podman name=backingFsBlockDev scontext=syste
m_u:system_r:certmonger_t:s0 tcontext=unconfined_u:object_r:container_share_t:s0 tclass=dir permissive=1 
type=AVC msg=audit(20/08/19 11:22:32.425:7867) : avc:  denied  { mknod } for  pid=26655 comm=podman capability=mknod  scontext=system_u:system_r:certmonger_t:s0 tcontext=system_u:system_r:certmonger_t:s0 tclass=capability permissive=1 

"podman exec  cp /var/lib/kolla/config_files/src-tls/etc/pki/tls/certs/rabbitmq.crt /etc/pki/tls/certs/rabbitmq.crt" leads to "type=AVC msg=audit(20/08/19 11:22:32.585:7879) : avc:  denied  { sys_resource } for  pid=26669 comm=podman capability=sys_resource  scontext=system_u:system_r:certmonger_t:s0 tcontext=system_u:system_r:certmonger_t:s0 tclass=capability permissive=1"

There is only the one virtd denial, not sure what it's related to although certmonger_t is the source context again:
type=USER_AVC msg=audit(20/08/19 11:22:38.440:7949) : pid=1 uid=root auid=unset ses=unset subj=system_u:system_r:init_t:s0 msg='avc:  denied  { reload } for auid=unset uid=root gid=root path=/usr/lib/systemd/system/libvirtd.service cmdline="" scontext=system_u:system_r:certmonger_t:s0 tcontext=system_u:object_r:virtd_unit_file_t:s0 tclass=service permissive=1  exe=/usr/lib/systemd/systemd sauid=root hostname=? addr=? terminal=?'

Comment 15 Nathan Kinder 2019-08-20 16:19:32 UTC
Assuming my theory outlined in comment#12 is correct, these snippets of code are where the option is specified/used for the VNC cert (there may be other locations with the same issue):

https://github.com/openstack/puppet-tripleo/blob/6f790d624198eeb9219b26848c05f0edafd09dab/manifests/certmonger/libvirt_vnc.pp#L43-L45

https://github.com/openstack/tripleo-heat-templates/blob/456c8da28cd94dad380da34d5cfd54682dbf9549/deployment/nova/nova-libvirt-container-puppet.yaml#L466-L472

Comment 16 Julie Pichon 2019-08-20 18:17:56 UTC
(In reply to Michele Baldessari from comment #11)
> Want to throw a patch my way for fixing B) and we see if the deploy works
> and then we investigate A) and the hiera denials a bit more in detail?

Nathan's comments suggest that this probably should be fixed as a code change somewhere instead of a selinux rule, but I'm not sure how to easily test with a new rule if we wanted to try it. Adding the rule manually won't be enough pre-deploy, since the denials were seen on a controller. If I create a test rpm, would the controllers/overcloud nodes be able to pick it up during the deploy also, rather than whatever is in the compose?

Comment 17 Harry Rybacki 2019-08-20 19:06:56 UTC
We suspect that our recently merged workaround[1] for an underlying bug in certmonger[2] is the source of this. We have submitted a revert[3] for each release while we investigate further.

[1] - https://review.opendev.org/#/c/673887/
[2] - https://bugzilla.redhat.com/show_bug.cgi?id=1710632
[3] - https://review.opendev.org/#/c/677548/

Comment 18 Julie Pichon 2019-08-21 11:37:59 UTC
Thank you for the update. Is this expected to resolve the main "hanging" issue, and so we can handle the other AVC denials normally as non-blockers?

Comment 19 Julie Pichon 2019-08-21 16:40:33 UTC
There were thoughts that maybe it's hiera trying to parse everything causing all the read-only denials, but from what I can tell the hiera part should be resolved with "allow certmonger_t puppet_etc_t:file { getattr ioctl open read }" as the files appear to be labelled as expected. Matching proctitle to the read-only AVCs, this is what I see:

proctitle=pkill -o -HUP stunnel (getattr/open/read/search denials)
- auditd_t
- chronyd_t
- cluster_t
- crond_t
- getty_t
- gssproxy_t
- irqbalance_t
- kernel_t
- NetworkManager_t
- oddjob_t
- openvswitch_t
- policykit_t
- rhsmcertd_t
- rngd_t
- rpcbind_t
- sshd_t
- sssd_t
- syslogd_t
- system_dbusd_t
- systemd_logind_t
- systemd_machined_t
- tuned_t
- udev_t
- unconfined_t:s0-s0:c0.c1023
- virt_qemu_ga_t
- virtd_t

proctitle=pkill -USR1 httpd (open/search/getattr/read denials)
- chronyd_t
- gssproxy_t
- irqbalance_t
- kernel_t
- policykit_t
- rhsmcertd_t
- system_dbusd_t

I didn't see any other kill commands.

The "podman ps" command at https://github.com/openstack/puppet-tripleo/blob/master/files/certmonger-haproxy-refresh.sh#L36 seems responsible for many of the complicated denials in comment 13.

Comment 22 Ade Lee 2019-08-22 15:33:10 UTC
So, just as a data point.

I ran the install with the revert in place and was able to get a successful install.
I'm trying again on a different machine.

At the same time, though, I noticed that I first needed to manually delete the hosts from IPA from the previous run.
This appears to be because the novajoin_notifier is not receiving notifications from nova.  It is supposed to receive notifications
on server delete on a novajoin_notifications queue.

Not sure if this is selinux related - or basically another bug.  Still investigating ..

Comment 23 melanie witt 2019-08-22 19:22:24 UTC
Update: Ade and I poked around on the deployment and found that the notifications driver config setting: [oslo_messaging_notifications]/driver is set to 'noop' which will cause services (like nova) not to emit any notifications. The setting should be [oslo_messaging_notifications]/driver=messagingv2 in order for services to emit notifications.

Alex Schultz let us know that this is related to the following recent change to upstream tripleo:

https://review.opendev.org/674639

which will result in setting [oslo_messaging_notifications]/driver=noop when telemetry is not enabled. The ability to set driver=noop was added in an older change:

https://review.opendev.org/481726

which environments/disable-telemetry.yaml makes use of:

https://github.com/openstack/tripleo-heat-templates/blob/stable/stein/environments/disable-telemetry.yaml

The recent tripleo change was backported to stein:

https://review.opendev.org/675114

and we picked it up in OSP 15 because we're pulling from upstream.

A fix is being investigated from the tripleo side in order to enable notifications when novajoin is present in a deployment.

Comment 24 Harry Rybacki 2019-08-23 17:13:57 UTC
Updating component to THT. We have located the revert in build openstack-tripleo-heat-templates-10.6.1-0.20190823114746.30390bc.el8ost. 

Moving RHBZ to MODIFIED. Will follow up with Rel/Del for switch to ON_QA.

Comment 25 Julie Pichon 2019-08-23 17:34:55 UTC
For reference (hit some issue with the bugzilla 'clone' functionality), opened bug 1745142 to deal with the remaining AVCs.

Comment 33 Waldemar Znoinski 2019-09-04 14:17:29 UTC
*** Bug 1747957 has been marked as a duplicate of this bug. ***

Comment 37 errata-xmlrpc 2019-09-21 11:24:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:2811


Note You need to log in before you can comment on or make changes to this bug.