Bug 1612960 - Openstack containers are needlessly restarted during overcloud redeploy/update
Summary: Openstack containers are needlessly restarted during overcloud redeploy/update
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: z3
: 13.0 (Queens)
Assignee: Michele Baldessari
QA Contact: Gurenko Alex
URL:
Whiteboard:
: 1642178 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-08-06 15:52 UTC by Marian Krcmarik
Modified: 2018-11-13 22:28 UTC (History)
14 users (show)

Fixed In Version: openstack-tripleo-heat-templates-8.0.4-26.el7ost
Doc Type: Bug Fix
Doc Text:
During a redeployment, a number of containers can be restarted needlessly, even in the absence of any configuration change. This was due to including too many unneeded files in the md5 calculation of the config files. With this release, no spurious container restarts are triggered by a redeploy.
Clone Of:
Environment:
Last Closed: 2018-11-13 22:27:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1786065 0 None None None 2018-08-08 15:48:49 UTC
OpenStack gerrit 599292 0 None MERGED Make redeploy idempotent 2021-01-27 07:32:38 UTC
Red Hat Product Errata RHBA-2018:3587 0 None None None 2018-11-13 22:28:27 UTC

Description Marian Krcmarik 2018-08-06 15:52:10 UTC
Description of problem:
Some of the openstack containers are needlessly restart during overcloud redeploy/update with no configuration change.

I am attaching diff between two files, file named "before" contains a list of all running containers with start time before overcloud redeploy and the same in file named "after" which is list once overcloud was redeployed/updated.
$ diff before after 
7,10c7,10
< ceilometer_agent_notification: 2018-08-06 10:42:45 +0000 UTC
< cinder_api: 2018-08-06 10:42:43 +0000 UTC
< cinder_api_cron: 2018-08-06 10:42:49 +0000 UTC
< cinder_scheduler: 2018-08-06 10:42:47 +0000 UTC
---
> ceilometer_agent_notification: 2018-08-06 15:24:57 +0000 UTC
> cinder_api: 2018-08-06 15:24:56 +0000 UTC
> cinder_api_cron: 2018-08-06 15:25:00 +0000 UTC
> cinder_scheduler: 2018-08-06 15:24:59 +0000 UTC
18,21c18,21
< heat_api: 2018-08-06 10:42:43 +0000 UTC
< heat_api_cfn: 2018-08-06 10:42:50 +0000 UTC
< heat_api_cron: 2018-08-06 10:42:50 +0000 UTC
< heat_engine: 2018-08-06 10:42:47 +0000 UTC
---
> heat_api: 2018-08-06 15:24:56 +0000 UTC
> heat_api_cfn: 2018-08-06 15:25:02 +0000 UTC
> heat_api_cron: 2018-08-06 15:25:01 +0000 UTC
> heat_engine: 2018-08-06 15:24:58 +0000 UTC
24,26c24,26
< keystone: 2018-08-06 10:40:02 +0000 UTC
< keystone_cron: 2018-08-06 10:40:03 +0000 UTC
< logrotate_crond: 2018-08-06 10:42:51 +0000 UTC
---
> keystone: 2018-08-06 15:21:44 +0000 UTC
> keystone_cron: 2018-08-06 15:21:45 +0000 UTC
> logrotate_crond: 2018-08-06 15:25:03 +0000 UTC
33,37c33,37
< nova_api: 2018-08-06 10:42:53 +0000 UTC
< nova_api_cron: 2018-08-06 10:42:44 +0000 UTC
< nova_conductor: 2018-08-06 10:42:48 +0000 UTC
< nova_consoleauth: 2018-08-06 10:42:44 +0000 UTC
< nova_metadata: 2018-08-06 10:42:53 +0000 UTC
---
> nova_api: 2018-08-06 15:25:04 +0000 UTC
> nova_api_cron: 2018-08-06 15:24:56 +0000 UTC
> nova_conductor: 2018-08-06 15:25:00 +0000 UTC
> nova_consoleauth: 2018-08-06 15:24:57 +0000 UTC
> nova_metadata: 2018-08-06 15:25:05 +0000 UTC
39,41c39,41
< nova_scheduler: 2018-08-06 10:42:43 +0000 UTC
< nova_vnc_proxy: 2018-08-06 10:42:46 +0000 UTC
< panko_api: 2018-08-06 10:42:54 +0000 UTC
---
> nova_scheduler: 2018-08-06 15:24:55 +0000 UTC
> nova_vnc_proxy: 2018-08-06 15:24:57 +0000 UTC
> panko_api: 2018-08-06 15:25:06 +0000 UTC
65,68c65,68
< ceilometer_agent_notification: 2018-08-06 10:42:50 +0000 UTC
< cinder_api: 2018-08-06 10:42:49 +0000 UTC
< cinder_api_cron: 2018-08-06 10:42:55 +0000 UTC
< cinder_scheduler: 2018-08-06 10:42:53 +0000 UTC
---
> ceilometer_agent_notification: 2018-08-06 15:24:55 +0000 UTC
> cinder_api: 2018-08-06 15:24:53 +0000 UTC
> cinder_api_cron: 2018-08-06 15:24:59 +0000 UTC
> cinder_scheduler: 2018-08-06 15:24:57 +0000 UTC
76,79c76,79
< heat_api: 2018-08-06 10:42:49 +0000 UTC
< heat_api_cfn: 2018-08-06 10:42:56 +0000 UTC
< heat_api_cron: 2018-08-06 10:42:55 +0000 UTC
< heat_engine: 2018-08-06 10:42:52 +0000 UTC
---
> heat_api: 2018-08-06 15:24:54 +0000 UTC
> heat_api_cfn: 2018-08-06 15:25:01 +0000 UTC
> heat_api_cron: 2018-08-06 15:25:00 +0000 UTC
> heat_engine: 2018-08-06 15:24:57 +0000 UTC
82,84c82,84
< keystone: 2018-08-06 10:40:09 +0000 UTC
< keystone_cron: 2018-08-06 10:40:10 +0000 UTC
< logrotate_crond: 2018-08-06 10:42:57 +0000 UTC
---
> keystone: 2018-08-06 15:21:41 +0000 UTC
> keystone_cron: 2018-08-06 15:21:42 +0000 UTC
> logrotate_crond: 2018-08-06 15:25:01 +0000 UTC
91,95c91,95
< nova_api: 2018-08-06 10:42:59 +0000 UTC
< nova_api_cron: 2018-08-06 10:42:49 +0000 UTC
< nova_conductor: 2018-08-06 10:42:54 +0000 UTC
< nova_consoleauth: 2018-08-06 10:42:50 +0000 UTC
< nova_metadata: 2018-08-06 10:42:59 +0000 UTC
---
> nova_api: 2018-08-06 15:25:02 +0000 UTC
> nova_api_cron: 2018-08-06 15:24:54 +0000 UTC
> nova_conductor: 2018-08-06 15:24:58 +0000 UTC
> nova_consoleauth: 2018-08-06 15:24:55 +0000 UTC
> nova_metadata: 2018-08-06 15:25:03 +0000 UTC
97,99c97,99
< nova_scheduler: 2018-08-06 10:42:48 +0000 UTC
< nova_vnc_proxy: 2018-08-06 10:42:51 +0000 UTC
< panko_api: 2018-08-06 10:43:00 +0000 UTC
---
> nova_scheduler: 2018-08-06 15:24:53 +0000 UTC
> nova_vnc_proxy: 2018-08-06 15:24:56 +0000 UTC
> panko_api: 2018-08-06 15:25:04 +0000 UTC
123,126c123,126
< ceilometer_agent_notification: 2018-08-06 10:43:10 +0000 UTC
< cinder_api: 2018-08-06 10:43:09 +0000 UTC
< cinder_api_cron: 2018-08-06 10:43:14 +0000 UTC
< cinder_scheduler: 2018-08-06 10:43:12 +0000 UTC
---
> ceilometer_agent_notification: 2018-08-06 15:24:56 +0000 UTC
> cinder_api: 2018-08-06 15:24:54 +0000 UTC
> cinder_api_cron: 2018-08-06 15:25:01 +0000 UTC
> cinder_scheduler: 2018-08-06 15:24:58 +0000 UTC
134,137c134,137
< heat_api: 2018-08-06 10:43:09 +0000 UTC
< heat_api_cfn: 2018-08-06 10:43:16 +0000 UTC
< heat_api_cron: 2018-08-06 10:43:15 +0000 UTC
< heat_engine: 2018-08-06 10:43:11 +0000 UTC
---
> heat_api: 2018-08-06 15:24:54 +0000 UTC
> heat_api_cfn: 2018-08-06 15:25:03 +0000 UTC
> heat_api_cron: 2018-08-06 15:25:02 +0000 UTC
> heat_engine: 2018-08-06 15:24:57 +0000 UTC
140,142c140,142
< keystone: 2018-08-06 10:40:11 +0000 UTC
< keystone_cron: 2018-08-06 10:40:15 +0000 UTC
< logrotate_crond: 2018-08-06 10:43:17 +0000 UTC
---
> keystone: 2018-08-06 15:21:58 +0000 UTC
> keystone_cron: 2018-08-06 15:22:02 +0000 UTC
> logrotate_crond: 2018-08-06 15:25:04 +0000 UTC
149,153c149,153
< nova_api: 2018-08-06 10:43:18 +0000 UTC
< nova_api_cron: 2018-08-06 10:43:09 +0000 UTC
< nova_conductor: 2018-08-06 10:43:13 +0000 UTC
< nova_consoleauth: 2018-08-06 10:43:10 +0000 UTC
< nova_metadata: 2018-08-06 10:43:19 +0000 UTC
---
> nova_api: 2018-08-06 15:25:05 +0000 UTC
> nova_api_cron: 2018-08-06 15:24:55 +0000 UTC
> nova_conductor: 2018-08-06 15:25:00 +0000 UTC
> nova_consoleauth: 2018-08-06 15:24:55 +0000 UTC
> nova_metadata: 2018-08-06 15:25:06 +0000 UTC
155,158c155,158
< nova_scheduler: 2018-08-06 10:43:08 +0000 UTC
< nova_vnc_proxy: 2018-08-06 10:43:11 +0000 UTC
< openstack-cinder-volume-docker-0: 2018-08-06 10:44:50 +0000 UTC
< panko_api: 2018-08-06 10:43:19 +0000 UTC
---
> nova_scheduler: 2018-08-06 15:24:54 +0000 UTC
> nova_vnc_proxy: 2018-08-06 15:24:57 +0000 UTC
> openstack-cinder-volume-docker-0: 2018-08-06 15:26:33 +0000 UTC
> panko_api: 2018-08-06 15:25:06 +0000 UTC
179c179
< logrotate_crond: 2018-08-06 10:42:55 +0000 UTC
---
> logrotate_crond: 2018-08-06 15:24:28 +0000 UTC
181,184c181,184
< nova_compute: 2018-08-06 10:42:55 +0000 UTC
< nova_libvirt: 2018-08-06 10:39:57 +0000 UTC
< nova_migration_target: 2018-08-06 10:42:55 +0000 UTC
< nova_virtlogd: 2018-08-06 10:39:57 +0000 UTC
---
> nova_compute: 2018-08-06 15:24:28 +0000 UTC
> nova_libvirt: 2018-08-06 15:21:31 +0000 UTC
> nova_migration_target: 2018-08-06 15:24:28 +0000 UTC
> nova_virtlogd: 2018-08-06 15:21:30 +0000 UTC
188c188
< logrotate_crond: 2018-08-06 10:42:49 +0000 UTC
---
> logrotate_crond: 2018-08-06 15:24:20 +0000 UTC
190,193c190,193
< nova_compute: 2018-08-06 10:42:48 +0000 UTC
< nova_libvirt: 2018-08-06 10:39:52 +0000 UTC
< nova_migration_target: 2018-08-06 10:42:48 +0000 UTC
< nova_virtlogd: 2018-08-06 10:39:52 +0000 UTC
---
> nova_compute: 2018-08-06 15:24:20 +0000 UTC
> nova_libvirt: 2018-08-06 15:21:22 +0000 UTC
> nova_migration_target: 2018-08-06 15:24:20 +0000 UTC
> nova_virtlogd: 2018-08-06 15:21:22 +0000 UTC

Version-Release number of selected component (if applicable):
OSP13 - rhosp-director-images-13.0-20180803.1.el7ost.noarch

How reproducible:
always

Steps to Reproduce:
1. Deploy overcloud
2. Redeploy the same overcloud with no configuration change

Actual results:
Some Openstack containers are restarted needlessly - there is no configuration change.

Expected results:
No containers restarted because no configuration change was done

Additional info:

Comment 1 Emilien Macchi 2018-08-06 17:52:10 UTC
Containers restart when they become unhealthy, docker will restart them. Could you provide some logs so we can see if containers are unhealthy before their restarts. So we can figure out if it's something to do with the applications themselves.

Comment 2 Michele Baldessari 2018-08-07 09:34:05 UTC
I have observed this behavior as well (after Marian mentioned it to me).
So the reproducer used is the following (run after a successful deployment):
#!/bin/bash
tripleo-ansible-inventory --static-yaml-inventory inv.yaml
ansible -f1 -i inv.yaml  -m shell --become -a "docker ps --format=\"{{ '{{' }}.Names{{ '}}' }}: {{ '{{' }}.CreatedAt{{ '}}' }}\" | sort" overcloud > before
./overcloud_deploy.sh
ansible -f1 -i inv.yaml  -m shell --become -a "docker ps --format=\"{{ '{{' }}.Names{{ '}}' }}: {{ '{{' }}.CreatedAt{{ '}}' }}\" | sort" overcloud > after

So I took one of the containers that is affected (keystone) and run docker inspect before and after:
docker inspect keystone > keystone-before.txt
...redeploy with no changes...
docker inspect keystone > keystone-after.txt

diffing the two text files brings up at least one point that needs more analysis:
-                "TRIPLEO_CONFIG_HASH=6802e3cfc6fd2a7257269bd40acfd34c",
+                "TRIPLEO_CONFIG_HASH=e1378deafc578eb34b5ce93b167a360f",

(there are a bunch of bind mounts changes, but those seem just to be due to the non guaranteed ordering).

Comment 3 Michele Baldessari 2018-08-07 12:35:26 UTC
Ok so one of the reasons that *some* containers restart is that the cron file config changes at every run. The reason for this is the following snippet:
[root@controller-0 usr]# cat /var/lib/config-data/puppet-generated/keystone/var/spool/cron/keystone
# HEADER: This file was autogenerated at 2018-08-07 11:44:57 +0000 by puppet.
# HEADER: While it can still be managed manually, it is definitely not recommended.
# HEADER: Note particularly that the comments starting with 'Puppet Name' should
# HEADER: not be deleted, as doing so could cause duplicate cron jobs.
# Puppet Name: keystone-manage token_flush
PATH=/bin:/usr/bin:/usr/sbin SHELL=/bin/sh
1 * * * * keystone-manage token_flush >>/var/log/keystone/keystone-tokenflush.log 2>&1

The #{Time.Now} call in the HEADER lines triggers and md5 change in the cron file, which is part of the keystone volumes to calculate the md5 hash and so we restart the container. From a quick look this seems hardcoded into puppet (which we ship inside the container) and we will prolly need to find a way to exclude '^#.*HEADER.*' lines from md5 calculation (or find another way).

Looking at puppet code also the 'parsedfile' provider has this behaviour, so if we use that we will be affected there as well.

Comment 4 Michele Baldessari 2018-08-08 14:44:17 UTC
So we really have three separate issues here. Below is the list of all the containers
that may restart needlessly (at least what I have observed in my tests):
A) cron category:
ceilometer_agent_notification
cinder_api
cinder_api_cron
cinder_scheduler
heat_api
heat_api_cfn
heat_api_cron
heat_engine
keystone
keystone_cron
logrotate_crond
nova_api
nova_api_cron
nova_conductor
nova_consoleauth
nova_metadata
nova_scheduler
nova_vnc_proxy
openstack-cinder-volume-docker-0
panko_api

These end up being restarted because in the config volume for the container there is
a cron file and cron files are generated with a timestamp inside:
$ cat /var/lib/config-data/puppet-generated/keystone/var/spool/cron/keystone
# HEADER: This file was autogenerated at 2018-08-07 11:44:57 +0000 by puppet.
# HEADER: While it can still be managed manually, it is definitely not recommended.
# HEADER: Note particularly that the comments starting with 'Puppet Name' should
# HEADER: not be deleted, as doing so could cause duplicate cron jobs.
# Puppet Name: keystone-manage token_flush
PATH=/bin:/usr/bin:/usr/sbin SHELL=/bin/sh
1 * * * * keystone-manage token_flush >>/var/log/keystone/keystone-tokenflush.log 2>&1

The timestamp is unfortunately hard coded into puppet in both the cron provider and the parsedfile
provider:
https://github.com/puppetlabs/puppet/blob/master/lib/puppet/provider/cron/crontab.rb#L127
https://github.com/puppetlabs/puppet/blob/master/lib/puppet/provider/parsedfile.rb#L104

One possible fix would be to change the tar command in docker-puppet.py to be something like
this:
tar -c -f - /var/lib/config-data/${NAME} --mtime='1970-01-01' | tar xO | sed '/#.*HEADER.*/d'

Note that the 'tar xO' is needed in order to force text output and not use the
native tar format (if you don't the sed won't work).

B) swift category:
swift_account_auditor
swift_account_reaper
swift_account_replicator
swift_account_server
swift_container_auditor
swift_container_replicator
swift_container_server
swift_container_updater
swift_object_auditor
swift_object_expirer
swift_object_replicator
swift_object_server
swift_object_updater
swift_proxy
swift_rsync

So the swift containers restart because when recalculating the md5 over the
/var/lib/config-data/puppet-generated/swift folder we also include:
B.1) /etc/swift/backups/... which is a folder which over time collects backup of the ringfiles
B.2) /etc/swift/*.gz it seems that the *.gz files seem to change over time

I see two potential fixes here:
B.3) We simply exclude rings and their backups from the tar commands:
EXCL="--exclude='*/etc/swift/backups/*' --exclude='*/etc/swift/*.gz'"
tar -c -f - /var/lib/config-data/${NAME} --mtime='1970-01-01' $EXCL | tar xO | sed '/^#.*HEADER.*/d'

B.4) We stop storing the rings under /etc/swift and we store them in some other place ?

I lack the necessary swift knowledge to add anything too useful here.

C) libvirt category:
nova_compute
nova_libvirt
nova_migration_target
nova_virtlogd

This one seems to be due to the fact that the /etc/libvirt/passwd.db file contains a timestamp or somehow
is just different after a redeploy:
[root@compute-1 nova_libvirt]# git diff cb2441bb1caf7572ccfd870561dcc29d7819ba04..0c7441f30926b111603ce4d4b60c6000fe49d290 .
diff --git a/puppet-generated/nova_libvirt/etc/libvirt/passwd.db b/puppet-generated/nova_libvirt/etc/libvirt/passwd.db
index 66e2024..6ffa492 100644
Binary files a/puppet-generated/nova_libvirt/etc/libvirt/passwd.db and b/puppet-generated/nova_libvirt/etc/libvirt/passwd.db differ

(I have not exactly understood why yet because in my quick test without tls a simple 'saslpasswd2 -d -a libvirt -u overcloud migration' did not change the md5 for me, but maybe I am missing something)

Excluding this file from the tar command is probably a bad idea because if you do indeed change the password of the migration user then you probably do want the libvirtd container to restart (actually you likely want only that one to restart, as opposed to all nova* containers on the compute which is what happens now)

Just to make sure that the above theories are all correct I used the following tar commands:
EXCL="--exclude='*/etc/swift/backups/*' --exclude='*/etc/swift/*.gz' --exclude '*/etc/libvirt/passwd.db'"
tar -c -f - /var/lib/config-data/${NAME} --mtime='1970-01-01' $EXCL | tar xO | sed '/^#.*HEADER.*/d' | md5sum | awk '{print $1}' > /var/lib/config-data/${NAME}.md5sum
tar -c -f - /var/lib/config-data/puppet-generated/${NAME} --mtime='1970-01-01' $EXCL | tar xO | sed '/^#.*HEADER.*/d' | md5sum | awk '{print $1}' > /var/lib/config-data/puppet-generated/${NAME}.md5sum

And I observed no spurious restarts.

Comment 5 Michele Baldessari 2018-08-08 16:02:22 UTC
(In reply to Michele Baldessari from comment #4)
> So the swift containers restart because when recalculating the md5 over the
> /var/lib/config-data/puppet-generated/swift folder we also include:
> B.1) /etc/swift/backups/... which is a folder which over time collects
> backup of the ringfiles
> B.2) /etc/swift/*.gz it seems that the *.gz files seem to change over time
> 
> I see two potential fixes here:
> B.3) We simply exclude rings and their backups from the tar commands:
> EXCL="--exclude='*/etc/swift/backups/*' --exclude='*/etc/swift/*.gz'"
> tar -c -f - /var/lib/config-data/${NAME} --mtime='1970-01-01' $EXCL | tar xO
> | sed '/^#.*HEADER.*/d'
> 
> B.4) We stop storing the rings under /etc/swift and we store them in some
> other place ?

Spoke to Giulio and the rings (either the backup ones or the ones in /etc/swift/*.ring.gz) do not need to trigger any container restart when they change, so option B.3 should be viable.

Comment 6 Martin Schuppert 2018-08-09 09:34:33 UTC
(In reply to Michele Baldessari from comment #4)
> 
> C) libvirt category:
> nova_compute
> nova_libvirt
> nova_migration_target
> nova_virtlogd
> 
> This one seems to be due to the fact that the /etc/libvirt/passwd.db file
> contains a timestamp or somehow
> is just different after a redeploy:
> [root@compute-1 nova_libvirt]# git diff
> cb2441bb1caf7572ccfd870561dcc29d7819ba04..
> 0c7441f30926b111603ce4d4b60c6000fe49d290 .
> diff --git a/puppet-generated/nova_libvirt/etc/libvirt/passwd.db
> b/puppet-generated/nova_libvirt/etc/libvirt/passwd.db
> index 66e2024..6ffa492 100644
> Binary files a/puppet-generated/nova_libvirt/etc/libvirt/passwd.db and
> b/puppet-generated/nova_libvirt/etc/libvirt/passwd.db differ
> 
> (I have not exactly understood why yet because in my quick test without tls
> a simple 'saslpasswd2 -d -a libvirt -u overcloud migration' did not change
> the md5 for me, but maybe I am missing something)
> 

checking md5sum and stat from before and after an overcloud deploy

* before overcloud deploy

[root@compute-0 libvirt]# cat 1
-rw-r-----. 1 root root 12288 Aug  8 16:22 passwd.db
71ede9ac578c3e22efe1d0d2d68cb1ff  passwd.db
  File: ‘passwd.db’
  Size: 12288           Blocks: 24         IO Block: 4096   regular file
Device: fc02h/64514d    Inode: 14733371    Links: 1
Access: (0640/-rw-r-----)  Uid: (    0/    root)   Gid: (    0/    root)
Context: system_u:object_r:virt_etc_t:s0
Access: 2018-08-09 06:17:59.725717427 +0000
Modify: 2018-08-08 16:22:49.161938000 +0000
Change: 2018-08-08 16:29:45.598883277 +0000
 Birth: -


* after overcloud deploy

[root@compute-0 libvirt]# cat 2
-rw-r-----. 1 root root 12288 Aug  9 06:41 passwd.db
80b03ead0f5b13ea05ab6462f2ef7762  passwd.db
  File: ‘passwd.db’
  Size: 12288           Blocks: 24         IO Block: 4096   regular file
Device: fc02h/64514d    Inode: 14758991    Links: 1
Access: (0640/-rw-r-----)  Uid: (    0/    root)   Gid: (    0/    root)
Context: system_u:object_r:virt_etc_t:s0
Access: 2018-08-09 06:52:22.338206274 +0000
Modify: 2018-08-09 06:41:50.306298000 +0000
Change: 2018-08-09 06:48:46.525213436 +0000
 Birth: -

     93     if !empty($tls_password) {
     94       $libvirt_sasl_command = "echo \"\${TLS_PASSWORD}\" | saslpasswd2 -p -a libvirt -u overcloud migration"
     95       $libvirt_auth_ensure = present
     96       $libvirt_auth_conf = "
     97 [credentials-overcloud]
     98 authname=migration@overcloud
     99 password=${tls_password}
    100 
    101 [auth-libvirt-default]
    102 credentials=overcloud
    103 "
    104     }
    105     else {
    106       $libvirt_sasl_command = 'saslpasswd2 -d -a libvirt -u overcloud migration'
    107       $libvirt_auth_ensure = absent
    108       $libvirt_auth_conf = ''
    109     }
    110 
    111     exec{ 'set libvirt sasl credentials':
    112       environment => ["TLS_PASSWORD=${tls_password}"],
    113       command     => $libvirt_sasl_command,
    114       path        => ['/sbin', '/usr/sbin', '/bin', '/usr/bin'],
    115       require     => File['/etc/sasl2/libvirt.conf'],
    116       tag         => ['libvirt_tls_password']
    117     }

When running the same $libvirt_sasl_command manually with the same pwd, the md5sum/inode does not change.

[root@controller-0 libvirt]# md5sum passwd.db && stat passwd.db 
9424d45a095a94d5c1d5a6d1510a4d53  passwd.db
  File: ‘passwd.db’
  Size: 12288           Blocks: 16         IO Block: 4096   regular file
Device: fc02h/64514d    Inode: 14876433    Links: 1
Access: (0640/-rw-r-----)  Uid: (    0/    root)   Gid: (    0/    root)
Context: unconfined_u:object_r:virt_etc_t:s0
Access: 2018-08-09 07:09:10.992618579 +0000
Modify: 2018-08-09 07:09:10.991618574 +0000
Change: 2018-08-09 07:09:10.991618574 +0000
 Birth: -

[root@controller-0 libvirt]# echo "1234567" | saslpasswd2 -p -a libvirt -u overcloud migration
[root@controller-0 libvirt]# md5sum passwd.db && stat passwd.db 
9424d45a095a94d5c1d5a6d1510a4d53  passwd.db
  File: ‘passwd.db’
  Size: 12288           Blocks: 16         IO Block: 4096   regular file
Device: fc02h/64514d    Inode: 14876433    Links: 1
Access: (0640/-rw-r-----)  Uid: (    0/    root)   Gid: (    0/    root)
Context: unconfined_u:object_r:virt_etc_t:s0
Access: 2018-08-09 09:25:10.490826254 +0000
Modify: 2018-08-09 09:25:10.489826249 +0000
Change: 2018-08-09 09:25:10.489826249 +0000
 Birth: -

Comment 7 Martin Schuppert 2018-08-09 14:19:50 UTC
(In reply to Martin Schuppert from comment #6)
> (In reply to Michele Baldessari from comment #4)
> > 
> > C) libvirt category:
> > nova_compute
> > nova_libvirt
> > nova_migration_target
> > nova_virtlogd
> > 
> > This one seems to be due to the fact that the /etc/libvirt/passwd.db file
> > contains a timestamp or somehow
> > is just different after a redeploy:
> > [root@compute-1 nova_libvirt]# git diff
> > cb2441bb1caf7572ccfd870561dcc29d7819ba04..
> > 0c7441f30926b111603ce4d4b60c6000fe49d290 .
> > diff --git a/puppet-generated/nova_libvirt/etc/libvirt/passwd.db
> > b/puppet-generated/nova_libvirt/etc/libvirt/passwd.db
> > index 66e2024..6ffa492 100644
> > Binary files a/puppet-generated/nova_libvirt/etc/libvirt/passwd.db and
> > b/puppet-generated/nova_libvirt/etc/libvirt/passwd.db differ
> > 
> > (I have not exactly understood why yet because in my quick test without tls
> > a simple 'saslpasswd2 -d -a libvirt -u overcloud migration' did not change
> > the md5 for me, but maybe I am missing something)
> > 
> 
> checking md5sum and stat from before and after an overcloud deploy
> 
> * before overcloud deploy
> 
> [root@compute-0 libvirt]# cat 1
> -rw-r-----. 1 root root 12288 Aug  8 16:22 passwd.db
> 71ede9ac578c3e22efe1d0d2d68cb1ff  passwd.db
>   File: ‘passwd.db’
>   Size: 12288           Blocks: 24         IO Block: 4096   regular file
> Device: fc02h/64514d    Inode: 14733371    Links: 1
> Access: (0640/-rw-r-----)  Uid: (    0/    root)   Gid: (    0/    root)
> Context: system_u:object_r:virt_etc_t:s0
> Access: 2018-08-09 06:17:59.725717427 +0000
> Modify: 2018-08-08 16:22:49.161938000 +0000
> Change: 2018-08-08 16:29:45.598883277 +0000
>  Birth: -
> 
> 
> * after overcloud deploy
> 
> [root@compute-0 libvirt]# cat 2
> -rw-r-----. 1 root root 12288 Aug  9 06:41 passwd.db
> 80b03ead0f5b13ea05ab6462f2ef7762  passwd.db
>   File: ‘passwd.db’
>   Size: 12288           Blocks: 24         IO Block: 4096   regular file
> Device: fc02h/64514d    Inode: 14758991    Links: 1
> Access: (0640/-rw-r-----)  Uid: (    0/    root)   Gid: (    0/    root)
> Context: system_u:object_r:virt_etc_t:s0
> Access: 2018-08-09 06:52:22.338206274 +0000
> Modify: 2018-08-09 06:41:50.306298000 +0000
> Change: 2018-08-09 06:48:46.525213436 +0000
>  Birth: -
> 
>      93     if !empty($tls_password) {
>      94       $libvirt_sasl_command = "echo \"\${TLS_PASSWORD}\" |
> saslpasswd2 -p -a libvirt -u overcloud migration"
>      95       $libvirt_auth_ensure = present
>      96       $libvirt_auth_conf = "
>      97 [credentials-overcloud]
>      98 authname=migration@overcloud
>      99 password=${tls_password}
>     100 
>     101 [auth-libvirt-default]
>     102 credentials=overcloud
>     103 "
>     104     }
>     105     else {
>     106       $libvirt_sasl_command = 'saslpasswd2 -d -a libvirt -u
> overcloud migration'
>     107       $libvirt_auth_ensure = absent
>     108       $libvirt_auth_conf = ''
>     109     }
>     110 
>     111     exec{ 'set libvirt sasl credentials':
>     112       environment => ["TLS_PASSWORD=${tls_password}"],
>     113       command     => $libvirt_sasl_command,
>     114       path        => ['/sbin', '/usr/sbin', '/bin', '/usr/bin'],
>     115       require     => File['/etc/sasl2/libvirt.conf'],
>     116       tag         => ['libvirt_tls_password']
>     117     }
> 
> When running the same $libvirt_sasl_command manually with the same pwd, the
> md5sum/inode does not change.
> 
> [root@controller-0 libvirt]# md5sum passwd.db && stat passwd.db 
> 9424d45a095a94d5c1d5a6d1510a4d53  passwd.db
>   File: ‘passwd.db’
>   Size: 12288           Blocks: 16         IO Block: 4096   regular file
> Device: fc02h/64514d    Inode: 14876433    Links: 1
> Access: (0640/-rw-r-----)  Uid: (    0/    root)   Gid: (    0/    root)
> Context: unconfined_u:object_r:virt_etc_t:s0
> Access: 2018-08-09 07:09:10.992618579 +0000
> Modify: 2018-08-09 07:09:10.991618574 +0000
> Change: 2018-08-09 07:09:10.991618574 +0000
>  Birth: -
> 
> [root@controller-0 libvirt]# echo "1234567" | saslpasswd2 -p -a libvirt -u
> overcloud migration
> [root@controller-0 libvirt]# md5sum passwd.db && stat passwd.db 
> 9424d45a095a94d5c1d5a6d1510a4d53  passwd.db
>   File: ‘passwd.db’
>   Size: 12288           Blocks: 16         IO Block: 4096   regular file
> Device: fc02h/64514d    Inode: 14876433    Links: 1
> Access: (0640/-rw-r-----)  Uid: (    0/    root)   Gid: (    0/    root)
> Context: unconfined_u:object_r:virt_etc_t:s0
> Access: 2018-08-09 09:25:10.490826254 +0000
> Modify: 2018-08-09 09:25:10.489826249 +0000
> Change: 2018-08-09 09:25:10.489826249 +0000
>  Birth: -

The issue is that in the config container the passwd.db gets created via the saslpasswd2 commands triggered,

Aug  9 04:47:20 compute-0 puppet-user[10]: (/Stage[main]/Tripleo::Profile::Base::Nova::Libvirt/File[/etc/sasl2/libvirt.conf]/content) content changed '{md5}09c4fa846e8e27bfa3ab3325900d63ea' to '{md5}2f138c0278e1b666ec77a6d8ba3054a1'
Aug  9 04:47:20 compute-0 journal: Notice: /Stage[main]/Tripleo::Profile::Base::Nova::Libvirt/File[/etc/sasl2/libvirt.conf]/content: content changed '{md5}09c4fa846e8e27bfa3ab3325900d63ea' to '{md5}2f138c0278e1b666ec77a6d8ba3054a1'
Aug  9 04:47:20 compute-0 saslpasswd2: error deleting entry from sasldb: BDB0073 DB_NOTFOUND: No matching key/data pair found
Aug  9 04:47:20 compute-0 saslpasswd2: error deleting entry from sasldb: BDB0073 DB_NOTFOUND: No matching key/data pair found
Aug  9 04:47:20 compute-0 saslpasswd2: error deleting entry from sasldb: BDB0073 DB_NOTFOUND: No matching key/data pair found
Aug  9 04:47:20 compute-0 puppet-user[10]: (/Stage[main]/Tripleo::Profile::Base::Nova::Libvirt/Exec[set libvirt sasl credentials]/returns) executed successfully
Aug  9 04:47:20 compute-0 journal: Notice: /Stage[main]/Tripleo::Profile::Base::Nova::Libvirt/Exec[set libvirt sasl credentials]/returns: executed successfully
Aug  9 04:47:20 compute-0 puppet-user[10]: (/Stage[main]/Tripleo::Profile::Base::Nova::Libvirt/File[/etc/libvirt/auth.conf]/ensure) defined content as '{md5}4095b54c2cd67ed91cb56856c63babb9'
Aug  9 04:47:20 compute-0 journal: Notice: /Stage[main]/Tripleo::Profile::Base::Nova::Libvirt/File[/etc/libvirt/auth.conf]/ensure: defined content as '{md5}4095b54c2cd67ed91cb56856c63babb9'
Aug  9 04:47:20 compute-0 puppet-user[10]: (/Stage[main]/Tripleo::Profile::Base::Nova::Migration::Target/File[/etc/nova/migration/authorized_keys]/content) content changed '{md5}dff145cb4e519333c0096aae8de2e77c' to '{md5}e927faa6270b0bb9684bad3f346c4295'
Aug  9 04:47:20 compute-0 journal: Notice: /Stage[main]/Tripleo::Profile::Base::Nova::Migration::Target/File[/etc/nova/migration/authorized_keys]/content: content changed '{md5}dff145cb4e519333c0096aae8de2e77c' to '{md5}e927faa6270b0bb9684bad3f346c4295'

Also if we do not have tls enabled and there is no passwd.db 'saslpasswd2 -d' creates one.

Comment 8 Michele Baldessari 2018-08-10 13:29:24 UTC
Martin tested if we actually need to restart libvirt containers in case passwd.db changes and the answer is no:
(overcloud) [stack@undercloud-0 ~]$ openstack server list --long                                                                                                                                                                                                                          
+--------------------------------------+------+--------+------------+-------------+---------------------+------------+--------------------------------------+-------------+--------------------------------------+-------------------+------------------------+------------+
| ID                                   | Name | Status | Task State | Power State | Networks            | Image Name | Image ID                             | Flavor Name | Flavor ID                            | Availability Zone | Host                   | Properties |
+--------------------------------------+------+--------+------------+-------------+---------------------+------------+--------------------------------------+-------------+--------------------------------------+-------------------+------------------------+------------+
| 93f012fe-bf61-49b0-af2e-25b56cfd6083 | test | ACTIVE | None       | Running     | private=192.168.0.3 | cirros     | 823525a2-dd89-4d5a-9e70-183132bf2d6c | m1.small    | aed75444-94e3-445d-a30f-6881f8dd72dd | nova              | compute-0.redhat.local |            |
+--------------------------------------+------+--------+------------+-------------+---------------------+------------+--------------------------------------+-------------+--------------------------------------+-------------------+------------------------+------------+

1) change pwd on both computes
 
()[root@compute-1 libvirt]# cat auth.conf
 
[credentials-overcloud]
authname=migration@overcloud
password=hAhyktxV9hr8gE2JbCdC2gDjD                                                                                                                                  
 
[auth-libvirt-default]                                                                                                                                              
credentials=overcloud                                                                                                                                               
                                                                                                                                                                    
()[root@compute-0 libvirt]# sasldblistusers2 -f passwd.db                                                                                                           
migration@overcloud: userPassword                                                                                                                                   
                                                                                                                                                                    
()[root@compute-1 libvirt]# echo "12345678" | saslpasswd2 -p -a libvirt -u overcloud migration                                                                       
                                                                                                                                                                    
()[root@compute-0 libvirt]# cat auth.conf                                                                                                                           
                                                                                                                                                                    
[credentials-overcloud]                                                                                                                                             
authname=migration@overcloud                                                                                                                                        
password=12345678                                                                                                                                                   
                                                                                                                                                                    
[auth-libvirt-default]                                                                                                                                              
credentials=overcloud                                                                                                                                               
                                                                                                                                                                    
(overcloud) [stack@undercloud-0 ~]$ nova host-evacuate-live compute-0.redhat.local                                                                                   
                                                                                                                                                                    
(overcloud) [stack@undercloud-0 ~]$ openstack server list --long                                                                                                    
+--------------------------------------+------+--------+------------+-------------+---------------------+------------+--------------------------------------+-------------+--------------------------------------+-------------------+------------------------+------------+
| ID                                   | Name | Status | Task State | Power State | Networks            | Image Name | Image ID                             | Flavor Name | Flavor ID                            | Availability Zone | Host                   | Properties |
+--------------------------------------+------+--------+------------+-------------+---------------------+------------+--------------------------------------+-------------+--------------------------------------+-------------------+------------------------+------------+
| 93f012fe-bf61-49b0-af2e-25b56cfd6083 | test | ACTIVE | None       | Running     | private=192.168.0.3 | cirros     | 823525a2-dd89-4d5a-9e70-183132bf2d6c | m1.small    | aed75444-94e3-445d-a30f-6881f8dd72dd | nova              | compute-1.redhat.local |            |
+--------------------------------------+------+--------+------------+-------------+---------------------+------------+--------------------------------------+-------------+--------------------------------------+-------------------+------------------------+------------+

2) make it fail with different pwd in auth.conf and passwd.db on src and dst compute                                                                                                   
                                                                                                                                                                    
()[root@compute-0 libvirt]# echo "abcdefgh" | saslpasswd2 -p -a libvirt -u overcloud migration                                                                      
                                                                                                                   
(overcloud) [stack@undercloud-0 ~]$ openstack server list --long                                                                                                    
+--------------------------------------+------+-----------+------------+-------------+---------------------+------------+--------------------------------------+-------------+--------------------------------------+-------------------+------------------------+------------+
| ID                                   | Name | Status    | Task State | Power State | Networks            | Image Name | Image ID                             | Flavor Name | Flavor ID                            | Availability Zone | Host                   | Properties |
+--------------------------------------+------+-----------+------------+-------------+---------------------+------------+--------------------------------------+-------------+--------------------------------------+-------------------+------------------------+------------+
| 93f012fe-bf61-49b0-af2e-25b56cfd6083 | test | MIGRATING | migrating  | Running     | private=192.168.0.3 | cirros     | 823525a2-dd89-4d5a-9e70-183132bf2d6c | m1.small    | aed75444-94e3-445d-a30f-6881f8dd72dd | nova              | compute-1.redhat.local |            |
+--------------------------------------+------+-----------+------------+-------------+---------------------+------------+--------------------------------------+-------------+--------------------------------------+-------------------+------------------------+------------+
(overcloud) [stack@undercloud-0 ~]$ openstack server list --long                                                                                                    
+--------------------------------------+------+--------+------------+-------------+---------------------+------------+--------------------------------------+-------------+--------------------------------------+-------------------+------------------------+------------+
| ID                                   | Name | Status | Task State | Power State | Networks            | Image Name | Image ID                             | Flavor Name | Flavor ID                            | Availability Zone | Host                   | Properties |
+--------------------------------------+------+--------+------------+-------------+---------------------+------------+--------------------------------------+-------------+--------------------------------------+-------------------+------------------------+------------+
| 93f012fe-bf61-49b0-af2e-25b56cfd6083 | test | ACTIVE | None       | Running     | private=192.168.0.3 | cirros     | 823525a2-dd89-4d5a-9e70-183132bf2d6c | m1.small    | aed75444-94e3-445d-a30f-6881f8dd72dd | nova              | compute-1.redhat.local |            |
+--------------------------------------+------+--------+------------+-------------+---------------------+------------+--------------------------------------+-------------+--------------------------------------+-------------------+------------------------+------------+
 
2018-08-10 06:26:33.664 1 ERROR nova.virt.libvirt.driver [req-30ee5623-83b2-40e5-81c3-0dd76bf15a38 62d292c6128940eb8d78887f612845bc 291cc1a1c7724531b2cb894047e05fac - default default] [instance: 93f012fe-bf61-49b0-af2e-25b56cfd6083] Live Migration failure: operation failed: Failed 
to connect to remote libvirt URI qemu+tls://compute-0.internalapi.redhat.local/system: authentication failed: authentication failed: libvirtError: operation failed: Failed to connect to remote libvirt URI qemu+tls://compute-0.internalapi.redhat.local/system: authentication failed: 
authentication failed
2018-08-10 06:26:33.697 1 ERROR nova.virt.libvirt.driver [req-30ee5623-83b2-40e5-81c3-0dd76bf15a38 62d292c6128940eb8d78887f612845bc 291cc1a1c7724531b2cb894047e05fac - default default] [instance: 93f012fe-bf61-49b0-af2e-25b56cfd6083] Migration operation has aborted

Comment 9 Ollie Walsh 2018-08-10 13:59:54 UTC
I'm a bit concerned by this approach. Won't we just hit the same problem again in future unless, somehow, we can cover this in CI?

Ultimately I think the root cause of this issue is that the existing state is not available when running puppet apply, resulting non-existing files being recreated. I think https://review.openstack.org/562876 is the simple, and most correct solution.

Comment 10 Michele Baldessari 2018-08-10 15:53:35 UTC
(In reply to Ollie Walsh from comment #9)
> I'm a bit concerned by this approach. Won't we just hit the same problem
> again in future unless, somehow, we can cover this in CI?

I share you concern in general. I dislike the idea of somehow maintaining an exclude list (especially in docker-puppet.py). And yes we should somehow cover this in CI. I could not come up with a better contained fix so far.

> Ultimately I think the root cause of this issue is that the existing state
> is not available when running puppet apply, resulting non-existing files
> being recreated. I think https://review.openstack.org/562876 is the simple,
> and most correct solution.

That is one part of the story, but not all of it? Unless I am misunderstanding things, even if you mapped /etc/libvirt/passwd.db inside the docker-puppet container before running it, you'd still get and md5 mismatch because this Berkeley DB file contains a timestamp which gets changed as soon as you reinforce state on that file.

It also would not fix the swift stuff as those get pulled in via the undercloud and get updated and change the md5 anyway. 

This would fix the cron/parsefile thing though, which currently is rather ugly to deal with tar. I kind of like your approach in https://review.openstack.org/562876 for that. I also understand Jiri's concern in that review.
I don't know..https://review.openstack.org/#/c/590008/ now works in my tests but makes anyone's eyes bleed tbh, so I am more than happy to hear better ideas?

If we could restore your https://review.openstack.org/562876 approach we'd at least remove the sed hack in my review. We'd still have some excludes here and there to take off though

Comment 11 Martin Schuppert 2018-08-10 16:10:20 UTC
(In reply to Michele Baldessari from comment #10)
> (In reply to Ollie Walsh from comment #9)
> > Ultimately I think the root cause of this issue is that the existing state
> > is not available when running puppet apply, resulting non-existing files
> > being recreated. I think https://review.openstack.org/562876 is the simple,
> > and most correct solution.
> 
> That is one part of the story, but not all of it? Unless I am
> misunderstanding things, even if you mapped /etc/libvirt/passwd.db inside
> the docker-puppet container before running it, you'd still get and md5
> mismatch because this Berkeley DB file contains a timestamp which gets
> changed as soon as you reinforce state on that file.
> 

I doesn't seem to be a timestamp in the passwd.db:

[root@compute-0 libvirt]# echo "12345678" | saslpasswd2 -p -a libvirt -u overcloud migration

[root@compute-0 libvirt]# md5sum passwd.db 
1f584b05074002da163a31997c1498b9  passwd.db

[root@compute-0 libvirt]# echo "12345678" | saslpasswd2 -p -a libvirt -u overcloud migration

[root@compute-0 libvirt]# md5sum passwd.db 
1f584b05074002da163a31997c1498b9  passwd.db

[root@compute-0 libvirt]# db_dump -p ./passwd.db                                                                                                                                                                                                                                          
VERSION=3
format=print
type=hash
db_pagesize=4096
HEADER=END
 migration\00overcloud\00userPassword
 12345678
DATA=END

As long as we have the same db file it seems we are ok.

Comment 12 Michele Baldessari 2018-08-10 16:16:44 UTC
(In reply to Martin Schuppert from comment #11)
> (In reply to Michele Baldessari from comment #10)
> > (In reply to Ollie Walsh from comment #9)
> > > Ultimately I think the root cause of this issue is that the existing state
> > > is not available when running puppet apply, resulting non-existing files
> > > being recreated. I think https://review.openstack.org/562876 is the simple,
> > > and most correct solution.
> > 
> > That is one part of the story, but not all of it? Unless I am
> > misunderstanding things, even if you mapped /etc/libvirt/passwd.db inside
> > the docker-puppet container before running it, you'd still get and md5
> > mismatch because this Berkeley DB file contains a timestamp which gets
> > changed as soon as you reinforce state on that file.
> > 
> 
> I doesn't seem to be a timestamp in the passwd.db:
> 
> [root@compute-0 libvirt]# echo "12345678" | saslpasswd2 -p -a libvirt -u
> overcloud migration
> 
> [root@compute-0 libvirt]# md5sum passwd.db 
> 1f584b05074002da163a31997c1498b9  passwd.db
> 
> [root@compute-0 libvirt]# echo "12345678" | saslpasswd2 -p -a libvirt -u
> overcloud migration
> 
> [root@compute-0 libvirt]# md5sum passwd.db 
> 1f584b05074002da163a31997c1498b9  passwd.db
> 
> [root@compute-0 libvirt]# db_dump -p ./passwd.db                            
> 
> VERSION=3
> format=print
> type=hash
> db_pagesize=4096
> HEADER=END
>  migration\00overcloud\00userPassword
>  12345678
> DATA=END
> 
> As long as we have the same db file it seems we are ok.

Ack, then that would be covered, indeed.

Comment 13 Jiri Stransky 2018-08-13 09:51:09 UTC
I wonder if we could make an exclude list per-service? Like we pass puppet_tags [1] when we want Puppet to run more than default tags, we could also pass exclude_files to exclude a list of files from puppet-generated directory.

The issue with timestamped header in cron files is a bit less convenient though. I think it's a bug in Puppet (it should be idempotent IMO), and it looks like it's not possible to switch the header off easily. That one we might need to deal with in docker-puppet.py directly as Michele proposed. Alternatively we could try to subclass the crontab provider and override the header [2]. For context, IIUC this particular issue also would not be solved by the data carry-over patch [3], so we'd need to hack around it either way.

I still think we should try to avoid doing [3] it if at all possible. Once we start carrying config state this way, we will no longer be able to guarantee that e.g. fresh deployment and upgraded deployment have the same config files. We'd open ourselves up to a new class of bugs and lose one of the benefits of containerization.

According to my current understanding of the issue, per-service exclude lists and special handling for `# HEADER` bits would get my vote. (And perhaps we can convince Puppeteers to at remove the timestamp from headers in the long run, allowing us to drop the hack?)

[1] https://github.com/openstack/tripleo-heat-templates/blob/bf7a452de07d2d71ad97b25e8fc46ead42f0f634/docker/services/nova-libvirt.yaml#L217
[2] https://github.com/puppetlabs/puppet/blob/1fe4d9584e03f4366fb4be7ecafb4fc8fb833dfd/lib/puppet/provider/cron/crontab.rb#L126
[3] https://review.openstack.org/562876

Comment 14 Ollie Walsh 2018-08-13 10:15:12 UTC
Looks like the puppet parsedfile provider has the same issue. I had expected it to leave the header unchanged if there are no updates but it looks like it writes it out every time (just glancing at the source - https://github.com/puppetlabs/puppet/blob/master/lib/puppet/provider/parsedfile.rb#L380).

Comment 15 Michele Baldessari 2018-08-16 16:05:48 UTC
(In reply to Jiri Stransky from comment #13)
> I wonder if we could make an exclude list per-service? Like we pass
> puppet_tags [1] when we want Puppet to run more than default tags, we could
> also pass exclude_files to exclude a list of files from puppet-generated
> directory.

Yeah that is probably one of the best options at this stage. Do we all agree here that this is the way forward? Or do we want the quick exclude in first since it is likely more self-contained and we also work in parallel for a more generic exclude list? Maybe the former for OSP13 and the latter for OSP14?
 
> The issue with timestamped header in cron files is a bit less convenient
> though. I think it's a bug in Puppet (it should be idempotent IMO), and it
> looks like it's not possible to switch the header off easily. That one we
> might need to deal with in docker-puppet.py directly as Michele proposed.
> Alternatively we could try to subclass the crontab provider and override the
> header [2]. For context, IIUC this particular issue also would not be solved
> by the data carry-over patch [3], so we'd need to hack around it either way.

Probably not about [3] (see below).
I kinda like option [2] (I had not thought about it). So these subclasses of crontab/parsefiles would live in puppet-tripleo, correct? I guess we should first try our luck upstream.

> I still think we should try to avoid doing [3] it if at all possible. Once
> we start carrying config state this way, we will no longer be able to
> guarantee that e.g. fresh deployment and upgraded deployment have the same
> config files. We'd open ourselves up to a new class of bugs and lose one of
> the benefits of containerization.

Yeah I think your concern here is very valid and convincing
 
> According to my current understanding of the issue, per-service exclude
> lists and special handling for `# HEADER` bits would get my vote. (And
> perhaps we can convince Puppeteers to at remove the timestamp from headers
> in the long run, allowing us to drop the hack?)

So here is the thing I tried and it kind of surprised me, but the HEADER with timestamp does not retrigger the cron resource updating all the time:
[root@rhel8a ~]# cat > cron-test.pp

cron { 'helloworld':
   command => "/puppet/pls.sh",
   user    => root,
   hour    => '*',
   minute  => '*/5',
   require => File['/puppet/pls.sh']
}
[root@rhel8a ~]# vim cron-test.pp
[root@rhel8a ~]# puppet apply^C
[root@rhel8a ~]# cat cron-test.pp

cron { 'helloworld':
   command => "echo foo",
   user    => root,
   hour    => '*',
   minute  => '*/5',
}
[root@rhel8a ~]# puppet apply --detailed-exitcodes cron-test.pp
Notice: Compiled catalog for rhel8a.internal in environment production in 0.08 seconds
Notice: /Stage[main]/Main/Cron[helloworld]/ensure: created
Notice: Applied catalog in 0.06 seconds
[root@rhel8a ~]# cat /var/spool/cron/root
# HEADER: This file was autogenerated at 2018-08-16 11:49:45 -0400 by puppet.
# HEADER: While it can still be managed manually, it is definitely not recommended.
# HEADER: Note particularly that the comments starting with 'Puppet Name' should
# HEADER: not be deleted, as doing so could cause duplicate cron jobs.
# Puppet Name: helloworld
*/5 * * * * echo foo
[root@rhel8a ~]# puppet apply --detailed-exitcodes cron-test.pp
Notice: Compiled catalog for rhel8a.internal in environment production in 0.08 seconds
Notice: Applied catalog in 0.05 seconds

So the way I see it puppet is idempotent here (although I think we can all agree that the timestamp is a bit lame, the case for getting a patch upstream might be a bit lessened, since idempotency seems to work just fine. Not sure how it happens as I did not investigate too much)
I.e. [3] would indeed solve this issue here, I think.

> 
> [1]
> https://github.com/openstack/tripleo-heat-templates/blob/
> bf7a452de07d2d71ad97b25e8fc46ead42f0f634/docker/services/nova-libvirt.
> yaml#L217
> [2]
> https://github.com/puppetlabs/puppet/blob/
> 1fe4d9584e03f4366fb4be7ecafb4fc8fb833dfd/lib/puppet/provider/cron/crontab.
> rb#L126
> [3] https://review.openstack.org/562876

Comment 16 Ollie Walsh 2018-08-16 16:42:55 UTC
How about the reverse? i.e we have a include list of the files to preserve between puppet runs.

We could also catch any issues I think: run puppet twice, once with no existing state preserved (except the include list) and again with all state preserved. If there are any differences then fail the deployment (in CI at least).

Comment 17 Martin André 2018-08-28 06:51:43 UTC
(In reply to Michele Baldessari from comment #15)
> (In reply to Jiri Stransky from comment #13)
> > I wonder if we could make an exclude list per-service? Like we pass
> > puppet_tags [1] when we want Puppet to run more than default tags, we could
> > also pass exclude_files to exclude a list of files from puppet-generated
> > directory.
> 
> Yeah that is probably one of the best options at this stage. Do we all agree
> here that this is the way forward? Or do we want the quick exclude in first
> since it is likely more self-contained and we also work in parallel for a
> more generic exclude list? Maybe the former for OSP13 and the latter for
> OSP14?

I like the proposed approach.
 
> > The issue with timestamped header in cron files is a bit less convenient
> > though. I think it's a bug in Puppet (it should be idempotent IMO), and it
> > looks like it's not possible to switch the header off easily. That one we
> > might need to deal with in docker-puppet.py directly as Michele proposed.
> > Alternatively we could try to subclass the crontab provider and override the
> > header [2]. For context, IIUC this particular issue also would not be solved
> > by the data carry-over patch [3], so we'd need to hack around it either way.
> 
> Probably not about [3] (see below).
> I kinda like option [2] (I had not thought about it). So these subclasses of
> crontab/parsefiles would live in puppet-tripleo, correct? I guess we should
> first try our luck upstream.

Indeed, monkey patching the modules that inject a header with timestamps would be one option, but honestly I do not find it much cleaner than the sed you've implemented in docker-puppet.py. We can legitimately (?) assume the injected headers format is consistent across puppet modules and catch them all with the sed, while monkey patching would force us to work on a case by case basis.

Comment 18 Jiri Stransky 2018-08-28 09:15:43 UTC
(sorry for the delay, back from PTO now)

+1 for sed for the headers removal. The implementation is already proposed, it catches all rather than having to patch each affected Puppet type. Michele's implementation only affects the checksumming, we don't edit the files which then end up mounted into containers, so it's pretty safe i think.


Regarding list to keep vs. list to exclude, it's a tough call i think. List to keep will probably be nicer in the end but it will require more work to get it done, if we go that way we should probably involve all DFGs so that we can cover all services. I think this could be doable (and quite nice? :) ) if:

* When the parameter ("keep_files"?) isn't specified, we keep all files. That way we allow gradual migration from current approach to the new one.

* We probably should allow regexes as the keep_files entries, rather than simple strings, to make sure the keep_files parameter has sufficient power to cover all use cases.

I like it but it requires more effort :)

Comment 19 Michele Baldessari 2018-08-30 07:15:31 UTC
Thanks for all the feedback. I'll do one more round of testing of the WIP review and then propose that so that we can bring it to OSP13. I'll prolly need some help to do the cleaner approach with whitelist/blacklist for OSP14/15.

Comment 22 Gurenko Alex 2018-10-21 12:18:05 UTC
Verified on puddle 2018-10-02.1

No containers restart on stack update. Used the same script to verify

[stack@undercloud-0 ~]$ diff after before
[stack@undercloud-0 ~]$

Comment 24 Jiri Stransky 2018-10-29 15:35:07 UTC
*** Bug 1642178 has been marked as a duplicate of this bug. ***

Comment 28 errata-xmlrpc 2018-11-13 22:27:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3587


Note You need to log in before you can comment on or make changes to this bug.