This environment was deployed as OSP 10. i) We ran a controller replacement procedure https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html/director_installation_and_usage/sect-scaling_the_overcloud#sect-Replacing_Controller_Nodes (SFDC 02199264) ii) We then fixed an issue in the roles .yaml file and redeployed (SFDC 02192627) https://bugzilla.redhat.com/show_bug.cgi?id=1595315#c11 ~~~ Alan Bishop 2018-06-28 08:34:46 EDT Further analysis reveals this problem will only occur in an artificial environment in which the upgrade to OSP-10 occurs immediately after the initial deployment of OSP-9. If the OSP-9 deployment runs for a day, cinder's db_purge cron job will run, and that logs to cinder-manage.log. This has the side effect of creating the cinder-manage.log file with the correct 'cinder' ownership prior to the OSP-10 upgrade. So, when upgrading to OSP-10, the log file will already have been created with the proper ownership that will allow the OSP-10 db_sync operation to succeed. With this in mind, I'm lowering the BZ priority and severity. To fully address the problem, puppet-tripleo needs to be fixed to stop running db_sync on non-bootstrap nodes. I've filed a bug upstream, and will soon submit a patch to fix it. ~~~ Wouldn't the correct solution be to make sure that the file permissions of /var/log/cinder/cinder-manage.log file are cinder:cinder? E.g., add to the puppet manifest an instruction that sets the permissions of files in /var/log/cinder to cinder:cinder if this condition is not met. Thanks, Andreas
Actually we ran cinder-manage manually. Here's how to reproduce this: sudo [heat-admin@overcloud-controller-0 ~]$ sudo -i [root@overcloud-controller-0 ~]# [root@overcloud-controller-0 ~]# [root@overcloud-controller-0 ~]# [root@overcloud-controller-0 ~]# [root@overcloud-controller-0 ~]# ls /var/log/cinder/ api.log scheduler.log volume.log volume.log-20180922.gz api.log-20180908.gz scheduler.log-20180911.gz volume.log-20180916.gz cinder-manage.log scheduler.log-20180916.gz volume.log-20180918.gz cinder-rowsflush.log scheduler.log-20180921.gz volume.log-20180920.gz [root@overcloud-controller-0 ~]# ls /var/log/cinder/ -alh total 8.0M drwxr-x---. 2 cinder root 4.0K Sep 22 03:35 . drwxr-xr-x. 41 root root 4.0K Oct 14 03:35 .. -rw-r--r--. 1 cinder cinder 244K Oct 16 18:44 api.log -rw-r--r--. 1 cinder cinder 841K Sep 7 22:02 api.log-20180908.gz -rw-r--r--. 1 cinder cinder 417K Oct 17 00:01 cinder-manage.log -rw-r--r--. 1 cinder cinder 416K Oct 17 00:01 cinder-rowsflush.log -rw-r--r--. 1 cinder cinder 1.7M Oct 16 18:44 scheduler.log -rw-r--r--. 1 cinder cinder 749K Sep 11 03:17 scheduler.log-20180911.gz -rw-r--r--. 1 cinder cinder 860K Sep 16 03:46 scheduler.log-20180916.gz -rw-r--r--. 1 cinder cinder 854K Sep 21 03:29 scheduler.log-20180921.gz -rw-r--r--. 1 cinder cinder 14K Oct 16 18:44 volume.log -rw-r--r--. 1 cinder cinder 526K Sep 16 03:46 volume.log-20180916.gz -rw-r--r--. 1 cinder cinder 524K Sep 18 03:45 volume.log-20180918.gz -rw-r--r--. 1 cinder cinder 523K Sep 20 03:20 volume.log-20180920.gz -rw-r--r--. 1 cinder cinder 446K Sep 21 20:46 volume.log-20180922.gz [root@overcloud-controller-0 ~]# cinder-manage db sync Option "logdir" from group "DEFAULT" is deprecated. Use option "log-dir" from group "DEFAULT". [root@overcloud-controller-0 ~]# ls /var/log/cinder/ -alh total 8.0M drwxr-x---. 2 cinder root 4.0K Sep 22 03:35 . drwxr-xr-x. 41 root root 4.0K Oct 14 03:35 .. -rw-r--r--. 1 cinder cinder 244K Oct 16 18:44 api.log -rw-r--r--. 1 cinder cinder 841K Sep 7 22:02 api.log-20180908.gz -rw-r--r--. 1 cinder cinder 417K Oct 17 00:01 cinder-manage.log -rw-r--r--. 1 cinder cinder 416K Oct 17 00:01 cinder-rowsflush.log -rw-r--r--. 1 cinder cinder 1.7M Oct 16 18:44 scheduler.log -rw-r--r--. 1 cinder cinder 749K Sep 11 03:17 scheduler.log-20180911.gz -rw-r--r--. 1 cinder cinder 860K Sep 16 03:46 scheduler.log-20180916.gz -rw-r--r--. 1 cinder cinder 854K Sep 21 03:29 scheduler.log-20180921.gz -rw-r--r--. 1 cinder cinder 14K Oct 16 18:44 volume.log -rw-r--r--. 1 cinder cinder 526K Sep 16 03:46 volume.log-20180916.gz -rw-r--r--. 1 cinder cinder 524K Sep 18 03:45 volume.log-20180918.gz -rw-r--r--. 1 cinder cinder 523K Sep 20 03:20 volume.log-20180920.gz -rw-r--r--. 1 cinder cinder 446K Sep 21 20:46 volume.log-20180922.gz [root@overcloud-controller-0 ~]# rm -f /var/log/cinder/cinder-manage.log [root@overcloud-controller-0 ~]# cinder-manage db sync Option "logdir" from group "DEFAULT" is deprecated. Use option "log-dir" from group "DEFAULT". [root@overcloud-controller-0 ~]# ls /var/log/cinder/ -alh total 7.6M drwxr-x---. 2 cinder root 4.0K Oct 17 22:36 . drwxr-xr-x. 41 root root 4.0K Oct 14 03:35 .. -rw-r--r--. 1 cinder cinder 244K Oct 16 18:44 api.log -rw-r--r--. 1 cinder cinder 841K Sep 7 22:02 api.log-20180908.gz -rw-r--r--. 1 root root 0 Oct 17 22:36 cinder-manage.log -rw-r--r--. 1 cinder cinder 416K Oct 17 00:01 cinder-rowsflush.log -rw-r--r--. 1 cinder cinder 1.7M Oct 16 18:44 scheduler.log -rw-r--r--. 1 cinder cinder 749K Sep 11 03:17 scheduler.log-20180911.gz -rw-r--r--. 1 cinder cinder 860K Sep 16 03:46 scheduler.log-20180916.gz -rw-r--r--. 1 cinder cinder 854K Sep 21 03:29 scheduler.log-20180921.gz -rw-r--r--. 1 cinder cinder 14K Oct 16 18:44 volume.log -rw-r--r--. 1 cinder cinder 526K Sep 16 03:46 volume.log-20180916.gz -rw-r--r--. 1 cinder cinder 524K Sep 18 03:45 volume.log-20180918.gz -rw-r--r--. 1 cinder cinder 523K Sep 20 03:20 volume.log-20180920.gz -rw-r--r--. 1 cinder cinder 446K Sep 21 20:46 volume.log-20180922.gz [root@overcloud-controller-0 ~]#
Removing triaged keyword until I understand what's happening.
I finally tracked down where (and how) cinder-manage is invoked in a way that causes the cinder-manage.log file to be owned by root, which in turn causes puppet's db sync to fail. There are three places [1], [2] and [3] where cinder-manage is run as root. The command always opens its cinder-manage.log file, and if the file doesn't already exist then it will be created with root as the owner. I believe [1] is the source of cinder-manage.log files with 0 size. [1] https://github.com/openstack/tripleo-heat-templates/blob/newton-eol/extraconfig/tasks/major_upgrade_controller_pacemaker_2.sh#L69 [2] https://github.com/openstack/tripleo-heat-templates/blob/newton-eol/extraconfig/tasks/major_upgrade_controller_pacemaker_2.sh#L72 [3] https://github.com/openstack/tripleo-heat-templates/blob/newton-eol/extraconfig/tasks/major_upgrade_controller_pacemaker_3.sh#L59 One thing that will help is if these where invoked using "sudo -u cinder," but there's a deeper problem. All of the above instances are supposed to only be run on the bootstrap node, and the general pattern uses this logic: if [[ -n $(is_bootstrap_node) ]]; then... The "-n" tests whether the following string has a non-zero length. Unfortunately, the is_boostrap_function emits a debut message [4], and so the "-n" test will always be True. In other words, the code always thinks its running on the bootstrap node. [4] https://github.com/openstack/tripleo-heat-templates/blob/newton-eol/extraconfig/tasks/pacemaker_common_functions.sh#L37 I'd like DFG:Upgrades to take a look at this. Although the pacemaker upgrade scripts are not used in later releases, the is_bootstrap_node issue identified is present in upstream master.
So in this particular case, my problem was different. I was troubleshooting, and cinder-manage.log was not yet created for some reason. I think that *my troubleshooting* created that file when I ran `cinder-manage db sync` with user root. Perhaps it would be useful to make sure that the owner/group/permissions on the log files are correct with puppet before calling cinder-manage.
Hi Andreas, so I'm wondering why we don't close the bug as it seems to be due to the fact that the command was run as root user and you already have the correct package (one version more than the one provided there https://bugzilla.redhat.com/show_bug.cgi?id=1595315) Is there still something to be done here?
Because I think that Director needs to be smart enough to deal with somebody running `cinder-manage db sync` as root. Feel free to close it if you don't agree, but my take is that given the sequence: *) log file doesn't exist *) admin logs in, runs `cinder-manage db sync` *) Director is executed This should still work and not fail.
I don't think there's much that any service (or the Director) can do in the face of an admin running commands as root. My concern is when the Director itself does that, which it appears to be doing at [1] [1] https://github.com/openstack/tripleo-heat-templates/blob/newton-eol/extraconfig/tasks/major_upgrade_controller_pacemaker_2.sh#L69 If the cinder-manage.log file is created for the first time, then it will be owned by root, which is a problem. Normally, the log file will have already been created by the cinder user long before an upgrade takes place. But, this should happen *only* on the boostrap node. This leads to my biggest concern, and that is the debug_log at [2] will the is_bootstrap_node behave as if it's True on non-bootstrap nodes. [2] https://github.com/openstack/tripleo-heat-templates/blob/newton-eol/extraconfig/tasks/pacemaker_common_functions.sh#L37
Director can simply chown the log file, can't it? I had exactly the opposite discussion on another ticket where Director (or rather puppet) unconditionally restarted iptables.service when it was stopped (by an admin). Leading to a complete flush of iptables rules and an outage. When I said that I didn't agree with that behavior of the product (I still don't think that DIrector should simply restart iptables.service when it's down), the argument as I understood it was that Director expects a state and brings the system into that state, no matter what. Following that logic, Director should in this case chown the log file to the correct user if it's owned by root. In the end, I'm indifferent to this. It's highly unlikely that customers run into this issue, and if somebody hits it, this bug here exists to document it, so I'm good with whatever you decide to do :-)
(In reply to Andreas Karis from comment #12) > Director can simply chown the log file, can't it? Sure it could, but this is one of thousands of potential conditions that might arise when something runs as root when it really shouldn't. Yes, a chown could address this particular instance, but where does it all end? I'm trying to get to the root cause here, which I think it a combination of these factors: 1) The Director's upgrade scripts need to take care when running things as root 2) The is_bootstrap_node function may be allowing things to run on non-bootstrap nodes
Hi, lot of things there. >> Director can simply chown the log file, can't it? > > Sure it could, but this is one of thousands of potential conditions that might arise when > something runs as root when it really shouldn't. Yes, a chown could address this particular > instance, but where does it all end? I tend to agree with Alan here. Furthermore when https://review.openstack.org/#/c/579234/1/manifests/profile/base/cinder/api.pp will be available to osp10 (z9 with version >= puppet-tripleo-5.6.8-13.el7ost), it will reduce the likelihood of this happening as the db sync will run only on the bootstrap node. Furthermore we can fix that cheaply during the upgrade. See below. So now form point 1. and 2. in Alan's recap of the situation. > 1) The Director's upgrade scripts need to take care when running things as root Agree, at least we have to make sure that we're not the one doing this and it's a cheap change. Let's harden the situation there. We can start with a patch in osp10 and then make sure it's ok upstream in newer version. We could as well add a lit >> This leads to my biggest concern, and that is the debug_log at [2] will the >> is_bootstrap_node behave as if it's True on non-bootstrap nodes. > > 2) The is_bootstrap_node function may be allowing things to run on non-bootstrap nodes Well, this I don't get. is_bootstrap_node is always checked for non-empty output -n $(is_bootstrap_node) the function is_bootstrap_node won't emit anything on non bootstrap node, log_debug or not, as the call to log debug is within the "if/fi": function is_bootstrap_node { if [ "$(hiera -c /etc/puppet/hiera.yaml bootstrap_nodeid | tr '[:upper:]' '[:lower:]')" = "$(facter hostname | tr '[:upper:]' '[:lower:]')" ]; then # -> something on the stdout log_debug "Node is bootstrap" echo "true" fi # -> nothing on the stdout } So not sure what I'm missing here? In the meatime I'll propose a patch to prevent osp10 upgrade script from running cinder-manage as root and make sure the file has enough permission
(In reply to Sofer Athlan-Guyot from comment #14) > <snip> > Well, this I don't get. > > is_bootstrap_node is always checked for non-empty output > > -n $(is_bootstrap_node) > > the function is_bootstrap_node won't emit anything on non bootstrap node, > log_debug or not, as the call to log debug is within the "if/fi": > > function is_bootstrap_node { > if [ "$(hiera -c /etc/puppet/hiera.yaml bootstrap_nodeid | tr > '[:upper:]' '[:lower:]')" = "$(facter hostname | tr '[:upper:]' > '[:lower:]')" ]; then > # -> something on the stdout > log_debug "Node is bootstrap" > echo "true" > fi > # -> nothing on the stdout > } > > So not sure what I'm missing here? You're missing nothing, and I'm feeling foolish. You are absolutely correct that the debug message only appears when the node *is* a bootstrap node, and it will not appear on non-bootstrap nodes. Sorry for the confusion.