Description of problem: While switching from graphite-web-0.X.X to graphite-web-1.X.X tendrl-server needs to migrate graphite-data. It needs some extra steps to do a complete migration. This should be done from the tendrl-upgrade script. Version-Release number of selected component (if applicable): tendrl-monitoring-integration-1.6.3-16.el7rhgs.noarch How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: grafana should display all monitoring data after updating a new version of graphite-web Additional info:
PR is under review: https://github.com/Tendrl/monitoring-integration/pull/583
Providing QA ack. Note that the upgrade script should not break RHGSWA if run again after migration.
The migration process from graphite-web 0.9.15 to 1.1.4 (between RHGS 3.4.2 and 3.4.3) seems to be quite unclear. The tendrl-upgrade script performs following two commands: # django-admin migrate --fake dashboard --settings=graphite.settings --run-syncdb # django-admin migrate --fake-initial --settings=graphite.settings --run-syncdb But it is quite questionable, if it is the correct migration process, because Graphite documentation mentions slightly different command in Upgrading section[1]. Unfortunately that command seems to not work correctly. Also when I've tried to compare the dump of /var/lib/graphite-web/graphite.db from freshly installed cluster with another dump from cluster upgraded from previous version, there were some differences, which looks, like the migration process wasn't completed correctly. For example completely missing following two tables: "dashboard_template" and "dashboard_template_owners". Also the description for the '--fake' argument seems quite worryingly (from command line help, or documentation[2]): --fake Mark migrations as run without actually running them. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From another point of view, we didn't find any obvious issue on the updated cluster - all Grafana Dashboards seems to show correct data. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ So my question is, if we can accept/approve this approach, without deep understanding of the migration process, with the risk, that something might be broken (and which we might even miss during our testing)? Because to the ambiguity, I'm moving this Bug back to ASSIGNED. If the migration process will be approved without any change, please switch it back to ON_QA. >> ASSIGNED [1] https://graphite.readthedocs.io/en/latest/releases/1_0_0.html#upgrading [2] https://docs.djangoproject.com/en/2.1/ref/django-admin/#django-admin-migrate
Just suggesting for consideration, I might miss some important facts. Based on examination of the content of graphite.db, there seems to be no really relevant data which should be preserved between the old and new version. What about simply deleting the database file during the upgrade process and initializing it freshly the same way as during fresh installation?
Actually, the fake command is just marking migration is done without actually migration database schema. If we use the same initialization command then it gives error table is already exist. Regarding this, I have raised an upstream issue in a graphite-web repo. They actually closed that issue with a comment like it is not possible to migrate and create a new one https://github.com/graphite-project/graphite-web/issues/2389 Even I tried in different ways but I am still not able to find route case for migrating to the new schema.
We that ok to delete the graphiteDB and recreate it?
I've tried the scenario with deleting /var/lib/graphite-web/graphite.db and there seems to be one possible issue, we have to take care about: If graphite.db is deleted while httpd service is running, it might be recreated without correct initialization. Then tendrl-ansible skip the initialization step, because the db file already exists. In other words, httpd service have to be stopped in the time when graphite.db file will be deleted and reinitialized. We also have to consider, that for the other task of tendrl-upgrade script (Clearing grafana dashboards), httpd service have to be running. Following suggestion is really not clear and nice solution, but with other approaches there seems to be more problems than with this one: So I think, that tendrl-upgrade script should do all the required steps: 1) stop httpd service 2) delete graphite.db 3) initialize the graphite.db 4) start httpd service Or do you see any other better option?
PR: https://github.com/Tendrl/monitoring-integration/pull/590
Testing of this BZ should include use case described in BZ 1665030, because httpd is now restarted in the update script.
I've tested the scenario of update from RHGS WA 3.4.2 to RHGS WA 3.4.3: Updating from: Red Hat Enterprise Linux Server release 7.6 (Maipo) carbon-selinux-1.5.4-2.el7rhgs.noarch grafana-4.6.4-1.el7rhgs.x86_64 graphite-web-0.9.15-1.el7rhgs.noarch python-carbon-0.9.15-2.1.el7rhgs.noarch python-django-1.6.11-7.el7rhgs.noarch python-django-bash-completion-1.6.11-7.el7rhgs.noarch python-django-tagging-0.3.1-11.1.el7rhgs.noarch tendrl-ansible-1.6.3-10.el7rhgs.noarch tendrl-api-1.6.3-8.el7rhgs.noarch tendrl-api-httpd-1.6.3-8.el7rhgs.noarch tendrl-commons-1.6.3-13.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-16.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-16.el7rhgs.noarch tendrl-node-agent-1.6.3-11.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-14.el7rhgs.noarch Updating to: Red Hat Enterprise Linux Server release 7.6 (Maipo) carbon-selinux-1.5.4-3.el7rhgs.noarch grafana-4.6.4-1.el7rhgs.x86_64 graphite-web-1.1.4-1.el7rhgs.noarch python2-django-1.11.15-3.el7rhgs.noarch python-carbon-1.1.4-1.el7rhgs.noarch python-django-bash-completion-1.11.15-3.el7rhgs.noarch python-django-tagging-0.4.6-1.el7rhgs.noarch tendrl-ansible-1.6.3-11.el7rhgs.noarch tendrl-api-1.6.3-8.el7rhgs.noarch tendrl-api-httpd-1.6.3-8.el7rhgs.noarch tendrl-commons-1.6.3-14.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-20.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-3.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-20.el7rhgs.noarch tendrl-node-agent-1.6.3-13.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-3.el7rhgs.noarch tendrl-ui-1.6.3-14.el7rhgs.noarch The tendrl-upgrade script correctly perform all the steps required for migration to new graphite (stop all related services, remove old database, initialize new database, set proper ownership, and start previously stopped services). After the whole update process is finished, Grafana dashboards shows proper data. For the full verification of this bug, it is necessary to validate scenario from Bug 1665030, as mentioned Martin in previous comment.
I have tested the scenario of update from RHGS WA 3.4.1 to RHGS WA 3.4.3: Updating from: carbon-selinux-1.5.4-2.el7rhgs.noarch grafana-4.3.2-3.el7rhgs.x86_64 graphite-web-0.9.15-1.el7rhgs.noarch python-django-1.6.11-7.el7rhgs.noarch python-django-bash-completion-1.6.11-7.el7rhgs.noarch python-django-tagging-0.3.1-11.1.el7rhgs.noarch tendrl-ansible-1.6.3-8.el7rhgs.noarch tendrl-api-1.6.3-7.el7rhgs.noarch tendrl-api-httpd-1.6.3-7.el7rhgs.noarch tendrl-commons-1.6.3-13.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-14.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-14.el7rhgs.noarch tendrl-node-agent-1.6.3-10.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-11.el7rhgs.noarch The tendrl-upgrade script correctly performed all the steps required for migration to new graphite and all dashboards are showing data. Links from tendrl point to correct grafana dashboards.
Verifying based on comment 15 and comment 16. >> VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2019:0265