Description of problem: In some cases rolling_update fails e.g. https://bugzilla.redhat.com/show_bug.cgi?id=1366808 in this case we'd like to be able to recommend taking some corrective action and re-running rolling_update. Right now the task that restarts OSDs will run regardless if the service version changed. This will cause unnecessary load on the cluster AND will increase the probability for timing dependant issues like the dmcrypt race above to prevent the cluster from reaching active an clean. Version-Release number of selected component (if applicable): How reproducible: 100% Steps to Reproduce: 1. Create a cluster using ceph version A 2. upgrade onee OSD host to ceph version B 3. run rolling_update and Actual results: observe that all osds were restarted Expected results: only services whose version has changed should be restarted
Upstream seems to have another approach now, it first stops services, apply roles and then make sure the process is started. This makes sense since the package upgrade + the role will start the daemon already. There are no more restart calls.
Sebastien, is this bug fixed as of v2.1.9?
Yes Ken.
Hi Ken,Seb, Is there any change that has gone into this fix? From comment #2 there is a new upstream approach. So could you let us know what is the expected behaviour now? Thanks
What do you mean? There is nothing new to this BZ, IMHO this was fixed a couple of months ago.
Hi, As per my understanding, Now ONLY services which undergoes version change will be stopped and started later and other osd services are untouched as per new approach followed in upstream. Is this the actual case ? Is my understating correct ? I got confused whether this particular fix followed the above mentioned upstream approach or just skipped restarting of services which didn't get upgraded. Regards, Vasishta
Your understanding is correct, only selected services will be updated and thus restarted.
Created attachment 1279679 [details] Terminal log of rolling update Hi all, I could observe that osd services were 'stopped and started' even though there was no version change. Attached file contains full terminal log. (All are ubuntu nodes and only repos in node 106 was not replaced) TASK [stop ceph osds with systemd] ********************************************* changed: [magna106] => (item=2) changed: [magna106] => (item=5) TASK [start ceph osds with systemd] ******************************************** ok: [magna106] => (item=2) ok: [magna106] => (item=5) $ for i in {068,071,106};do ssh magna$i 'ceph -v';done ceph version 10.2.7-20redhat1xenial (8b2e41c074ec6b5053c9838b5e21239ba5d63443) ceph version 10.2.7-20redhat1xenial (8b2e41c074ec6b5053c9838b5e21239ba5d63443) ceph version 10.2.5-28redhat1xenial (033f137cde8573cfc5a4662b4ed6a63b8a8d1464) Changing the status back to ASSIGNED state as service was stopped and started regardless of version change. Please let me know if there are any concerns. Regards, Vasishta
Hi, Again tried a dry run of rolling update and observed that services were stopped and started though there were no version change. Moving back to ASSIGNED stated as mentioned in Comment 12. $ ps aux |grep ceph ceph 31780 0.2 0.1 891124 41224 ? Ssl 10:55 0:15 /usr/bin/ceph-osd -f --cluster temp --id 2 --setuser ceph --setgroup ceph ceph 31920 0.2 0.1 890048 41488 ? Ssl 10:55 0:15 /usr/bin/ceph-osd -f --cluster temp --id 5 --setuser ceph --setgroup ceph ubuntu 32486 0.0 0.0 16572 2148 pts/1 S+ 12:29 0:00 grep --color=auto ceph ubuntu@magna106:~$ ps aux |grep ceph ubuntu 32832 0.0 0.0 16572 2208 pts/1 S+ 12:31 0:00 grep --color=auto ceph $ ps aux| grep ceph ceph 34537 0.5 0.1 883068 36892 ? Ssl 12:31 0:00 /usr/bin/ceph-osd -f --cluster temp --id 2 --setuser ceph --setgroup ceph ceph 34675 0.5 0.1 884052 36908 ? Ssl 12:31 0:00 /usr/bin/ceph-osd -f --cluster temp --id 5 --setuser ceph --setgroup ceph ubuntu 34965 0.0 0.0 16572 2196 pts/1 S+ 12:33 0:00 grep --color=auto ceph Regards, Vasishta
I'd say that if you run rolling_update, you're expecting to get a new version, even you don't then it will just assuming there is a new version available. I'd like to close this as "won't fix" since the behavior is expected. Do not run the playbook if there is nothing to update. Does that sound reasonable?
discussed at program meeting, does not meeting blocker critera, moving to next release