Created attachment 1405395 [details] File contains contents of OSD journald log Description of problem: Tried adding a new OSD to upgraded containerized cluster from 3.live to RC 3.0.z1, OSDs are getting shutdown continuously receiving signal Interrupt from PID: 0 Version-Release number of selected component (if applicable): Container - rhceph:3-3 How reproducible: Always (2/2) Steps to Reproduce: 1. Upgrade a containerized cluster from 3.0 live to rhceph:3-3 2. Add a new OSD node with limit option Actual results: OSDs (dmcrypt enabled + collocated journal) are getting restarted - ceph-osd-run.sh[22803]: 2018-03-07 14:30:12.947659 7ff421d11d00 -1 osd.9 2164 log_to_monitors {default=true} ceph-osd-run.sh[22803]: 2018-03-07 14:35:07.200026 7ff3faa6a700 -1 Fail to open '/proc/0/cmdline' error = (2) No such file or directory ceph-osd-run.sh[22803]: 2018-03-07 14:35:07.200044 7ff3faa6a700 -1 received signal: Interrupt from PID: 0 task name: <unknown> UID: 0 ceph-osd-run.sh[22803]: 2018-03-07 14:35:07.200048 7ff3faa6a700 -1 osd.9 2229 *** Got signal Interrupt *** ceph-osd-run.sh[22803]: 2018-03-07 14:35:07.200062 7ff3faa6a700 -1 osd.9 2229 shutdown Expected results: OSD services must be up and running Additional info: When this issue was tried on another cluster, services were stabilized after twi shutdown and restarts. limit option was used when playbook was initiated to add new OSD node. Playbook eventually failed trying to restart OSDs saying "PGs were not reported as active+clean", "It is possible that the cluster has less OSDs than the replica configuration", "Will refuse to continue"]}"
Created attachment 1405398 [details] File contains contents of ansible-playbook log
I'm looking into the machine at the moment
Thanks for letting me know Ken. I'm still unsure of what the real error is, still investigating.
After 1 day of hard debugging, it seems that the machine where these new OSDs are running has BAD firewall rules. That was sneaky because the OSD appeared up when they were started and down when stopped, so this was hard to believe fw was involved. After disabling firewalld, everything works. Please check your hardware setup before running a deployment. We can't afford to waste time on issues like this. Also, this is not the first time either, so please from now on be more careful. Thanks.