.For Red Hat Ceph Storage deployments running within containers, adding a new OSD will cause the new OSD daemon to continuously restart
Adding a new OSD to an existing Ceph Storage Cluster running within a container, will restart the new OSD daemon every 5 minutes. As a result, the storage cluster will not achieve a `HEALTH_OK` state. Currently, there is no workaround for this issue. This does not affect already running OSD daemons.
Created attachment 1405395[details]
File contains contents of OSD journald log
Description of problem:
Tried adding a new OSD to upgraded containerized cluster from 3.live to RC 3.0.z1, OSDs are getting shutdown continuously receiving signal Interrupt from PID: 0
Version-Release number of selected component (if applicable):
Container - rhceph:3-3
How reproducible:
Always (2/2)
Steps to Reproduce:
1. Upgrade a containerized cluster from 3.0 live to rhceph:3-3
2. Add a new OSD node with limit option
Actual results:
OSDs (dmcrypt enabled + collocated journal) are getting restarted -
ceph-osd-run.sh[22803]: 2018-03-07 14:30:12.947659 7ff421d11d00 -1 osd.9 2164 log_to_monitors {default=true}
ceph-osd-run.sh[22803]: 2018-03-07 14:35:07.200026 7ff3faa6a700 -1 Fail to open '/proc/0/cmdline' error = (2) No such file or directory
ceph-osd-run.sh[22803]: 2018-03-07 14:35:07.200044 7ff3faa6a700 -1 received signal: Interrupt from PID: 0 task name: <unknown> UID: 0
ceph-osd-run.sh[22803]: 2018-03-07 14:35:07.200048 7ff3faa6a700 -1 osd.9 2229 *** Got signal Interrupt ***
ceph-osd-run.sh[22803]: 2018-03-07 14:35:07.200062 7ff3faa6a700 -1 osd.9 2229 shutdown
Expected results:
OSD services must be up and running
Additional info:
When this issue was tried on another cluster, services were stabilized after twi shutdown and restarts.
limit option was used when playbook was initiated to add new OSD node.
Playbook eventually failed trying to restart OSDs saying
"PGs were not reported as active+clean", "It is possible that the cluster has less OSDs than the replica configuration", "Will refuse to continue"]}"
After 1 day of hard debugging, it seems that the machine where these new OSDs are running has BAD firewall rules. That was sneaky because the OSD appeared up when they were started and down when stopped, so this was hard to believe fw was involved.
After disabling firewalld, everything works.
Please check your hardware setup before running a deployment. We can't afford to waste time on issues like this. Also, this is not the first time either, so please from now on be more careful.
Thanks.
Created attachment 1405395 [details] File contains contents of OSD journald log Description of problem: Tried adding a new OSD to upgraded containerized cluster from 3.live to RC 3.0.z1, OSDs are getting shutdown continuously receiving signal Interrupt from PID: 0 Version-Release number of selected component (if applicable): Container - rhceph:3-3 How reproducible: Always (2/2) Steps to Reproduce: 1. Upgrade a containerized cluster from 3.0 live to rhceph:3-3 2. Add a new OSD node with limit option Actual results: OSDs (dmcrypt enabled + collocated journal) are getting restarted - ceph-osd-run.sh[22803]: 2018-03-07 14:30:12.947659 7ff421d11d00 -1 osd.9 2164 log_to_monitors {default=true} ceph-osd-run.sh[22803]: 2018-03-07 14:35:07.200026 7ff3faa6a700 -1 Fail to open '/proc/0/cmdline' error = (2) No such file or directory ceph-osd-run.sh[22803]: 2018-03-07 14:35:07.200044 7ff3faa6a700 -1 received signal: Interrupt from PID: 0 task name: <unknown> UID: 0 ceph-osd-run.sh[22803]: 2018-03-07 14:35:07.200048 7ff3faa6a700 -1 osd.9 2229 *** Got signal Interrupt *** ceph-osd-run.sh[22803]: 2018-03-07 14:35:07.200062 7ff3faa6a700 -1 osd.9 2229 shutdown Expected results: OSD services must be up and running Additional info: When this issue was tried on another cluster, services were stabilized after twi shutdown and restarts. limit option was used when playbook was initiated to add new OSD node. Playbook eventually failed trying to restart OSDs saying "PGs were not reported as active+clean", "It is possible that the cluster has less OSDs than the replica configuration", "Will refuse to continue"]}"