@sam do you have a name in mind ? My usual contact for kernel related stuff is Ilya :-)
@skinjo would it be possible to get access to a cluster where the problem can be reproduced ? Even if not frequent, it would be enough to have one way to reproduce it to make progress. As it it, we're not closer to get it resolved, it only confirms that the problem exists and that we have no theory to explain why.
Loic, do you know if the original instance also used a raid controller in JBOD mode, and if so which model? Perhaps there is some firmware bug common to the two cases.
@josh > do you know if the original instance also used a raid controller in JBOD mode, and if so which model? Perhaps there is some firmware bug common to the two cases. I don't know.
@Shinobu KINJO In order to correlate the problem with previous occurrences, it would be useful to get the kernel log (dmesg) as well as a log of the load average an hour before and after the crash. I'm exploring ideas and will update you when / if I figure something out ;-)
@Loic, I understood. I'm pretty serious for this issue because I tried to figure out what's exactly going on on this was reported http://tracker.ceph.com/issues/12100 Please let me know at any time if your team require more. Rgds, Shinobu
@Shinobu Sam suggested that you activate debug with: debug osd = 0/20 debug filestore = 0/20 debug ms = 0/1 debug journal = 0/20 which we call "in memory logging". What it means is that your logs won't grow. The OSD will be marginaly slower because it will log things into memory but only keep the last 10,000 messages. Should a crash happen, it will dump these messages to the logs and they will likely contain the information we need to better understand what is going on.
@Loic, I will suggest that again. But since the cluster is in production, customer is worried about performance degradation. So it probably takes a time to have them get core. Thank you for paying attention to this. Rgds, Shinobu
@Shinobu for the record Ilya figured out the root cause of the problem ( http://tracker.ceph.com/issues/9570#note-37 ) and suggested a workaround (upgrade the kernel so it is newer than 3.13).
@Loic, Thank you, and your team (especially Ilya!) for analysis and confirmation. Rgds, Shinobu
This seems to have been a kernel bug.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days