1301817 – journal: Unexpected aio error on several nodes

Bug 1301817 - journal: Unexpected aio error on several nodes

Summary: journal: Unexpected aio error on several nodes

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RADOS
Sub Component:
Version:	1.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	1.3.4
Assignee:	Loic Dachary
QA Contact:	ceph-qe-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-01-26 03:55 UTC by Shinobu KINJO
Modified:	2023-09-14 03:16 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-07-19 01:24:16 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Comment 7 Loic Dachary 2016-01-27 15:23:24 UTC

@sam do you have a name in mind ? My usual contact for kernel related stuff is Ilya :-)

Comment 8 Loic Dachary 2016-01-27 15:25:21 UTC

@skinjo would it be possible to get access to a cluster where the problem can be reproduced ? Even if not frequent, it would be enough to have one way to reproduce it to make progress. As it it, we're not closer to get it resolved, it only confirms that the problem exists and that we have no theory to explain why.

Comment 9 Josh Durgin 2016-01-27 16:47:44 UTC

Loic, do you know if the original instance also used a raid controller in JBOD mode, and if so which model? Perhaps there is some firmware bug common to the two cases.

Comment 26 Loic Dachary 2016-01-28 15:15:54 UTC

@josh 

> do you know if the original instance also used a raid controller in JBOD mode, and if so which model? Perhaps there is some firmware bug common to the two cases.

I don't know.

Comment 29 Loic Dachary 2016-01-29 09:24:00 UTC

@Shinobu KINJO

In order to correlate the problem with previous occurrences, it would be useful to get the kernel log (dmesg) as well as a log of the load average an hour before and after the crash. 

I'm exploring ideas and will update you when / if I figure something out ;-)

Comment 30 Shinobu KINJO 2016-01-29 10:50:56 UTC

@Loic,

I understood.

I'm pretty serious for this issue because I tried to figure out what's exactly going on on this was reported

  http://tracker.ceph.com/issues/12100

Please let me know at any time if your team require more.

Rgds,
Shinobu

Comment 31 Loic Dachary 2016-01-29 15:26:43 UTC

@Shinobu Sam suggested that you activate debug with:

debug osd = 0/20
debug filestore = 0/20
debug ms = 0/1
debug journal = 0/20

which we call "in memory logging". What it means is that your logs won't grow. The OSD will be marginaly slower because it will log things into memory but only keep the last 10,000 messages. Should a crash happen, it will dump these messages to the logs and they will likely contain the information we need to better understand what is going on.

Comment 32 Shinobu KINJO 2016-01-29 21:49:06 UTC

@Loic,

I will suggest that again.

But since the cluster is in production, customer is worried about performance degradation. So it probably takes a time to have them get core.

Thank you for paying attention to this.

Rgds,
Shinobu

Comment 33 Loic Dachary 2016-02-02 05:05:59 UTC

@Shinobu for the record Ilya figured out the root cause of the problem ( http://tracker.ceph.com/issues/9570#note-37 ) and suggested a workaround (upgrade the kernel so it is newer than 3.13).

Comment 34 Shinobu KINJO 2016-02-02 05:44:19 UTC

@Loic,

Thank you, and your team (especially Ilya!) for analysis and confirmation.

Rgds,
Shinobu

Comment 37 Josh Durgin 2017-07-19 01:24:16 UTC

This seems to have been a kernel bug.

Comment 38 Red Hat Bugzilla 2023-09-14 03:16:46 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.