From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20020830 Description of problem: Problem: msgstr: [snip] blau und gelb ist [snip] msgmerge'd into: [snip] blau undgelb ist [snip] A lot of german translations have missing spaces in between words. The initial cause of this I believe to be msgmerge, since it appends the next line to the previous line, without checking whether it ends in a space. A lot of translators are not aware that this space at the end of the line is *absolutely* required. I believe msgmerge should do a sanity-check, or is there any particular reason that it doesn't? An option to not add a space might be ok, but I believe that as default, msgmerge should not re-arrange lines, but leave them as is. Cheers, Bernd Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. Create a multi-line string where some lines leave quite some space at the end of the line, i.e. only have 40 characters (e.g.). Do not end the line in either "\\n" or " ". 2. Perform a msgmerge on the file containing this entry. Actual Results: msgmerge re-formats the lines and introduces a more suitable line-break. In so doing, it falsly merges the last word of the first line with the first word of the second line into a single string. Expected Results: The entry should have been left as is, or, if a new line-break was introduced, there should be a space in between the last word of the first line and the first word of the second line. Additional info:
Created attachment 89959 [details] Testcase: pot, old po, merged po for an entry with this problem The first row is from the pot-file. msgid on the right, no msgstr available (as seen on left). The second row is from the old po-file. Correct msgstr is on the left. The third row is the result after msgmerge (using default options). On the left side now is the msgstr in which this problem occurs, i.e. "zusagen" instead of "zu sagen", "derGemeinschaft" rather than "der Gemeinschaft", and "könnten,und" where it should read "könnten, und"
Created attachment 89960 [details] pot-file of the previous testcase
Created attachment 89961 [details] old po-file of the previous testcase
Created attachment 89962 [details] new merged po-file of the previous testcase
The orginal idea of muliple lines is for easy reading. People would type in a word inbetween lines (very simliar behaviour on editors with wrapping on), as you know the message is in a line if there aren't any \n. I don't think it is sane to change the behaviour or enforce people to add a space at the end of the line. On the other hand, on your test case that needed a space between "," and "u", how will it shows on software? Possibly it will shows without space as well.
> The orginal idea of muliple lines is for easy reading. People would type in > a word inbetween lines (very simliar behaviour on editors with wrapping on), > as you know the message is in a line if there aren't any \n. Yes, I know. But most tools for po-files do not wrap automatically and you have to wrap manually, by pressing [Enter]. > I don't think it is sane to change the behaviour or enforce people to add a > space at the end of the line. That's what I meant. Therefore, msgmerge should check whether there's a space at the end in such cases. And if it isn't, msgmerge should add a space automatically, before appending the lines into one, which causes the problem iff spaces are missing. > On the other hand, on your test case that needed a space between "," and "u", > how will it shows on software? Possibly it will shows without space as well. Yes, that's right, it will show without the space. Not just in this case, but in all three for the given testcase. "zusagen" is not a german word, neither is "derGemeinschaft". This is how it would appear in the software, and that is wrong. msgmerge does that by re-wrapping the lines and ignoring that users might not have added a space at the end of a line. A such, I believe msgmerge should do a sanity-check, and if there is no space at the end of the line, add one.
This is NOT a msgmerge bug. The format is *simple*, no exceptions: list of strings (with possible \foo escape sequences) concatenated together. If you "fix" msgmerge not to reformat the entries, gettext will *still* join the strings, causing the same output. The fact that there are newlines outside the " " quotes around the strings has *no* influence on the translated string contents, neither does the way string is broken into parts: "a" " " "b" "a b" "a " "b" all generate the same string in the .mo file. Use a reasonable .po editor (emacs, kbabel, whatever), fix your .po files and be happy.
> This is NOT a msgmerge bug. > The format is *simple*, no exceptions: list of strings (with possible > \foo escape sequences) concatenated together. Technically, it might not be. But it sure is a problem! Do you know just how many of these wrong entries I have fixed in german translations? > If you "fix" msgmerge not to reformat the entries, gettext > will *still* join the strings, causing the same output. The fact > that there are newlines outside the " " quotes around the strings has > *no* influence on the translated string contents, neither does the > way string is broken into parts: That was an *example*, which I don't like as much as reformatting and simply adding a space at the end of the line if not there, or at least provide an option that enables this. In this case, you at least fix all such potential problems the first time you run a msgmerge on the file. And that's what I said. And there won't be any problem with gettext after that anymore either. > "a" > " " > "b" > > "a b" > > "a " > "b" > > all generate the same string in the .mo file. Yes, I know. But what about: "a" "b" This will generate "ab", not "a b". Now you sure can tell me about the *simple* format, but then you completely ignore common practice. If you set a newline writing some text, for example, do you mean that to say that there will be a new word, or do you intend to say the next line should be read as part of that very same line, i.e. the last string of the first line and the first string of the second line forming a single word? Example: Since this example doesn't fit entirely in my box, I really do prefer to use multiple lines, even though I'd like to have it in one line later on in my output. This example will be msgmerge'd into: Since this example doesn't fit entirely in my box, I really doprefer to use multiple lines, even though I'd like to have itin one line later on in my output. If that is the desired output for you, you do have a point. Then I'd like to make the mere suggestion of doing a sanity-check, at least in msgmerge and simply adding a space if there isn't one. > Use a reasonable .po editor (emacs, kbabel, whatever), fix your .po files > and be happy. I *do* use kbabel! But kbabel doesn't tell me, hey, you've forgotten a space at the end of the line there, which is *absolutely* required, because if you don't have one there, msgmerge will scramble your translation next time 'round! ;-) Cheers, Bernd
couple of points for us to brainstorm: - msgmerge is not a compulsory step for converting .po to .mo, hence we could not rely on msgmerge to do a sanity check and add a space at the end of line, but also update msgfmt, etc for that behaviour. - 3rd party script may try to break the line with just counting the width. So if they are breaking one word into two lines, then they may complain if we change the behaviour. - Ben has pointed out that application like kbabel will automatically add a space on previous line when you break into new line. This is a absolutely a good feature for client to handle. (on KBabel 0.9.6) I will try to get a approval from upstream if you still prefer handle it on gettext tools instead of translation applications level.
> - msgmerge is not a compulsory step for converting .po to .mo, hence we > could not rely on msgmerge to do a sanity check and add a space at the > end of line, but also update msgfmt, etc for that behaviour. In my case, I'll never ever convert anything to .mo. I simply create a new pot and then do a msgmerge to fill the entries. If a space was missing, as addressed, my previous translation breaks. While this is easily avoidable, it is annoying if you happen to have forgotten the required space (in practice, this happens, as seen in some of the german translations I took over). How this could be achieved otherwise, I don't know? > - 3rd party script may try to break the line with just counting the width. > So if they are breaking one word into two lines, then they may complain if > we change the behaviour. What exactly do you mean here? If they don't use msgmerge, they don't have any problem. And if they use the output, I don't see any problem either. Or do you mean that they want to use po-files and write words over the edge, e.g. [snip]and today is a re ally nice day? I'd prefer the same behaviour as in markup-languages, if you do a line-break, a space is added if not placed explicitly. > - Ben has pointed out that application like kbabel will automatically add > a space on previous line when you break into new line. This is a absolutely > a good feature for client to handle. (on KBabel 0.9.6) I agree. KBabel therefore complies to the behaviour we know from markup-languages. But it doesn't enforce it. If you happen to do some changes and end up with no space at the end, you might not realise it until your translation is broken. And if you do not use kbabel, then it is even more likely that you forget a space now and then. I just think it is wrong that forgetting a simple space at the end of a line can break your translation. > I will try to get a approval from upstream if you still prefer handle it > on gettext tools instead of translation applications level. Just see what their take on this is. If you've worked a lot with markup-languages and know that a line-break actually implies a space, then you are likely to expect that from some tool too. And if it breaks your translation, because you didn't explicitly add a space at the end of the line, then it simply is annoying. And if you keep adding spaces in previous translations and know it could have been easily avoided, you are annoyed too. I'm happy to keep the default as it is, but an option would be really good. And while I am all for doing it on the application-level, I am also for doing it on the gettext tools, because otherwise you are dependent on your application. And what tool would like to make itself dependent on some application? But if you want to resolve it to NOTABUG, that's fine with me too. I'm sure I find another merge-tool, or I simply write my own. :-) Cheers, Bernd
> "a" > "b" > This will generate "ab", not "a b". Yes, and that's *right*. > Since this example doesn't fit entirely in my box, I really doprefer to use > multiple lines, even though I'd like to have itin one line later on in my > output. > If that is the desired output for you, you do have a point. That is not the desired output when using clear text or *ML, but .po is not *ML. The " " marks are part of the syntax and only the marks are there to delimit what is the string. No additional rules. Changing .po to be more *ML-like now after years of use is a very bad idea IMHO. > I *do* use kbabel! But kbabel doesn't tell me, hey, you've forgotten a space at > the end of the line there, which is *absolutely* required, because if you don't > have one there, msgmerge will scramble your translation next time 'round! ;-) > If a space was missing, as > addressed, my previous translation breaks.is Your translation is *already* broken (will be wrong in when compiled) in those cases. > Then I'd like to make the mere suggestion of doing a sanity-check, at least in > msgmerge and simply adding a space if there isn't one. I guess this bug can be laid to rest with mention of something like this: --------------- addspaces.awk /msgid/ { in_str = 0; } /msgstr/ { in_str = 1; } { if (in_str != 0 && $0 ~ /".* "$/) $0 = gensub(/"(.*) "$/, "\"\\1\"", "g"); print $0; } --------------- Then hand-check the results.
> > "a" > > "b" > > This will generate "ab", not "a b". > Yes, and that's *right*. That's nothing I disagree with, from a technical viewpoint that is. > > Since this example doesn't fit entirely in my box, I really doprefer to use > > multiple lines, even though I'd like to have itin one line later on in my > > output. > > If that is the desired output for you, you do have a point. > That is not the desired output when using clear text or *ML, but > .po is not *ML. The " " marks are part of the syntax and only the marks > are there to delimit what is the string. No additional rules. No, it's not. But given how many of these problems I've fixed in previous translations, I can sure tell you that a lot of translators don't seem to be aware of the po-syntax and simply assume that po is simply clear text. Let's redirect my critic then away from the tools and to whoever is responsible for not explaining translators the exact syntax of po-files. Given that some translators might get the impression, that all they are meant to do is to translate some clear text, we should make an effort to really tell them that they are not simply translating clear text, but that they are translating po-strings, which have a given syntax-requirement. > Changing .po to be more *ML-like now after years of use is > a very bad idea IMHO. This point is taken. But the *ML-like was simply an example, an example for clear text in an editor that doesn't support line-wrapping, like editors you write your po-files with. Even if one simply writes clear text in an email, for example, one often puts a line-break -- without space at the end of the line -- to start a new word, simply in the next line, so that the line doesn't get too long. If translators behave in the same -- usual -- way writing their po-files, their translation might break. While I agree with you that changing the default-behaviour of a tool after years of use is a bad idea, let's just think about what else we can do, so that this won't happen anymore, on a more global level. There are really only two options (well, three to be exact). 1) Make sure that every translator is aware, that they are not simply translating clear text, but po-strings, with given syntax requirements, whereas some of these are not enforced by the common tools. The other two are obvious, 2) have some software pick up potential problems, or 3) waste a lot of time fixing broken translations afterwards. A matter of choice I guess, but I, personally, surely don't choose option 3). And I don't choose option 1) for the reason that I believe that option 2) is much easier to achieve than option 1). > > If a space was missing, as addressed, my previous translation breaks. > Your translation is *already* broken (will be wrong in when compiled) > in those cases. True. But it isn't lost for the translators, since it still appears correct in your editor and can easily be auto-fixed, i.e. in adding a space at the end. Once a msgmerge is performed, these fixes must be done manually, because these really are broken. As such, from a translators perspective, working on clear text, the translations aren't really broken until you do a msgmerge. Or do you expect all translators to know that these texts will be broken in the output? If you do, did you do all necessary steps to ensure they do? If you ensure me, that every translator but me (since I do know) is aware that this space is required, then I'm happy to not do anything further about this issue and resolve it myself to NOTABUG. :-) > > Then I'd like to make the mere suggestion of doing a sanity-check, at > > least in msgmerge and simply adding a space if there isn't one. > I guess this bug can be laid to rest with mention of something like this: > --------------- addspaces.awk > /msgid/ { in_str = 0; } > /msgstr/ { in_str = 1; } > { > if (in_str != 0 && $0 ~ /".* "$/) > $0 = gensub(/"(.*) "$/, "\"\\1\"", "g"); > print $0; > } > --------------- > Then hand-check the results. If you believe this will solve the problem, and every translator will now be aware that this space is required, then we can safely lay this issue (which I completely agree is not a bug technically) to rest. But I am even in doubt that all translators know about awk, or even any kind of programming. Cheers, Bernd
Just for the case I left it too vague how I would fix the problem if I would have a say, here's what I'd do. > Changing .po to be more *ML-like now after years of use is > a very bad idea IMHO. And we wouldn't want that, but if you simply not reformat the lines in msgmerge -- even though you change the default-behaviour -- this, as you said yourself, doesn't have any negative effect, since, and I cite: '' "a" " " "b" "a b" "a " "b" all generate the same string in the .mo file. '' But, on the other hand, not reformatting the lines brings the advantage, that missing spaces at the end of the line do not cause the translation to break, i.e. end up in a stage where a manual fix is required. While these are still broken in the software, these can still be fixed automatically any time, if we come across such problems. What now is your argument against this? Why is it so essential for you, that msgmerge reformats the lines? What disadvantage causes this for you? And do you think this is a bigger disadvantage than the one I face? Anyone else? Any other options? Please *do* fire away! Thank you, Bernd
Three comments: - As was already said, the PO file format has a certain specification for 8 years now, and msgmerge is obeying this specification. Basically, white space outside "..." doesn't count. - Translators should test their translations before submitting them. The first step of this test is to call msgfmt to get a .mo file and install this .mo file at the appropriate place. - A good editor for PO files (Emacs PO mode does this; I don't know whether KBabel does) should: 1. Display the translation in a way that displays newlines as newlines and not \n, does not depend on extra whitespace and line breaks outside strings in the PO file, and show where the string ends. 2. Let the translator see where there are spaces at the end of line. 3. If it does automatic line wrapping, let the translator see where there are linebreaks (\n in PO mode syntax).
> - As was already said, the PO file format has a certain specification for 8 > years now, and msgmerge is obeying this specification. Basically, white > space outside "..." doesn't count. That's true. But if you'd adapt msgmerge as suggested in the previous comment, then it would still obey the very same specification, that it did in the last 8 years. This change doesn't mean it suddenly doesn't obey this specification anymore. But, it does solve one problem that seems to occur in practice. > - Translators should test their translations before submitting them. The > first step of this test is to call msgfmt to get a .mo file and install this > .mo file at the appropriate place. That's a nice idea. But how do you ensure that every translator really does that? > - A good editor for PO files (Emacs PO mode does this; I don't know whether > KBabel does) should: > 1. Display the translation in a way that displays newlines as newlines and > not \n, does not depend on extra whitespace and line breaks outside > strings in the PO file, and show where the string ends. What do you mean with extra whitespace and line breaks outside strings in the PO-file? We are only talking about strings in the po-file, or? What do you mean with outside? I'm not even talking about anything but the po-file really. > 2. Let the translator see where there are spaces at the end of line. > 3. If it does automatic line wrapping, let the translator see where there > are linebreaks (\n in PO mode syntax). But do we want to dictate translators what capabilities their editor needs to have? What if I want to edit the po-file directly? In an ordinary text-editor? Do you simply want to tell me that I shouldn't do that? And that if I am not willing or able to create and test the .mo, that I shouldn't do any translations in the first place? Thanks, Bernd
> But do we want to dictate translators what capabilities their editor needs to > have? What if I want to edit the po-file directly? In an ordinary text-editor? > Do you simply want to tell me that I shouldn't do that? You can use whatever tools you want, but it is your reponsibility to ensure the output is right.
> You can use whatever tools you want, but it is your reponsibility to > ensure the output is right. Yes, and I do that. But it seems to be true in practice, that not every translator does that. It's nice to say that everybody has to ensure their output is right, but it doesn't change that it doesn't always apply in practice. Sometimes people don't even have the time in practice. And then, a simple mistake like this, can not only show wrong in the output, no, once you performed a msgmerge, it is even broken in the source. As a result, you have to spend additional time (which in some cases means additional money) to fix manually, what otherwise could have been fixed automatically. This could have been avoided easily, simply through msgmerge not changing the msgstr-entries, but simply leave them as is and copy them to the new file as is. Why does msgmerge need to do a reformatting of lines? Some translators might even layout their texts nicely, so that they break behind a comma, or a period, but msgmerge simply ignores that and does its own layout. Some people might even find that annoying. Software shouldn't change things that it wasn't told to change by default. For such things, it can always provide an option. Just because that's how it was done the last 8 years, doesn't mean we have to do it this way for the rest of time. Especially not if more and more non-software text gets converted into po and therefore more and more translators might use it. Maybe even in a completely different domain. I will repeat my question. Why is it so essential, that msgmerge reformats the msgstr-entry in a po-file, rather than using it as it is? If you can give me a reasonable answer to this very question, you might just convince me of your point. Thank you, Bernd
FYI, here's what I've received this morning: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=84724 And this is only one, currently at the end of a very long list, not including the ones I found. Cheers, Bernd
I still can't see *any* indication why the proposed change would help anything. * The translation is broken in both cases * Manual intervention is needed in both cases * Not reformating means the files still spell-check fine, reformating could catch (some of) these errors.
> I still can't see *any* indication why the proposed change would help > anything. ??? > * The translation is broken in both cases Depends on your perspective. It is not broken from a clear-text point of view, but only from a po-string point of view. And yes, I am only addressing po-files here, not mo-files. I'm not primarily concerned with mo-files and a lot of my po-files aren't software-files, therefore no mo-file is ever to be created. > * Manual intervention is needed in both cases Yes. But in one case the manual intervention is simply running a script that automatically fixes all these problems, which does not require an understanding of the language in which the translation is in, while in the second case, somebody who understands the language has to go manually through every entry in the po-file to ensure it is still valid. One manual intervention takes a few seconds, the other can take up to several hours. IMO, that's a big difference. > * Not reformating means the files still spell-check fine, reformating could > catch (some of) these errors. Ok, let's get into this issue. Does msgmerge do a spell-check that works only if it reformats the line? Will msgmerge automatically notify you of incorrect spellings? Or what exactly do you refer to here? Do you have a concrete example? Thank you, Bernd
> > * The translation is broken in both cases > Depends on your perspective. It is not broken from a clear-text point of view, > but only from a po-string point of view. And yes, I am only addressing po- files > here, not mo-files. I'm not primarily concerned with mo-files and a lot of my > po-files aren't software-files, therefore no mo-file is ever to be created. If you are creating "po" files which consider the sequence < " new-line " > to be white space, you are not using the .po format. > > * Manual intervention is needed in both cases > Yes. But in one case the manual intervention is simply running a script that > automatically fixes all these problems, which does not require an understanding > of the language in which the translation is in, while in the second case, > somebody who understands the language has to go manually through every entry in > the po-file to ensure it is still valid. No. A "smart .po editor" is perfectly allowed to break the lines in the middle of a word, in which case your "script that automatically fixes all the problems" introduces new errors. And if you need a "script that automatically fixes all the problems" has to be run, why shouldn't it be run before you even start using the .po file in the first place? > > * Not reformating means the files still spell-check fine, reformating could > > catch (some of) these errors. > > Ok, let's get into this issue. Does msgmerge do a spell-check that works only if > it reformats the line? Will msgmerge automatically notify you of incorrect > spellings? Or what exactly do you refer to here? Do you have a concrete example? Nothing that involved. Merely that using my simple spell check script (which does have the bug about " \n " ;-), msgstr "wrong" "file" passes ok, but msgstr "wrongfi" "le" does not. After msgmerge does it's job, you have a high probablility that a spell check will reveal at least some instances of this problem. (This might not be that useful in German, where almost any word combination is gramatically correct, or correct enough for the spellchecker not to complain). Yes, once we are discussing "not-really-po" files, this all makes some sense. But IMHO, it still should be solved by using the *strict* .po format, thus allowing full interoperability, not by trying to expand the number of utilities that support a particular dialect.
> If you are creating "po" files which consider the sequence < " new-line " > > to be white space, you are not using the .po format. But I don't. I am only using .po format. > No. A "smart .po editor" is perfectly allowed to break the lines in the > middle of a word, in which case your "script that automatically fixes > all the problems" introduces new errors. It is not to be applied to all po-files, and such script doesn't exist yet either. The point was that the file remains valid if you look at it as clear text. And if you ask me how I know to which file it can be applied to automatically and to which it can't, well, some files have a X-Generator entry, in fact, all po-files I ever came across (and these are many and the ones that concern me) have a X-Generator entry, i.e. "X-Generator: KBabel 1.0beta2\n" A "smart .po editor" uses this feature! ;-) > And if you need a "script that automatically fixes all the problems" > has to be run, why shouldn't it be run before you even start using the .po > file in the first place? Because the engineers don't wait for me with a change in po-strings. They do their changes, and create a revision. And that's the way it should be. Sure, you could enforce to run a script on commit, but I still believe it is much easier and more sane if msgmerge simply leaves the entries as is and doesn't reformat them. Because then you could check the entries any time in the future, without having to worry about having to check all the entries, but simply the ones with a space missing at the end. And you don't use a script that might wrongfully fix entries it shouldn't fix either (good argument, by the way, but it only further proves my point). > Nothing that involved. Merely that using my simple spell check script > (which does have the bug about " \n " ;-), > msgstr "wrong" > "file" > passes ok, but > msgstr "wrongfi" > "le" > does not. Are you now being funny? :-) > After msgmerge does it's job, you have a high probablility that a spell > check will reveal at least some instances of this problem. Well, why not add an option to msgmerge to allow such reformating? I would. But this doesn't change that I still believe the default shouldn't change the entries, but leave them as is. For your spell-checking problem, alternatively, you could fix your script! ;-) > (This might not be that useful in German, where almost any word combination > is gramatically correct, or correct enough for the spellchecker not to > complain). It still does complain though. But I don't have a problem spell-checking anyway, since I do it in kbabel. :-) > Yes, once we are discussing "not-really-po" files, this all makes some sense. But I don't. I am discussing nothing but po-files and the disadvantages you face in practice with its strict format, i.e. no whitespace after \n and a space at the end of the line, else it no longer corelates with clear text. While I am completely for not changing the format, I am also for msgmerge not changing the format -- of the po-entries. I *do* talk about po-files, but also about the problem we seem to run into constantly by translators understanding po as clear text. And my suggestion of msgmerge not changing the entries is the best way to ensure, that such small mistakes do not completely mess up the translation-source. > But IMHO, it still should be solved by using the *strict* .po format, thus > allowing full interoperability, not by trying to expand the number of > utilities that support a particular dialect. Me too. As said, I am not for changing the format. From one of my initial thoughts of adding a space if msgmerge really needs to do a reformating I have long departed. But I do strongly believe that msgmerge shouldn't change any format either, not even that of po-entries. Cheers, Bernd