Discussion:
How to force tex2lyx to read unicode (from within Lyx)?
stefano franchi
2012-02-13 22:41:47 UTC
Permalink
I have been helping Eric Weir with his Scrivener-->LaTeX-->Lyx import
and we have narrowed down the problem to Lyx not importing a
Unicode-encoded file as Unicode.
So tex2lyx is the culprit. Wasn't this solved some time ago, however?I
mean: automatic recognition of the imported file encoding?

Try out the enclosed minimal lyx file.

1. exporting to latex and reimporting into lyx from
FIle>>Import>>Latex(plain) produces garbage characters for the dashes

2. However, calling >tex2lyx -e UTF8 from the command line produces
the correct file.

Is this a bug? Or a feature?

Cheers,

Stefano
--
__________________________________________________
Stefano Franchi
Associate Research Professor
Department of Hispanic Studies            Ph:   +1 (979) 845-2125
Texas A&M University                          Fax:  +1 (979) 845-6421
College Station, Texas, USA

***@tamu.edu
http://stefano.cleinias.org
Richard Heck
2012-02-13 23:47:57 UTC
Permalink
Post by stefano franchi
I have been helping Eric Weir with his Scrivener-->LaTeX-->Lyx import
and we have narrowed down the problem to Lyx not importing a
Unicode-encoded file as Unicode.
I'm glad to hear someone was able to help with this. Thanks for your
efforts on behalf of the LyX community.
Post by stefano franchi
So tex2lyx is the culprit. Wasn't this solved some time ago, however?I
mean: automatic recognition of the imported file encoding?
Well, that's a complicated issue. They may have tried to do a better
job, but recognizing the file encoding completely reliably is not in
general possible. There's some famous example of this.
Post by stefano franchi
Try out the enclosed minimal lyx file.
1. exporting to latex and reimporting into lyx from
FIle>>Import>>Latex(plain) produces garbage characters for the dashes
2. However, calling>tex2lyx -e UTF8 from the command line produces
the correct file.
Still, it's surprising we fail in this particular case, so I'd guess it
is indeed a bug. I'll cross-post to devel.

Richard
Jean-Marc Lasgouttes
2012-02-14 14:05:17 UTC
Permalink
Post by Richard Heck
Post by stefano franchi
Try out the enclosed minimal lyx file.
1. exporting to latex and reimporting into lyx from
FIle>>Import>>Latex(plain) produces garbage characters for the dashes
2. However, calling>tex2lyx -e UTF8 from the command line produces
the correct file.
Still, it's surprising we fail in this particular case, so I'd guess it
is indeed a bug. I'll cross-post to devel.
tex2lyx is not really the culprit. The file you attach is broken : it
does not use xetex/luatex, but it does use "Unicode (XeTeX) (utf8)" as
encoding. As a result, the latex export of the file does not specify any
encoding, and tex2lyx is not able to guess it.

Even better, try to visualize the file: the encoding will be bogus,
without any intervention of tex2lyx.

What happens in this case is that the encoding is set to latin1 (utf8 is
a better default in some sense, but if you guess it wrong, you can bet
convertion errors, see bug #7509).

Unfortuantely, it seems that the encodings utf8 and utf8x (handled by
plain latex), do not understand some of your hyphens, and produce a
document that does not compile.

So your best bet is probably to produce a file that will be imported as
XeTeX/LuaTeX. This will happen when some characteristic packages are
recognized, which will happen after the followinf patch is applied.

Georg, Juergen, I'd like some feedback on the soundness of the patch. I
know next to nil about xetex, and I do not know what is the current
state of the art wrt tex2lyx.

JMarc
Jürgen Spitzmüller
2012-02-14 14:17:26 UTC
Permalink
Post by Jean-Marc Lasgouttes
Georg, Juergen, I'd like some feedback on the soundness of the patch. I
know next to nil about xetex, and I do not know what is the current
state of the art wrt tex2lyx.
The latter I don't know either, but the patch looks sane. We set the encoding
to utf8 internally as well if \use_non_tex_fonts is true.

Jürgen
stefano franchi
2012-02-14 17:00:57 UTC
Permalink
On Tue, Feb 14, 2012 at 8:05 AM, Jean-Marc Lasgouttes
Post by Richard Heck
Post by stefano franchi
Try out the enclosed minimal lyx file.
1. exporting to latex and reimporting into lyx from
FIle>>Import>>Latex(plain) produces garbage characters for the dashes
2. However, calling>tex2lyx -e UTF8 from the command line produces
the correct file.
Still, it's surprising we fail in this particular case, so I'd guess it
is indeed a bug. I'll cross-post to devel.
tex2lyx is not really the culprit. The file you attach is broken : it does
not use xetex/luatex, but it does use "Unicode (XeTeX) (utf8)" as encoding.
As a result, the latex export of the file does not specify any encoding, and
tex2lyx is not able to guess it.
Sorry, but I don't understand what you mean by "broken" here. If I set
the encoding to pure Unicode (isn't that what "Unicode (XeTeX) (utf8)"
means?) shouldn't that be enough to specify that the file is
Unicode-encoded? That seems obvious to me (which of course may only
reflect my ignorance of Lyx code). If that's not true then I do not
understand what is the meaning of the
Document>>Settings>Language>>Encoding value.

Cheers,

Stefano
--
__________________________________________________
Stefano Franchi
Associate Research Professor
Department of Hispanic Studies            Ph:   +1 (979) 845-2125
Texas A&M University                          Fax:  +1 (979) 845-6421
College Station, Texas, USA

***@tamu.edu
http://stefano.cleinias.org
Guenter Milde
2012-02-15 09:51:58 UTC
Permalink
Post by stefano franchi
On Tue, Feb 14, 2012 at 8:05 AM, Jean-Marc Lasgouttes
Post by Jean-Marc Lasgouttes
tex2lyx is not really the culprit. The file you attach is broken : it
does not use xetex/luatex, but it does use "Unicode (XeTeX) (utf8)" as
encoding. As a result, the latex export of the file does not specify
any encoding, and tex2lyx is not able to guess it.
Sorry, but I don't understand what you mean by "broken" here. If I set
the encoding to pure Unicode (isn't that what "Unicode (XeTeX) (utf8)"
means?) shouldn't that be enough to specify that the file is
Unicode-encoded?
The setting "Unicode (XeTeX) (utf8)" specifies that the LyX-generated LaTeX
file should be utf-8 encoded Unicode

* without "forced" substitutions (for characters the halfway utf8 support
of 8-bit LaTeX does not understand or does wrong), and

* without calling the "inputenc" package.

This means that in the exported file, utf-8 encoding is used but there is no
specification of the used encoding inside the *.tex file.

You can consider this an "expert setting": it allows to circumvent
limitations but usually requires additional custom preamble code to
produce valid LaTeX files.

Similar to the use of ERT, LyX does not guarantee proper working.
Post by stefano franchi
That seems obvious to me (which of course may only reflect my ignorance
of Lyx code). If that's not true then I do not understand what is the
meaning of the Document>>Settings>Language>>Encoding value.
It seems that tex2lyx relies on the optional argument in the

\usepackage[<encoding>]{inputenc}

line to determine the *.tex file encoding.

IMV, it should try utf8 first in case this does not give a result.

* if the file is pure ASCII, everything is fine
* if the file is utf8 encoded, fine too
* if another encoding is used, an error occures: try again with the second
guess.

Günter
Jean-Marc Lasgouttes
2012-02-15 10:35:10 UTC
Permalink
Post by Guenter Milde
IMV, it should try utf8 first in case this does not give a result.
* if the file is pure ASCII, everything is fine
* if the file is utf8 encoded, fine too
* if another encoding is used, an error occures: try again with the second
guess.
That is not a bad solution, but probably not so easy to implement with
our current code.

I have a patch that seems to work with XeTeX, I do not know anything
about luatex (what shall we do with it?)

JMarc
Jürgen Spitzmüller
2012-02-15 10:38:24 UTC
Permalink
Post by Jean-Marc Lasgouttes
I have a patch that seems to work with XeTeX, I do not know anything
about luatex (what shall we do with it?)
I think your patch will work for LuaTeX as well.

Jürgen
Jean-Marc Lasgouttes
2012-02-15 11:11:06 UTC
Permalink
Post by Jürgen Spitzmüller
Post by Jean-Marc Lasgouttes
I have a patch that seems to work with XeTeX, I do not know anything
about luatex (what shall we do with it?)
I think your patch will work for LuaTeX as well.
I thought LuaTeX could use encdings other than utf8. Is utf8 the
default? And does it use the same packages as XeTeX?

I am very confused by these different engines...

JMarc
Jürgen Spitzmüller
2012-02-15 11:25:41 UTC
Permalink
Post by Jean-Marc Lasgouttes
I thought LuaTeX could use encdings other than utf8. Is utf8 the
default? And does it use the same packages as XeTeX?
LuaTeX can use other encodings, but not with non-tex fonts. So if
\use_non_tex_fonts is true (this is what you want to check for, right?), the
encoding of the file must be utf8.

If other encodings are used, LuaTeX uses the "luainputenc" package, which has
the same syntax than inputenc, i.e.

\usepackage[<enc>]{luainputenc}

Maybe we have to care about that.

The font packages are the same (basically fontspec).

Jürgen
Guenter Milde
2012-02-15 16:26:06 UTC
Permalink
Post by Jürgen Spitzmüller
Post by Jean-Marc Lasgouttes
I thought LuaTeX could use encdings other than utf8. Is utf8 the
default? And does it use the same packages as XeTeX?
LuaTeX can use other encodings, but not with non-tex fonts.
Are you sure. I'd think that with "luainputenc" there would be no limitation
of the input encoding to utf8. Of course, with non-tex fonts the *font
encoding* is Unicode.
Post by Jürgen Spitzmüller
So if
\use_non_tex_fonts is true (this is what you want to check for,
right?), the encoding of the file must be utf8.
This might be a LyX limitation. However this is about tex2lyx, so we would
check for \usepackage{fontenc} or \usepackage{xunicode}.

I propose to also check for
Post by Jürgen Spitzmüller
\usepackage[<enc>]{luainputenc}
Günter
Georg Baum
2012-02-15 20:03:04 UTC
Permalink
Post by Jürgen Spitzmüller
LuaTeX can use other encodings, but not with non-tex fonts. So if
\use_non_tex_fonts is true (this is what you want to check for, right?),
the encoding of the file must be utf8.
No, it is the other way round: If a xetex package is detected,
use_non_tex_fonts is set to true, and the encoding is set to utf8.
Post by Jürgen Spitzmüller
If other encodings are used, LuaTeX uses the "luainputenc" package, which
has the same syntax than inputenc, i.e.
\usepackage[<enc>]{luainputenc}
Maybe we have to care about that.
luainputenc is already handled. What is missing for luatex is to set
use_non_tex_fonts to true and the encoding to utf8 if luainputenc is not
used. Unfortunately this is not so easy as with xetex.


Georg
Guenter Milde
2012-02-16 09:35:19 UTC
Permalink
Post by Georg Baum
Post by Jürgen Spitzmüller
If other encodings are used, LuaTeX uses the "luainputenc" package, which
has the same syntax than inputenc, i.e.
\usepackage[<enc>]{luainputenc}
Maybe we have to care about that.
luainputenc is already handled.
Fine.
Post by Georg Baum
What is missing for luatex is to set
use_non_tex_fonts to true and the encoding to utf8 if luainputenc is not
used. Unfortunately this is not so easy as with xetex.
"use_non_tex_fonts" is actually a toggle to use the "fontspec" package
(coupled with the use of polyglossia if the language-package
setting is set to auto, if I remember right).

The "fontspec" package can be used with the XeTeX and LuaTeX engines but not
with TeX and eTeX.

This means that tex2lyx should look for

\usepackage{fontspec}

and set "use_non_tex_fonts" accordingly.

(IMV, the name for this setting should also be changed:
- use_non_tex_fonts
+ use_fontspec
this would prevent much confusion.)

The default encoding (if neither "inputenc" nor "luainputenc" is found) may
be chosen according to use_non_tex_fonts:

True
XeTeX or LuaTeX engines must be used to compile

--> assume their default encoding (utf8) as file encoding.

False
any tex engine can be used to compile

--> assume either the 8-bit tex default (ASCII) or utf8 as file encoding.

Actually, utf8 should be a safe bet in any case when no "inpuenc" or
"luainputenc" is found, as ASCII is a subset of utf8 and no tex engine
understands 8-bit encodings (latin-1 ...) without inputenc or a similar
package.

UTF8 decoding errors should lead to a message: "unknown encoding, please
specify the file encoding". If tex2lyx is called from the LyX-GUI, this
error should lead to a pop-up dialogue where you can specify an encoding.

Günter
Eric Weir
2012-02-16 12:25:41 UTC
Permalink
Post by Guenter Milde
- use_non_tex_fonts
+ use_fontspec
this would prevent much confusion.)
While I now know what it refers to, as a new user I would otherwise have found the new description mystifying. I would seem that it's possible to be clear both to advanced users/developers and new users.

Regards,
------------------------------------------------------------------------------------------
Eric Weir
Decatur, GA
***@bellsouth.net

"Style is truth."

- Ray Bradbury
Jürgen Spitzmüller
2012-02-16 12:50:11 UTC
Permalink
Post by Eric Weir
Post by Guenter Milde
- use_non_tex_fonts
+ use_fontspec
this would prevent much confusion.)
While I now know what it refers to, as a new user I would otherwise have
found the new description mystifying. I would seem that it's possible to be
clear both to advanced users/developers and new users.
This is an internal value only. But I also think we should keep it as it is,
i.e. abstract and not bound to some specific package which may change in the
future.

Jürgen
Georg Baum
2012-02-15 19:58:42 UTC
Permalink
Post by Jean-Marc Lasgouttes
Georg, Juergen, I'd like some feedback on the soundness of the patch. I
know next to nil about xetex, and I do not know what is the current
state of the art wrt tex2lyx.
Unfortunately I know next to nothing about use_non_tex_fonts. After reading
the code and user guide a bit I see that using xetex is only possible if
use_non_tex_fonts is true, so your patch looks fine.

Georg
Guenter Milde
2012-02-16 09:36:48 UTC
Permalink
Post by Georg Baum
Post by Jean-Marc Lasgouttes
Georg, Juergen, I'd like some feedback on the soundness of the patch. I
know next to nil about xetex, and I do not know what is the current
state of the art wrt tex2lyx.
Unfortunately I know next to nothing about use_non_tex_fonts. After reading
the code and user guide a bit I see that using xetex is only possible if
use_non_tex_fonts is true, so your patch looks fine.
It is the other way round: using xetex or luatex is also possible without
use_non_tex_fonts (which actually means use_fontspec). OTOH, with
use_non_tex_fonts True, it one of xetex or luatex must be used.


Günter
Georg Baum
2012-02-19 12:46:05 UTC
Permalink
Post by Guenter Milde
Post by Georg Baum
Post by Jean-Marc Lasgouttes
Georg, Juergen, I'd like some feedback on the soundness of the patch. I
know next to nil about xetex, and I do not know what is the current
state of the art wrt tex2lyx.
Unfortunately I know next to nothing about use_non_tex_fonts. After
reading the code and user guide a bit I see that using xetex is only
possible if use_non_tex_fonts is true, so your patch looks fine.
It is the other way round: using xetex or luatex is also possible without
use_non_tex_fonts (which actually means use_fontspec). OTOH, with
use_non_tex_fonts True, it one of xetex or luatex must be used.
For luatex I agree, but for xetex I don't believe that you are right. Do you
have an example .lyx file, where use_non_tex_fonts is false, but xetex is
still used? If you are right, the patch introduces a regression.


Georg
Guenter Milde
2012-02-27 21:18:16 UTC
Permalink
Post by Georg Baum
Post by Guenter Milde
Post by Georg Baum
Unfortunately I know next to nothing about use_non_tex_fonts. After
reading the code and user guide a bit I see that using xetex is only
possible if use_non_tex_fonts is true, so your patch looks fine.
It is the other way round: using xetex or luatex is also possible without
use_non_tex_fonts (which actually means use_fontspec). OTOH, with
use_non_tex_fonts True, it one of xetex or luatex must be used.
For luatex I agree, but for xetex I don't believe that you are right. Do you
have an example .lyx file, where use_non_tex_fonts is false, but xetex is
still used? If you are right, the patch introduces a regression.
Just take a "normal" file (with use_non_tex_fonts False) and select one
of

File>Export>LaTeX (XeTex)
File>Export>PDF (XeTeX)
View>Other Formats>PDF (XeTeX)

The exported LaTeX with my somewhat outdated LyX-svn is

%% LyX 2.1.0svn created this file. For more info, see http://www.lyx.org/.
%% Do not edit unless you really know what you are doing.
\documentclass[english]{article}
\usepackage[T1]{fontenc}
\usepackage{babel}
\usepackage{xunicode}
\begin{document}
Test
\end{document}

and compiles fine with XeTeX.

Günter

Loading...