How can LaTeX read utf8?
As described in the TeXbook, TeX reads files byte by byte, regardless of the particular format -- as I understand, this is just how INITEX is set up.
I also understand that LaTeX is just a collection of macros built on top of INITEX, described in most distributions of TeX by the file 'latex.ltx'.
The above two things are at odds with my understanding of LaTeX's ability to read utf8. I was under the impression that reading the input byte by byte (and thus for instance, only being able to access numbers from 0 to 255 using char or something) was baked into TeX, and thus would exist in all variants built on top of it.
Thus, how is LaTeX able to do this?
unicode
add a comment |
As described in the TeXbook, TeX reads files byte by byte, regardless of the particular format -- as I understand, this is just how INITEX is set up.
I also understand that LaTeX is just a collection of macros built on top of INITEX, described in most distributions of TeX by the file 'latex.ltx'.
The above two things are at odds with my understanding of LaTeX's ability to read utf8. I was under the impression that reading the input byte by byte (and thus for instance, only being able to access numbers from 0 to 255 using char or something) was baked into TeX, and thus would exist in all variants built on top of it.
Thus, how is LaTeX able to do this?
unicode
4
The TeXbook describes the so-called Knuth-TeX engine, as well as a collection of macros frequently called "PlainTeX". Are you aware of newer engines called pdfTeX, XeTeX, and LuaTeX?
– Mico
8 hours ago
2
You could be asking one of (at least) two questions here. Are you wondering how e.g. XeTeX (natively UTF-8) can be derived from Knuth's TeX (8-bit). Or are you wondering how 8-bit TeX engines deal with UTF-8 input (conversion of 'raw' bytes to codpoints to output)?
– Joseph Wright♦
2 hours ago
add a comment |
As described in the TeXbook, TeX reads files byte by byte, regardless of the particular format -- as I understand, this is just how INITEX is set up.
I also understand that LaTeX is just a collection of macros built on top of INITEX, described in most distributions of TeX by the file 'latex.ltx'.
The above two things are at odds with my understanding of LaTeX's ability to read utf8. I was under the impression that reading the input byte by byte (and thus for instance, only being able to access numbers from 0 to 255 using char or something) was baked into TeX, and thus would exist in all variants built on top of it.
Thus, how is LaTeX able to do this?
unicode
As described in the TeXbook, TeX reads files byte by byte, regardless of the particular format -- as I understand, this is just how INITEX is set up.
I also understand that LaTeX is just a collection of macros built on top of INITEX, described in most distributions of TeX by the file 'latex.ltx'.
The above two things are at odds with my understanding of LaTeX's ability to read utf8. I was under the impression that reading the input byte by byte (and thus for instance, only being able to access numbers from 0 to 255 using char or something) was baked into TeX, and thus would exist in all variants built on top of it.
Thus, how is LaTeX able to do this?
unicode
unicode
asked 8 hours ago
extremeaxe5extremeaxe5
2444
2444
4
The TeXbook describes the so-called Knuth-TeX engine, as well as a collection of macros frequently called "PlainTeX". Are you aware of newer engines called pdfTeX, XeTeX, and LuaTeX?
– Mico
8 hours ago
2
You could be asking one of (at least) two questions here. Are you wondering how e.g. XeTeX (natively UTF-8) can be derived from Knuth's TeX (8-bit). Or are you wondering how 8-bit TeX engines deal with UTF-8 input (conversion of 'raw' bytes to codpoints to output)?
– Joseph Wright♦
2 hours ago
add a comment |
4
The TeXbook describes the so-called Knuth-TeX engine, as well as a collection of macros frequently called "PlainTeX". Are you aware of newer engines called pdfTeX, XeTeX, and LuaTeX?
– Mico
8 hours ago
2
You could be asking one of (at least) two questions here. Are you wondering how e.g. XeTeX (natively UTF-8) can be derived from Knuth's TeX (8-bit). Or are you wondering how 8-bit TeX engines deal with UTF-8 input (conversion of 'raw' bytes to codpoints to output)?
– Joseph Wright♦
2 hours ago
4
4
The TeXbook describes the so-called Knuth-TeX engine, as well as a collection of macros frequently called "PlainTeX". Are you aware of newer engines called pdfTeX, XeTeX, and LuaTeX?
– Mico
8 hours ago
The TeXbook describes the so-called Knuth-TeX engine, as well as a collection of macros frequently called "PlainTeX". Are you aware of newer engines called pdfTeX, XeTeX, and LuaTeX?
– Mico
8 hours ago
2
2
You could be asking one of (at least) two questions here. Are you wondering how e.g. XeTeX (natively UTF-8) can be derived from Knuth's TeX (8-bit). Or are you wondering how 8-bit TeX engines deal with UTF-8 input (conversion of 'raw' bytes to codpoints to output)?
– Joseph Wright♦
2 hours ago
You could be asking one of (at least) two questions here. Are you wondering how e.g. XeTeX (natively UTF-8) can be derived from Knuth's TeX (8-bit). Or are you wondering how 8-bit TeX engines deal with UTF-8 input (conversion of 'raw' bytes to codpoints to output)?
– Joseph Wright♦
2 hours ago
add a comment |
1 Answer
1
active
oldest
votes
If you want to know how the 8-bit engines handle utf8 input you can use tracingmacros:
documentclass{article}
begin{document}
{tracingmacros =1 ä }
end{document}
which gives
Ã->UTFviii@two@octets Ã
UTFviii@two@octets #1#2->expandafter UTFviii@defined csname u8:#1string #2
endcsname
#1<-Ã
#2<-¤
UTFviii@defined #1->ifx #1relax if relax expandafter UTFviii@checkseq s
tring #1relax relax UTFviii@undefined@err {#1}else PackageError {inputenc}
{Invalid UTF-8 byte sequence}UTFviii@invalid@help fi else expandafter #1fi
#1<-u8:ä
u8:ä ->IeC {"a}
That means the the first byte of the ä (the Ã) is an active char, a command which then picks up the next byte and then calls u8:ä which calls "a. In this way (pdf)latex can handle quite a lot of utf8 input but it has e.g. problems with "char + combining accent" as there is no sensible code for the combining accent to go back to add an accent on the char.
Just a comment to point out log file containing the trace got viewed by editor in some 8bit encoding, presumably iso-latin-1 (like Emacs does for me), not in UTF8... soÃis only one byte. This is tacit in your answer...
– jfbu
1 hour ago
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "85"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2ftex.stackexchange.com%2fquestions%2f471071%2fhow-can-latex-read-utf8%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
If you want to know how the 8-bit engines handle utf8 input you can use tracingmacros:
documentclass{article}
begin{document}
{tracingmacros =1 ä }
end{document}
which gives
Ã->UTFviii@two@octets Ã
UTFviii@two@octets #1#2->expandafter UTFviii@defined csname u8:#1string #2
endcsname
#1<-Ã
#2<-¤
UTFviii@defined #1->ifx #1relax if relax expandafter UTFviii@checkseq s
tring #1relax relax UTFviii@undefined@err {#1}else PackageError {inputenc}
{Invalid UTF-8 byte sequence}UTFviii@invalid@help fi else expandafter #1fi
#1<-u8:ä
u8:ä ->IeC {"a}
That means the the first byte of the ä (the Ã) is an active char, a command which then picks up the next byte and then calls u8:ä which calls "a. In this way (pdf)latex can handle quite a lot of utf8 input but it has e.g. problems with "char + combining accent" as there is no sensible code for the combining accent to go back to add an accent on the char.
Just a comment to point out log file containing the trace got viewed by editor in some 8bit encoding, presumably iso-latin-1 (like Emacs does for me), not in UTF8... soÃis only one byte. This is tacit in your answer...
– jfbu
1 hour ago
add a comment |
If you want to know how the 8-bit engines handle utf8 input you can use tracingmacros:
documentclass{article}
begin{document}
{tracingmacros =1 ä }
end{document}
which gives
Ã->UTFviii@two@octets Ã
UTFviii@two@octets #1#2->expandafter UTFviii@defined csname u8:#1string #2
endcsname
#1<-Ã
#2<-¤
UTFviii@defined #1->ifx #1relax if relax expandafter UTFviii@checkseq s
tring #1relax relax UTFviii@undefined@err {#1}else PackageError {inputenc}
{Invalid UTF-8 byte sequence}UTFviii@invalid@help fi else expandafter #1fi
#1<-u8:ä
u8:ä ->IeC {"a}
That means the the first byte of the ä (the Ã) is an active char, a command which then picks up the next byte and then calls u8:ä which calls "a. In this way (pdf)latex can handle quite a lot of utf8 input but it has e.g. problems with "char + combining accent" as there is no sensible code for the combining accent to go back to add an accent on the char.
Just a comment to point out log file containing the trace got viewed by editor in some 8bit encoding, presumably iso-latin-1 (like Emacs does for me), not in UTF8... soÃis only one byte. This is tacit in your answer...
– jfbu
1 hour ago
add a comment |
If you want to know how the 8-bit engines handle utf8 input you can use tracingmacros:
documentclass{article}
begin{document}
{tracingmacros =1 ä }
end{document}
which gives
Ã->UTFviii@two@octets Ã
UTFviii@two@octets #1#2->expandafter UTFviii@defined csname u8:#1string #2
endcsname
#1<-Ã
#2<-¤
UTFviii@defined #1->ifx #1relax if relax expandafter UTFviii@checkseq s
tring #1relax relax UTFviii@undefined@err {#1}else PackageError {inputenc}
{Invalid UTF-8 byte sequence}UTFviii@invalid@help fi else expandafter #1fi
#1<-u8:ä
u8:ä ->IeC {"a}
That means the the first byte of the ä (the Ã) is an active char, a command which then picks up the next byte and then calls u8:ä which calls "a. In this way (pdf)latex can handle quite a lot of utf8 input but it has e.g. problems with "char + combining accent" as there is no sensible code for the combining accent to go back to add an accent on the char.
If you want to know how the 8-bit engines handle utf8 input you can use tracingmacros:
documentclass{article}
begin{document}
{tracingmacros =1 ä }
end{document}
which gives
Ã->UTFviii@two@octets Ã
UTFviii@two@octets #1#2->expandafter UTFviii@defined csname u8:#1string #2
endcsname
#1<-Ã
#2<-¤
UTFviii@defined #1->ifx #1relax if relax expandafter UTFviii@checkseq s
tring #1relax relax UTFviii@undefined@err {#1}else PackageError {inputenc}
{Invalid UTF-8 byte sequence}UTFviii@invalid@help fi else expandafter #1fi
#1<-u8:ä
u8:ä ->IeC {"a}
That means the the first byte of the ä (the Ã) is an active char, a command which then picks up the next byte and then calls u8:ä which calls "a. In this way (pdf)latex can handle quite a lot of utf8 input but it has e.g. problems with "char + combining accent" as there is no sensible code for the combining accent to go back to add an accent on the char.
answered 1 hour ago
Ulrike FischerUlrike Fischer
189k7295676
189k7295676
Just a comment to point out log file containing the trace got viewed by editor in some 8bit encoding, presumably iso-latin-1 (like Emacs does for me), not in UTF8... soÃis only one byte. This is tacit in your answer...
– jfbu
1 hour ago
add a comment |
Just a comment to point out log file containing the trace got viewed by editor in some 8bit encoding, presumably iso-latin-1 (like Emacs does for me), not in UTF8... soÃis only one byte. This is tacit in your answer...
– jfbu
1 hour ago
Just a comment to point out log file containing the trace got viewed by editor in some 8bit encoding, presumably iso-latin-1 (like Emacs does for me), not in UTF8... so
à is only one byte. This is tacit in your answer...– jfbu
1 hour ago
Just a comment to point out log file containing the trace got viewed by editor in some 8bit encoding, presumably iso-latin-1 (like Emacs does for me), not in UTF8... so
à is only one byte. This is tacit in your answer...– jfbu
1 hour ago
add a comment |
Thanks for contributing an answer to TeX - LaTeX Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2ftex.stackexchange.com%2fquestions%2f471071%2fhow-can-latex-read-utf8%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
4
The TeXbook describes the so-called Knuth-TeX engine, as well as a collection of macros frequently called "PlainTeX". Are you aware of newer engines called pdfTeX, XeTeX, and LuaTeX?
– Mico
8 hours ago
2
You could be asking one of (at least) two questions here. Are you wondering how e.g. XeTeX (natively UTF-8) can be derived from Knuth's TeX (8-bit). Or are you wondering how 8-bit TeX engines deal with UTF-8 input (conversion of 'raw' bytes to codpoints to output)?
– Joseph Wright♦
2 hours ago