blob: db4e309832bcaa2d278105c696aabedc380b67f6 [file] [log] [blame]
Nathan Willis9f4b3752018-10-29 17:10:53 -05001<?xml version="1.0"?>
2<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN"
3 "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [
4 <!ENTITY % local.common.attrib "xmlns:xi CDATA #FIXED 'http://www.w3.org/2003/XInclude'">
5 <!ENTITY version SYSTEM "version.xml">
6]>
Nathan Willis3a27e8f2018-10-12 18:23:26 -05007<chapter id="shaping-concepts">
8 <title>Shaping concepts</title>
9 <section id="text-shaping-concepts">
10 <title>Text shaping</title>
11 <para>
12 Text shaping is the process of transforming a sequence of Unicode
13 codepoints that represent individual characters (letters,
14 diacritics, tone marks, numbers, symbols, etc.) into the
15 orthographically and linguistically correct two-dimensional layout
16 of glyph shapes taken from a specified font.
17 </para>
18 <para>
19 For some writing systems (or <emphasis>scripts</emphasis>) and
20 languages, the process is simple, requiring the shaper to do
21 little more than advance the horizontal position forward by the
22 correct amount for each successive glyph.
23 </para>
24 <para>
25 But, for <emphasis>complex scripts</emphasis>, any combination of
26 several shaping operations may be required, and the rules for how
27 and when they are applied vary from script to script. HarfBuzz and
28 other shaping engines implement these rules.
29 </para>
30 <para>
31 The exact rules and necessary operations for a particular script
32 constitute a shaping <emphasis>model</emphasis>. OpenType
33 specifies a set of shaping models that covers all of
34 Unicode. Other shaping models are available, however, including
35 Graphite and Apple Advanced Typography (AAT).
36 </para>
37 </section>
38
39 <section id="complex-scripts">
40 <title>Complex scripts</title>
41 <para>
42 In text-shaping terminology, scripts are generally classified as
43 either <emphasis>complex</emphasis> or <emphasis>non-complex</emphasis>.
44 </para>
45 <para>
46 Complex scripts are those for which transforming the input
47 sequence into the final layout requires some combination of
48 operations&mdash;such as context-dependent substitutions,
49 context-dependent mark positioning, glyph-to-glyph joining,
50 glyph reordering, or glyph stacking.
51 </para>
52 <para>
53 In some complex scripts, the shaping rules require that a text
54 run be divided into syllables before the operations can be
55 applied. Other complex scripts may apply shaping operations over
56 entire words or over the entire text run, with no subdivision
57 required.
58 </para>
59 <para>
60 Non-complex scripts, by definition, do not require these
61 operations. However, correctly shaping a text run in a
62 non-complex script may still involve Unicode normalization,
63 ligature substitutions, mark positioning, kerning, and applying
64 other font features. The key difference is that a text run in a
65 non-complex script can be processed sequentially and in the same
66 order as the input sequence of Unicode codepoints, without
67 requiring an analysis stage.
68 </para>
69 </section>
70
71 <section id="shaping-operations">
72 <title>Shaping operations</title>
73 <para>
74 Shaping a complex-script text run involves transforming the
75 input sequence of Unicode codepoints with some combination of
76 operations that is specified in the shaping model for the
77 script.
78 </para>
79 <para>
80 The specific conditions that trigger a given operation for a
81 text run varies from script to script, as do the order that the
82 operations are performed in and which codepoints are
83 affected. However, the same general set of shaping operations is
84 common to all of the complex-script shaping models.
85 </para>
86
87 <itemizedlist>
88 <listitem>
89 <para>
90 A <emphasis>reordering</emphasis> operation moves a glyph
91 from its original ("logical") position in the sequence to
92 some other ("visual") position.
93 </para>
94 <para>
95 The shaping model for a given complex script might involve
96 more than one reordering step.
97 </para>
98 </listitem>
99
100 <listitem>
101 <para>
102 A <emphasis>joining</emphasis> operation replaces a glyph
103 with an alternate form that is designed to connect with one
104 or more of the adjacent glyphs in the sequence.
105 </para>
106 </listitem>
107
108 <listitem>
109 <para>
110 A contextual <emphasis>substitution</emphasis> operation
111 replaces either a single glyph or a subsequence of several
112 glyphs with an alternate glyph. This substitution is
113 performed when the original glyph or subsequence of glyphs
114 occurs in a specified position with respect to the
115 surrounding sequence. For example, one substitution might be
116 performed only when the target glyph is the first glyph in
117 the sequence, while another substitution is performed only
118 when a different target glyph occurs immediately after a
119 particular string pattern.
120 </para>
121 <para>
122 The shaping model for a given complex script might involve
123 multiple contextual-substitution operations, each applying
124 to different target glyphs and patterns, and which are
125 performed in separate steps.
126 </para>
127 </listitem>
128
129 <listitem>
130 <para>
131 A contextual <emphasis>positioning</emphasis> operation
132 moves the horizontal and/or vertical position of a
133 glyph. This positioning move is performed when the glyph
134 occurs in a specified position with respect to the
135 surrounding sequence.
136 </para>
137 <para>
138 Many contextual positioning operations are used to place
139 <emphasis>mark</emphasis> glyphs (such as diacritics, vowel
140 signs, and tone markers) with respect to
141 <emphasis>base</emphasis> glyphs. However, some complex
142 scripts may use contextual positioning operations to
143 correctly place base glyphs as well, such as
144 when the script uses <emphasis>stacking</emphasis> characters.
145 </para>
146 </listitem>
147
148 </itemizedlist>
149 </section>
150
151 <section id="unicode-character-categories">
152 <title>Unicode character categories</title>
153 <para>
154 Shaping models are typically specified with respect to how
155 scripts are defined in the Unicode standard.
156 </para>
157 <para>
158 Every codepoint in the Unicode Character Database (UCD) is
159 assigned a <emphasis>Unicode General Category</emphasis> (UGC),
160 which provides the most fundamental information about the
161 codepoint: whether the codepoint represents a
162 <emphasis>Letter</emphasis>, a <emphasis>Mark</emphasis>, a
163 <emphasis>Number</emphasis>, <emphasis>Punctuation</emphasis>, a
164 <emphasis>Symbol</emphasis>, a <emphasis>Separator</emphasis>,
165 or something else (<emphasis>Other</emphasis>).
166 </para>
167 <para>
168 These UGC properties are "Major" categories. Each codepoint is
169 further assigned to a "minor" category within its Major
170 category, such as "Letter, uppercase" (<literal>Lu</literal>) or
171 "Letter, modifier" (<literal>Lm</literal>).
172 </para>
173 <para>
174 Shaping models are concerned primarily with Letter and Mark
175 codepoints. The minor categories of Mark codepoints are
176 particularly important for shaping. Marks can be nonspacing
177 (<literal>Mn</literal>), spacing combining
178 (<literal>Mc</literal>), or enclosing (<literal>Me</literal>).
179 </para>
180 <para>
181 In addition to the UGC property, codepoints in the Indic and
182 Southeast Asian scripts are also assigned
183 <emphasis>Unicode Indic Syllabic Category</emphasis> (UISC) and
184 <emphasis>Unicode Indic Positional Category</emphasis> (UIPC)
Nathan Willised13cad2018-11-28 13:48:38 -0600185 properties that provide more detailed information needed for
Nathan Willis3a27e8f2018-10-12 18:23:26 -0500186 shaping.
187 </para>
188 <para>
189 The UISC property sub-categorizes Letters and Marks according to
190 common script-shaping behaviors. For example, UISC distinguishes
191 between consonant letters, vowel letters, and vowel marks. The
Nathan Willised13cad2018-11-28 13:48:38 -0600192 UIPC property sub-categorizes Mark codepoints by the relative visual
Nathan Willis3a27e8f2018-10-12 18:23:26 -0500193 position that they occupy (above, below, right, left, or in
194 multiple positions).
195 </para>
196 <para>
197 Some complex scripts require that the text run be split into
Nathan Willised13cad2018-11-28 13:48:38 -0600198 syllables. What constitutes a valid syllable in these
199 scripts is specified in regular expressions, formed from the
200 Letter and Mark codepoints, that take the UISC and UIPC
201 properties into account.
Nathan Willis3a27e8f2018-10-12 18:23:26 -0500202 </para>
203
204 </section>
205
206 <section id="text-runs">
207 <title>Text runs</title>
208 <para>
209 Real-world text usually contains codepoints from a mixture of
210 different Unicode scripts (including punctuation, numbers, symbols,
211 white-space characters, and other codepoints that do not belong
212 to any script). Real-world text may also be marked up with
213 formatting that changes font properties (including the font,
214 font style, and font size).
215 </para>
216 <para>
217 For shaping purposes, all real-world text streams must be first
218 segmented into runs that have a uniform set of properties.
219 </para>
220 <para>
221 In particular, shaping models always assume that every codepoint
222 in a text run has the same <emphasis>direction</emphasis>,
223 <emphasis>script</emphasis> tag, and
224 <emphasis>language</emphasis> tag.
225 </para>
226 </section>
227
228 <section id="opentype-shaping-models">
229 <title>OpenType shaping models</title>
230 <para>
231 OpenType provides shaping models for the following scripts:
232 </para>
233
234 <itemizedlist>
235 <listitem>
236 <para>
237 The <emphasis>default</emphasis> shaping model handles all
238 non-complex scripts, and may also be used as a fallback for
239 handling unrecognized scripts.
240 </para>
241 </listitem>
242
243 <listitem>
244 <para>
245 The <emphasis>Indic</emphasis> shaping model handles the Indic
246 scripts Bengali, Devanagari, Gujarati, Gurmukhi, Kannada,
247 Malayalam, Oriya, Tamil, Telugu, and Sinhala.
248 </para>
249 <para>
250 The Indic shaping model was revised significantly in
251 2005. To denote the change, a new set of <emphasis>script
252 tags</emphasis> was assigned for Bengali, Devanagari,
253 Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, and
254 Telugu. For the sake of clarity, the term "Indic2" is
255 sometimes used to refer to the current, revised shaping
256 model.
257 </para>
258 </listitem>
259
260 <listitem>
261 <para>
262 The <emphasis>Arabic</emphasis> shaping model supports
263 Arabic, Mongolian, N'Ko, Syriac, and several other connected
264 or cursive scripts.
265 </para>
266 </listitem>
267
268 <listitem>
269 <para>
270 The <emphasis>Thai/Lao</emphasis> shaping model supports
271 the Thai and Lao scripts.
272 </para>
273 </listitem>
274
275 <listitem>
276 <para>
277 The <emphasis>Khmer</emphasis> shaping model supports the
278 Khmer script.
279 </para>
280 </listitem>
281
282 <listitem>
283 <para>
284 The <emphasis>Myanmar</emphasis> shaping model supports the
285 Myanmar (or Burmese) script.
286 </para>
287 </listitem>
288
289 <listitem>
290 <para>
291 The <emphasis>Tibetan</emphasis> shaping model supports the
292 Tibetan script.
293 </para>
294 </listitem>
295
296 <listitem>
297 <para>
298 The <emphasis>Hangul</emphasis> shaping model supports the
299 Hangul script.
300 </para>
301 </listitem>
302
303 <listitem>
304 <para>
305 The <emphasis>Hebrew</emphasis> shaping model supports the
306 Hebrew script.
307 </para>
308 </listitem>
309
310 <listitem>
311 <para>
312 The <emphasis>Universal Shaping Engine</emphasis> (USE)
313 shaping model supports complex scripts not covered by one of
314 the above, script-specific shaping models, including
315 Javanese, Balinese, Buginese, Batak, Chakma, Lepcha, Modi,
316 Phags-pa, Tagalog, Siddham, Sundanese, Tai Le, Tai Tham, Tai
317 Viet, and many others.
318 </para>
319 </listitem>
320
321 <listitem>
322 <para>
323 Text runs that do not fall under one of the above shaping
324 models may still require processing by a shaping engine. Of
325 particular note is <emphasis>Emoji</emphasis> shaping, which
326 may involve variation-selector sequences and glyph
327 substitution. Emoji shaping is handled by the default
328 shaping model.
329 </para>
330 </listitem>
331
332 </itemizedlist>
333
334 </section>
335
336 <section id="graphite-shaping">
337 <title>Graphite shaping</title>
338 <para>
339 In contrast to OpenType shaping, Graphite shaping does not
340 specify a predefined set of shaping models or a set of supported
341 scripts.
342 </para>
343 <para>
344 Instead, each Graphite font contains a complete set of rules that
345 implement the required shaping model for the intended
346 script. These rules include finite-state machines to match
347 sequences of codepoints to the shaping operations to perform.
348 </para>
349 <para>
350 Graphite shaping can perform the same shaping operations used in
351 OpenType shaping, as well as other functions that have not been
352 defined for OpenType shaping.
353 </para>
354 </section>
355
356 <section id="aat-shaping">
357 <title>AAT shaping</title>
358 <para>
359 In contrast to OpenType shaping, AAT shaping does not specify a
360 predefined set of shaping models or a set of supported scripts.
361 </para>
362 <para>
363 Instead, each AAT font includes a complete set of rules that
364 implement the desired shaping model for the intended
365 script. These rules include finite-state machines to match glyph
366 sequences and the shaping operations to perform.
367 </para>
368 <para>
369 Notably, AAT shaping rules are expressed for glyphs in the font,
370 not for Unicode codepoints. AAT shaping can perform the same
371 shaping operations used in OpenType shaping, as well as other
372 functions that have not been defined for OpenType shaping.
373 </para>
374 </section>
375</chapter>