Blame - docs/usermanual-shaping-concepts.xml - platform/external/harfbuzz_ng

blob: db4e309832bcaa2d278105c696aabedc380b67f6 [file] [log] [blame]

Nathan Willis	9f4b375	2018-10-29 17:10:53 -0500	[diff] [blame]	1	<?xml version="1.0"?>
				2	<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN"
				3	"http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [
				4	<!ENTITY % local.common.attrib "xmlns:xi CDATA #FIXED 'http://www.w3.org/2003/XInclude'">
				5	<!ENTITY version SYSTEM "version.xml">
				6	]>
Nathan Willis	3a27e8f	2018-10-12 18:23:26 -0500	[diff] [blame]	7	<chapter id="shaping-concepts">
				8	<title>Shaping concepts</title>
				9	<section id="text-shaping-concepts">
				10	<title>Text shaping</title>
				11	<para>
				12	Text shaping is the process of transforming a sequence of Unicode
				13	codepoints that represent individual characters (letters,
				14	diacritics, tone marks, numbers, symbols, etc.) into the
				15	orthographically and linguistically correct two-dimensional layout
				16	of glyph shapes taken from a specified font.
				17	</para>
				18	<para>
				19	For some writing systems (or <emphasis>scripts</emphasis>) and
				20	languages, the process is simple, requiring the shaper to do
				21	little more than advance the horizontal position forward by the
				22	correct amount for each successive glyph.
				23	</para>
				24	<para>
				25	But, for <emphasis>complex scripts</emphasis>, any combination of
				26	several shaping operations may be required, and the rules for how
				27	and when they are applied vary from script to script. HarfBuzz and
				28	other shaping engines implement these rules.
				29	</para>
				30	<para>
				31	The exact rules and necessary operations for a particular script
				32	constitute a shaping <emphasis>model</emphasis>. OpenType
				33	specifies a set of shaping models that covers all of
				34	Unicode. Other shaping models are available, however, including
				35	Graphite and Apple Advanced Typography (AAT).
				36	</para>
				37	</section>
				38
				39	<section id="complex-scripts">
				40	<title>Complex scripts</title>
				41	<para>
				42	In text-shaping terminology, scripts are generally classified as
				43	either <emphasis>complex</emphasis> or <emphasis>non-complex</emphasis>.
				44	</para>
				45	<para>
				46	Complex scripts are those for which transforming the input
				47	sequence into the final layout requires some combination of
				48	operations—such as context-dependent substitutions,
				49	context-dependent mark positioning, glyph-to-glyph joining,
				50	glyph reordering, or glyph stacking.
				51	</para>
				52	<para>
				53	In some complex scripts, the shaping rules require that a text
				54	run be divided into syllables before the operations can be
				55	applied. Other complex scripts may apply shaping operations over
				56	entire words or over the entire text run, with no subdivision
				57	required.
				58	</para>
				59	<para>
				60	Non-complex scripts, by definition, do not require these
				61	operations. However, correctly shaping a text run in a
				62	non-complex script may still involve Unicode normalization,
				63	ligature substitutions, mark positioning, kerning, and applying
				64	other font features. The key difference is that a text run in a
				65	non-complex script can be processed sequentially and in the same
				66	order as the input sequence of Unicode codepoints, without
				67	requiring an analysis stage.
				68	</para>
				69	</section>
				70
				71	<section id="shaping-operations">
				72	<title>Shaping operations</title>
				73	<para>
				74	Shaping a complex-script text run involves transforming the
				75	input sequence of Unicode codepoints with some combination of
				76	operations that is specified in the shaping model for the
				77	script.
				78	</para>
				79	<para>
				80	The specific conditions that trigger a given operation for a
				81	text run varies from script to script, as do the order that the
				82	operations are performed in and which codepoints are
				83	affected. However, the same general set of shaping operations is
				84	common to all of the complex-script shaping models.
				85	</para>
				86
				87	<itemizedlist>
				88	<listitem>
				89	<para>
				90	A <emphasis>reordering</emphasis> operation moves a glyph
				91	from its original ("logical") position in the sequence to
				92	some other ("visual") position.
				93	</para>
				94	<para>
				95	The shaping model for a given complex script might involve
				96	more than one reordering step.
				97	</para>
				98	</listitem>
				99
				100	<listitem>
				101	<para>
				102	A <emphasis>joining</emphasis> operation replaces a glyph
				103	with an alternate form that is designed to connect with one
				104	or more of the adjacent glyphs in the sequence.
				105	</para>
				106	</listitem>
				107
				108	<listitem>
				109	<para>
				110	A contextual <emphasis>substitution</emphasis> operation
				111	replaces either a single glyph or a subsequence of several
				112	glyphs with an alternate glyph. This substitution is
				113	performed when the original glyph or subsequence of glyphs
				114	occurs in a specified position with respect to the
				115	surrounding sequence. For example, one substitution might be
				116	performed only when the target glyph is the first glyph in
				117	the sequence, while another substitution is performed only
				118	when a different target glyph occurs immediately after a
				119	particular string pattern.
				120	</para>
				121	<para>
				122	The shaping model for a given complex script might involve
				123	multiple contextual-substitution operations, each applying
				124	to different target glyphs and patterns, and which are
				125	performed in separate steps.
				126	</para>
				127	</listitem>
				128
				129	<listitem>
				130	<para>
				131	A contextual <emphasis>positioning</emphasis> operation
				132	moves the horizontal and/or vertical position of a
				133	glyph. This positioning move is performed when the glyph
				134	occurs in a specified position with respect to the
				135	surrounding sequence.
				136	</para>
				137	<para>
				138	Many contextual positioning operations are used to place
				139	<emphasis>mark</emphasis> glyphs (such as diacritics, vowel
				140	signs, and tone markers) with respect to
				141	<emphasis>base</emphasis> glyphs. However, some complex
				142	scripts may use contextual positioning operations to
				143	correctly place base glyphs as well, such as
				144	when the script uses <emphasis>stacking</emphasis> characters.
				145	</para>
				146	</listitem>
				147
				148	</itemizedlist>
				149	</section>
				150
				151	<section id="unicode-character-categories">
				152	<title>Unicode character categories</title>
				153	<para>
				154	Shaping models are typically specified with respect to how
				155	scripts are defined in the Unicode standard.
				156	</para>
				157	<para>
				158	Every codepoint in the Unicode Character Database (UCD) is
				159	assigned a <emphasis>Unicode General Category</emphasis> (UGC),
				160	which provides the most fundamental information about the
				161	codepoint: whether the codepoint represents a
				162	<emphasis>Letter</emphasis>, a <emphasis>Mark</emphasis>, a
				163	<emphasis>Number</emphasis>, <emphasis>Punctuation</emphasis>, a
				164	<emphasis>Symbol</emphasis>, a <emphasis>Separator</emphasis>,
				165	or something else (<emphasis>Other</emphasis>).
				166	</para>
				167	<para>
				168	These UGC properties are "Major" categories. Each codepoint is
				169	further assigned to a "minor" category within its Major
				170	category, such as "Letter, uppercase" (<literal>Lu</literal>) or
				171	"Letter, modifier" (<literal>Lm</literal>).
				172	</para>
				173	<para>
				174	Shaping models are concerned primarily with Letter and Mark
				175	codepoints. The minor categories of Mark codepoints are
				176	particularly important for shaping. Marks can be nonspacing
				177	(<literal>Mn</literal>), spacing combining
				178	(<literal>Mc</literal>), or enclosing (<literal>Me</literal>).
				179	</para>
				180	<para>
				181	In addition to the UGC property, codepoints in the Indic and
				182	Southeast Asian scripts are also assigned
				183	<emphasis>Unicode Indic Syllabic Category</emphasis> (UISC) and
				184	<emphasis>Unicode Indic Positional Category</emphasis> (UIPC)
Nathan Willis	ed13cad	2018-11-28 13:48:38 -0600	[diff] [blame]	185	properties that provide more detailed information needed for
Nathan Willis	3a27e8f	2018-10-12 18:23:26 -0500	[diff] [blame]	186	shaping.
				187	</para>
				188	<para>
				189	The UISC property sub-categorizes Letters and Marks according to
				190	common script-shaping behaviors. For example, UISC distinguishes
				191	between consonant letters, vowel letters, and vowel marks. The
Nathan Willis	ed13cad	2018-11-28 13:48:38 -0600	[diff] [blame]	192	UIPC property sub-categorizes Mark codepoints by the relative visual
Nathan Willis	3a27e8f	2018-10-12 18:23:26 -0500	[diff] [blame]	193	position that they occupy (above, below, right, left, or in
				194	multiple positions).
				195	</para>
				196	<para>
				197	Some complex scripts require that the text run be split into
Nathan Willis	ed13cad	2018-11-28 13:48:38 -0600	[diff] [blame]	198	syllables. What constitutes a valid syllable in these
				199	scripts is specified in regular expressions, formed from the
				200	Letter and Mark codepoints, that take the UISC and UIPC
				201	properties into account.
Nathan Willis	3a27e8f	2018-10-12 18:23:26 -0500	[diff] [blame]	202	</para>
				203
				204	</section>
				205
				206	<section id="text-runs">
				207	<title>Text runs</title>
				208	<para>
				209	Real-world text usually contains codepoints from a mixture of
				210	different Unicode scripts (including punctuation, numbers, symbols,
				211	white-space characters, and other codepoints that do not belong
				212	to any script). Real-world text may also be marked up with
				213	formatting that changes font properties (including the font,
				214	font style, and font size).
				215	</para>
				216	<para>
				217	For shaping purposes, all real-world text streams must be first
				218	segmented into runs that have a uniform set of properties.
				219	</para>
				220	<para>
				221	In particular, shaping models always assume that every codepoint
				222	in a text run has the same <emphasis>direction</emphasis>,
				223	<emphasis>script</emphasis> tag, and
				224	<emphasis>language</emphasis> tag.
				225	</para>
				226	</section>
				227
				228	<section id="opentype-shaping-models">
				229	<title>OpenType shaping models</title>
				230	<para>
				231	OpenType provides shaping models for the following scripts:
				232	</para>
				233
				234	<itemizedlist>
				235	<listitem>
				236	<para>
				237	The <emphasis>default</emphasis> shaping model handles all
				238	non-complex scripts, and may also be used as a fallback for
				239	handling unrecognized scripts.
				240	</para>
				241	</listitem>
				242
				243	<listitem>
				244	<para>
				245	The <emphasis>Indic</emphasis> shaping model handles the Indic
				246	scripts Bengali, Devanagari, Gujarati, Gurmukhi, Kannada,
				247	Malayalam, Oriya, Tamil, Telugu, and Sinhala.
				248	</para>
				249	<para>
				250	The Indic shaping model was revised significantly in
				251	2005. To denote the change, a new set of <emphasis>script
				252	tags</emphasis> was assigned for Bengali, Devanagari,
				253	Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, and
				254	Telugu. For the sake of clarity, the term "Indic2" is
				255	sometimes used to refer to the current, revised shaping
				256	model.
				257	</para>
				258	</listitem>
				259
				260	<listitem>
				261	<para>
				262	The <emphasis>Arabic</emphasis> shaping model supports
				263	Arabic, Mongolian, N'Ko, Syriac, and several other connected
				264	or cursive scripts.
				265	</para>
				266	</listitem>
				267
				268	<listitem>
				269	<para>
				270	The <emphasis>Thai/Lao</emphasis> shaping model supports
				271	the Thai and Lao scripts.
				272	</para>
				273	</listitem>
				274
				275	<listitem>
				276	<para>
				277	The <emphasis>Khmer</emphasis> shaping model supports the
				278	Khmer script.
				279	</para>
				280	</listitem>
				281
				282	<listitem>
				283	<para>
				284	The <emphasis>Myanmar</emphasis> shaping model supports the
				285	Myanmar (or Burmese) script.
				286	</para>
				287	</listitem>
				288
				289	<listitem>
				290	<para>
				291	The <emphasis>Tibetan</emphasis> shaping model supports the
				292	Tibetan script.
				293	</para>
				294	</listitem>
				295
				296	<listitem>
				297	<para>
				298	The <emphasis>Hangul</emphasis> shaping model supports the
				299	Hangul script.
				300	</para>
				301	</listitem>
				302
				303	<listitem>
				304	<para>
				305	The <emphasis>Hebrew</emphasis> shaping model supports the
				306	Hebrew script.
				307	</para>
				308	</listitem>
				309
				310	<listitem>
				311	<para>
				312	The <emphasis>Universal Shaping Engine</emphasis> (USE)
				313	shaping model supports complex scripts not covered by one of
				314	the above, script-specific shaping models, including
				315	Javanese, Balinese, Buginese, Batak, Chakma, Lepcha, Modi,
				316	Phags-pa, Tagalog, Siddham, Sundanese, Tai Le, Tai Tham, Tai
				317	Viet, and many others.
				318	</para>
				319	</listitem>
				320
				321	<listitem>
				322	<para>
				323	Text runs that do not fall under one of the above shaping
				324	models may still require processing by a shaping engine. Of
				325	particular note is <emphasis>Emoji</emphasis> shaping, which
				326	may involve variation-selector sequences and glyph
				327	substitution. Emoji shaping is handled by the default
				328	shaping model.
				329	</para>
				330	</listitem>
				331
				332	</itemizedlist>
				333
				334	</section>
				335
				336	<section id="graphite-shaping">
				337	<title>Graphite shaping</title>
				338	<para>
				339	In contrast to OpenType shaping, Graphite shaping does not
				340	specify a predefined set of shaping models or a set of supported
				341	scripts.
				342	</para>
				343	<para>
				344	Instead, each Graphite font contains a complete set of rules that
				345	implement the required shaping model for the intended
				346	script. These rules include finite-state machines to match
				347	sequences of codepoints to the shaping operations to perform.
				348	</para>
				349	<para>
				350	Graphite shaping can perform the same shaping operations used in
				351	OpenType shaping, as well as other functions that have not been
				352	defined for OpenType shaping.
				353	</para>
				354	</section>
				355
				356	<section id="aat-shaping">
				357	<title>AAT shaping</title>
				358	<para>
				359	In contrast to OpenType shaping, AAT shaping does not specify a
				360	predefined set of shaping models or a set of supported scripts.
				361	</para>
				362	<para>
				363	Instead, each AAT font includes a complete set of rules that
				364	implement the desired shaping model for the intended
				365	script. These rules include finite-state machines to match glyph
				366	sequences and the shaping operations to perform.
				367	</para>
				368	<para>
				369	Notably, AAT shaping rules are expressed for glyphs in the font,
				370	not for Unicode codepoints. AAT shaping can perform the same
				371	shaping operations used in OpenType shaping, as well as other
				372	functions that have not been defined for OpenType shaping.
				373	</para>
				374	</section>
				375	</chapter>