Blame - README - platform/external/tagsoup

blob: 1e718195442b4a800d89f322c1676a4363654370 [file] [log] [blame]

Narayan Kamath	70dce01	2013-10-21 12:26:25 +0100	[diff] [blame]	1	TagSoup - Just Keep On Truckin'
				2
				3	Introduction
				4
				5	This is the home page of TagSoup, a SAX-compliant parser written in
				6	Java that, instead of parsing well-formed or valid XML, parses HTML as
				7	it is found in the wild: [1]poor, nasty and brutish, though quite often
				8	far from short. TagSoup is designed for people who have to process this
				9	stuff using some semblance of a rational application design. By
				10	providing a SAX interface, it allows standard XML tools to be applied
				11	to even the worst HTML. TagSoup also includes a command-line processor
				12	that reads HTML files and can generate either clean HTML or well-formed
				13	XML that is a close approximation to XHTML.
				14
				15	This is also the README file packaged with TagSoup.
				16
				17	TagSoup is free and Open Source software. As of version 1.2, it is
				18	licensed under the [2]Apache License, Version 2.0, which allows
				19	proprietary re-use as well as use with GPL 3.0 or GPL 2.0-or-later
				20	projects. (If anyone needs a GPL 2.0 license for a GPL 2.0-only
				21	project, feel free to ask.)
				22
				23	Warning: TagSoup will not build on stock Java 5.x or 6.x!
				24
				25	Due to a bug in the versions of Xalan shipped with Java 5.x and 6.x,
				26	TagSoup will not build out of the box. You need to retrieve [3]Saxon
				27	6.5.5, which does not have the bug. Unpack the zipfile in an empty
				28	directory and copy the saxon.jar and saxon-xml-apis.jar files to
				29	$ANT_HOME/lib. The Ant build process for TagSoup will then notice that
				30	Saxon is available and use it instead.
				31
				32	TagSoup 1.2 released
				33
				34	There are a great many changes, most of them fixes for long-standing
				35	bugs, in this release. Only the most important are listed here; for the
				36	rest, see the CHANGES file in the source distribution. Very special
				37	thanks to Jojo Dijamco, whose intensive efforts at debugging made this
				38	release a usable upgrade rather than a useless mass of undetected bugs.
				39	* As noted above, I have changed the license to Apache 2.0.
				40	* The default content model for bogons (unknown elements) is now ANY
				41	rather than EMPTY. This is a breaking change, which I have done
				42	only because there was so much demand for it. It can be undone on
				43	the command line with the --emptybogons switch, or programmatically
				44	with parser.setFeature(Parser.emptyBogonsFeature, true).
				45	* The processing of entity references in attribute values has finally
				46	been fixed to do what browsers do. That is, a reference is only
				47	recognized if it is properly terminated by a semicolon; otherwise
				48	it is treated as plain text. This means that URIs like
				49	foo?cdown=32&cup=42 are no longer seen as containing an instance of
				50	the )U character (whose name happens to be cup).
				51	* Several new switches have been added:
				52	+ --doctype-system and --doctype-public force a DOCTYPE
				53	declaration to be output and allow setting the system and
				54	public identifiers.
				55	+ --standalone and --version allow control of the XML
				56	declaration that is output. (Note that TagSoup's XML output is
				57	always version 1.0, even if you use --version=1.1.)
				58	+ --norootbogons causes unknown elements not to be allowed as
				59	the document root element. Instead, they are made children of
				60	the default root element (the html element for HTML).
				61	* The TagSoup core now supports character entities with values above
				62	U+FFFF. As a consequence, the HTML schema now supports all 2,210
				63	standard character entities from the [4]2007-12-14 draft of XML
				64	Entity Definitions for Characters, except the 94 which require more
				65	than one Unicode character to represent.
				66	* The SAX events startPrefixMapping and endPrefixMapping are now
				67	being reported for all cases of foreign elements and attributes.
				68	* All bugs around newline processing on Windows should now be gone.
				69	* A number of content models have been loosened to allow elements to
				70	appear in new and non-standard (but commonly found) places. In
				71	particular, tables are now allowed inside paragraphs, against the
				72	letter of the W3C specification.
				73	* Since the span element is intended for fine control of appearance
				74	using CSS, it should never have been a restartable element. This
				75	very long-standing bug has now been fixed.
				76	* The following non-standard elements are now at least partly
				77	supported: bgsound, blink, canvas, comment, listing, marquee, nobr,
				78	rbc, rb, rp, rtc, rt, ruby, wbr, xmp.
				79	* In HTML output mode, boolean attributes like checked are now output
				80	as such, rather than in XML style as checked="checked".
				81	* Runs of < characters such as << and <<< are now handled correctly
				82	in text rather than being transformed into extremely bogus
				83	start-tags.
				84
				85	[5]Download the TagSoup 1.2 jar file here. It's about 87K long.
				86	[6]Download the full TagSoup 1.2 source here. If you don't have zip,
				87	you can use jar to unpack it.
				88	[7]Download the current CHANGES file here.
				89
				90	TagSoup 1.1 released
				91
				92	TagSoup 1.1 adds Tatu Saloranta's JAXP support for TagSoup. To use
				93	TagSoup within the JAXP framework (which is not something I necessarily
				94	recommend, but it is part of the Java XML platform), you can create a
				95	SAXParser by calling
				96	org.ccil.cowan.tagsoup.jaxp.SAXParserImpl.newInstance(). You can also
				97	set the system property javax.xml.parsers.SAXParserFactory to
				98	org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl, but be aware that doing
				99	this will cause all JAXP-based XML parsing to go through TagSoup, which
				100	is a Bad Thing if your application also reads XML documents.
				101
				102	What TagSoup does
				103
				104	TagSoup is designed as a parser, not a whole application; it isn't
				105	intended to permanently clean up bad HTML, as [8]HTML Tidy does, only
				106	to parse it on the fly. Therefore, it does not convert presentation
				107	HTML to CSS or anything similar. It does guarantee well-structured
				108	results: tags will wind up properly nested, default attributes will
				109	appear appropriately, and so on.
				110
				111	The semantics of TagSoup are as far as practical those of actual HTML
				112	browsers. In particular, never, never will it throw any sort of syntax
				113	error: the TagSoup motto is [9]"Just Keep On Truckin'". But there's
				114	much, much more. For example, if the first tag is LI, it will supply
				115	the application with enclosing HTML, BODY, and UL tags. Why UL? Because
				116	that's what browsers assume in this situation. For the same reason,
				117	overlapping tags are correctly restarted whenever possible: text like:
				118	This is <B>bold, <I>bold italic, </b>italic, </i>normal text
				119
				120	gets correctly rewritten as:
				121	This is <b>bold, <i>bold italic, </i></b><i>italic, </i>normal text.
				122
				123	By intention, TagSoup is small and fast. It does not depend on the
				124	existence of any framework other than SAX, and should be able to work
				125	with any framework that can accept SAX parsers. In particular, [10]XOM
				126	is known to work.
				127
				128	You can replace the low-level HTML scanner with one based on Sean
				129	McGrath's [11]PYX format (very close to James Clark's ESIS format). You
				130	can also supply an AutoDetector that peeks at the incoming byte stream
				131	and guesses a character encoding for it. Otherwise, the platform
				132	default is used. If you need an autodetector of character sets,
				133	consider trying to adapt the [12]Mozilla one; if you succeed, let me
				134	know.
				135
				136	Note: TagSoup in Java 1.1
				137
				138	If you go through the TagSoup source and replace all references to
				139	HashMap with Hashtable and recompile, TagSoup will work fine in Java
				140	1.1 VMs. Thanks to Thorbjørn Vinne for this discovery.
				141
				142	The TSaxon XSLT-for-HTML processor
				143
				144	[13]I am also distributing [14]TSaxon, a repackaging of version 6.5.5
				145	of Michael Kay's Saxon XSLT version 1.0 implementation that includes
				146	TagSoup. TSaxon is a drop-in replacement for Saxon, and can be used to
				147	process either HTML or XML documents with XSLT stylesheets.
				148
				149	TagSoup as a stand-alone program
				150
				151	It is possible to run TagSoup as a program by saying java -jar
				152	tagsoup-1.0.1 [option ...] [file ...]. Files mentioned on the command
				153	line will be parsed individually. If no files are specified, the
				154	standard input is read.
				155
				156	The following options are understood:
				157
				158	--files
				159	Output into individual files, with html extensions changed to
				160	xhtml. Otherwise, all output is sent to the standard output.
				161
				162	--html
				163	Output is in clean HTML: the XML declaration is suppressed, as
				164	are end-tags for the known empty elements.
				165
				166	--omit-xml-declaration
				167	The XML declaration is suppressed.
				168
				169	--method=html
				170	End-tags for the known empty HTML elements are suppressed.
				171
				172	--doctype-system=systemid
				173	Forces the output of a DOCTYPE declaration with the specified
				174	systemid.
				175
				176	--doctype-public=publicid
				177	Forces the output of a DOCTYPE declaration with the specified
				178	publicid.
				179
				180	--version=version
				181	Sets the version string in the XML declaration.
				182
				183	--standalone=[yes\|no]
				184	Sets the standalone declaration to yes or no.
				185
				186	--pyx
				187	Output is in PYX format.
				188
				189	--pyxin
				190	Input is in PYXoid format (need not be well-formed).
				191
				192	--nons
				193	Namespaces are suppressed. Normally, all elements are in the
				194	XHTML 1.x namespace, and all attributes are in no namespace.
				195
				196	--nobogons
				197	Bogons (unknown elements) are suppressed.
				198
				199	--nodefaults
				200	suppress default attribute values
				201
				202	--nocolons
				203	change explicit colons in element and attribute names to
				204	underscores
				205
				206	--norestart
				207	don't restart any normally restartable elements
				208
				209	--ignorable
				210	output whitespace in elements with element-only content
				211
				212	--emptybogons
				213	Bogons are given a content model of EMPTY rather than ANY.
				214
				215	--any
				216	Bogons are given a content model of ANY rather than EMPTY
				217	(default).
				218
				219	--norootbogons
				220	Don't allow bogons to be root elements; make them subordinate to
				221	the root.
				222
				223	--lexical
				224	Pass through HTML comments and DOCTYPE declarations. Has no
				225	effect when output is in PYX format.
				226
				227	--reuse
				228	Reuse a single instance of TagSoup parser throughout. Normally,
				229	a new one is instantiated for each input file.
				230
				231	--nocdata
				232	Change the content models of the script and style elements to
				233	treat them as ordinary #PCDATA (text-only) elements, as in
				234	XHTML, rather than with the special CDATA content model.
				235
				236	--encoding=encoding
				237	Specify the input encoding. The default is the Java platform
				238	default.
				239
				240	--output-encoding=encoding
				241	Specify the output encoding. The default is the Java platform
				242	default.
				243
				244	--help
				245	Print help.
				246
				247	--version
				248	Print the version number.
				249
				250	SAX features and properties
				251
				252	TagSoup supports the following SAX features in addition to the standard
				253	ones:
				254
				255	http://www.ccil.org/~cowan/tagsoup/features/ignore-bogons
				256	A value of "true" indicates that the parser will ignore unknown
				257	elements.
				258
				259	http://www.ccil.org/~cowan/tagsoup/features/bogons-empty
				260	A value of "true" indicates that the parser will give unknown
				261	elements a content model of EMPTY; a value of "false", a content
				262	model of ANY.
				263
				264	http://www.ccil.org/~cowan/tagsoup/features/root-bogons
				265	A value of "true" indicates that the parser will allow unknown
				266	elements to be the root of the output document.
				267
				268	http://www.ccil.org/~cowan/tagsoup/features/default-attributes
				269	A value of "true" indicates that the parser will return default
				270	attribute values for missing attributes that have default
				271	values.
				272
				273	http://www.ccil.org/~cowan/tagsoup/features/translate-colons
				274	A value of "true" indicates that the parser will translate
				275	colons into underscores in names.
				276
				277	http://www.ccil.org/~cowan/tagsoup/features/restart-elements
				278	A value of "true" indicates that the parser will attempt to
				279	restart the restartable elements.
				280
				281	http://www.ccil.org/~cowan/tagsoup/features/ignorable-whitespace
				282	A value of "true" indicates that the parser will transmit
				283	whitespace in element-only content via the SAX
				284	ignorableWhitespace callback. Normally this is not done, because
				285	HTML is an SGML application and SGML suppresses such whitespace.
				286
				287	http://www.ccil.org/~cowan/tagsoup/features/cdata-elements
				288	A value of "true" indicates that the parser will process the
				289	script and style elements (or any elements with type='cdata' in
				290	the TSSL schema) as SGML CDATA elements (that is, no markup is
				291	recognized except the matching end-tag).
				292
				293	TagSoup supports the following SAX properties in addition to the
				294	standard ones:
				295
				296	http://www.ccil.org/~cowan/tagsoup/properties/scanner
				297	Specifies the Scanner object this parser uses.
				298
				299	http://www.ccil.org/~cowan/tagsoup/properties/schema
				300	Specifies the Schema object this parser uses.
				301
				302	http://www.ccil.org/~cowan/tagsoup/properties/auto-detector
				303	Specifies the AutoDetector (for encoding detection) this parser
				304	uses.
				305
				306	More information
				307
				308	I gave a presentation (a nocturne, so it's not on the schedule) at
				309	[15]Extreme Markup Languages 2004 about TagSoup, updated from the one
				310	presented in 2002 at the New York City XML SIG and at XML 2002. This is
				311	the main high-level documentation about how TagSoup works. Formats:
				312	[16]OpenDocument [17]Powerpoint [18]PDF.
				313
				314	I also had people add [19]"evil" HTML to a large poster so that I could
				315	[20]clean it up; View Source is probably more useful than ordinary
				316	browsing. The original instructions were:
				317
				318	SOUPE DE BALISES (BE EVIL)!
				319	Ecritez une balise ouvrante (sans attributs)
				320	ou fermante HTML ici, s.v.p.
				321
				322	There is a [21]tagsoup-friends mailing list hosted at [22]Yahoo Groups.
				323	You can [23]join via the Web, or by sending a blank email to
				324	[24]tagsoup-friends-subscribe@yahoogroups.com. The [25]archives are
				325	open to all.
				326
				327	Online TagSoup processing for publicly accessible HTML documents is now
				328	[26]available courtesy of Leigh Dodds.
				329
				330	References
				331
				332	1. http://oregonstate.edu/instruct/phl302/texts/hobbes/leviathan-c.html
				333	2. http://opensource.org/licenses/apache2.0.php
				334	3. http://prdownloads.sourceforge.net/saxon/saxon6-5-5.zip
				335	4. http://www.w3.org/TR/2007/WD-xml-entity-names-20071214
				336	5. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup-1.2.jar
				337	6. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup-1.2-src.zip
				338	7. http://home.ccil.org/~cowan/XML/tagsoup/CHANGES
				339	8. http://tidy.sf.net/
				340	9. http://www.crumbmuseum.com/truckin.html
				341	10. http://www.cafeconleche.org/XOM
				342	11. http://gnosis.cx/publish/programming/xml_matters_17.html
				343	12. http://jchardet.sourceforge.net/
				344	13. http://www.ccil.org/~cowan
				345	14. http://home.ccil.org/~cowan/XML/tagsoup/tsaxon
				346	15. http://www.extrememarkup.com/extreme/2004
				347	16. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup.odp
				348	17. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup.ppt
				349	18. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup.pdf
				350	19. http://home.ccil.org/~cowan/XML/tagsoup/extreme.html
				351	20. http://home.ccil.org/~cowan/XML/tagsoup/extreme.xhtml
				352	21. http://groups.yahoo.com/group/tagsoup-friends
				353	22. http://groups.yahoo.com/
				354	23. http://groups.yahoo.com/group/tagsoup-friends/join
				355	24. mailto:tagsoup-friends-subscribe@yahoogroups.com
				356	25. http://groups.yahoo.com/group/tagsoup-friends/messages
				357	26. http://xmlarmyknife.org/docs/xhtml/tagsoup/