Changes from 1.2 to 1.2.1 ========================= Match DOCTYPE case-blind Extend PushbackReader's size for oddball cases like & followed by CR Leo Sutic's 2x-4x speedup by precompiling HTMLScanner table Changes from 1.1.3 to 1.2 ========================= Changed license to Apache 2.0 Bogon default model is now ANY, not EMPTY Support new DOCTYPE output switches --doctype-system and --doctype-public Support new XML declaration output switches --standalone and --version New --norootbogons switch makes bogons children of the root Don't resolve entity references in attribute values unless semicolon-terminated Support character entities above U+FFFF Add character entities from the 2007-12-14 draft of xml-entity-names Call SAX events startPrefixMapping and endPrefixMapping to report prefixes Clean up newline processing, shrinking html.stml considerably Allow link elements in the body as well as the head, to avoid excess bodies Allow tables inside paragraphs Allow cells and forms in thead and tfoot elements without intervening tr element The span element is no longer restartable Support non-standard elements bgsound, blink, canvas, comment, listing, marquee, nobr, ruby, rbc, rtc, rb, rt, rp, wbr, xmp In HTML mode, boolean attributes like checked are output in minimized form Correctly handle runs of less-than characters Suppress all but the first DOCTYPE declaration Modify PI targets containing colons to have underscores instead The case of element tags is now canonicalized to the schema PI targets are no longer forced to lower case Changes from 1.1.2 to 1.1.3 =========================== Allow Parser.set* methods to accept null Allow setting the LexicalHandler feature to be null in both cases means "use default behavior" Changes from 1.1.1 to 1.1.2 =========================== Setting CDATAElementsFeature didn't really set CDATAElements instance variable Changes from 1.1 to 1.1.1 ========================= Removed lexical handler calls to startCDATA/endCDATA from CDATA element handling Added lexical handler calls to startCDATA/endCDATA from CDATA section handling Added CDATAElementsFeature, the programmatic equivalent of the --nocdata switch Changes from 1.0.5 to 1.1 ========================= Add Tatu Saloranta's JAXP support package Changes from 1.0.4 to 1.0.5 =========================== Major repairs to comment scanning Skip leading BOM Comment out debugging code in PYXWriter Allow &#X as well as &#x Add net.sf.saxon to list of supported XSLT engines Changes from 1.0.4 to 1.0.3 =========================== Certain options were mutually exclusive that should not have been Blocked XML declaration from specifying an encoding of "" --method=html was not doing the right thing Changes from 1.0.3 to 1.0.2 =========================== Fixed build file to use Java target version 1.4 Fixed --version switch to print the right thing Changes from 1.0.1 to 1.0.2 =========================== Version attribute default value removed from html element Leading and trailing hyphens now trimmed properly from comments Added --output-encoding switch to control encoding If output encoding is Unicode, don't generate character references Whitespace compressed and junk stripped from public identifiers Changes from 1.0 to 1.0.1 ========================= Added ignorableWhitespaceFeature and --ignorable to report ignorable whitespace Patch due to David Pashley Insert spaces to break up -- in comments Change bogus chars in publicids to spaces --lexical switch now outputs DOCTYPE if there is one Remove unnecessary blank line after XML declaration Changes from 1.0rc9 to 1.0 ========================== Added feature to control restartability Patch due to Nikita Zhuk Added corresponding --norestart switch in CommandLine Made translate-colons feature actually work Changes from 1.0rc8 to 1.0rc9 ============================= If there is a publicid but no systemid, set systemid to "" Changes from 1.0rc7 to 1.0rc8 ============================= Fixed paper-bag bug (source didn't match binary in release) Changes from 1.0rc6 to 1.0rc7 ============================= LexicalHandler now gets DOCTYPE information (publicid and systemid) Patch due to Mike Bremford HTMLScanner now reports more useful debug output when not commented out Patch due to Mike Bremford Change "<memberOfAny>" to exclude "<root>" pseudo-element This prevents "script" from being output as a root The shared HTMLParser object has been eliminated Changes from 1.0rc5 to 1.0rc6 ============================= If namespaceFeature is false, uri and localname are passed as empty strings The namespacePrefixesFeature is now always false Command line switch --nons no longer affects namespacePrefixesFeature Command line switch --html now implies --nons XMLWriter is now told directly to use the schema's URI as default namespace XMLWriter now takes the element name from the qname if localname is empty Changes from 1.0rc4 to 1.0rc5 ============================= The --nodefault switch now removes only default attributes, not all of them Added --nocolons switch and translate-colons feature to convert ":" in names to "_" (thus suppressing namespaces other than the basic one) The root element can be unknown without problem Empty <script/> and <style/> tags now work Added all standard SAX2 features to feature hashtable Reimplemented namespacePrefixes feature (broken since 1.0rc3) Changes from 1.0rc3 to 1.0rc4 ============================= Remove trailing ? from processing instructions (in case the input is XHTML) Added Javadocs for all SAX standard and TagSoup-specific features and properties Fixed termination conditions for entity/character references Fixed EOF-pushback bug that was generating bogus 񥔵 references Added Parser feature and --nodefaults switch to ignore default attribute values Added support for SAX Locator Updated AFL license to version 3.0 Scanner buffer size increases as needed, allowing large attribute values Look for various XSLT implementations as available (still fails in raw 5.0) Clean up handling of XML empty tags and SGML minimized end-tags Support proper options and help message internally Use Hashtable in CommandLine class instead of HashMap Do proper buffering of InputStream and Reader Clean up content model of noframes element Removed htmlMode in XMLWriter Added support for XSLT output options METHOD=html and OMIT_XML_DECLARATION=yes Command line option --html sets both of these Wrote simple validator for TSSL schemas (tssl/tssl-validator.xslt) Removed various validity problems in html.tssl When processing a start-tag, don't restart elements that aren't in the new element's content model Remove bogus double param in tssl.xslt Changes from 1.0rc2 to 1.0rc3 ============================= Convert CR and CRLF to LF in comments and PIs Force empty elements to close immediately Match close tags of CDATA elements more precisely (but case-blind) Process switches on the command line Man page available Changes from 1.0rc1 to 1.0rc2 ============================= Isolated & and &# now don't crash parser TagSoup no longer depends on /dev/stdin existing Refactored Parser class, removing main method to new CommandLine class Changes to content models of form, button, table, and tr elements in html.tssl '</scr' + 'ipt>' in a script element no longer terminates it Introduced "uncloseability" of form and table elements "pyxin" property specifies that input is in PYX format Correctly cope with unexpected characters around colons, also with multiple colons Correctly output comments with "--" in them (by adding a space) Changes from 0.10.2 to 1.0rc1 ============================= Script can now appear anywhere Switch -nocdata correctly implemented Eliminated useless M_n constants in Schema Introduced <memberofAny> and <isRoot> as alternatives to <memberOf> in TSSL Allow prefixes in element names Attributes are now normalized Expanded public API for Element and ElementType Javadoc improved Changes from 0.10.1 to 0.10.2 ============================= Removed misfeature whereby > terminated a tag even inside quotes Added licensing language to XSLT scripts, RELAX NG schemas Removed long-standing mishandling of entity references in attributes Cleaned up logic for converting junky strings to proper XML Names Correctly handle empty tag that has no whitespace or attributes Restore correct 0.9.3 handling of an apparent end-tag in a CDATA element Added script element to content model of head element Changes from 0.9.7 to 0.10.1 (there is no 0.10.0): ================================================== Convert to XSLT configuration exclusively; Perl code and tab-separated tables are gone Remove xmlns:* attributes Append "_" to attribute names ending in ":" Don't prepend "_" to an attribute name starting in "_" Handle namespace prefixes in attributes: "xml" prefix is handled correctly other prefixes are mapped to "urn:x-prefix:foo" Ignore XML declarations -Dnocdata=true turns off F_CDATA on script and style elements Fixed off-by-one errors in character references that made them uninterpreted Start-tags ending in a minimized attribute are no longer being dropped XML empty tags are now supported (though slashes are still allowed in unquoted attribute values) Changes from 0.9.6 to 0.9.7: ============================ Upgraded AFL to version 2.1 Passed through newlines in character content (very old bug) Changes from 0.9.5 to 0.9.6: ============================ Script element can appear directly in body ">" terminates a start-tag even inside a quoted attribute, to protect against unbalanced quotes "_" is prepended to attributes that don't begin with a letter Remove "xmlns" attributes from the input All standard features can now be set (although there is no effect from doing so) New "bogons-empty" feature can be set to false to give bogons content model of ANY rather than EMPTY; -Dany switch sets this feature to false TSSL now has an explicit group element to declare an element group STML is a new XML format for modeling state-table changes License updated to AFL 2.1 Changes from 0.9.4 to 0.9.5: ============================ S in the statetable now means \r and \n and \t as well as space (as was always intended; brain fart!) Ins and del elements are now allowed everywhere TSSL now correctly supports attributes that are legal on all elements Changes from 0.9.3 to 0.9.4: ============================ Fixed paper-bag bug that revealed attribute type BOOLEAN to applications. Obsolete ABSTRACT removed in favor of README. Improved implementation of CDATA restart after bogus end-tag. Allowed hyphen, underscore, and period in names as well as colon. First cut at TagSoup Schema Language -- doesn't do anything yet. Support CDATA sections on input. Don't generate built-in entities within CDATA elements. Changes from 0.9.2 to 0.9.3: ============================ Convenience main program "tagsoup" in bin directory. Begin to integrate tests. Introduced BOOLEAN type (currently just converted to NMTOKEN). Features that actually work are now named constants in Parser. Double root elements are really gone now. ID attributes weren't being removed from restarted elements. Fixed a bug that made unknown elements disappear in some cases. Parser is now safely reusable. PYXWriter and XMLWriter now implement LexicalHandler. Parser reports comments, startCDATA, and endCDATA events to a LexicalHandler. ScanHandler methods now throw only SAXException, not also IOException. -Dlexical=true switch sets the ContentHandler as a LexicalHandler as well (XMLWriter prints comments, ignores CDATA sections; PYXWriter ignores all). -Dreuse=true switch reuses a single Parser object (no great speed gain). We now disallow an a element as the child of another a element. An empty input is now treated as zero-length character content. HTMLWriter is gone in favor of an extended XMLWriter with get/setHTMLMode methods. CDATA elements only terminaate with matching end-tags (thanks to Sebastien Bardoux). Changes from 0.9.1 to 0.9.2: ============================ No longer inserts bogus ; after unknown entity reference without ;. Consecutive entity references now work correctly. Setting namespaces and namespace-prefixes methods now works. -Dnons=true option turns off namespace and prefix. New feature http://www.ccil.org/~cowan/tagsoup/features/ignore-bogons" suppresses unknown start-tags (any end-tag will be automatically ignored). -Dnobogons=true option turns ignore-bogons on. Suppress unknown and/or empty initial start-tag always (prevents double root element). Schema now allows style as an inline element, like script. Schema now allows tr as a child of table to avoid problems with embedded tables. Clear Parser instance variables to make Parsers properly reusable. Changes from 0.9 to 0.9.1: ========================== Incorporated patch for -jar support by Joseph Walton. Incorporated patch for Megginson XMLWriter support by Joseph Walton. Changed existing XMLWriter to HTMLWriter. Rewrote Parsermain for better features, removed Tester class. -Dnewline=true removed, now implied by -DHTML=true. -Dfiles=true now used to generate separate outputs (old Tester behavior) with extension xhtml (removing any old extension). Fixed nasty bug in HTMLScanner that was failing to fix unusual entities. Don't attempt to smash whitespace to spaces any more. Changes from 0.8 to 0.9: ======================== Ant-ified by Martin Rademacher. Don't suppress colons in element names. Entity problems fixed (I hope). Can now set namespace and namespace-prefixes features (without effect). Properly templatize HTMLModels.java. Attributes are no longer in the HTML namespace.