Extract positional XML parser into common and fix encoding issues
The XML DOM parser used by the lint CLI driver (which tracks
positions) is needed outside of lint, so pull it out of the lint/cli
project, and refactor it such that it does not directly reference the
lint Position APIs (but can utilize them when subclassed in lint).
In addition, handle non-UTF-8 file encodings. XML files can be encoded
in other character sets, and can specify this via the encoding
attribute in the XML prologue. Until now, the CLI lint runner would
just read the XML file contents in using the default encoding and
parse this. Now there's a new utility method which takes a byte[] and
infers the desired encoding and uses that to convert the byte[] into a
string using the correct encoding. (We can't just pass an InputStream
and let the SAX parser handle this on its own because the XML parser
needs to access the character stream in order to assign correct node
offsets.) This code now also handles the byte order mark more
cleanly.
There are some new unit tests too to check the new encoding, BOM and
offset handling.
Change-Id: Ib0badbbe72172e3408c6d5af2413be51280a7724
31 files changed