Parsing XML from a stream.
I have to extract a bit of XML out of a stream, and it's bugging me. I've done it before in C++ but it was really dirty feeling (it works solidly though) and I'm looking for either a simpler OR more elegant way to do it. I don't mind abusing a little of the system resources to trade off robustness for speed or space, but I want something simple to understand.
Here is the problem in detail:
The XML must be UTF-8, or 8bit single byte ANSI. I can switch the entire system to UTF-8 or Default ANSI, but I won't do both. For now I want to try to do it UTF-8 style so I can support more inputs.
The XML consists of the declaration which includes the incoding type (if it doesn't we assume utf-8/ANSI anyway).
I may not get the entire XML content at once (I have a thread that is triggered on new data).
I may get multiple AND partial XML data AND sometimes garbage:
<?xml ...><root>...</root> %#$!%$^ <?xml ...><root>...</root><?xml ...><root>...</root><?xml ...><root>...
Notice how it cuts off at the end.
The solution I'm thinking of:
Use the regular expressions (my brain will hurt) to extract complete data, then do a string search to find the last root ending tag </root> and reset the stream to start from there again + whatever is left off. Finally do an examination of what is left off to see if it has the start tag in it OR a partial tag. If it is bad data then delete it up to the point of possible good data.
I previously did this using memcpy and a block of ram with index counters to keep track of everything. This required a lot of memory shuffling and worked decently enough, but this only worked on single byte and of course required copying memory blocks (I guess I could use byte arrays as memory blocks). Now how do I go about moving around data that may be half of a character and keeping track of that?

