Categories: MSDN / DotNet / Java / Scripts / Linux / PHP Ask - La ask - La Answer

Parsing XML from a stream.

Hi,

I have to extract a bit of XML out of a stream, and it's bugging me. I've done it before in C++ but it was really dirty feeling (it works solidly though) and I'm looking for either a simpler OR more elegant way to do it. I don't mind abusing a little of the system resources to trade off robustness for speed or space, but I want something simple to understand.

Here is the problem in detail:

The XML must be UTF-8, or 8bit single byte ANSI. I can switch the entire system to UTF-8 or Default ANSI, but I won't do both. For now I want to try to do it UTF-8 style so I can support more inputs.

The XML consists of the declaration which includes the incoding type (if it doesn't we assume utf-8/ANSI anyway).

I may not get the entire XML content at once (I have a thread that is triggered on new data).

I may get multiple AND partial XML data AND sometimes garbage:

<?xml ...><root>...</root> %#$!%$^ <?xml ...><root>...</root><?xml ...><root>...</root><?xml ...><root>...

Notice how it cuts off at the end.

The solution I'm thinking of:

Use the regular expressions (my brain will hurt) to extract complete data, then do a string search to find the last root ending tag </root> and reset the stream to start from there again + whatever is left off. Finally do an examination of what is left off to see if it has the start tag in it OR a partial tag. If it is bad data then delete it up to the point of possible good data.

I previously did this using memcpy and a block of ram with index counters to keep track of everything. This required a lot of memory shuffling and worked decently enough, but this only worked on single byte and of course required copying memory blocks (I guess I could use byte arrays as memory blocks). Now how do I go about moving around data that may be half of a character and keeping track of that?
[2033 byte] By [liRetro] at [2007-11-11 8:00:33]
# 1 Re: Parsing XML from a stream.
See if this helps: http://www.xml.com/pub/a/2002/05/22/parsing.html . My understanding is that .NET's XmlReader class supports streaming automatically.
Phil Weber at 2007-11-11 21:48:11 >
# 2 Re: Parsing XML from a stream.
Unfortunately the system being used does not seem able to handle multiple documents being continuously streamed in, nor does it say anything about handling garbage before or after valid data. After reading in good data I have to take any remainder (might be partial), copy it to the beginning of the stream and reset the stream to the beginning.

This is required because the stream could be data comming off a dirt slow 300 baud connection, or even an erratic hand typed message.

Edit: finally figured out the expression to match my tags, and THEN I can check if it is valid xml =>

string pat = String.Format("({0}[^?>]*\\?>)(\\s*{1}[^>]*>)(.*?)({2})", XML_TAG, XML_7START, XML_7STOP);

Where XML_TAG is the declaration, and the rest are the root element start and stop tags (data is guaranteed inside or it is invalid).Yay :( .
liRetro at 2007-11-11 21:49:17 >