Categories: MSDN / DotNet / Java / Scripts / Linux / PHP Ask - La ask - La Answer

How to read SGML files using Java

I've got a text categorisation test collection called Reuters-21578 for my Information Retrieval project. It is distributed in 22 files. Each of the first 21 files (reut2-000.sgm through reut2-020.sgm) contains 1000 documents, while the last (reut2-021.sgm) contains 578 documents. The files are in SGML format. Each of the 22 files begins with a document type declaration line:
<!DOCTYPE lewis SYSTEM "lewis.dtd"> The DTD file lewis.dtd is included in the distribution. Following the document type declaration line are individual Reuters articles marked up with SGML tags.

My questions is how to write a java program to read those 21578 documents or transform them into 21578 seperated text files.
[724 byte] By [WXY595] at [2007-11-11 7:45:22]
# 1 Re: How to read SGML files using Java
Would you please help me? Thank you!
WXY595 at 2007-11-11 22:37:31 >