TCLUG Development Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Fwd: HTML Parsing question]



This is in reference to the HTML parsing thread that occured a week ago.

-- 
Perry Hoekstra
Talent Software Services
dutchman@mn.uswest.net

"I don't see much sense in that," said Rabbit.
"No," said Pooh humbly, "there isn't. But there was going to be when I
began it. It's just that something happened to it along the way."


It looks like there is a Java port of tidy with source. One route
for me might be soemthing like this:

1) get URL to from which I want to extract text. 
2) use one the previous mentioned / suggested tools for extracting
text from a well formed HTML document. 
3) If no errors -> done
    else 
  catch error, but rather than throwing possilbly useful
content away, run tidy againist the problematic page and try to extract content
again.

Too bad Tidy doesn't have a save to text feature. 

Thanks to the folk(s) who took the trouble to port it to Java. 

Spencer


 

Dennis Sosnoski <dms@sosnoski.com> writes:

> If all you want to do is convert a set of static pages Tidy is definitely the
> way to go. It's probably too costly to run a separate program to convert data
> you retrieve on the fly, at least if you're running much volume of this type of
> query - it puts your performance back into the CGI realm.
> 
> Looking into the Tidy program is a great way to find out how to resolve some of
> the more thorny recovery cases, though.
> 
>   - Dennis
> 
> Kay Michael wrote:
> > 
> > > It's got decent recovery from most of the bad HTML I've tried
> > > it with. Some cases are tough, though - attribute values with a leading
> > > quote and no trailing quote, for instance.
> > 
> > Why not use Dave Raggett's HTML Tidy program, available from the W3C site?
> > 
> > Mike Kay
> > 
> > ---------------------------------------------------------------
> > java-xml-interest Commands
> > To: majordomo@cybercom.net
> > Body: subscribe java-xml-interest
> > Body: unsubscribe java-xml-interest
> 
> ---------------------------------------------------------------
> java-xml-interest Commands
> To: majordomo@cybercom.net
> Body: subscribe java-xml-interest
> Body: unsubscribe java-xml-interest

---------------------------------------------------------------
java-xml-interest Commands
To: majordomo@cybercom.net
Body: subscribe java-xml-interest
Body: unsubscribe java-xml-interest