Conversation
added delimtier option to as_text to provide for a user supplied delimiter. This allows users to better parse text from a website. POD has been modified to reflect this change.
|
I'm not convinced that this is a good idea (even without the I think you'd be better off telling people to do something like: with the |
|
While I am not certain that my suggestion is the best possibility, I think a new user's expectation for as_text is that it won't remove spaces between words, even if those spaces are handled by the style sheets. For example, if you Obviously we can't parse style sheets, so that's problematic. That as_text does sometimes combine words makes it hard to do things like word counts or parsing. My suggestion would add an option that largely deals with the issue, given that tags, inline or otherwise, usually wrap whole words. You suggest the following solution. I see what you are saying, and for complex parsing, I'd probably do something like you suggest if I was using the current API, differentiating between paragraphs and divs and inline tags. But it would be more complex if you are looking at a whole document, or multiple heterogenous documents, because you have to consider anything that could contain text. Maybe instead I should add an as_delimited_text method, which allows you to pass a delimiter and a subroutine to make decisions about the insertion of a delimiter, or if no subroutine is passed, adds the delimiter as my current function does? As things currently stand though, it's not obvious to a module user that they should use look_down to deal with as_text combining words. |
as_text|
You seem to have omitted an example at the end of your first paragraph. |
|
Sorry that should have had https://gist.github.com/gryftir/2b15c2d62c70d8431e4c You can see that you get things like GoogleSearch instead of Google Search. |
|
You might want to close this and re-submit at https://github.com/kentfredric/HTML-Tree/pulls |
added delimiter option to as_text to provide for a user supplied
delimiter. This allows users to better parse text from a website.
POD has been modified to reflect this change.