Fixed unicode handling, Python 3 support, Request as network backend, better content root extraction and other awesome features#248
Open
Lol4t0 wants to merge 53 commits intograngier:developfrom
Open
Conversation
…d install bs3 under py3
* As STOP_WORDS are stored in unicode format we should keep our words candidates in unicode also to be able to compare candidates against dictionary correctly * With some languages, short stopwords are linked to the next word in the sentance with no-breakable-space. To designate those stop words we should support nbsp when tokenizing. Russian is an example. So this fixes grangier#223
See https://github.com/vetal4444/python-goose/tree/python_3 grangier#220 Conflicts: goose/text.py
Python 3 support
Html fetching is now done with requests Using requests allows writing high-level code encapsulating network & html level (decoding gzip, etc)
1.0.28: * Move to requests as network library
Some special tags can be false positive, so we had to porcess them all to select best top node
Requests uses headers-preferred content encoding, but for HTML better choise is TAGS-preferred content encoding
Moving to requests as http library made test mocks, that used urllib mocking, incorrect This commit fixes tests by using mock_requests library for mocking, instead of urllib one.
It is not clear why it was there in the first place, as valid html does not contain such header. Again this is not connected to the test itself.
This benefits to automatic cookie handling, keep alive connection and may be some other features
After moving to requests http backend cookies are handled correctly. Test url http://www.nytimes.com/2013/08/18/world/middleeast/pressure-by-us-failed-to-sway-egypts-leaders.html?hp checked working
Python 3.4, Python 3.5 added
* Requests used for images. Same http session is used for all requests. * Analyze all possible text root nodes and select best one, do not stop on first text root node candidate * Improve text selection filters
Config parameter is `known_context_patterns'
Default:
{
'known_context_patterns': [
{'attr': 'class', 'value': 'short-story'},
{'attr': 'itemprop', 'value': 'articleBody'},
{'attr': 'class', 'value': 'post-content'},
{'attr': 'class', 'value': 'g-content'},
{'tag': 'article'},
]
}
When performing network requests, use request timeout, provided by goose configuration
support
added 2 commits
January 26, 2016 17:17
Swallowing errors makes it difficult to understand whether something went wrong with network, goose, or target resource. So strict mode (now default) is introduced. With this mode goose will raise Exception instead of returning empty responses.
|
@grangier please merge this, Python 3 compatibility would be great to have |
|
@grangier +1 on merging this PR. Python3 support is really needed. |
|
@grainger Pleas merge, we are no more using python2x |
|
FYI, I've produced a pypi package I appreciate all the work that @grangier has done, but I really needed goose to work on python3. If you'd like to fix any bugs, tests, etc I'm more than happy to put in time to look at pull requests and merge them. Thank you. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Russian is an example. So this fixes Russian articles are not extracted #223