Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
bebb765
Load plugins early for improved flexibility (fixes chfoo/wpull#383)
JustAnotherArchivist Jan 15, 2018
dd60250
Fix URLRecord not defining all fields on initialisation
JustAnotherArchivist Jan 16, 2018
526d44c
Initial implementation of URL prioritisation
JustAnotherArchivist Jan 20, 2018
d233570
Test suite for prioritisation code
JustAnotherArchivist Jan 20, 2018
7add6ad
Fix pipeline retrieving an item from the ItemSource prematurely
JustAnotherArchivist Jan 20, 2018
42d9659
Replace the poison pill generic object with a custom class
JustAnotherArchivist Jan 20, 2018
3ec379b
Simplify ItemQueue.put_item_coro
JustAnotherArchivist Jan 20, 2018
748cd4f
Add docstrings to the pipeline module
JustAnotherArchivist Jan 20, 2018
07ce7d2
Fix pipeline bug introduced in 2e42b7d9 leading to links discovered o…
JustAnotherArchivist Jan 20, 2018
c275fcc
Apply --concurrent argument to pipeline series (fixes chfoo/wpull#339)
JustAnotherArchivist Jan 20, 2018
7341341
URL prioritisation plugin interface get_priority
JustAnotherArchivist Jan 22, 2018
00410a6
Tests for get_priority hook
JustAnotherArchivist Jan 22, 2018
6e2057a
Update docs
JustAnotherArchivist Jan 22, 2018
c14f2d4
Pin html5lib version to 0.9999999 (seven nines) (fixes chfoo/wpull#332)
JustAnotherArchivist Jan 22, 2018
65d0809
Docs: Add description of prioritisation
JustAnotherArchivist Jan 27, 2018
16624e8
Add --warc-split-meta
JustAnotherArchivist Jan 28, 2018
d327eec
Remove flattening consecutive slashes in URL paths (fixes chfoo/wpull…
JustAnotherArchivist Feb 22, 2018
4e41272
Fix test broken by 44ef3690
JustAnotherArchivist Feb 22, 2018
9b9fc0a
Typo
JustAnotherArchivist Feb 22, 2018
9d7e2d2
Handle backslashes in the path of special URLs like forward slashes (…
JustAnotherArchivist Feb 22, 2018
c279743
Handle email.utils.parsedate returning None when parsing fails (fixes…
JustAnotherArchivist Feb 22, 2018
2bd10f7
Fix changelog
JustAnotherArchivist Feb 22, 2018
b4641ac
Strip tab and newline characters from scraped URLs (fixes chfoo/wpull…
JustAnotherArchivist Feb 22, 2018
79b21f9
Handle empty ports correctly (fixes chfoo/wpull#340)
JustAnotherArchivist Feb 22, 2018
d3264df
Update changelog to reflect the new situation regarding the official …
JustAnotherArchivist Oct 10, 2018
1a3ef99
Document --warc-split-meta
JustAnotherArchivist Oct 10, 2018
de601e8
Print some version information on CI
JustAnotherArchivist Oct 18, 2018
5b14cf7
Work around the change in tornadoweb/tornado@84bb2e28
JustAnotherArchivist Oct 18, 2018
b71a874
Use snake_case for variables in prioritisation argument handling
JustAnotherArchivist Nov 3, 2018
748ac7e
Spaces according to PEP8
JustAnotherArchivist Nov 3, 2018
edce0d6
Wrap docstrings
JustAnotherArchivist Nov 3, 2018
ab4b110
Wrap long lines
JustAnotherArchivist Nov 3, 2018
3c45bc7
Remove stray prints
JustAnotherArchivist Nov 3, 2018
0fcbae5
Fix long lines in prioritisation plugin
JustAnotherArchivist Nov 3, 2018
1d5d30d
Pass the URLInfo and URLRecord objects directly into the get_priority…
JustAnotherArchivist Nov 3, 2018
66d2d29
s/priorisation/prioritisation/
JustAnotherArchivist Nov 3, 2018
90f3b14
A few minor corrections
JustAnotherArchivist Nov 3, 2018
d0a98eb
Fix test broken by a100e5bd
JustAnotherArchivist Nov 3, 2018
47a7c66
Fix order of todo/error with priorities
JustAnotherArchivist Nov 3, 2018
37dfdf9
Fix performance regression due to missing index
JustAnotherArchivist May 11, 2019
6530b28
Pin html5lib and SQLAlchemy to working versions in setup.py
JustAnotherArchivist Jan 16, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .drone.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,9 @@ steps:
commands:
- pip install -r requirements.txt
- pip install nose coverage warcat youtube-dl
- python --version
- pip --version
- pip freeze
- pip install . --no-dependencies
- nosetests --with-coverage --cover-package=wpull --cover-branches

Expand All @@ -21,6 +24,9 @@ steps:
commands:
- pip install -r requirements.txt
- pip install nose coverage warcat youtube-dl
- python --version
- pip --version
- pip freeze
- pip install . --no-dependencies
- nosetests --with-coverage --cover-package=wpull --cover-branches
depends_on:
Expand All @@ -36,6 +42,9 @@ steps:
commands:
- pip install -r requirements.txt
- pip install nose coverage warcat youtube-dl
- python --version
- pip --version
- pip freeze
- pip install . --no-dependencies
- nosetests --with-coverage --cover-package=wpull --cover-branches
depends_on:
Expand Down
10 changes: 10 additions & 0 deletions doc/api/urlprioritiser.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
.. This document was automatically generated.
DO NOT EDIT!

:mod:`urlprioritiser` Module
============================

.. automodule:: wpull.urlprioritiser
:members:
:show-inheritance:
:undoc-members:
45 changes: 45 additions & 0 deletions doc/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,51 @@ Summary of notable changes.
Unreleased
==========

* Fixed: Plugins are now loaded early again, restoring behaviour in version 1.x and allowing plugins to override wpull components again.
* Fixed: `--concurrent` option had no effect since version 2.0.
* Fixed: wpull no longer flattens consecutive slashes in the path component of URLs, in line with RFC 3986, related standards, and behaviour of other software (wget and browsers).
* Fixed: Backslashes in URL paths are now treated like forward slashes in accordance with the URL Standard and behaviour of browsers.
* Fixed: ASCII tab and newline characters are now stripped from URLs, as required by the URL Standard.
* Fixed: Empty ports in URLs are now handled correctly, i.e. as if no colon appeared in the host.
* Added: URL prioritisation through `--priority-*` options and a `get_priority` hook.
* Added: Meta WARC splitting whenever data WARCs are split using the `--warc-split-meta` option.
* Changed: `wpull.pipeline.pipeline.ItemSource` now deals in item generation coroutines rather than items directly.
* Changed: `wpull.database.base.BaseURLTable` and its subclasses now have to take care of returning the URLs in the expected order; the two arguments to their `check_out` method have been removed.


Backwards incompatibility
+++++++++++++++++++++++++

Plugins are now loaded very early in the initialisation again (as in version 1.x). This means that the various components of wpull
are not yet initialised at the time `WpullPlugin.activate()` is executed. Plugins written for versions 2.0 through 2.0.3 which use
instances directly will need to be changed to instead replace the relevant class in WpullPlugin.app_session.factory.class_map`.

Any direct usage of the `ItemSource`, `Worker`, or `Producer` classes in `wpull.pipeline.pipeline` has to be adapted to handle item
generation coroutines instead of items. The `Pipeline` and `PipelineSeries` classes in the same module are not affected.

The `*URLTable.check_out` method no longer receives status and level filter arguments. It is now the URLTable's duty to return URLs
in the correct order. For the default implementation, this means first sorted by highest priority, then todos before errors.


2.0.3 (2017-05-15)
==================

* Removed: HTTP CONNECT support
* Fixed: `ValueError` crash from URL parsing

Note: This version was only available on the fork by falconkirtaran until 2018-10-10.


2.0.2 (2017-01-12)
==================

* Fixed: Deadlock when wpull is finished.
* Fixed: `AttributeError` crash on some SSL connections.
* Fixed: `--warc-max-size` option had no effect since version 2.0.
* Fixed: `AttributeError` crash from asyncio reading from a closed connection.

Note: This version was only available on the fork by falconkirtaran until 2018-10-10.


2.0.1 (2016-06-21)
==================
Expand Down
3 changes: 3 additions & 0 deletions doc/differences.rst
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ Missing in Wget
* ``--no-use-internal-ca-certs``
* ``--warc-append``
* ``--warc-move``: Move WARC files out of the way for resuming a crashed crawl.
* ``--warc-split-meta``
* ``--page-requisites-level``: Prevent infinite downloading of misconfurged server resources such as HTML served under a image.
* ``--sitemaps``: Discover more URLs.
* ``--hostnames``: Wget simply matches the endings when using ``--domains`` instead of matching each part of the hostname.
Expand All @@ -69,6 +70,8 @@ Missing in Wget
* ``--proxy-server``
* ``--proxy-server-address``
* ``--proxy-server-port``
* ``--priority-regex``: Control in which order the URLs are retrieved.
* ``--priority-domain``
* ``--phantomjs``
* ``--phantomjs-exe``
* ``--phantomjs-max-time``
Expand Down
6 changes: 6 additions & 0 deletions doc/scripting.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,12 @@ load it with ``--plugin-script`` option.

The plugin interface provides two type of callbacks: hooks and events.

Callbacks receive a number of internal objects from wpull, e.g.
:py:class:`wpull.url.URLInfo` and
:py:class:`wpull.pipeline.session.ItemSession`. It is strongly
recommended not to modify these objects in any way in the callbacks.
Doing so results in undefined behaviour.


Hook
++++
Expand Down
3 changes: 3 additions & 0 deletions doc/scripting_interfaces_include.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,9 @@
:py:attr:`PluginFunctions.finishing_statistics <wpull.application.plugin.PluginFunctions.finishing_statistics>`
event Interface: :py:meth:`StatsStopTask.plugin_finishing_statistics <wpull.application.tasks.stats.StatsStopTask.plugin_finishing_statistics>`

:py:attr:`PluginFunctions.get_priority <wpull.application.plugin.PluginFunctions.get_priority>`
hook Interface: :py:meth:`URLPrioritiser.plugin_get_priority <wpull.urlprioritiser.URLPrioritiser.plugin_get_priority>`

:py:attr:`PluginFunctions.get_urls <wpull.application.plugin.PluginFunctions.get_urls>`
event Interface: :py:meth:`ProcessingRule.plugin_get_urls <wpull.processor.rule.ProcessingRule.plugin_get_urls>`

Expand Down
11 changes: 11 additions & 0 deletions doc/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,17 @@ example, limit the recursion depth.
move the files ``--warc-move``.


Prioritisation
==============

With the ``--priority-regex`` and ``--priority-domain`` options, you can
control in which order the URLs in the queue are downloaded. These options
can be specified multiple times and are used in the given order. Each URL
is checked against the list of priority rules, and the first matching rule
sets the priority for the URL. The default priority (if no rule matches)
is zero. The higher a URL's priority value is, the sooner it is processed.


Proxied Services
================

Expand Down
2 changes: 1 addition & 1 deletion requirements-sphinx.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Should be the same for requirements.txt:
chardet>=2.0.1,<=2.3
dnspython3==1.12
html5lib>=0.999,<1.0
html5lib>=0.999,<=0.9999999
# lxml>=3.1.0,<=3.5 # except for this because it requires building C libs
namedlist>=1.3,<=1.7
psutil>=2.0,<=4.2
Expand Down
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Absolutely known to work versions only:
chardet>=2.0.1,<=2.3
dnspython3==1.12
html5lib>=0.999,<1.0
html5lib>=0.999,<=0.9999999
lxml>=3.1.0,<=3.5
namedlist>=1.3,<=1.7
psutil>=2.0,<=4.2
Expand Down
4 changes: 2 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,9 +101,9 @@ def get_version():
setup_kwargs['install_requires'] = [
'chardet',
'dnspython3',
'html5lib',
'html5lib <= 0.9999999',
'namedlist',
'sqlalchemy',
'sqlalchemy < 1.3.0',
'tornado',
'yapsy',
]
Expand Down
8 changes: 6 additions & 2 deletions wpull/application/builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
from wpull.application.tasks.plugin import PluginSetupTask
from wpull.application.tasks.resmon import ResmonSetupTask, ResmonSleepTask
from wpull.application.tasks.rule import URLFiltersSetupTask, \
URLFiltersPostURLImportSetupTask
URLFiltersPostURLImportSetupTask, URLPrioritiserSetupTask
from wpull.application.tasks.sslcontext import SSLContextTask
from wpull.application.tasks.stats import StatsStartTask, StatsStopTask
from wpull.application.tasks.warc import WARCRecorderSetupTask, \
Expand Down Expand Up @@ -65,6 +65,7 @@
from wpull.stats import Statistics
from wpull.url import URLInfo
from wpull.urlfilter import DemuxURLFilter
from wpull.urlprioritiser import URLPrioritiser
from wpull.urlrewrite import URLRewriter
from wpull.waiter import LinearWaiter
from wpull.warc.recorder import WARCRecorder
Expand Down Expand Up @@ -123,6 +124,7 @@ def __init__(self, args, unit_test=False):
'SitemapScraper': SitemapScraper,
'Statistics': Statistics,
'URLInfo': URLInfo,
'URLPrioritiser': URLPrioritiser,
'URLTable': URLTableHookWrapper,
'URLTableImplementation': SQLURLTable,
'URLRewriter': URLRewriter,
Expand Down Expand Up @@ -160,13 +162,15 @@ def _build_pipelines(self) -> PipelineSeries:
AppSource(app_session),
[
LoggingSetupTask(),
PluginSetupTask(),
DatabaseSetupTask(),
ParserSetupTask(),
WARCVisitsTask(),
SSLContextTask(),
ResmonSetupTask(),
StatsStartTask(),
URLFiltersSetupTask(),
URLPrioritiserSetupTask(),
NetworkSetupTask(),
ClientSetupTask(),
WARCRecorderSetupTask(),
Expand All @@ -175,7 +179,6 @@ def _build_pipelines(self) -> PipelineSeries:
ProxyServerSetupTask(),
CoprocessorSetupTask(),
LinkConversionSetupTask(),
PluginSetupTask(),
InputURLTask(),
URLFiltersPostURLImportSetupTask(),
])
Expand Down Expand Up @@ -226,6 +229,7 @@ def _build_pipelines(self) -> PipelineSeries:
download_stop_pipeline, conversion_pipeline, app_stop_pipeline
))
pipeline_series.concurrency_pipelines.add(download_pipeline)
pipeline_series.concurrency = self._args.concurrent

return pipeline_series

Expand Down
30 changes: 25 additions & 5 deletions wpull/application/hook.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@

import asyncio

from typing import Optional
from typing import Optional, Iterable

from wpull.application.plugin import WpullPlugin, PluginFunctionCategory
from wpull.backport.logging import BraceMessage as __
Expand All @@ -30,9 +30,10 @@ class HookAlreadyConnectedError(ValueError):

class HookDispatcher(collections.abc.Mapping):
'''Dynamic callback hook system.'''
def __init__(self, event_dispatcher_transclusion: Optional['EventDispatcher']=None):
def __init__(self, plugins: Optional[Iterable[WpullPlugin]] = [], event_dispatcher_transclusion: Optional['EventDispatcher'] = None):
super().__init__()
self._callbacks = {}
self._plugins = plugins
self._event_dispatcher = event_dispatcher_transclusion

def __getitem__(self, key):
Expand All @@ -54,6 +55,11 @@ def register(self, name: str):
if self._event_dispatcher is not None:
self._event_dispatcher.register(name)

for plugin in self._plugins:
for func, f_name, f_category in plugin.get_plugin_functions():
if f_category == PluginFunctionCategory.hook and f_name == name:
self.connect(name, func)

def unregister(self, name: str):
'''Unregister hook.'''
del self._callbacks[name]
Expand Down Expand Up @@ -102,8 +108,9 @@ def is_registered(self, name: str) -> bool:


class EventDispatcher(collections.abc.Mapping):
def __init__(self):
def __init__(self, plugins: Optional[Iterable[WpullPlugin]] = []):
self._callbacks = {}
self._plugins = plugins

def __getitem__(self, key):
return self._callbacks[key]
Expand All @@ -120,6 +127,11 @@ def register(self, name: str):

self._callbacks[name] = set()

for plugin in self._plugins:
for func, f_name, f_category in plugin.get_plugin_functions():
if f_category == PluginFunctionCategory.event and f_name == name:
self.add_listener(name, func)

def unregister(self, name: str):
del self._callbacks[name]

Expand All @@ -138,10 +150,12 @@ def is_registered(self, name: str) -> bool:


class HookableMixin(object):
_plugins = [] # type: Iterable[WpullPlugin]

def __init__(self):
super().__init__()
self.event_dispatcher = EventDispatcher()
self.hook_dispatcher = HookDispatcher(event_dispatcher_transclusion=self.event_dispatcher)
self.event_dispatcher = EventDispatcher(plugins=self._plugins)
self.hook_dispatcher = HookDispatcher(event_dispatcher_transclusion=self.event_dispatcher, plugins=self._plugins)

def connect_plugin(self, plugin: WpullPlugin):
for func, name, category in plugin.get_plugin_functions():
Expand All @@ -156,6 +170,12 @@ def connect_plugin(self, plugin: WpullPlugin):
_logger.debug('Connected event %s %s', name, func)
self.event_dispatcher.add_listener(name, func)

@classmethod
def set_plugins(cls, plugins: Iterable[WpullPlugin]):
HookableMixin._plugins = plugins
# Note that HookableMixin is hardcoded here as the plugin list is always defined at the level of this class.
# If cls._plugins was used instead, calling set_plugins of a subclass would break unit tests, for example.


class HookStop(Exception):
'''Stop the engine.
Expand Down
Loading