Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 0 additions & 10 deletions doc/api/driver.phantomjs.rst

This file was deleted.

10 changes: 0 additions & 10 deletions doc/api/processor.coprocessor.phantomjs.rst

This file was deleted.

7 changes: 0 additions & 7 deletions doc/differences.rst
Original file line number Diff line number Diff line change
Expand Up @@ -69,12 +69,5 @@ Missing in Wget
* ``--proxy-server``
* ``--proxy-server-address``
* ``--proxy-server-port``
* ``--phantomjs``
* ``--phantomjs-exe``
* ``--phantomjs-max-time``
* ``--phantomjs-scroll``
* ``--phantomjs-wait``
* ``--no-phantomjs-snapshot``
* ``--no-phantomjs-smart-scroll``
* ``--youtube-dl``
* ``--youtube-dl-exe``
8 changes: 0 additions & 8 deletions doc/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,6 @@ The following are optional:

* `psutil` for monitoring disk space
* `Manhole <https://pypi.python.org/pypi/manhole>`_ for a REPL debugging socket
* `PhantomJS 1.9.8, 2.1 <http://phantomjs.org/>`_ for capturing interactive
JavaScript pages
* `youtube-dl <https://rg3.github.io/youtube-dl/>`_ for downloading complex
video streaming sites

Expand Down Expand Up @@ -116,9 +114,3 @@ pip. Note for Linux users, ensure you are executing the appropriate
Python version when installing pip.


PhantomJS (Optional)
++++++++++++++++++++

It is recommended to download a prebuilt binary build from
http://phantomjs.org/download.html.

2 changes: 1 addition & 1 deletion doc/intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Notable Features:

* Written in Python: lightweight, modifiable, robust, & scriptable
* Graceful stopping; on-disk database resume
* PhantomJS & youtube-dl integration (experimental)
* youtube-dl integration (experimental)

.. ⬆ Please keep this intro above in sync with the README file. ⬆
Additional intro stuff not in the README should go below.
Expand Down
14 changes: 0 additions & 14 deletions doc/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -83,20 +83,6 @@ The requests will go through the proxy to Wpull's HTTP client (which can be reco
It is not possible to use the proxy standalone at this time.


PhantomJS Integration
+++++++++++++++++++++

**PhantomJS support is currently experimental.**

``--phantomjs`` will enable PhantomJS integration.

If a HTML document is encountered, Wpull will open the URL in PhantomJS. After the page is loaded, Wpull will try to scroll the page as specified by ``--phantomjs-scroll``. Then, the HTML DOM source is scraped for URLs as normal. HTML and PDF snapshots are taken by default.

Currently, Wpull will *not do anything else* to manipulate the page such as clicking on links. As a consequence, Wpull with PhantomJS is *not* a complete solution for dynamic web pages yet!

Storing console logs and alert messages inside the WARC file is not yet supported.


youtube-dl Integration
++++++++++++++++++++++

Expand Down
29 changes: 0 additions & 29 deletions doc/warc.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,35 +45,6 @@ The response data is recorded as
* WARC-Concurrent-To: a WARC Record ID of the Control Conversation


PhantomJS
+++++++++


Snapshot
--------

A PhantomJS Snapshot represents the state of the DOM at the time of capture.

A Snapshot is recorded as

* WARC-Type: ``resource``
* WARC-Target-URI: ``urn:X-wpull:snapshot?url=URLHERE`` where ``URLHERE`` is a percent-encoded URL of the PhantomJS page.
* Content-Type: one of ``application/pdf``, ``text/html``, ``image/png``
* WARC-Concurrent-To: a WARC Record ID of a Snapshot Action Metadata.


Snapshot Action Metadata
------------------------

An Action Metadata is a log of steps performed before a Snapshot is taken.

It is recorded as

* WARC-Type: ``metadata``
* Content-Type: ``application/json``
* WARC-Target-URI: ``urn:X-wpull:snapshot?url=URLHERE`` where ``URLHERE`` is a percent-encoded URL of the PhantomJS page.


Wpull Metadata
++++++++++++++

Expand Down
4 changes: 0 additions & 4 deletions wpull/application/builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,15 +32,13 @@
from wpull.cookie import DeFactoCookiePolicy
from wpull.database.sqltable import URLTable as SQLURLTable
from wpull.database.wrap import URLTableHookWrapper
from wpull.driver.phantomjs import PhantomJSDriver
from wpull.network.bandwidth import BandwidthLimiter
from wpull.network.dns import Resolver
from wpull.network.pool import ConnectionPool
from wpull.path import PathNamer
from wpull.pipeline.app import AppSource, AppSession
from wpull.pipeline.pipeline import Pipeline, PipelineSeries
from wpull.pipeline.session import URLItemSource
from wpull.processor.coprocessor.phantomjs import PhantomJSCoprocessor
from wpull.processor.coprocessor.proxy import ProxyCoprocessor
from wpull.processor.coprocessor.youtubedl import YoutubeDlCoprocessor
from wpull.processor.delegate import DelegateProcessor
Expand Down Expand Up @@ -106,8 +104,6 @@ def __init__(self, args, unit_test=False):
'HTMLScraper': HTMLScraper,
'JavaScriptScraper': JavaScriptScraper,
'PathNamer': PathNamer,
'PhantomJSDriver': PhantomJSDriver,
'PhantomJSCoprocessor': PhantomJSCoprocessor,
'PipelineSeries': PipelineSeries,
'ProcessingRule': ProcessingRule,
'Processor': DelegateProcessor,
Expand Down
49 changes: 0 additions & 49 deletions wpull/application/options.py
Original file line number Diff line number Diff line change
Expand Up @@ -205,7 +205,6 @@ def _add_app_args(self):
self._add_recursive_args()
self._add_accept_args()
self._add_proxy_server_args()
self._add_phantomjs_args()
self._add_youtube_dl_args()

def _add_startup_args(self):
Expand Down Expand Up @@ -1303,54 +1302,6 @@ def _add_proxy_server_args(self):
help=_('bind the proxy server port to PORT')
)

def _add_phantomjs_args(self):
group = self.add_argument_group(_('PhantomJS'))
group.add_argument(
'--phantomjs',
action='store_true',
help=_('use PhantomJS for loading dynamic pages'),
)
group.add_argument(
'--phantomjs-exe',
metavar='PATH',
default='phantomjs',
help=_('path of PhantomJS executable')
)
group.add_argument(
'--phantomjs-max-time',
default=900,
type=self.int_0_inf,
help=_('maximum duration of PhantomJS session')
)
group.add_argument(
'--phantomjs-scroll',
type=int,
default=20,
metavar='NUM',
help=_('scroll the page up to NUM times'),
)
group.add_argument(
'--phantomjs-wait',
type=float,
default=1.0,
metavar='SEC',
help=_('wait SEC seconds between page interactions'),
)
group.add_argument(
'--no-phantomjs-snapshot',
action='store_false',
dest='phantomjs_snapshot',
default=True,
help=_('don’t take dynamic page snapshots'),
)
group.add_argument(
'--no-phantomjs-smart-scroll',
action='store_false',
dest='phantomjs_smart_scroll',
default=True,
help=_('always scroll the page to maximum scroll count option'),
)

def _add_youtube_dl_args(self):
group = self.add_argument_group(_('youtube-dl'))
group.add_argument(
Expand Down
72 changes: 2 additions & 70 deletions wpull/application/tasks/download.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@

from wpull.backport.logging import BraceMessage as __
from wpull.cookie import BetterMozillaCookieJar
from wpull.processor.coprocessor.phantomjs import PhantomJSParams
from wpull.namevalue import NameValueRecord
from wpull.pipeline.pipeline import ItemTask
from wpull.pipeline.session import ItemSession
Expand All @@ -18,7 +17,6 @@
from wpull.protocol.http.stream import Stream as HTTPStream
import wpull.util
import wpull.processor.coprocessor.youtubedl
import wpull.driver.phantomjs
import wpull.application.hook

_logger = logging.getLogger(__name__)
Expand Down Expand Up @@ -224,7 +222,7 @@ class ProxyServerSetupTask(ItemTask[AppSession]):
def process(self, session: AppSession):
'''Build MITM proxy server.'''
args = session.args
if not (args.phantomjs or args.youtube_dl or args.proxy_server):
if not (args.youtube_dl or args.proxy_server):
return

proxy_server = session.factory.new(
Expand Down Expand Up @@ -388,81 +386,15 @@ class CoprocessorSetupTask(ItemTask[ItemSession]):
@asyncio.coroutine
def process(self, session: AppSession):
args = session.args
if args.phantomjs or args.youtube_dl or args.proxy_server:
if args.youtube_dl or args.proxy_server:
proxy_port = session.proxy_server_port
assert proxy_port

if args.phantomjs:
phantomjs_coprocessor = self._build_phantomjs_coprocessor(session, proxy_port)
else:
phantomjs_coprocessor = None

if args.youtube_dl:
youtube_dl_coprocessor = self._build_youtube_dl_coprocessor(session, proxy_port)
else:
youtube_dl_coprocessor = None

@classmethod
def _build_phantomjs_coprocessor(cls, session: AppSession, proxy_port: int):
'''Build proxy server and PhantomJS client. controller, coprocessor.'''
page_settings = {}
default_headers = NameValueRecord()

for header_string in session.args.header:
default_headers.parse(header_string)

# Since we can only pass a one-to-one mapping to PhantomJS,
# we put these last since NameValueRecord.items() will use only the
# first value added for each key.
default_headers.add('Accept-Language', '*')

if not session.args.http_compression:
default_headers.add('Accept-Encoding', 'identity')

default_headers = dict(default_headers.items())

if session.args.read_timeout:
page_settings['resourceTimeout'] = session.args.read_timeout * 1000

page_settings['userAgent'] = session.args.user_agent \
or session.default_user_agent

# Test early for executable
wpull.driver.phantomjs.get_version(session.args.phantomjs_exe)

phantomjs_params = PhantomJSParams(
wait_time=session.args.phantomjs_wait,
num_scrolls=session.args.phantomjs_scroll,
smart_scroll=session.args.phantomjs_smart_scroll,
snapshot=session.args.phantomjs_snapshot,
custom_headers=default_headers,
page_settings=page_settings,
load_time=session.args.phantomjs_max_time,
)

extra_args = [
'--proxy',
'{}:{}'.format(session.args.proxy_server_address, proxy_port),
'--ignore-ssl-errors=true'
]

phantomjs_driver_factory = functools.partial(
session.factory.class_map['PhantomJSDriver'],
exe_path=session.args.phantomjs_exe,
extra_args=extra_args,
)

phantomjs_coprocessor = session.factory.new(
'PhantomJSCoprocessor',
phantomjs_driver_factory,
session.factory['ProcessingRule'],
phantomjs_params,
root_path=session.args.directory_prefix,
warc_recorder=session.factory.get('WARCRecorder'),
)

return phantomjs_coprocessor

@classmethod
def _build_youtube_dl_coprocessor(cls, session: AppSession, proxy_port: int):
'''Build youtube-dl coprocessor.'''
Expand Down
6 changes: 0 additions & 6 deletions wpull/application/tasks/warc.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@
from wpull.pipeline.app import AppSession
from wpull.pipeline.pipeline import ItemTask
from wpull.warc.recorder import WARCRecorder, WARCRecorderParams
import wpull.driver.phantomjs
import wpull.processor.coprocessor.youtubedl
import wpull.warc.format

Expand Down Expand Up @@ -43,11 +42,6 @@ def process(self, session: AppSession):

software_string = WARCRecorder.DEFAULT_SOFTWARE_STRING

if args.phantomjs:
software_string += ' PhantomJS/{0}'.format(
wpull.driver.phantomjs.get_version(exe_path=args.phantomjs_exe)
)

if args.youtube_dl:
software_string += ' youtube-dl/{0}'.format(
wpull.processor.coprocessor.youtubedl.get_version(exe_path=args.youtube_dl_exe)
Expand Down
4 changes: 0 additions & 4 deletions wpull/driver/Makefile

This file was deleted.

Loading