-
Notifications
You must be signed in to change notification settings - Fork 2
Misc patch p2 #86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Misc patch p2 #86
Changes from all commits
bf620c1
9f22628
b57cf16
1e6af1c
3c0c02a
34f5b03
8767b77
451dc20
b27be09
5f2a9cc
b11a924
9d55361
9848d1e
d725382
8fca673
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -39,8 +39,9 @@ def scraper_post_processing(raw_articles, model_start_date, id_col='id', | |
| param.irrelevant_link)] | ||
|
|
||
| # Subset the data only after the model_start_date | ||
| processed_articles = processed_articles[processed_articles[date_col] | ||
| > model_start_date] | ||
| processed_articles = processed_articles[ | ||
| (processed_articles[date_col] > model_start_date) & | ||
| (processed_articles[date_col] <= datetime.today())] | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. using today's date will behave funny, I'd refer to airflow execution date to enforce reproducibility between runs |
||
|
|
||
| return processed_articles | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -15,7 +15,12 @@ | |
| NEWSPIDER_MODULE = 'thereadingmachine.scraper.news_scraper.spiders' | ||
|
|
||
|
|
||
| # Logging | ||
| LOG_STDOUT = False | ||
| LOG_LEVEL = 'ERROR' | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. a bit harsh as a default, consider using environment variables instead and defaulting to something milder. |
||
|
|
||
| # Only scrap data that is new | ||
| SCRAPE_ONLY_NEW = True | ||
|
|
||
| # Crawl responsibly by identifying yourself (and your website) on the | ||
| # user-agent | ||
|
|
@@ -42,5 +47,5 @@ | |
| ITEM_PIPELINES = { | ||
| 'thereadingmachine.scraper.news_scraper.pipelines.DuplicatesPipeline': 100, | ||
| 'thereadingmachine.scraper.news_scraper.pipelines.SanitizeArticlePipeline': 300, | ||
| 'thereadingmachine.scraper.news_scraper.pipelines.AmisJsonPipeline': 500 | ||
| 'thereadingmachine.scraper.news_scraper.pipelines.AmisScrapePipeline': 500 | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a good practice.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an old one, although dynamic dates are not recommended in the documentation but there are no other solutions.
This is fairly standard practice as far as I am aware and there is a reason why the function
days_agoactually exist. Unless you know of any other solution.