Code to scrape PPT slides for duplicate content
- Create a folder in your Desktop to contain all the downloaded files.
- For each Powerpoint slides located in the Google Drive, download each Powerpoint content as a text file. a. Click File > Download > Plain Text(.txt)
- Repeat Step 1 until all Powerpoint slides are downloaded.
- Move all the downloaded text files into the folder created in Step 1.
- Right-click on the folder created in Step 1 and click on "Copy as path" to get the file path. a. Remove the quotation marks surrounding the file path. b. Example of filepath: C:\Users\XXXXX\files
- Run the "duplicate_search_tool.exe" file.
- Paste the filepath into the top most entry box.
- Click on "Step 2: Validate filepath".
- Type in the phrase that you would like to search across all files.
- Click on "Step 4: Search".
- A text file named "duplicated_slides.txt" would be created in the folder from Step 1; containing titles of powerpoints slides with matches to the specified input phrase.