This was my first experience coding in Python.
Date: Summer 2018
This repository contains python code used to parse and organize gene variants from supplemental data files associated with human genomics research articles.
Code is in src.
Example output with some files (not all that are usually outputted) containing extracted data and logs of files processed is in output.zip.
Install the following packages on your computer if you don't already have them:
- pandas
- xlrd
- docx
- docx2csv
- XlsxWriter
- antiword
- pdftotext --> command line tool, although there may be alternative python library
pip install pandas
pip install xlrd
pip install docx
pip install docx2csv
pip install XlsxWriter
sudo apt-get install antiword unrtf poppler-utils libjpeg-dev
pdftotext: http://macappstore.org/pdftotext/ or pip install pdftotext (python package)
- Place code and
genelist.txtin directory that contains folder of supplemental data files. - Run
suppdata_scraper.py. You will be prompted in the terminal to input the name of folder containing files and your name. - As script runs, the following should happen:
- File progress will be logged via different
.txtfiles. Checkfiles_processed.txtfor overall progress. - Terminal will print statements indicating filename with index currently being processed. Files are not processed in exact index order because of multiprocessing.
.txtfiles will be created indataframesfolder for every individual file that contains data.- Files that may contain amino acids or nucleotides are copied to
manualfolder. .txtand.xlsxfiles will be created in theworkspacefolder for parsing purposes.
- Once
suppdata_scraper.pyis done running, rundataframe.py. When prompted, you should input the name of the folder containing the dataframes. Once this is done running, you should havemasterlist.txtcontained in theoutputfolder.
-
suppdata_scraper.pyis the main scraper program used to parse files and extract gene variants. -
dataframe.pyis for combining all the dataframes containing extracted data from different files into a single masterlist file with all the extracted data. This should be used in instances where suppdata_scraper.py hits a roadblock and is not able to concatenate all the dataframes during its run. -
big_manual.pyis for screening and prioritizing large files that contain amino acids and/or nucleotides and need to be manually extracted. -
manual.pyis for screening files containing amino acids and/or nucleotides and counting the number of occurrences of amino acids and/or nucleotides. These files will need to be manually extracted.
Input: Directory of supplemental data files scraped from the web. These files can be any of the following type:
- doc/docx
- txt
- xls/xlsx
- csv/tsv
Output:
outputfolder with following:masterlist.txt--> Main output. All gene variants are stored with files they came from. Alsomasterlist.csvandmasterlist.xlsx, which contain same info in different file type.- Following
.txtfiles that characterize data:files_processed.txt: filenames and index in listbad_files.txt: files that produce an errorgood_files.txt: files that contain gene variantsmanual.txt: files that contain nucleotides or amino acidsfiles_ignored.txt: Other file types such as media files that are not relevantvariant_counts.txt: Counts for total number of different gene variants for each file that contains dataprocess_time.txt: File size and time it takes for script to process each file
dataframesfolder with dataframe files containing data extracted from all filesmanualfolder with files that need to be manually extracted and have been copied overbig_files_manualfolder containing large files that need to be manually extracted and have been copied over
Some more details:
- Scraper finds genes by comparing against genelist.txt and finds different variants using regular expressions.
- For pdf, doc, and txt files, the scraper goes through line by line and pulls out gene and variant matches.
- For xlsx and xls files, the scraper goes through every cell row by row and pulls out gene and variant matches.
- For docx files, the scraper extracts any tables that it finds, converts them into xlsx files, and then follows the same procedure for a xlsx file.
This methodology, while perhaps not the most efficient, proved to be pretty accurate and ensured that associations between genes and variants on the same lines/rows in files were in most cases maintained during the data extraction.