GitHub - tgr2uk/search-log-analysis: Python scripts for the search log file analysis described in the book Text Mining and Visualization

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README		README
extract_basic_vectors.py		extract_basic_vectors.py
extract_default_vectors.py		extract_default_vectors.py
format_as_csv.awk		format_as_csv.awk
how_to_extract_query_frequencies_from_AOL.txt		how_to_extract_query_frequencies_from_AOL.txt
how_to_extract_term_frequencies_from_AOL.txt		how_to_extract_term_frequencies_from_AOL.txt
normalise_sessions.py		normalise_sessions.py
parse14.py		parse14.py
selectRandomVectors.py		selectRandomVectors.py
session_stats.py		session_stats.py

Repository files navigation

Overall process for clustering AOL sessions
===========================================

Remove queries that are just ‘-‘
> awk '!/\t-\t/' fullcollection.1000000 > fullcollection.1000000.filtered


Extract the feature vectors:
> python parseN.py -v fullcollection.1000000


Select a sample from the output.log:
> python selectRandomVectors.py output.log -s 100000 > 100000.sample1
(times N)

Edit the ARFFheader to suit the data 
(textedit)

Prepend the appropriate weka header
> cat ARFFheader 100000.sample1 > ARFF/100000.sample1.arff
(times N)

Apply feature scaling:
> python normalise_sessions.py ARFF/100000.sample1.arff > ARFF/100000.sample1.norm.arff
(times N)

Cluster using weka