Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
87265cb
Update to lucene 5.2.1
kostafey Aug 25, 2015
e373e3a
Remove java 6 support, update to lucene 5.3.0
kostafey Sep 22, 2015
75406ac
Implement positions searcher for dictionary words.
kostafey Sep 25, 2015
8e898e8
Update fork version and url.
kostafey Sep 28, 2015
a99e999
Update README - clojars artifact name.
kostafey Sep 28, 2015
ee0ba5b
Add stemming-text fn.
kostafey Sep 28, 2015
b74b057
Fix docstring.
kostafey Sep 28, 2015
2849bcc
Fix meta-map for strings and sets.
kostafey Oct 6, 2015
18d15d4
Add matched text visualization.
kostafey Oct 7, 2015
f05144f
Use string as possible source for pos searcher visualization.
kostafey Oct 7, 2015
1a93d53
Fix test.
kostafey Oct 7, 2015
c5d6dd6
Return text position from searcher instead of matched text.
kostafey Oct 8, 2015
a11b74b
Add streams usage possibility for index building.
kostafey Oct 8, 2015
8222635
Remove unnecessary stack, add document from file.
kostafey Oct 15, 2015
1eaaea7
Improve performance for small index and large dict.
kostafey Oct 19, 2015
828407d
Add return-stemmed flag for dict-searcher.
kostafey Oct 21, 2015
f4227e0
Add text visualization for structure with stemmed.
kostafey Oct 21, 2015
5cea68e
Change *with-stemmed* to dynamic var.
kostafey Oct 23, 2015
2042655
Fix build.
kostafey Oct 23, 2015
d9f5d26
Add custom analyzer construct.
kostafey Nov 13, 2015
a3de142
Fix README.
kostafey Nov 13, 2015
3b2e613
Add words frequency iterator.
kostafey Nov 24, 2015
33de273
Add get-top-phrases fn.
kostafey Nov 25, 2015
abd9b8d
Add get-word-count fn.
kostafey Nov 25, 2015
d05d630
Add words positions to top-words-iterator.
kostafey Nov 26, 2015
15af96f
Add get-top-words fn.
kostafey Nov 27, 2015
c0459f5
Add type hints for performance.
kostafey Dec 1, 2015
26f4980
Handle empty TermsEnum.
kostafey Dec 1, 2015
561cf26
Use RussianLightStemmer.
kostafey Dec 4, 2015
3e86b77
Fix build.
kostafey Dec 4, 2015
466fc6f
Fix light and min stemmers analyzer configuration.
kostafey Dec 4, 2015
ff137e3
Use unstemmed versions of words in dict search.
kostafey Dec 16, 2015
0364986
Add custom phrase separator regex.
kostafey Dec 21, 2015
53881c3
Add LengthFilter usage.
kostafey Dec 24, 2015
064c82c
Add search phrase by words distance by default.
kostafey Dec 28, 2015
1342d63
Fix words-distance-in-phrase usage.
kostafey Dec 28, 2015
88738b1
Update to lucene 5.4.0
kostafey Dec 29, 2015
c1726a5
Released 0.5.4.0
kostafey Dec 29, 2015
cea231f
Update README.
kostafey Dec 29, 2015
5798318
Use cloverage.
kostafey Dec 30, 2015
7253763
Fix coveralls usage.
kostafey Dec 30, 2015
945c627
Update README.md
kostafey Dec 30, 2015
3c7a420
Add with-index macro.
kostafey Jan 21, 2016
1af68f0
Update clojure to 1.8.0
kostafey Jan 21, 2016
2a10b91
Update to lucene 5.4.1. Fix build index from file.
kostafey Jan 25, 2016
abd950a
Add file-index? fn.
kostafey Jan 27, 2016
8aaa3c0
Add support for explaining queries.
ieure Aug 24, 2013
63e801d
Add field boosting.
ieure Aug 24, 2013
50e2ba4
Merge pull request #1 from ieure/boost_and_explain_new
kostafey Feb 15, 2016
6114753
Update version.
kostafey Feb 15, 2016
a2fffd1
Nested maps and array values are indexed
kokosro Feb 18, 2016
c30d8ec
Moved functions to util.
kokosro Feb 19, 2016
c7703d7
Merge pull request #3 from kokosro/master
kostafey Feb 20, 2016
2febc8f
Update version.
kostafey Feb 20, 2016
c34e542
Update lucene to 5.5.0
kostafey Mar 11, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@ clucy*.jar
pom.xml
pom.xml.asc
.lein-failures
.nrepl-port
7 changes: 5 additions & 2 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,9 @@ language: clojure
lein: lein2

jdk:
- openjdk6
- openjdk7
- oraclejdk7
- oraclejdk7
- oraclejdk8

after_script:
- bash -ex test/coveralls.sh
23 changes: 0 additions & 23 deletions ChangeLog

This file was deleted.

63 changes: 61 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,11 @@
Clucy
=====

[![Build Status](https://secure.travis-ci.org/weavejester/clucy.png?branch=master)](http://travis-ci.org/weavejester/clucy)
[![License EPL](https://img.shields.io/badge/license-EPL-yellow.svg)](https://www.eclipse.org/legal/epl-v10.html)
[![Build Status](https://travis-ci.org/kostafey/clucy.svg?branch=master)](https://travis-ci.org/kostafey/clucy)
[![Clojars Project](https://img.shields.io/badge/clojars-clucy-blue.svg)](https://clojars.org/org.clojars.kostafey/clucy)
[![Coverage Status](https://coveralls.io/repos/kostafey/clucy/badge.svg?branch=master)](https://coveralls.io/github/kostafey/clucy?branch=master)
[![Dependencies Status](https://jarkeeper.com/kostafey/clucy/status.svg)](https://jarkeeper.com/kostafey/clucy)

Clucy is a Clojure interface to [Lucene](http://lucene.apache.org/).

Expand All @@ -11,11 +15,13 @@ Installation
To install Clucy, add the following dependency to your `project.clj`
file:

[clucy "0.4.0"]
[![Clojars Project](http://clojars.org/org.clojars.kostafey/clucy/latest-version.svg)](http://clojars.org/org.clojars.kostafey/clucy)

Usage
-----

#### Search in documents

To use Clucy, first require it:

(ns example
Expand Down Expand Up @@ -51,6 +57,59 @@ scientists...

(clucy/search-and-delete index "job:scientist")

#### Search text positions in single document

```clojure
(ns example
(:use [clucy.core
clucy.analyzers
clucy.positions-searcher]))

(binding [*analyzer* (make-analyzer :class :en)]
(let [test-text "This is the house that Jack built.
This is the malt
That lay in the house that Jack built."
index (doto (memory-index)
(add (set-field-params
test-text
{:positions-offsets true
:vector-positions true})))
searcher (make-dict-searcher
#{"house"
"lay"
"Jack built"})
result-iter (searcher index)]
(sort-by second
(show-text-matches result-iter test-text))))
```

=> (["house" 12] ["Jack built" 23] ["lay" 95] ["house" 106] ["Jack built" 117])

#### Statistics for single document

```clojure
(ns example
(:use clucy.core
clucy.analyzers
clucy.document-statistics))

(binding [*analyzer* (make-analyzer :class :en)]
(let [index (doto (memory-index)
(add (set-field-params
"This is the house that Jack built.
This is the malt
That lay in the house that Jack built."
{:positions-offsets true})))
iterator (get-top-words-iterator index 2)]
{:word-count (get-word-count index)
:most-frequent (iterator)}))

=> {:word-count 8,
:most-frequent (["built" {:count 2, :pos ([130 135] [28 33])}]
["hous" {:count 2, :pos ([114 119] [12 17])}]
["jack" {:count 2, :pos ([125 129] [23 27])}])}
```

Storing Fields
--------------

Expand Down
24 changes: 12 additions & 12 deletions project.clj
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
(defproject clucy "0.4.0"
(defproject org.clojars.kostafey/clucy "0.5.5.0"
:description "A Clojure interface to the Lucene search engine"
:url "http://github/weavejester/clucy"
:dependencies [[org.clojure/clojure "1.4.0"]
[org.apache.lucene/lucene-core "4.2.0"]
[org.apache.lucene/lucene-queryparser "4.2.0"]
[org.apache.lucene/lucene-analyzers-common "4.2.0"]
[org.apache.lucene/lucene-highlighter "4.2.0"]]
:url "http://github/kostafey/clucy"
:dependencies [[org.clojure/clojure "1.8.0"]
[org.apache.lucene/lucene-core "5.5.0"]
[org.apache.lucene/lucene-queryparser "5.5.0"]
[org.apache.lucene/lucene-analyzers-common "5.5.0"]
[org.apache.lucene/lucene-highlighter "5.5.0"]
[me.raynes/fs "1.4.6"]]
:license {:name "Eclipse Public License"
:url "http://www.eclipse.org/legal/epl-v10.html"}
:profiles {:1.4 {:dependencies [[org.clojure/clojure "1.4.0"]]}
:1.5 {:dependencies [[org.clojure/clojure "1.5.0"]]}
:1.6 {:dependencies [[org.clojure/clojure "1.6.0-master-SNAPSHOT"]]}}
:codox {:src-dir-uri "http://github/weavejester/clucy/blob/master"
:src-linenum-anchor-prefix "L"})
:profiles {:1.6 {:dependencies [[org.clojure/clojure "1.6.0"]]}
:1.7 {:dependencies [[org.clojure/clojure "1.7.0"]]}
:1.8 {:dependencies [[org.clojure/clojure "1.8.0"]]}}
:plugins [[lein-cloverage "1.0.6"]])
198 changes: 198 additions & 0 deletions src/clucy/analyzers.clj
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
(ns clucy.analyzers
(:use clucy.util)
(:require [clojure.java.io :as io])
(:import
(java.io InputStream)
(java.nio.charset StandardCharsets)
(org.apache.lucene.analysis.util WordlistLoader
CharArraySet)
(org.apache.lucene.util IOUtils)
(org.apache.lucene.analysis.Analyzer$TokenStreamComponents)
(org.apache.lucene.analysis Analyzer
TokenStream
Tokenizer
TokenFilter
CachingTokenFilter)
(org.apache.lucene.analysis.standard StandardAnalyzer
ClassicAnalyzer
StandardFilter
StandardTokenizer
ClassicTokenizer
ClassicFilter)
(org.apache.lucene.analysis.snowball SnowballFilter)
(org.apache.lucene.analysis.ar ArabicAnalyzer)
(org.apache.lucene.analysis.bg BulgarianAnalyzer)
(org.apache.lucene.analysis.de GermanAnalyzer
GermanLightStemFilter)
(org.apache.lucene.analysis.en EnglishAnalyzer
EnglishMinimalStemFilter)
(org.apache.lucene.analysis.fr FrenchAnalyzer
FrenchLightStemFilter)
(org.apache.lucene.analysis.ru RussianAnalyzer
RussianLightStemFilter)
(org.apache.lucene.analysis.core LowerCaseFilter
StopFilter
WhitespaceTokenizer
LetterTokenizer
KeywordTokenizer
LowerCaseTokenizer)
(org.apache.lucene.analysis.path PathHierarchyTokenizer)
(org.apache.lucene.analysis.wikipedia WikipediaTokenizer)
(org.apache.lucene.analysis.miscellaneous SetKeywordMarkerFilter
LengthFilter)
(org.tartarus.snowball.ext EnglishStemmer
FrenchStemmer
GermanStemmer
RussianStemmer)))

(def analysers-class-map
{:basic Analyzer
:standard StandardAnalyzer
:classic ClassicAnalyzer
:ar ArabicAnalyzer
:bg BulgarianAnalyzer
:fr FrenchAnalyzer
:de GermanAnalyzer
:en EnglishAnalyzer
:ru RussianAnalyzer})

(def tokenizers-class-map
{:standard StandardTokenizer
:whitespace WhitespaceTokenizer
:letter LetterTokenizer
:classic ClassicTokenizer
:keyword KeywordTokenizer
:lowercase LowerCaseTokenizer
:path-hierarchy PathHierarchyTokenizer
:wikipedia WikipediaTokenizer})

(def filters-class-map
{:standard StandardFilter
:snowball SnowballFilter
:classic ClassicFilter
:caching-token CachingTokenFilter})

(def stemmers-class-map
{:en EnglishStemmer
:fr FrenchStemmer
:de GermanStemmer
:ru RussianStemmer
:en-min EnglishMinimalStemFilter
:fr-light FrenchLightStemFilter
:de-light GermanLightStemFilter
:ru-light RussianLightStemFilter})

(defn- build-analyzer
([analyzer-class]
(.newInstance (analysers-class-map analyzer-class)))
([analyzer-class stop-words]
(let [ctor (.getConstructor (analysers-class-map analyzer-class)
(into-array [CharArraySet]))]
(.newInstance ctor (into-array [stop-words]))))
([analyzer-class stop-words stem-exclusion-words]
(let [ctor (.getConstructor (analysers-class-map analyzer-class)
(into-array [CharArraySet
CharArraySet]))]
(.newInstance ctor (into-array [stop-words
stem-exclusion-words])))))

(defn- get-analyzer [analyzer-class stop-words stem-exclusion-words]
(assert (not (and (some #{analyzer-class} [:standard :classic])
(not (nil? stem-exclusion-words))))
"Can't set stem-exclusion-words for Standard or Classic Analyzer.")
(cond
(and stop-words stem-exclusion-words) (build-analyzer
analyzer-class
stop-words
stem-exclusion-words)
(boolean stop-words) (build-analyzer analyzer-class stop-words)
:else (build-analyzer analyzer-class)))


(defn- get-tokenizer [key-or-object]
(if (instance? Tokenizer key-or-object)
key-or-object
(.newInstance (tokenizers-class-map key-or-object))))

(defn make-analyzer
([] (make-analyzer :class :standard))
([& {:keys [class
version
stop-words
stem-exclusion-words
tokenizer
filter
stemmer
lower-case
length-filter]
:or {class :basic
version org.apache.lucene.util.Version/LATEST
stop-words nil
stem-exclusion-words nil
tokenizer :standard
filter :standard
stemmer nil
lower-case true
length-filter nil}}]
(let [analyzer
(if (not (= :basic class))
;; ------------------------------------------------------------
;; Use pre-defined analyzer class.
;; All params except stop-words, stem-exclusion-words and version
;; are ignored.
(get-analyzer class stop-words stem-exclusion-words)
;; ------------------------------------------------------------
;; Custom analyser.
(proxy [Analyzer] []
(createComponents [fieldName]
(let [^Tokenizer source (get-tokenizer tokenizer)
^TokenStream result (.newInstance
(.getConstructor
(filters-class-map filter)
(into-array [TokenStream]))
(into-array [source]))
result (if lower-case (LowerCaseFilter. result) result)
result (if length-filter (LengthFilter.
result
(first length-filter)
(second length-filter)) result)
result (if stop-words (StopFilter. result stop-words) result)
result (if stem-exclusion-words
(SetKeywordMarkerFilter. result stem-exclusion-words)
result)
result (if stemmer
;; for light stemmers
(if (or (ends-with (name stemmer) "light")
(ends-with (name stemmer) "min"))
(.newInstance
(.getConstructor
(stemmers-class-map stemmer)
(into-array [TokenStream]))
(into-array [result]))
;; for snowball stemmers
(SnowballFilter.
result
(.newInstance (stemmers-class-map stemmer))))
result)]
(org.apache.lucene.analysis.Analyzer$TokenStreamComponents.
source result)))))]
(.setVersion analyzer version)
analyzer)))

(defn file->wordset ^CharArraySet [^String file-name]
(WordlistLoader/getSnowballWordSet
(IOUtils/getDecodingReader SnowballFilter
file-name
StandardCharsets/UTF_8)))

(defn resource->wordset ^CharArraySet [^String resource-file-name]
(WordlistLoader/getSnowballWordSet
(IOUtils/getDecodingReader
(io/input-stream
(io/resource resource-file-name))
StandardCharsets/UTF_8)))

(defn stream->wordset ^CharArraySet [^InputStream istream]
(WordlistLoader/getSnowballWordSet
(IOUtils/getDecodingReader istream
StandardCharsets/UTF_8)))
Loading