dependencies
| (this space intentionally left almost blank) | ||||||||||||
Clojure Finite State Transdurer (FST)This Clojure FST implementation is a wrapper above the Lucene FST package which is part of Lucene core. Finite state transducers are finite state machines with two tapes: an input and an output tape. The automaton map an input string to an output. The output can be another string or an integer. The FST produced by this application are implemented as a bytes array which makes FST really effective indexes in terms of speed and memory consumption. In fact, a 10 millions terms index will takes roughtly 256 MB of memory (depending of the string composition of the input strings, and if the outputs are integers or strings). LimitationsThe main limitation is that a FST index cannot be updated once it is created. This means that it cannot evolves over time. If you want to add or remove inputs/outputs, then you have to re-create the FST entirely. | |||||||||||||
(ns clj-fst.core (:use [clj-fst.utils]) (:refer-clojure :exclude [load]) (:import (org.apache.lucene.util.fst PositiveIntOutputs CharSequenceOutputs ListOfOutputs Builder FST Util) (org.apache.lucene.util BytesRef BytesRefBuilder) (org.apache.lucene.util IntsRef IntsRefBuilder) (org.apache.lucene.util CharsRef CharsRefBuilder))) | |||||||||||||
(declare int-outputs char-outputs chars-ref chars-ref-builder ints-ref ints-ref-builder bytes-ref bytes-ref-builder builder!) | |||||||||||||
Building and Populating a FSTThe creation of a new FST is a 3 steps process:
Builder CreationThe first step is to create the FST's builder by using the
Here is some code that shows you how to create the builder:
Optionally, you can tell the builder to
| |||||||||||||
Create a new FST builder map.
| (defn create-builder! [& {:keys [type pack] :or {type :char pack false}}] {:builder (case type :int (builder! type :pack pack) :char (builder! type :pack pack)) :type type}) | ||||||||||||
Populating the FSTPopulating the FST with However, there is quite an important thing to keep in mind: You have to sort the index you want to create by their input keys If you miss to perform this step, then you will end-up with unexpected results. Populating a FST is quite simple. Here is a code example that will populate a FST
using a
What this code shows you is how you can iteratively
populate the FST. However, if you already have a single
| |||||||||||||
Populate a FST with
| (defn add! [builder values] (let [scratch-bytes (bytes-ref) scratch-ints (ints-ref-builder)] (doseq [[word index] values] (case (builder :type) :int (do (.copyChars scratch-bytes word) (.add (builder :builder) (. Util toIntsRef scratch-bytes scratch-ints) index)) :char (.add (builder :builder) (. Util toUTF16 word scratch-ints) (new CharsRef index)))))) | ||||||||||||
FST CreationOnce the builder is create and populated with the inputs and outputs, then the final step is to create the actual FST. Once the FST is created, there is no way to add or remove any inputs/outputs. If this become necessary, the FST needs to be re-created. Creating the FST is as simple as calling
| |||||||||||||
Create a new FST based on a builder that has been populated with inputs/outputs
| (defn create-fst! [builder] (.finish (builder :builder))) | ||||||||||||
Querying the FSTNow that we have a fully operational FST, the end goal is to be able to use it.
What we want to do is to know if there exists an output for a given input, and if
there is one to return the output associated with the input. Finally, if there is
no output for a given input, we want to get a Querying the FST is as simple as:
This will return the output of the input string | |||||||||||||
Get the output for a given input.
| (defn get-output [input fst] (let [result (. Util get fst (. Util toUTF16 input (ints-ref-builder)))] (if-not (nil? result) (process-output result)))) | ||||||||||||
Loading and Saving FSTIt is possible to save FST on the file system. That way, it is possible to reload a FST from the file system when your application starts.
| |||||||||||||
Save a FST to a file on the file system
| (defn save! [file fst] (. fst save (clojure.java.io/file file))) | ||||||||||||
Load a FST to a file on the file system [file] is the file path on the file system [output-type] (optional) :int (default) when the output of the FST file are integers. :char when the output of the FST file are characters. Returns the loaded FST | (defn load! [file & {:keys [output-type] :or {output-type :char}}] (let [outputs (if (= output-type :int) (int-outputs) (char-outputs))] (. FST read (clojure.java.io/file file) outputs))) | ||||||||||||
Utility functionsThis section list a series of utilities functions used by the
core | |||||||||||||
Create a builder object. You can directly use this function instead of the
| (defn builder! [type & {:keys [min-suffix-count-1 min-suffix-count-2 share-suffix share-non-singleton-nodes share-max-tail-length pack-fst acceptable-overhead-ratio allow-array-arcs bytes-page-bits] :or {min-suffix-count-1 0 min-suffix-count-2 0 share-suffix true share-non-singleton-nodes true share-max-tail-length Integer/MAX_VALUE pack-fst false acceptable-overhead-ratio org.apache.lucene.util.packed.PackedInts/COMPACT allow-array-arcs true bytes-page-bits 15}}] (if (= type :int) (Builder. org.apache.lucene.util.fst.FST$INPUT_TYPE/BYTE1 min-suffix-count-1 min-suffix-count-2 share-suffix share-non-singleton-nodes share-max-tail-length (int-outputs) pack-fst acceptable-overhead-ratio allow-array-arcs bytes-page-bits) (Builder. org.apache.lucene.util.fst.FST$INPUT_TYPE/BYTE4 min-suffix-count-1 min-suffix-count-2 share-suffix share-non-singleton-nodes share-max-tail-length (char-outputs) pack-fst acceptable-overhead-ratio allow-array-arcs bytes-page-bits))) | ||||||||||||
Create a PositiveIntOutputs | (defn int-outputs [] (new ListOfOutputs (. PositiveIntOutputs getSingleton))) | ||||||||||||
Create a CharSequenceOutputs | (defn char-outputs [] (new ListOfOutputs (. CharSequenceOutputs getSingleton))) | ||||||||||||
Create a BytesRef | (defn bytes-ref [] (new BytesRef)) | ||||||||||||
Create a BytesRefBuilder | (defn bytes-ref-builder [] (new BytesRefBuilder)) | ||||||||||||
Create a IntsRef | (defn ints-ref [] (new IntsRef)) | ||||||||||||
Create a IntsRefBuilder | (defn ints-ref-builder [] (new IntsRefBuilder)) | ||||||||||||
Create a CharsRef | (defn chars-ref [] (new CharsRef)) | ||||||||||||
Create a CharsRefBuilder | (defn chars-ref-builder [] (new CharsRefBuilder)) | ||||||||||||
FST EnumerationsFST enumerations are used to search for the existance of inputs in the FST.
The first thing you have to do is to create an enumeration from a FST using
the | |||||||||||||
(ns clj-fst.enum (:use [clj-fst.core] [clj-fst.utils]) (:import (org.apache.lucene.util.fst BytesRefFSTEnum IntsRefFSTEnum Util) (org.apache.lucene.util BytesRef BytesRefBuilder IntsRef IntsRefBuilder) (org.apache.lucene.util.clj CljUtils))) | |||||||||||||
FST CreationBefore creating an enumeration, you have to have a FST. Once your FST is created, you can create its enumeration instance. Creating an enum is as simple as:
| |||||||||||||
Create an enumeration of all the tuples that compose the FST | (defn create-enum! [fst] (new IntsRefFSTEnum fst)) | ||||||||||||
Iterating & Searching the EnumerationThere are two things you can do with the Enumerations API:
| |||||||||||||
Iterating the content of a FSTTo iterate the content of the FST, you have to use the
will return the tuple in the form of a Clojure map:
then you can always use the
| |||||||||||||
Returns the term of the current input of the enumeration | (defn current! [enum] (let [input-output (.current enum) input (CljUtils/inputToString (.input input-output) true) output (process-output (.output input-output))] {:input input :output (if (vector? output) output [output])})) | ||||||||||||
Returns the term of the next input of the enumeration | (defn next! [enum] (let [input-output (.next enum) input (CljUtils/inputToString (.input input-output) true) output (process-output (.output input-output))] {:input input :output (if (vector? output) output [output])})) | ||||||||||||
Searching the content of a FSTThree functions will let you search the content of a FST: | |||||||||||||
Returns the smallest term that is greater or equal to the input term, nil otherwise. | (defn get-ceil-term! [input enum] (let [input-output (.seekCeil enum (. Util toUTF16 input (ints-ref-builder)))] (if input-output (let [input (CljUtils/inputToString (.input input-output) true) output (process-output (.output input-output))] {:input input :output output})))) | ||||||||||||
Returns the biggest term that is smaller or equal to the input term, nil otherwise. | (defn get-floor-term! [input enum] (let [input-output (.seekFloor enum (. Util toUTF16 input (ints-ref-builder)))] (if input-output (let [input (CljUtils/inputToString (.input input-output) true) output (process-output (.output input-output))] {:input input :output (if (vector? output) output [output])})))) | ||||||||||||
Returns the term if the exact input term exists, nil otherwise. | (defn get-exact-term! [input enum] (let [input-output (.seekExact enum (. Util toUTF16 input (ints-ref-builder)))] (if input-output (let [input (CljUtils/inputToString (.input input-output) true) output (process-output (.output input-output))] {:input input :output (if (vector? output) output [output])})))) | ||||||||||||