clj-fst

0.1.0


Finite State Transducers (FST) for Clojure

dependencies

org.clojure/clojure
1.6.0
org.apache.lucene/lucene-core
4.10.2
org.apache.lucene/lucene-misc
4.10.2
lein-marginalia
0.8.1-SNAPSHOT



(this space intentionally left almost blank)
 

Clojure Finite State Transdurer (FST)

This Clojure FST implementation is a wrapper above the Lucene FST package which is part of Lucene core.

Finite state transducers are finite state machines with two tapes: an input and an output tape. The automaton map an input string to an output. The output can be another string or an integer.

The FST produced by this application are implemented as a bytes array which makes FST really effective indexes in terms of speed and memory consumption. In fact, a 10 millions terms index will takes roughtly 256 MB of memory (depending of the string composition of the input strings, and if the outputs are integers or strings).

Limitations

The main limitation is that a FST index cannot be updated once it is created. This means that it cannot evolves over time. If you want to add or remove inputs/outputs, then you have to re-create the FST entirely.

(ns clj-fst.core
  (:use [clj-fst.utils])  
  (:refer-clojure :exclude [load])
  (:import (org.apache.lucene.util.fst PositiveIntOutputs CharSequenceOutputs ListOfOutputs Builder FST Util)
           (org.apache.lucene.util BytesRef BytesRefBuilder)
           (org.apache.lucene.util IntsRef IntsRefBuilder)
           (org.apache.lucene.util CharsRef CharsRefBuilder)))
(declare int-outputs char-outputs chars-ref chars-ref-builder ints-ref ints-ref-builder bytes-ref bytes-ref-builder builder!)

Building and Populating a FST

The creation of a new FST is a 3 steps process:

  1. Create a new builder using (create-builder!). The builder is used to populate the index with <input,output> tuples
  2. Populate the index using (add!)
  3. Create the FST using (create-fst!)

Builder Creation

The first step is to create the FST's builder by using the (create-builder!) function. There are two types of outputs currently supported by this wrapper:

  1. integers
  2. unicode strings

Here is some code that shows you how to create the builder:

;; Create a new FST builder with `:char` outputs
(def builder (create-builder! :type :char))

;; Create a new FST builder with `:int` outputs
(def builder (create-builder! :type :int))

Optionally, you can tell the builder to pack the index to make it even more compact/smaller by specifying the :pack true optional parameter. The default is false.

;; Create a new FST builder with `:int` outputs that is packed
(def builder (create-builder! :type :char :pack true))

Create a new FST builder map.

  • [type] (optional): Output type of the FST. Can be :int or :char (default)
  • [pack] (optional): Specify that you want to pack the index. Default is false
(defn create-builder!
  [& {:keys [type pack]
      :or {type :char
           pack false}}]
  {:builder (case type
                :int (builder! type :pack pack)
                :char (builder! type :pack pack))
   :type type})

Populating the FST

Populating the FST with <input,output> tuples is quite simple. The only thing you have to do once your builder is created, is to add the tuples iteratively by calling multiple types the (add!) function.

However, there is quite an important thing to keep in mind:

You have to sort the index you want to create by their input keys

If you miss to perform this step, then you will end-up with unexpected results.

Populating a FST is quite simple. Here is a code example that will populate a FST using a sorted-map:

;; The first thing to do is to create the Builder
(def builder (create-builder! :type :int))

;; This small `sorted-map` defines the things
;; to add to the FST
(def values (into (sorted-map) {"cat" 1
                                "dog" 2
                                "mice" 3}))

;; Populate the FST using that `sorted-map`
(doseq [[input output] values]
  (add! builder {input output}))

What this code shows you is how you can iteratively populate the FST. However, if you already have a single sorted-map that does have all the <input,output> tuples that you want in your FST, then you can simply use (add!) this way:

;; Populate the FST using that `sorted-map`
(add! builder values))

Populate a FST with <input,output> tuples. This function can be called iteratively multiple times before the (create-fst!) function is called to actually create the FST.

  • [builder]: builder where to populate the FST
  • [values]: map of the inputs->ouputs. The keys of the maps are the inputs, and their values are the outputs.

    Note: if (add!) is used iteratively, then you have to make sure that the structure it iterates over has been previously sorted by the input keys.

(defn add!
  [builder values]
  (let [scratch-bytes (bytes-ref)
        scratch-ints (ints-ref-builder)]
    (doseq [[word index] values]      
      (case (builder :type)
          :int (do
                 (.copyChars scratch-bytes word)
                 (.add (builder :builder) (. Util toIntsRef scratch-bytes scratch-ints) index))
          :char (.add (builder :builder) (. Util toUTF16 word scratch-ints) (new CharsRef index))))))

FST Creation

Once the builder is create and populated with the inputs and outputs, then the final step is to create the actual FST. Once the FST is created, there is no way to add or remove any inputs/outputs. If this become necessary, the FST needs to be re-created.

Creating the FST is as simple as calling (create-fst!):

;; Creating a new FST
(def fst (create-fst! builder))

Create a new FST based on a builder that has been populated with inputs/outputs

  • [builder]: builder option that has been created and populated
(defn create-fst!
  [builder]
  (.finish (builder :builder)))

Querying the FST

Now that we have a fully operational FST, the end goal is to be able to use it. What we want to do is to know if there exists an output for a given input, and if there is one to return the output associated with the input. Finally, if there is no output for a given input, we want to get a nil value.

Querying the FST is as simple as:

;; Query the FST
(get-output "cat" fst)

This will return the output of the input string "cat"

Get the output for a given input.

  • [input]: input for which we want its output
  • [fst]: FST object where to look for the output
(defn get-output
  [input fst]
  (let [result (. Util get fst (. Util toUTF16 input (ints-ref-builder)))]
    (if-not (nil? result)
      (process-output result))))

Loading and Saving FST

It is possible to save FST on the file system. That way, it is possible to reload a FST from the file system when your application starts.

;; Save a FST on the file system
(save! "resources/fst.srz" fst)

;; Load a FST from the file system
(load! "resources/fst.srz)

Save a FST to a file on the file system

  • [file] is the file path on the file system
  • [fst] is the FST instance
(defn save!
  [file fst]
  (. fst save (clojure.java.io/file file)))

Load a FST to a file on the file system

[file] is the file path on the file system [output-type] (optional) :int (default) when the output of the FST file are integers. :char when the output of the FST file are characters.

Returns the loaded FST

(defn load!
  [file & {:keys [output-type]
           :or {output-type :char}}]
  (let [outputs (if (= output-type :int)
                  (int-outputs)
                  (char-outputs))]
    (. FST read (clojure.java.io/file file) outputs)))

Utility functions

This section list a series of utilities functions used by the core clj-fst functions

Create a builder object.

You can directly use this function instead of the (create-builder!) function if you require really specific settings.

  • [type]: type of the output. Can be :int or :char
  • [min-suffix-count-1] (optional): If pruning the input graph during construction, this threshold is used for telling if a node is kept or pruned. If transition_count(node) >= minSuffixCount1, the node is kept.
  • [mind-suffix-count-2] (optional): (Note: only Mike McCandless knows what this one is really doing...)
  • [share-suffix] (optional): If true, the shared suffixes will be compacted into unique paths. This requires an additional RAM-intensive hash map for lookups in memory. Setting this parameter to false creates a single suffix path for all input sequences. This will result in a larger FST, but requires substantially less memory and CPU during building.
  • [share-non-singleton-nodes] (optional): Only used if share-suffix is true. Set this to true to ensure FST is fully minimal, at cost of more CPU and more RAM during building.
  • [share-max-tail-length] (optional): Only used if share-suffix is true. Set this to Integer.MAX_VALUE to ensure FST is fully minimal, at cost of more CPU and more RAM during building.
  • [pack-fst] (optional): Pass true to create a packed FST.
  • [acceptrable-overhead-ratio] (optional): How to trade speed for space when building the FST. This option is only relevant when doPackFST is true.
  • [allow-array-arcs] (optional): Pass false to disable the array arc optimization while building the FST; this will make the resulting FST smaller but slower to traverse.
  • [bytes-page-bits] (optional): How many bits wide to make each byte[] block in the BytesStore; if you know the FST will be large then make this larger. For example 15 bits = 32768 byte pages.
(defn builder!
  [type & {:keys [min-suffix-count-1
                  min-suffix-count-2
                  share-suffix
                  share-non-singleton-nodes
                  share-max-tail-length
                  pack-fst
                  acceptable-overhead-ratio
                  allow-array-arcs
                  bytes-page-bits]
           :or {min-suffix-count-1 0
                min-suffix-count-2 0
                share-suffix true
                share-non-singleton-nodes true
                share-max-tail-length Integer/MAX_VALUE
                pack-fst false
                acceptable-overhead-ratio org.apache.lucene.util.packed.PackedInts/COMPACT
                allow-array-arcs true
                bytes-page-bits 15}}]
  (if (= type :int)
    (Builder. org.apache.lucene.util.fst.FST$INPUT_TYPE/BYTE1
              min-suffix-count-1
              min-suffix-count-2
              share-suffix
              share-non-singleton-nodes
              share-max-tail-length
              (int-outputs)
              pack-fst
              acceptable-overhead-ratio
              allow-array-arcs
              bytes-page-bits)
    (Builder. org.apache.lucene.util.fst.FST$INPUT_TYPE/BYTE4
              min-suffix-count-1
              min-suffix-count-2
              share-suffix
              share-non-singleton-nodes
              share-max-tail-length
              (char-outputs)
              pack-fst
              acceptable-overhead-ratio
              allow-array-arcs
              bytes-page-bits)))

Create a PositiveIntOutputs

(defn int-outputs
  []
  (new ListOfOutputs (. PositiveIntOutputs getSingleton)))

Create a CharSequenceOutputs

(defn char-outputs
  []
  (new ListOfOutputs (. CharSequenceOutputs getSingleton)))

Create a BytesRef

(defn bytes-ref
  []
  (new BytesRef))

Create a BytesRefBuilder

(defn bytes-ref-builder
  []
  (new BytesRefBuilder))

Create a IntsRef

(defn ints-ref
  []
  (new IntsRef))

Create a IntsRefBuilder

(defn ints-ref-builder
  []
  (new IntsRefBuilder))

Create a CharsRef

(defn chars-ref
  []
  (new CharsRef))

Create a CharsRefBuilder

(defn chars-ref-builder
  []
  (new CharsRefBuilder))
 

FST Enumerations

FST enumerations are used to search for the existance of inputs in the FST. The first thing you have to do is to create an enumeration from a FST using the create-enum! function. Then you have a series of utility function that you can use to seek for a given input term.

(ns clj-fst.enum
  (:use [clj-fst.core]
        [clj-fst.utils])
  (:import (org.apache.lucene.util.fst BytesRefFSTEnum
                                       IntsRefFSTEnum
                                       Util)
           (org.apache.lucene.util BytesRef BytesRefBuilder
                                   IntsRef IntsRefBuilder)
           (org.apache.lucene.util.clj CljUtils)))

FST Creation

Before creating an enumeration, you have to have a FST. Once your FST is created, you can create its enumeration instance. Creating an enum is as simple as:

(def enum (create-enum! fst))

Create an enumeration of all the tuples that compose the FST

(defn create-enum!
  [fst]
  (new IntsRefFSTEnum fst))

Iterating & Searching the Enumeration

There are two things you can do with the Enumerations API:

  1. Iterating over the content of the FST
  2. Searching over the content of the FST

Iterating the content of a FST

To iterate the content of the FST, you have to use the (next!) function. What this function does is to return the next <input,output> tuple of the FST:

(next! enum)

will return the tuple in the form of a Clojure map:

{:input "some input", :output ["some output"]}

then you can always use the (current!) function to return the current tuple pointed by the internal enumeration pointer without moving it to the next tuple:

(current! enum)

Returns the term of the current input of the enumeration

(defn current!
  [enum]
  (let [input-output (.current enum)
        input (CljUtils/inputToString (.input input-output) true)
        output (process-output (.output input-output))]
    {:input input :output (if (vector? output) output [output])}))

Returns the term of the next input of the enumeration

(defn next!
  [enum]
  (let [input-output (.next enum)
        input (CljUtils/inputToString (.input input-output) true)
        output (process-output (.output input-output))]
    {:input input :output (if (vector? output) output [output])}))

Searching the content of a FST

Three functions will let you search the content of a FST: (get-ceil-term!), (get-floor-term!), (get-exact-term!). These functions will search the FST in different ways. They will move the internal pointer to whatever input they find. This means that if you use one of these functions, then if you use the (current!) function than the same result will be returned because the internal pointer got moved.

Returns the smallest term that is greater or equal to the input term, nil otherwise.

(defn get-ceil-term!
  [input enum]
  (let [input-output (.seekCeil enum (. Util toUTF16 input (ints-ref-builder)))]
    (if input-output
      (let [input (CljUtils/inputToString (.input input-output) true)
            output (process-output (.output input-output))]
        {:input input :output output}))))

Returns the biggest term that is smaller or equal to the input term, nil otherwise.

(defn get-floor-term!
  [input enum]
  (let [input-output (.seekFloor enum (. Util toUTF16 input (ints-ref-builder)))]
    (if input-output
      (let [input (CljUtils/inputToString (.input input-output) true)
            output (process-output (.output input-output))]
        {:input input :output (if (vector? output) output [output])}))))

Returns the term if the exact input term exists, nil otherwise.

(defn get-exact-term!
  [input enum]
  (let [input-output (.seekExact enum (. Util toUTF16 input (ints-ref-builder)))]
    (if input-output
      (let [input (CljUtils/inputToString (.input input-output) true)
            output (process-output (.output input-output))]
        {:input input :output (if (vector? output) output [output])}))))