clojure-opennlp

0.2.2-SNAPSHOT


Natural Language Processing with Clojure, library for opennlp.

dependencies

org.clojure/clojure
1.5.1
org.apache.opennlp/opennlp-tools
1.5.2-incubating



(this space intentionally left almost blank)
 

The main namespace for the clojure-opennlp project. Functions for creating NLP performers can be created with the tools in this namespace.

(ns opennlp.nlp
  (:use [clojure.java.io :only [input-stream]])
  (:require [opennlp.span :as nspan])
  (:import
   (opennlp.tools.doccat DoccatModel
                         DocumentCategorizerME)
   (opennlp.tools.namefind NameFinderME TokenNameFinderModel)
   (opennlp.tools.postag POSModel POSTaggerME)
   (opennlp.tools.sentdetect SentenceDetectorME SentenceModel)
   (opennlp.tools.tokenize DetokenizationDictionary
                           DetokenizationDictionary$Operation
                           Detokenizer$DetokenizationOperation
                           DictionaryDetokenizer
                           TokenSample
                           TokenizerME
                           TokenizerModel)
   (opennlp.tools.util Span)))

OpenNLP property for pos-tagging. Meant to be rebound before calling the tagging creators

(def #^{:dynamic true} *beam-size* 3)

Caching to use for pos-tagging

(def #^{:dynamic true} *cache-size* 1024)

Takes a collection of spans and the data they refer to. Returns a list of substrings corresponding to spans.

(defn- opennlp-span-strings
  [span-col data]
  (if (seq span-col)
    (seq (Span/spansToStrings (into-array span-col)
                              (if (string? data) data (into-array data))))
    []))

Take an OpenNLP span object and return a pair [i j] where i and j are the start and end positions of the span.

(defn- to-native-span
  [span]
  (nspan/make-span (.getStart span) (.getEnd span) (.getType span)))

Return a function for splitting sentences given a model file.

(defmulti make-sentence-detector
  class)
(defmethod make-sentence-detector :default
  [modelfile]
  (with-open [model-stream (input-stream modelfile)]
    (make-sentence-detector (SentenceModel. model-stream))))
(defmethod make-sentence-detector SentenceModel
  [model]
  (fn sentence-detector
    [text]
    {:pre [(string? text)]}
    (let [detector  (SentenceDetectorME. model)
          spans     (.sentPosDetect detector text)
          sentences (opennlp-span-strings spans text)
          probs     (seq (.getSentenceProbabilities detector))]
      (with-meta
        (into [] sentences)
        {:probabilities probs
         :spans         (map to-native-span spans)}))))

Return a function for tokenizing a sentence based on a given model file.

(defmulti make-tokenizer
  class)
(defmethod make-tokenizer :default
  [modelfile]
  (with-open [model-stream (input-stream modelfile)]
    (make-tokenizer (TokenizerModel. model-stream))))
(defmethod make-tokenizer TokenizerModel
  [model]
  (fn tokenizer
    [sentence]
    {:pre [(string? sentence)]}
    (let [tokenizer (TokenizerME. model)
          spans     (.tokenizePos tokenizer sentence)
          probs     (seq (.getTokenProbabilities tokenizer))
          tokens    (opennlp-span-strings spans sentence)]
      (with-meta
        (into [] tokens)
        {:probabilities probs
         :spans         (map to-native-span spans)}))))

Return a function for tagging tokens based on a givel model file.

(defmulti make-pos-tagger
  class)
(defmethod make-pos-tagger :default
  [modelfile]
  (with-open [model-stream (input-stream modelfile)]
    (make-pos-tagger (POSModel. model-stream))))
(defmethod make-pos-tagger POSModel
  [model]
  (fn pos-tagger
    [tokens]
    {:pre [(coll? tokens)]}
    (let [token-array (into-array tokens)
          tagger (POSTaggerME. model *beam-size* *cache-size*)
          tags (.tag tagger token-array)
          probs (seq (.probs tagger))]
      (with-meta
        (map vector tokens tags)
        {:probabilities probs}))))

Return a function for finding names from tokens based on a given model file.

(defmulti make-name-finder
  (fn [model & args] (class model)))
(defmethod make-name-finder :default
  [modelfile & args]
  (with-open [model-stream (input-stream modelfile)]
    (make-name-finder (TokenNameFinderModel. model-stream))))
(defmethod make-name-finder TokenNameFinderModel
  [model & {:keys [feature-generator beam] :or {beam *beam-size*}}]
  (fn name-finder
    [tokens & contexts]
    {:pre [(seq tokens)
           (every? #(= (class %) String) tokens)]}
    (let [finder (NameFinderME. model feature-generator beam)
          matches (.find finder (into-array String tokens))
          probs (seq (.probs finder))]
      (with-meta
        (distinct (Span/spansToStrings matches (into-array String tokens)))
        {:probabilities probs
         :spans (map to-native-span matches)}))))

Return a function for taking tokens and recombining them into a sentence based on a given model file.

(defmulti make-detokenizer
  class)
(defmethod make-detokenizer :default
  [modelfile]
  (with-open [model-stream (input-stream modelfile)]
    (make-detokenizer (DetokenizationDictionary. model-stream))))

Given a sequence of DetokenizationOperations, produce a string.

TODO: clean this up, recursion is a smell TODO: remove debug printlns once I'm satisfied

#_(defn- collapse-tokens
    [tokens detoken-ops]
    (let [sb (StringBuilder.)
          token-set (atom #{})]
      ;;(println :ops detoken-ops)
      (loop [ts tokens dt-ops detoken-ops]
        (let [op (first dt-ops)
              op2 (second dt-ops)]
          ;; (println :op op)
          ;; (println :op2 op)
          ;; (println :ts (first ts))
          ;; (println :sb (.toString sb))
          (cond
           (or (= op2 nil)
               (= op2 Detokenizer$DetokenizationOperation/MERGE_TO_LEFT))
           (.append sb (first ts))
           (or (= op nil)
               (= op Detokenizer$DetokenizationOperation/MERGE_TO_RIGHT))
           (.append sb (first ts))
           (= op DetokenizationDictionary$Operation/RIGHT_LEFT_MATCHING)
           (if (contains? @token-set (first ts))
             (do
               ;; (println :token-set @token-set)
               ;; (println :ts (first ts))
               (swap! token-set disj (first ts))
               (.append sb (first ts)))
             (do
               ;;(println :token-set @token-set)
               ;;(println :ts (first ts))
               (swap! token-set conj (first ts))
               (.append sb (str (first ts) " "))))
           :else
           (.append sb (str (first ts) " ")))
          (when (and op op2)
            (recur (next ts) (next dt-ops)))))
      (.toString sb)))
;; In the current documentation there is no RIGHT_LEFT_MATCHING and
;; I've never seen such an operation in practice.
;; http://opennlp.apache.org/documentation/apidocs/opennlp-tools/opennlp/tools/tokenize/Detokenizer.DetokenizationOperation.html
(defn- detokenize*
  [tokens ops]
  (loop [toks        (seq tokens)
         ops         (seq ops)
         result-toks []]
    (if toks
      (let [op    (first ops)
            rtoks (cond
                   (= op Detokenizer$DetokenizationOperation/MERGE_TO_LEFT)
                   (if (not-empty result-toks)
                     (conj (pop result-toks) (first toks) " ")
                     (conj result-toks (first toks) " "))
                   (= op Detokenizer$DetokenizationOperation/MERGE_TO_RIGHT)
                   (conj result-toks (first toks))
                   :else
                   (conj result-toks (first toks) " "))]
        (recur (next toks) (next ops) rtoks))
      (apply str (butlast result-toks)))))
#_(defmethod make-detokenizer DetokenizationDictionary
    [model]
    (fn detokenizer
      [tokens]
      {:pre [(coll? tokens)
             (every? #(= (class %) String) tokens)]}
      (let [detoken (DictionaryDetokenizer. model)
            ops     (.detokenize detoken (into-array String tokens))]
        (detokenize* tokens ops))))

(defmethod make-detokenizer DetokenizationDictionary
  [model]
  (fn detokenizer
    [tokens]
    {:pre [(coll? tokens)
           (every? #(= (class %) String) tokens)]}
    (-> (DictionaryDetokenizer. model)
        (TokenSample. (into-array String tokens))
        (.getText))))
(defn parse-categories [outcomes-string outcomes]
  "Given a string that represents the opennlp outcomes and an array of
  probability outcomes, zip them into a map of category-probability pairs"
  (zipmap
   (map first (map rest (re-seq #"(\w+)\[.*?\]" outcomes-string)))
   outcomes))

Return a function for determining a category given a model.

(defmulti make-document-categorizer
  class)
(defmethod make-document-categorizer :default
  [modelfile]
  (with-open [model-stream (input-stream modelfile)]
    (make-document-categorizer (DoccatModel. model-stream))))
(defmethod make-document-categorizer DoccatModel
  [model]
  (fn document-categorizer
    [text]
    {:pre [(string? text)]}
    (let [categorizer (DocumentCategorizerME. model)
          outcomes (.categorize categorizer text)]
      (with-meta
        {:best-category (.getBestCategory categorizer outcomes)}
        {:probabilities (parse-categories
                         (.getAllResults categorizer outcomes)
                         outcomes)}))))
 
(ns opennlp.sample
  (:require [clojure.java.io :as io])
  (:import (opennlp.tools.doccat DocumentSample)
           (opennlp.tools.util ObjectStream)))
(defn print-sample [sample ^java.io.Writer w]
  (.write w "#opennlp/sample {")
  (.write w ":category ")
  (binding [*out* w]
    (pr (.getCategory sample)))
  (.write w " :text ")
  (binding [*out* w]
    (pr (vec (.getText sample))))
  (.write w "}"))
(defmethod print-method DocumentSample
  [sample w]
  (print-sample sample w))
(defmethod print-dup DocumentSample
  [sample w]
  (print-sample sample w))
(defn read-document-sample [{:keys [category text]}]
  (DocumentSample. category (into-array String text)))
(defn clojure-document-sample-stream [in]
  (let [i (java.io.PushbackReader. (io/reader in))
        buf (atom [])
        pos (atom 0)]
    (reify
      ObjectStream
      (read [_]
        (if (= @pos (count @buf))
          (when-let [obj (read i false nil)]
            (swap! buf conj obj)
            (swap! pos inc)
            obj)
          (let [p @pos]
            (swap! pos inc)
            (nth @buf p))))
      (close [_]
        (.close i)
        (.close in))
      (reset [_]
        (reset! pos 0)))))
 
(ns opennlp.span)
(defrecord Span [start end type])

Make a native span object.

(defn make-span
  [start end type]
  (Span. start end type))

Return true if location k is in span. We assume span is [i,j).

(defn in-span?
  [span k]
  (and (>= k (:start span)) (< k (:end span))))

Return true if s1 is contains spans s2.

(defn contains-span?
  [s1 s2]
  (and (>= (:start s2) (:start s1))
       (<= (:start s2) (:end s1))
       (>= (:end s2) (:start s1))
       (<= (:end s2) (:end s1))))

Return true if location k is to the right of span.

(defn right-of-span?
  [span k]
  (>= k (:end span)))

Return true if location k is the end of span.

(defn end-of-span?
  [span k]
  (== (dec (:end span)) k))

Given two overlapping spans where the first comes before the second, return a merged span with the type of the first.

(defn merge-spans
  [A B]
  (assoc A :end (:end B)))

Return true of A does not overlap B.

(defn span-disjoint?
  [A B]
  (or (<= (:end A) (:start B)) (>= (:start A) (:end B))))

Return true if A overlaps B.

(defn span-overlaps?
  [A B]
  (not (span-disjoint? A B)))

Return the intersection of two spans as a span. Type of new span is :intersection.

(defn intersection-span
  [A B]
  {:pre [(not (span-disjoint? A B))]}
  (->Span (max (:start A) (:start B)) (min (:end A) (:end B)) :intersection))

Return the length of the span.

(defn span-length
  [s]
  (- (:end s) (:start s)))

Return the substring corresponding to the span.

(defn subs-span
  [s span]
  (subs s (:start span) (:end span)))

Shift a span by i positions.

(defn shift-span
  [span i]
  (->Span (+ (:start span) i) (+ (:end span) i) (:type span)))

Return a span of the area between two spans A and B. Type of new span is :between.

(defn between-span
  [a b]
  {:pre [(<= (:end a) (:start b))]}
  (->Span (:end a) (:start b) :between))
 

Namespace used for filtering POS-tagged datastructures by grammatical classification. Also provides methods for building your own filters.

(ns opennlp.tools.filters)

Declare a filter for pos-tagged vectors with the given name and regex.

(defmacro pos-filter
  [n r]
  (let [docstring (str "Given a list of pos-tagged elements, "
                       "return only the " n " in a list.")]
    `(defn ~n
       ~docstring
       [elements#]
       (filter (fn [t#] (re-find ~r (second t#))) elements#))))

Declare a filter for treebank-chunked lists with the given name and regex.

(defmacro chunk-filter
  [n r]
  (let [docstring (str "Given a list of treebank-chunked elements, "
                       "return only the " n " in a list.")]
    `(defn ~n
       ~docstring
       [elements#]
       (filter (fn [t#] (re-find ~r (:tag t#))) elements#))))

It's easy to define your own filters!

(pos-filter nouns #"^NN")
(pos-filter nouns-and-verbs #"^(NN|VB)")
(pos-filter proper-nouns #"^NNP")
(pos-filter verbs #"^VB")
(chunk-filter verb-phrases #"^VP$")
(chunk-filter noun-phrases #"^NP$")
(chunk-filter adverb-phrases #"^ADVP$")
(chunk-filter adjective-phrases #"^ADJP$")
 

Tools for lazily separating, tokenizing and tagging sentences.

(ns opennlp.tools.lazy)

Given a sequence of texts and a sentence-finder, return a lazy sequence of sentences for each text.

TODO: collapse these 3 functions into a generic one

(defn lazy-get-sentences
  [text sentence-finder]
  (lazy-seq
   (when-let [s (seq text)]
     (cons (sentence-finder (first text))
           (lazy-get-sentences (rest s) sentence-finder)))))

Given a sequence of sentences, and a tokenizer, return a lazy sequence of tokenized sentences.

(defn lazy-tokenize
  [sentences tokenizer]
  (lazy-seq
   (when-let [s (seq sentences)]
     (cons (tokenizer (first s)) (lazy-tokenize (rest s) tokenizer)))))

Given a sequence of sentences, a tokenizer and a pos-tagger, return a lazy sequence of pos-tagged sentences.

(defn lazy-tag
  [sentences tokenizer pos-tagger]
  (lazy-seq
   (when-let [s (seq sentences)]
     (cons (pos-tagger (tokenizer (first s)))
           (lazy-tag (rest s) tokenizer pos-tagger)))))

Given a sequence of sentences, a tokenizer, a pos-tagger and a chunker, return a lazy sequence of treebank-chunked sentences.

(defn lazy-chunk
  [sentences tokenizer pos-tagger chunker]
  (lazy-seq
   (when-let [s (seq sentences)]
     (cons (chunker (pos-tagger (tokenizer (first s))))
           (lazy-chunk (rest s) tokenizer pos-tagger chunker)))))

lazily read sentences from rdr as a lazy sequence of strings using the given sentence-finder. rdr must implement java.io.BufferedReader.

(defn sentence-seq
  [^java.io.BufferedReader rdr sentence-finder]
  (.mark rdr 0)
  (let [sb (StringBuilder.)]
    (loop [c (.read rdr)]
      (if-not (= -1 c)
        (do (.append sb (char c))
            (let [sents (sentence-finder (.toString sb))]
              (if (> (count sents) 1)
                (do (.reset rdr)
                    (cons (first sents)
                          (lazy-seq (sentence-seq rdr sentence-finder))))
                (do (.mark rdr 0)
                    (recur (.read rdr))))))
        [(.trim (.toString sb))]))))
 

This namespace contains tools used to train OpenNLP models

(ns opennlp.tools.train
  (:use [clojure.java.io :only [output-stream reader]])
  (:import (opennlp.tools.util PlainTextByLineStream TrainingParameters)
           (opennlp.tools.util.model BaseModel ModelType)
           (opennlp.tools.dictionary Dictionary)
           (opennlp.tools.tokenize TokenizerME
                                   TokenizerModel
                                   TokenSampleStream)
           (opennlp.tools.sentdetect SentenceDetectorME
                                     SentenceModel
                                     SentenceSampleStream)
           (opennlp.tools.namefind NameFinderEventStream
                                   NameSampleDataStream
                                   NameFinderME
                                   TokenNameFinderModel)
           (opennlp.tools.chunker ChunkerME ChunkSampleStream ChunkerModel)
           (opennlp.tools.parser ParseSampleStream ParserModel)
           (opennlp.tools.parser.lang.en HeadRules)
           (opennlp.tools.parser.chunking Parser)
           (opennlp.tools.postag POSTaggerME
                                 POSModel
                                 POSDictionary
                                 WordTagSampleStream
                                 POSContextGenerator)
           (opennlp.tools.doccat DoccatModel
                                 DocumentCategorizerME
                                 DocumentSampleStream)))

Write a model to disk

(defn write-model
  [#^BaseModel model out-stream]
  (with-open [out (output-stream out-stream)]
    (.serialize model out)))

Build a Dictionary based on file in appropriate format

(defn build-dictionary
  [in]
  (with-open [rdr (reader in)]
    (Dictionary/parseOneEntryPerLine rdr)))

Build a POSDictionary based on file in appropriate format

A POSDictionary records which part-of-speech tags a word may be assigned

(defn build-posdictionary
  [in]
  (with-open [rdr (reader in)]
    (POSDictionary/create rdr)))

Returns a treebank chunker based on given training file

(defn ^ChunkerModel train-treebank-chunker
  ([in] (train-treebank-chunker "en" in))
  ([lang in] (train-treebank-chunker lang in 100 5))
  ([lang in iter cut]
     (with-open [rdr (reader in)]
       (ChunkerME/train
        lang
        (ChunkSampleStream.
         (PlainTextByLineStream. rdr))
        cut iter))))

Returns a treebank parser based a training file and a set of head rules

(defn ^ParserModel train-treebank-parser
  ([in headrules] (train-treebank-parser "en" in headrules))
  ([lang in headrules] (train-treebank-parser lang in headrules 100 5))
  ([lang in headrules iter cut]
     (with-open [rdr (reader headrules)
                 fis (java.io.FileInputStream. in)]
       (Parser/train
        lang
        (ParseSampleStream.
         (PlainTextByLineStream.
          (.getChannel fis) "UTF-8"))
        (HeadRules. rdr) iter cut))))

Returns a trained name finder based on a given training file. Uses a non-deprecated train() method that allows for perceptron training with minimum modification. Optional arguments include the type of entity (e.g "person"), custom feature generation and a knob for switching to perceptron training (maXent is the default). For perceptron prefer cutoff 0, whereas for maXent 5.

(defn ^TokenNameFinderModel train-name-finder
  ([in] (train-name-finder "en" in))
  ([lang in] (train-name-finder lang in 100 5))
  ([lang in iter cut & {:keys [entity-type feature-gen classifier]
                        ;;MUST be either "MAXENT" or "PERCEPTRON"
                        :or  {entity-type "default" classifier "MAXENT"}}]
     (with-open [rdr (reader in)]
       (NameFinderME/train
        lang
        entity-type
        (->> rdr
             (PlainTextByLineStream.)
             (NameSampleDataStream.))
        (doto (TrainingParameters.)
          (.put TrainingParameters/ALGORITHM_PARAM classifier)
          (.put TrainingParameters/ITERATIONS_PARAM (Integer/toString iter))
          (.put TrainingParameters/CUTOFF_PARAM     (Integer/toString cut)))
        feature-gen {}))))

Returns a tokenizer based on given training file

(defn ^TokenizerModel train-tokenizer
  ([in] (train-tokenizer "en" in))
  ([lang in] (train-tokenizer lang in 100 5))
  ([lang in iter cut]
     (with-open [rdr (reader in)]
       (TokenizerME/train
        lang
        (->> rdr
             (PlainTextByLineStream.)
             (TokenSampleStream.))
        false
        cut
        iter))))

Returns a pos-tagger based on given training file

(defn ^POSModel train-pos-tagger
  ([in] (train-pos-tagger "en" in))
  ([lang in] (train-pos-tagger lang in nil))
  ([lang in tagdict] (train-pos-tagger lang in tagdict 100 5))
  ([lang in tagdict iter cut]
     (with-open [rdr (reader in)]
       (POSTaggerME/train
        lang
        (WordTagSampleStream. rdr)
        (ModelType/MAXENT)
        tagdict
        nil
        cut
        iter))))

Returns a sentence model based on a given training file

(defn ^SentenceModel train-sentence-detector
  ([in] (train-sentence-detector "en" in))
  ([lang in]
     (with-open [rdr (reader in)]
       (SentenceDetectorME/train lang
                                 (->> rdr
                                      (PlainTextByLineStream.)
                                      (SentenceSampleStream.))
                                 true
                                 nil))))

Returns a classification model based on a given training file

(defn ^DoccatModel train-document-categorization
  ([in] (train-document-categorization "en" in 1 100))
  ([lang in] (train-document-categorization "en" in 1 100))
  ([lang in cutoff iterations]
     (with-open [rdr (reader in)]
       (DocumentCategorizerME/train lang
                                    (->> rdr
                                         (PlainTextByLineStream.)
                                         (DocumentSampleStream.))
                                    cutoff iterations))))
 

Namespace containing tools pertaining to the treebank NLP tools. This includes treebank chuncking, parsing and linking (coref).

(ns #^{:doc 
       :author "Lee Hinman"}
  opennlp.treebank
  (:use [opennlp.nlp :only [*beam-size*]]
        [clojure.java.io :only [input-stream]])
  (:import (opennlp.tools.chunker ChunkerModel ChunkerME)
           (opennlp.tools.cmdline.parser ParserTool)
           (opennlp.tools.parser Parse ParserModel
                                 ParserFactory AbstractBottomUpParser)
           (opennlp.tools.parser.chunking Parser)
           (opennlp.tools.coref.mention Mention DefaultParse)
           (opennlp.tools.coref LinkerMode DefaultLinker)))

Default advance percentage as defined by AbstractBottomUpParser.defaultAdvancePercentage

(def #^{:dynamic true} *advance-percentage* 0.95)

Partition a sequence of treebank chunks by their phrases.

(defn- split-chunks
  [chunks]
  (let [seqnum    (atom 0)
        splitfunc (fn
                    [item]
                    (if (.startsWith item "B-")
                      (swap! seqnum inc)
                      @seqnum))]
    (partition-by splitfunc (pop chunks))))

Given a chunk ('B-NP' 'I-NP' 'I-NP' ...), return a vector of the chunk type and item count. So, for ('B-NP' 'I-NP' 'I-NP') it would return ['B-NP' 3].

(defn- size-chunk
  [tb-chunk]
  (let [chunk-type  (second (re-find #"B-(.*)" (first tb-chunk)))
        chunk-count (count tb-chunk)]
    [chunk-type chunk-count]))

Thanks chouser

(defn- split-with-size
  [[v & vs] s]
  (if-not v
    (list s)
    (cons (take v s) (split-with-size vs (drop v s)))))

De-interleave a sequence, returning a vector of the two resulting sequences.

(defn- de-interleave
  [s]
  [(map first s) (map last s)])
(defstruct treebank-phrase :phrase :tag)

Return a function for chunking phrases from pos-tagged tokens based on a given model file.

(defmulti make-treebank-chunker
  class)
(defmethod make-treebank-chunker :default
  [modelfile]
  (with-open [modelstream (input-stream modelfile)]
    (make-treebank-chunker (ChunkerModel. modelstream))))
(defmethod make-treebank-chunker ChunkerModel
  [model]
  (fn treebank-chunker
    [pos-tagged-tokens]
    (let [chunker (ChunkerME. model *beam-size*)
          [tokens tags] (de-interleave pos-tagged-tokens)
          chunks  (into [] (seq (.chunk chunker tokens tags)))
          sized-chunks (map size-chunk (split-chunks chunks))
          [types sizes] (de-interleave sized-chunks)
          token-chunks (split-with-size sizes tokens)
          probs (seq (.probs chunker))]
      (with-meta
        (map #(struct treebank-phrase (into [] (last %)) (first %))
             (partition 2 (interleave types token-chunks)))
        {:probabilities probs}))))

Given the chunks from a treebank-chunker, return just a list of phrase word-lists.

(defn phrases
  [phrases]
  (map :phrase phrases))

Given the chunks from a treebank-chunker, return a list of phrase strings.

(defn phrase-strings
  [phrase-chunks]
  (map #(apply str (interpose " " %)) (phrases phrase-chunks)))

Docs for treebank chunking:

(def chunker (make-treebank-chunker "models/EnglishChunk.bin.gz")) (pprint (chunker (pos-tag (tokenize "The override system is meant to deactivate the accelerator when the brake pedal is pressed."))))

(map size-chunk (split-chunks (chunker (pos-tag (tokenize "The override system is meant to deactivate the accelerator when the brake pedal is pressed.")))))

opennlp.nlp=> (split-with-size (sizes (map size-chunk (split-chunks (chunker (pos-tag (tokenize "The override system is meant to deactivate the accelerator when the brake pedal is pressed.")))))) (tokenize "The override system is meant to deactivate the accelerator when the brake pedal is pressed.")) (("The" "override" "system") ("is" "meant" "to" "deactivate") ("the" "accelerator") ("when") ("the" "brake" "pedal") ("is" "pressed") ("."))

(["NP" 3] ["VP" 4] ["NP" 2] ["ADVP" 1] ["NP" 3] ["VP" 2])

opennlp.nlp=> (pprint (chunker (pos-tag (tokenize "The override system is meant to deactivate the accelerator when the brake pedal is pressed.")))) #<ArrayList [B-NP, I-NP, I-NP, B-VP, I-VP, I-VP, I-VP, B-NP, I-NP, B-ADVP, B-NP, I-NP, I-NP, B-VP, I-VP, O]>

So, B-* starts a sequence, I-* continues it. New phrase starts when B-* is encountered

Treebank parsing

Treebank-parser does not like parens and braces, so replace them.

(defn- strip-parens
  [s]
  (-> s
      (.replaceAll "\\(" "-LRB-")
      (.replaceAll "\\)" "-RRB-")
      (.replaceAll "\\{" "-LCB-")
      (.replaceAll "\\}" "-RCB-")))

Given a line and Parser object, return a list of Parses.

(defn- parse-line
  [line parser]
  (let [line (strip-parens line)
        results (StringBuffer.)
        parse-num 1]
    (.show (first (ParserTool/parseLine line parser parse-num)) results)
    (.toString results)))

Return a function for treebank parsing a sequence of sentences, based on a given model file.

(defmulti make-treebank-parser
  class)
(defmethod make-treebank-parser :default
  [modelfile]
  (with-open [modelstream (input-stream modelfile)]
    (make-treebank-parser (ParserModel. modelstream))))
(defmethod make-treebank-parser ParserModel
  [model]
  (fn treebank-parser
    [text]
    (let [parser (ParserFactory/create model
                                       *beam-size*
                                       *advance-percentage*)
          parses (map #(parse-line % parser) text)]
      (vec parses))))

Strip out some characters that might cause trouble parsing the tree.

(defn- strip-funny-chars
  [s]
  (-> s
      (.replaceAll "'" "-SQUOTE-")
      (.replaceAll "\"" "-DQUOTE-")
      (.replaceAll "~" "-TILDE-")
      (.replaceAll "`" "-BACKTICK-")
      (.replaceAll "," "-COMMA-")
      (.replaceAll "\\\\" "-BSLASH-")
      (.replaceAll "\\/" "-FSLASH-")
      (.replaceAll "\\^" "-CARROT-")
      (.replaceAll "@" "-ATSIGN-")
      (.replaceAll "#" "-HASH-")
      (.replaceAll ";" "-SEMICOLON-")
      (.replaceAll ":" "-COLON-")))

Un-strip out some characters that might cause trouble parsing the tree.

(defn- unstrip-funny-chars
  [s]
  (-> s
      (.replaceAll "-SQUOTE-" "'")
      (.replaceAll "-DQUOTE-" "\"")
      (.replaceAll "-TILDE-" "~")
      (.replaceAll "-BACKTICK-" "`")
      (.replaceAll "-COMMA-" ",")
      (.replaceAll "-BSLASH-" "\\\\")
      (.replaceAll "-FSLASH-" "\\/")
      (.replaceAll "-CARROT-" "\\^")
      (.replaceAll "-ATSIGN-" "@")
      (.replaceAll "-HASH-" "#")
      (.replaceAll "-SEMICOLON-" ";")
      (.replaceAll "-COLON-" ":")))

Generate a tree from the string output of a treebank-parser.

Credit for this function goes to carkh in #clojure

(defn- tr
  [to-parse]
  (if (seq? to-parse)
    {:tag (first to-parse) :chunk (map tr (rest to-parse))}
    (str to-parse)))

Make a tree from the string output of a treebank-parser.

(defn make-tree
  [tree-text]
  (let [text (strip-funny-chars tree-text)]
    (tr (read-string text))))

------------------------------------------------------------------------

Treebank Linking WIP, do not use yet.

(declare print-parse)

Given a child, parent and start, print out the child parse.

(defn print-child
  [c p start]
  (let [s (.getSpan c)]
    (if (< @start (.getStart s))
      (print (.substring (.getText p) start (.getStart s))))
    (print-parse c)
    (reset! start (.getEnd s))))

Given a parse and the EntityMentions-to-parse map, print out the parse.

This is broken, don't use this.

(defn print-parse
  [p parse-map]
  (let [start (atom (.getStart (.getSpan p)))
        children (.getChildren p)]
    (if-not (= Parser/TOK_NODE (.getType p))
      (do
        (print (str "(" (.getType p)))
        (if (contains? parse-map p)
          (print (str "#" (get parse-map p))))
        (print " ")))
    (map #(print-child % p start) children)
    ;; FIXME: don't use substring
    (print (.substring (.getText p) @start (.getEnd (.getSpan p))))
    (if-not (= Parser/TOK_NODE (.getType p))
      (print ")"))))

Add a single mention to the parse-map with index.

(defn add-mention!
  [mention index parse-map]
  (let [mention-parse (.getParse (.getParse mention))]
    (swap! parse-map assoc mention-parse (+ index 1))))

Add mentions to the parse map.

(defn add-mentions!
  [entity index parse-map]
  (dorun (map #(add-mention! % index parse-map)
              (iterator-seq (.getMentions entity)))))

Given a list of entities, return a map of parses to entities.

(defn add-entities
  [entities]
  (let [parse-map (atom {})
        i-entities (map vector (iterate inc 0) entities)]
    (dorun (map (fn [[index entity]] (add-mentions! entity index parse-map))
                i-entities))
    @parse-map))

Given a list of parses and entities, print them out.

This is intended to actually be called.

(defn show-parses
  [parse entities]
  (let [parse-map (add-entities entities)]
    (println "parse-map:" parse-map)
    (println "parse:" parse)
    (print-parse parse parse-map)
    parse-map))
(defn coref-extent
  [extent p index]
  (if (nil? extent)
    (let [snp (Parse. (.getText p) (.getSpan extent) "NML" 1.0 0)]
      (.insert p snp) ; FIXME
      (.setParse extent (DefaultParse. snp index)))
    nil))
(defn coref-sentence
  [sentence parses index tblinker]
  (let [p (Parse/parseParse sentence)
        extents (.getMentions (.getMentionFinder tblinker)
                              (DefaultParse. p index))]
    (swap! parses #(assoc % (count %) p))
    (map #(coref-extent % p index) extents)
    ;;(println :es (map #(println (bean %)) extents))
    (map bean extents)))

Given an coref extent, a treebank linker, a parses atom and the index of the extent, return a tuple of the coresponding parse and discourse entities

TODO: fix this function, currently doesn't parse correctly

(defn parse-extent
  [extent tblinker parses pindex]
  (println :ext (bean extent))
  (let [e (filter #(not (nil? (:parse (bean %)))) extent)
        ;;_ (println :e e)
        mention-array (into-array Mention e)
        entities (.getEntities tblinker mention-array)]
    (println :entities (seq entities) (bean (first entities)))
    [(get @parses pindex) (seq entities)]))

Make a TreebankLinker, given a model directory.

Second Attempt

(defn make-treebank-linker
  [modeldir]
  (let [tblinker (DefaultLinker. modeldir LinkerMode/TEST)]
    (fn treebank-linker
      [sentences]
      (let [parses (atom [])
            indexed-sentences (map vector (iterate inc 0) sentences)
            extents (doall (map #(coref-sentence (second %) parses
                                                 (first %) tblinker)
                                indexed-sentences))
            i-extents (map vector (iterate inc 0) extents)]
        #_(map #(parse-extent %1 tblinker parses %2) i-extents)
        (doall (map println extents))
        extents))))

Set the location of the WordNet 'dict' directory

this is used for the treebank linking, it is a system property for the location of the wordnet installation 'dict' directory see: http://wordnet.princeton.edu/wordnet/

(defn set-wordnet-location!
  [location]
  (System/setProperty "WNSEARCHDIR" location))

What I really need is a good way to express this in Clojure's datastructures.

(comment
  (def tbl (make-treebank-linker "coref"))
  (def treebank-parser
    (make-treebank-parser "parser-model/en-parser-chunking.bin"))
  (def s (treebank-parser ["Mary said she liked me ."]))
  (tbl s))