wordlist2dawg - convert a wordlist to a DAWG for Tesseract
Contents
Arguments
WORDLIST A plain text file in UTF-8, one word per line.
DAWG The output DAWG to write.
lang.unicharset The unicharset of the language. This is the unicharset generated by mftraining(1).
Copying
Copyright (C) 2006 Google, Inc. Licensed under the Apache License, Version 2.0
Description
wordlist2dawg(1) converts a wordlist to a Directed Acyclic Word Graph (DAWG) for use with Tesseract. A
DAWG is a compressed, space and time efficient representation of a word list.
Name
wordlist2dawg - convert a wordlist to a DAWG for Tesseract
Options
-t Verify that a given dawg file is equivalent to a given wordlist.
-r 1 Reverse a word if it contains an RTL character.
-r 2 Reverse all words.
-l <short> <long> Produce a file with several dawgs in it, one each for words of length <short>,
<short+1>,... <long>
See Also
tesseract(1), combine_tessdata(1), dawg2wordlist(1) https://tesseract-ocr.github.io/tessdoc/Training-Tesseract.html
Synopsis
wordlist2dawgWORDLISTDAWGlang.unicharsetwordlist2dawg -t WORDLISTDAWGlang.unicharsetwordlist2dawg -r 1 WORDLISTDAWGlang.unicharsetwordlist2dawg -r 2 WORDLISTDAWGlang.unicharsetwordlist2dawg -l <short> <long> WORDLISTDAWGlang.unicharset
