logo
Free, unlimited AI code reviews that run on commit
git-lrc git-lrc GitHub Install Now We'd appreciate a star git-lrc - Free, unlimited AI code reviews that run on commit | Product Hunt git-lrc - Free, unlimited AI code reviews that run on commit | Product Hunt

wordlist2dawg - convert a wordlist to a DAWG for Tesseract

Arguments

WORDLIST A plain text file in UTF-8, one word per line.

       DAWG The output DAWG to write.

       lang.unicharset The unicharset of the language. This is the unicharset generated by mftraining(1).

Author

       The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985-1995)
       and Google (2006-2018).

                                                   01/19/2025                                   WORDLIST2DAWG(1)

Copying

       Copyright (C) 2006 Google, Inc. Licensed under the Apache License, Version 2.0

Description

wordlist2dawg(1) converts a wordlist to a Directed Acyclic Word Graph (DAWG) for use with Tesseract. A
       DAWG is a compressed, space and time efficient representation of a word list.

Name

       wordlist2dawg - convert a wordlist to a DAWG for Tesseract

Options

       -t Verify that a given dawg file is equivalent to a given wordlist.

       -r 1 Reverse a word if it contains an RTL character.

       -r 2 Reverse all words.

       -l <short> <long> Produce a file with several dawgs in it, one each for words of length <short>,
       <short+1>,... <long>

See Also

tesseract(1), combine_tessdata(1), dawg2wordlist(1)

       https://tesseract-ocr.github.io/tessdoc/Training-Tesseract.html

Synopsis

wordlist2dawgWORDLISTDAWGlang.unicharsetwordlist2dawg -t WORDLISTDAWGlang.unicharsetwordlist2dawg -r 1 WORDLISTDAWGlang.unicharsetwordlist2dawg -r 2 WORDLISTDAWGlang.unicharsetwordlist2dawg -l <short> <long> WORDLISTDAWGlang.unicharset

See Also