.\" Hey, Emacs! This is -*-nroff-*- you know... .\" .\" gendict.1: manual page for the gendict utility .\" .\" Copyright (C) 2016 and later: Unicode, Inc. and others. .\" License & terms of use: http://www.unicode.org/copyright.html .\" Copyright (C) 2012 International Business Machines Corporation and others .\" .TH GENDICT 1 "1 June 2012" "ICU MANPAGE" "ICU @VERSION@ Manual" .SH NAME .B gendict \- Compiles word list into ICU string trie dictionary .SH SYNOPSIS .B gendict [ .BR "\fB\-\-uchars" | .BR "\fB\-\-bytes" .BI "\fB\-\-transform" " transform" ] [ .BR "\-h\fP, \fB\-?\fP, \fB\-\-help" ] [ .BR "\-V\fP, \fB\-\-version" ] [ .BR "\-c\fP, \fB\-\-copyright" ] [ .BR "\-v\fP, \fB\-\-verbose" ] [ .BI "\-i\fP, \fB\-\-icudatadir" " directory" ] .IR " input-file" .IR " output\-file" .SH DESCRIPTION .B gendict reads the word list from .I dictionary-file and creates a string trie dictionary file. Normally this data file has the .B .dict extension. .PP Words begin at the beginning of a line and are terminated by the first whitespace. Lines that begin with whitespace are ignored. .SH OPTIONS .TP .BR "\-h\fP, \fB\-?\fP, \fB\-\-help" Print help about usage and exit. .TP .BR "\-V\fP, \fB\-\-version" Print the version of .B gendict and exit. .TP .BR "\-c\fP, \fB\-\-copyright" Embeds the standard ICU copyright into the .IR output-file . .TP .BR "\-v\fP, \fB\-\-verbose" Display extra informative messages during execution. .TP .BI "\-i\fP, \fB\-\-icudatadir" " directory" Look for any necessary ICU data files in .IR directory . For example, the file .B pnames.icu must be located when ICU's data is not built as a shared library. The default ICU data directory is specified by the environment variable .BR ICU_DATA . Most configurations of ICU do not require this argument. .TP .BR "\fB\-\-uchars" Set the output trie type to UChar. Mutually exclusive with .BR --bytes. .TP .BR "\fB\-\-bytes" Set the output trie type to Bytes. Mutually exclusive with .BR --uchars. .TP .BR "\fB\-\-transform" Set the transform type. Should only be specified with .BR --bytes. Currently supported transforms are: .BR offset-<hex-number>, which specifies an offset to subtract from all input characters. It should be noted that the offset transform also maps U+200D to 0xFF and U+200C to 0xFE, in order to offer compatibility to languages that require these characters. A transform must be specified for a bytes trie, and when applied to the non-value characters in the .IR input-file must produce output between 0x00 and 0xFF. .TP .BI " input\-file" The source file to read. .TP .BI " output\-file" The file to write the output dictionary to. .SH CAVEATS The .IR input-file is assumed to be encoded in UTF-8. The integers in the .IR input-file that are used as values must be made up of ASCII digits. They may be specified either in hex, by using a 0x prefix, or in decimal. Either .BI --bytes or .BI --uchars must be specified. .SH ENVIRONMENT .TP 10 .B ICU_DATA Specifies the directory containing ICU data. Defaults to .BR @thepkgicudatadir@/@PACKAGE@/@VERSION@/ . Some tools in ICU depend on the presence of the trailing slash. It is thus important to make sure that it is present if .B ICU_DATA is set. .SH AUTHORS Maxime Serrano .SH VERSION 1.0 .SH COPYRIGHT Copyright (C) 2012 International Business Machines Corporation and others .SH SEE ALSO .BR http://www.icu-project.org/userguide/boundaryAnalysis.html