looks quite awesome.
Text below is just cut and pasted from the Maven mailing list.
================================================================
PROBLEM:
At a customer site there is a custom, company-wide dictionary available for
spellchecking. This dictionary is managed in an proprietary application from
where you can export it. For the webapp we're building we need to transform this
dictionary into a very simple format: a single file with one
dictionary entry per line. The export format is somewhat special as
its spread over a bunch of files (one for each letter of the
alphabet), contains additional syllabication info, which we don't need
and also has some comments that have to be removed. The specifics of
the format aren't really that important here though...
After some testing I came up with the following short bash-script that fullfills
all my needs:
8<-----------------------------------------------------------
tmp_folder=target/dict
cls_folder=target/classes
mkdir -p $tmp_folder
mkdir -p $cls_folder
cat src/main/dictionary/*.lst > $tmp_folder/tmp1.dict
sed "s/[~?]//g" $tmp_folder/tmp1.dict > $tmp_folder/tmp2.dict
sed "s/ .*$//g" $tmp_folder/tmp2.dict > $tmp_folder/tmp3.dict
sort -u -o $cls_folder/my.dict $tmp_folder/tmp3.dict
8<-----------------------------------------------------------
(In other words: Take all files src/main/dictionary/*.lst, concat them into one
single file, match some strings with simple regexes and remove those, and
finally sort the dictionary entries and remove all duplicates.)
This script is then called from within maven with exec-maven-plugin. Afterwards
maven-jar-plugin wraps the file in a simple jar, so the dictionary can
be easily consumed in Java using
getClassLoader().getResourceAsStream().
Now all is well & nice and this script even performs sufficently given about 1.6
million dictionary entries (~38MB). But of course it's not really the
Maven way to do things, especially because it's not portable. You need
to have some kind of Unix-like enviroment in place for this script to
work.
SOLUTION:
Assuming the dictionary source files are already broken down by letters
of the alphabet, then the following 5 lines of code does most of it.
(Note that the sed scripts are pretty close to just a line by line
trim() and, of course, you may need to sort the file names.)
<plugin>
<groupId>org.codehaus.groovy.maven</groupId>
<artifactId>gmaven-plugin</artifactId>
<executions>
<execution>
<phase>generate-resources</phase>
<goals>
<goal>execute</goal>
</goals>
<configuration>
<source>
<![CDATA[
new File( "target" ).mkdirs();
def resultfile = new File( "target/dictionary" );
new File("src/main/dictionary").eachFileMatch(~/.*\.lst/){ file ->
file.readLines().sort().each(){resultfile << it.trim() + "\n";}
}
]]>
</source>
</configuration>
</execution>
</executions>
</plugin>
No comments:
Post a Comment