Corpus Processing Tools
- Myself, so far. I've searched the whole web for the appropriate
set of tools while working on my Ph. D. Even though some tools were
close to what I was looking for, I never found THE killer application
that would suit my corpus-processing needs. This is why I decided to
develop it myself. But if you would like to participate, your help
would be most appreciated.
What is CoPT?
- CoPT, Corpus Processing Tools, is a set of java classes intended
to assist field linguists, NLP (Natural Language Processing)
researchers and developers, students and software developers in all
- CoPT is also intended as an open and collaborative
corpus-processing platform, allowing individuals to contribute their
- CoPT will be released as a standalone application for final
users, and as a set of beans or plug-ins for developers (to be
integrated in existing NLP-related platforms: GATE, OpenNLP, Unitex,
PRAAT, Anvil). CoPT will, therefore, aim at providing a corpus
processing-related API, together with a set of methods and procedures
for that task.
- At this stage, I intend to develop the following tools:
- N-Grams related classes for creating statistical models of
natural languages, and for studying collocations and word associations.
The intended goal is to provide a set of basic pluggable statistical
tests useful for spotting collocational behavior and word association,
which can be extended as needed.
- Tools for converting existing transcription or text files into
open source database formats.
- Basic formatting classes for sending and retrieving data from
other corpus-related applications, namely Unitex and PRAAT, which are
ones I'm currently using most of the time. Interaction with Anvil is
also planned. Eventually, if the need arises, interactions with GATE,
will be taken in consideration.
What is a corpus?
- For linguists dealing with actual spoken, written, or signed
languages, a corpus is a sample of linguistic behavior which they
intend to study. So, for those linguists, a corpus represents the set
of experimental data they will use in their investigation.
- For CoPT, a corpus is a set of linearly-ordered symbols (left
to right OR right to left, for now). This means that CoPT only works on
transcription files or texts taken from natural languages. CoPT DOES
work on the original signal: audio or video, for example.
Who needs Copt?
- Students working on corpora, whatever their theoretical and
methodological school of thought should be able to use CoPT with some
profit, I hope. Students working on Natural Language Processing
assignements should also consider CoPT as a good framework for learning
about project development.
- Researchers: field linguists and corpus linguists should find
useful tools for mining their corpora and for turning them into
- Developers should also find here a concentration of otherwise
scattered corpus-related features, which will be available through a
API: tagging (requires some third-party software), concordancing,
statistical tools (eg. Khi-2 test).
When will CoPT be available?
As of this date, no precise schedule
is available for the different releases of CoPT.
A set of test applets is available, though, for
different modules, such as:
- WordLCS (in French for now,
English version coming soon)
- soon to come other applets.
- WordLCS_V1, 21/03/2005, just type
the following command: java WordLCS_V1 "String one here" "String two
here", and you will get:
- Contiguous Longest Common Substring (i.e. "String -")
- Discontiguous Longest Common Subsequence (i.e. "String - here").
- 23 sept. 2003: submission of the project description to
Sourceforge.net's assessment staff
- 24 sept. 2003: acceptance notification by Sourceforge.net's
- 25 sept. 2003: first public description of the project
- 29 sept. 2003: first release of public description on
- 21 mar. 2005 (at last !): first release of WordLCS_V1 class and
You will find here the documentation to the following classes:
Author: Antonio BALVET
Last updated: 21 mar. 2005
antonio.balvet at u-paris10.fr (replace "at" with @)
antonio.balvet at univ-lille3.fr
Affiliation: Université Lille 3, UMR Silex, UMR Modyco,