The corpora in ICE are being annotated at various levels to enhance their value in linguistic research. These levels are
- Textual Markup
- Wordclass Tagging
- Syntactic Parsing
1. Textual Markup
In written texts, features of the original layout are marked, including sentence and paragraph boundaries, headings, deletions, and typographic features.
Spoken texts are transcribed orthographically, and are marked for pauses, overlapping strings, discourse phenomena such as false starts and hesitations, and speaker turns.
The markup manual is available here.
2. Wordclass Tagging
2.1 ICE-GB tagging
ICE-GB texts are automatically tagged for wordclass by the ICE Tagger, developed by Sean Wallis at the Survey of English Usage, University College London. This assigns wordclass tags to each lexical item in the corpus. The tagset has been developed especially for ICE, and is largely based on Quirk et al (1985) A Comprehensive Grammar of the English Language. An example of a grammatically tagged sentence is shown below:
The tagging manual is available here.
2.2 Automatic POS Tagging with PENN and CLAWS
Each ICE component that is available on ICE-online has been automatically tagged with the PENN Treebank and with the CLAWS tagset. The part-of-speech tags have not been corrected manually, but evaluations have been made.
3. Syntactic Parsing
3.1 Constituency Annotation
Every sentence in the ICE-GB corpus is analysed at phrase, clause, and sentence level.
The analysis is shown in the form of a parse tree:
For more details about the grammatical constituency annotation, see the Quick Guide to the ICE-GB Grammar (on the UCL server).
The parse trees have been edited and corrected, if necessary, using a version of ICECUP, a dedicated syntactic tree editor and retrieval system which has been developed specifically for ICE.
3.2 Dependency Annotation
Each of the ICE components available in ICE-online has been automatically annotated with the Dependency Parser Pro3GreS. The annotations are not manually verified, but have been evaluated.
An example of an analysis, showing the part-of-speech tags and the dependencies is shown in the folllowing: