Show Summary Details

Page of

PRINTED FROM OXFORD HANDBOOKS ONLINE ( © Oxford University Press, 2022. All Rights Reserved. Under the terms of the licence agreement, an individual user may print out a PDF of a single chapter of a title in Oxford Handbooks Online for personal use (for details see Privacy Policy and Legal Notice).

date: 28 June 2022

Abstract and Keywords

Electronic text is essentially just a sequence of characters, but the majority of text processing tools operate in terms of linguistic units such as words and sentences. Tokenization is a process of segmenting text into words, and sentence splitting is the process of determining sentence boundaries in the text. In this chapter we describe major challenges for text tokenization and sentence splitting in different languages, and outline various computational approaches to tackling them.

Keywords: text segmentation, word splitting, text preprocessing, tokenization

Access to the complete content on Oxford Handbooks Online requires a subscription or purchase. Public users are able to search the site and view the abstracts and keywords for each book and chapter without a subscription.

Please subscribe or login to access full text content.

If you have purchased a print title that contains an access token, please see the token for information about how to register your code.

For questions on access or troubleshooting, please check our FAQs, and if you can''t find the answer there, please contact us.