This article focuses on the use of probabilistic context-free grammars (PCFGs) in natural language processing involving a large-scale natural language parsing task. It describes detailed, highly-structured Bayesian modelling in which model dimension and complexity responds naturally to observed data. The framework, termed hierarchical Dirichlet process probabilistic context-free grammar (HDP-PCFG), involves structured hierarchical Dirichlet process modelling and customized model fitting via variational methods to address the problem of syntactic parsing and the underlying problems of grammar induction and grammar refinement. The central object of study is the parse tree, which can be used to describe a substantial amount of the syntactic structure and relational semantics of natural language sentences. The article first provides an overview of the formal probabilistic specification of the HDP-PCFG, algorithms for posterior inference under the HDP-PCFG, and experiments on grammar learning run on the Wall Street Journal portion of the Penn Treebank.

Keywords: probabilistic context-free grammars (PCFGs), natural language processing, Bayesian modelling, parse tree, hierarchical Dirichlet process probabilistic context-free grammar (HDP-PCFG), syntactic parsing, grammar induction, grammar refinement, posterior inference, grammar learning

