A generative model of text documents capturing bursts and similarity

M. Ángeles Serrano

IFISC (CSIC-UIB), Palma de Mallorca, Spain

Various universal regularities characterize text from different domains and languages. Most notable are Zipf¢s law on the distribution of word frequencies, Heaps¢ law on vocabulary size, and the bursty nature of topical words. However, no single model of text generation explains how these properties emerge or predicts the empirical distribution of similarity between documents. Here we present and validate a generative model that produces simultaneously several statistical features of textual corpora. Our results point to frequency ranking along with dynamic reordering and memory accross documents as key mechanisms for understanding text generation. A picture of how structure and topicality emerge in written text can shed light into the collective cognitive processes we use to organize and store information, and find broad applications in topic detection, literature analysis, and Web mining.