Bag of biterms modeling for short texts

Published in Knowledge and Information Systems volume 62, pages4055–4090(2020), 2020

Recommended citation: Tuan, A. P., Tran, B., Nguyen, T. H., Van, L. N., & Than, K. (2020). Bag of biterms modeling for short texts. Knowledge and Information Systems, 62(10), 4055-4090. https://link.springer.com/article/10.1007/s10115-020-01482-z

Analyzing texts from social media encounters many challenges due to their unique characteristics of shortness, massiveness, and dynamic. Short texts do not provide enough context information, causing the failure of the traditional statistical models. Furthermore, many applications often face with massive and dynamic short texts, causing various computational challenges to the current batch learning algorithms. This paper presents a novel framework, namely bag of biterms modeling (BBM), for modeling massive, dynamic, and short text collections. BBM comprises of two main ingredients: (1) the concept of bag of biterms (BoB) for representing documents, and (2) a simple way to help statistical models to include BoB. Our framework can be easily deployed for a large class of probabilistic models, and we demonstrate its usefulness with two well-known models: latent Dirichlet allocation (LDA) and hierarchical …