Patrick McGuire: What is Wrong with Topic Modeling? (and How to Fix it Using Search-based SE). (arXiv:1608.08176v1 [cs.SE])

Monday, August 29, 2016

What is Wrong with Topic Modeling? (and How to Fix it Using Search-based SE). (arXiv:1608.08176v1 [cs.SE])

Topic Modeling finds human-readable structures in large sets of unstructured SE data. A widely used topic modeler is Latent Dirichlet Allocation. When run on SE data, LDA suffers from "order effects" i.e. different topics be generated if the training data was shuffled into a different order. Such order effects introduce a systematic error for any study that uses topics to make conclusions. This paper introduces LDADE, a Search-Based SE tool that tunes LDA's parameters using DE (Differential Evolution). LDADE has been tested on data from a programmer information exchange site (Stackoverflow), title and abstract text of thousands of SE papers, and software defect reports from NASA. Results were collected across different implementations of LDA (Python+Scikit-Learn, Scala+Spark); across different platforms (Linux, Macintosh) and for different kinds of LDAs (the traditional VEM method, or using Gibbs sampling). In all tests, the pattern was the same: LDADE's tunings dramatically reduces topic instability. The implications of this study for other software analytics tasks is now an open and pressing issue. In how many domains can search-based SE dramatically improve software analytics?

from cs.AI updates on arXiv.org http://ift.tt/2bvVgSF
via IFTTT

Patrick McGuire

Latest YouTube Video

Monday, August 29, 2016

What is Wrong with Topic Modeling? (and How to Fix it Using Search-based SE). (arXiv:1608.08176v1 [cs.SE])

No comments:

Click to Show Support

Click to Show Support