Topic Modeling finds human-readable structures in large sets of unstructured SE data. A widely used topic modeler is Latent Dirichlet Allocation. When run on SE data, LDA suffers from "order effects" i.e. different topics be generated if the training data was shuffled into a different order. Such order effects introduce a systematic error for any study that uses topics to make conclusions. This paper introduces LDADE, a Search-Based SE tool that tunes LDA's parameters using DE (Differential Evolution). LDADE has been tested on data from a programmer information exchange site (Stackoverflow), title and abstract text of thousands of SE papers, and software defect reports from NASA. Results were collected across different implementations of LDA (Python+Scikit-Learn, Scala+Spark); across different platforms (Linux, Macintosh) and for different kinds of LDAs (the traditional VEM method, or using Gibbs sampling). In all tests, the pattern was the same: LDADE's tunings dramatically reduces topic instability. The implications of this study for other software analytics tasks is now an open and pressing issue. In how many domains can search-based SE dramatically improve software analytics?
from cs.AI updates on arXiv.org http://ift.tt/2bvVgSF
via IFTTT
No comments:
Post a Comment