Patrick McGuire: Information Extraction in Illicit Domains. (arXiv:1703.03097v1 [cs.CL])

Thursday, March 9, 2017

Information Extraction in Illicit Domains. (arXiv:1703.03097v1 [cs.CL])

Extracting useful entities and attribute values from illicit domains such as human trafficking is a challenging problem with the potential for widespread social impact. Such domains employ atypical language models, have `long tails' and suffer from the problem of concept drift. In this paper, we propose a lightweight, feature-agnostic Information Extraction (IE) paradigm specifically designed for such domains. Our approach uses raw, unlabeled text from an initial corpus, and a few (12-120) seed annotations per domain-specific attribute, to learn robust IE models for unobserved pages and websites. Empirically, we demonstrate that our approach can outperform feature-centric Conditional Random Field baselines by over 18\% F-Measure on five annotated sets of real-world human trafficking datasets in both low-supervision and high-supervision settings. We also show that our approach is demonstrably robust to concept drift, and can be efficiently bootstrapped even in a serial computing environment.

from cs.AI updates on arXiv.org http://ift.tt/2m7SWS0
via IFTTT

Patrick McGuire

Latest YouTube Video

Thursday, March 9, 2017

Information Extraction in Illicit Domains. (arXiv:1703.03097v1 [cs.CL])

No comments:

Click to Show Support

Click to Show Support