Latest YouTube Video

Monday, March 28, 2016

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation. (arXiv:1603.08023v1 [cs.CL])

We investigate evaluation metrics for end-to-end dialogue systems where supervised labels, such as task completion, are not available. Recent works in end-to-end dialogue systems have adopted metrics from machine translation and text summarization to compare a model's generated response to a single target response. We show that these metrics correlate very weakly or not at all with human judgements of the response quality in both technical and non-technical domains. We provide quantitative and qualitative results highlighting specific weaknesses in existing metrics, and provide recommendations for future development of better automatic evaluation metrics for dialogue systems.

Donate to arXiv



from cs.AI updates on arXiv.org http://ift.tt/1PBZSwK
via IFTTT

No comments: