p-Value: A statistically rigorous approach to machine learning model comparison
This event took place on Wednesday 07 September 2022 at 13:00
Evaluation of machine learning models often involves a two stage comparison of model predictions to a human annotated "gold standard" - yielding a metric score such as accuracy or correlation - and then a comparison of those metric scores between a baseline model and some proposed improvement to it. These comparisons are used to e.g. establish new ``state-of-the-art'' results via benchmarks in the literature, or in practice to evaluate whether an engineering or data change made things better or worse. For the past decade of advances in AI, a mechanism to measure the confidence of these two-stage evaluations have eluded the community, and therefore we have largely failed to provide a measure of confidence on the comparative performance of the machines, even if statistical guarantees are provided for the individual stages. We propose that this ranking of machines should be grounded in a notion of statistical significance, and that grounding must be robust in the face of the multiple stages of comparison. In this work, we explored the production of p-value confidence scores for the models' comparative performance, by testing a null hypothesis that the machine predictions being compared are drawn from the same distribution. We then developed an approach to producing two-sided horizontal and vertical variance that allows us to test this null hypothesis and produce a p-value for the comparison of two sets of machine scores (e.g. proposed vs. baseline). In order to evaluate the p-values we produce, we developed a simulator that allows us to experiment with different metrics, sampling methods, and comparative distributions. Our initial results provide insight into which sampling methods and metrics provide the most accurate p-value for machine comparisons.
Dr. Chris Welty is a Sr. Research Scientist at Google in New York. His main area of interest is the interaction between structured knowledge (e.g. knowledge graphs such as freebase), unstructured knowledge (e.g. natural language text), and human knowledge (e.g. crowdsourcing). His latest work focuses on understanding the continuous nature of truth in the presence of a diversity of perspectives, and he has been working with the google maps team to better understand user contributions that often disagree. He is most active in the Crowdsourcing and Human Computation community, as well as The Web Conf, AKBC, Information and Knowledge Management, and AAAI.
His first project at Google was launched as Explore in Google Docs, and then on improving the quality and expanding the coverage of price level labels on maps using user signals. Before Google, Dr. Welty was a member of the technical leadership team for IBM's Watson - the question answering computer that defeated the all-time best Jeopardy! champions in a widely televised contest. He appeared on the broadcast, discussing the technology behind Watson, as well as many articles in the popular and scientific press. His proudest moment was being interviewed for StarTrek.com about the project. He is a recipient of the AAAI Feigenbaum Prize for his work.
Welty has played a seminal role in the development of the Semantic Web and Ontologies, and co-developed OntoClean, the first formal methodology for evaluating ontologies. He is on the editorial board of AI Magazine, the Journal of Applied Ontology, the Journal of Web Semantics, and the Semantic Web Journal. He is currently an editor for the AI Magazine column, "AI Bookies" to foster science bets on the progress of AI. He published many papers before those shown below, see his Google Scholar entry.
More information can be found on his Google Research page.
The event will be in person. Join us on the 4th Floor, Berrill Bulding