QuestionMark The Dataset Generator
QuestionMark: The Dataset Generator is a Python program to create a dataset for probabilistic product matching. This dataset is required to run the benchmark test with QuestionMark: The Probabilistic Benchmark.
This project is written by Nikki Zandbergen as part of her M.Sc. Computer Science thesis at the University of Twente. This project was supervised by Maurice van Keulen, Tom van Dijk and Jan Flokstra.
The dataset created by this program is an adaptation of the WDC Product Data Corpus for Large-Scale Product Matching dataset. The clustering provided by this original dataset is removed and a new probabilistic clustering is introduced.