Datasets

Tripadvisor Reviews

The Tripadvisor reviews about hotels and restaurants in Singapore formed the basis for the generation of our word graph, which in turn was the basis for the generation of questions and queries. The ZIP archive contains three files: two plain text files with all hotel and restaurant reviews, and one comma-separated text file with the basic information about all venues (Tripadvisor IDs, category, name, latitude, longitude)

Click here to download the ZIP archive (80Mb)

Generated Training & Test Data

The query-to-question pipeline of CloseUp is comprised of three components: (1) the query classification (QC) to identify live questions, (2) named entity recognition in queries (NERQ) to identify the names of venues in queries, and (3) query-to-question (Q2Q) translation to convert keyword-based queries into well-formed questions. Below, we provide the datasets we used to train all component. Each dataset contains one file with the training data of 1M items and one file with the test data of 100k items.

Click here to download the ZIP archive of QC dataset (15Mb)

Click here to download the ZIP archive of NERQ dataset (16Mb)

Click here to download the ZIP archive of Q2Q dataset (16Mb)