The Tripadvisor reviews about hotels and restaurants in Singapore formed the basis for the generation of our word graph, which in turn was the basis for the generation of questions and queries. The ZIP archive contains three files: two plain text files with all hotel and restaurant reviews, and one comma-separated text file with the basic information about all venues (Tripadvisor IDs, category, name, latitude, longitude)
Generated Training & Test Data
The query-to-question pipeline of CloseUp is comprised of three components: (1) the query classification (QC) to identify live questions, (2) named entity recognition in queries (NERQ) to identify the names of venues in queries, and (3) query-to-question (Q2Q) translation to convert keyword-based queries into well-formed questions. Below, we provide the datasets we used to train all component. Each dataset contains one file with the training data of 1M items and one file with the test data of 100k items.