html2struct.github.io

This page provides the data used in the paper Does Structure Matter? Encoding Documents for Machine Reading Comprehension, which appears in NAACL-HLT 2021.

If you find the data useful, please use the following bibtex to cite the paper.

@inproceedings{wan-etal-2021-structure,
    title = "Does Structure Matter? Encoding Documents for Machine Reading Comprehension",
    author = "Wan, Hui  and
      Feng, Song  and
      Gunasekara, Chulaka  and
      Patel, Siva Sankalp  and
      Joshi, Sachindra  and
      Lastras, Luis",
    booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jun,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.naacl-main.367",
    pages = "4626--4634",
    abstract = "Machine reading comprehension is a challenging task especially for querying documents with deep and interconnected contexts. Transformer-based methods have shown advanced performances on this task; however, most of them still treat documents as a flat sequence of tokens. This work proposes a new Transformer-based method that reads a document as tree slices. It contains two modules for identifying more relevant text passage and the best answer span respectively, which are not only jointly trained but also jointly consulted at inference time. Our evaluation results show that our proposed method outperforms several competitive baseline approaches on two datasets from varied domains.",
}

Document structures extracted from Doc2Dial

Please find below the document tree structures and the queries obtained from Version v0.9 of Doc2Dial dataset.

doc2dial_doc_tree.json.gz d2d_dialogue_QAs.tar.gz

Document structures extracted from Natural Questions

Please find below the queries and document tree structures obtained from each of the original Natural Questions dataset files, filtered so that each example contains short answers from the tree structure instead of the first paragraph in Wikipedia page.

nq-train-00.jsonl.gz
nq-train-01.jsonl.gz
nq-train-02.jsonl.gz
nq-train-03.jsonl.gz
nq-train-04.jsonl.gz
nq-train-05.jsonl.gz
nq-train-06.jsonl.gz
nq-train-07.jsonl.gz
nq-train-08.jsonl.gz
nq-train-09.jsonl.gz
nq-train-10.jsonl.gz
nq-train-11.jsonl.gz
nq-train-12.jsonl.gz
nq-train-13.jsonl.gz
nq-train-14.jsonl.gz
nq-train-15.jsonl.gz
nq-train-16.jsonl.gz
nq-train-17.jsonl.gz
nq-train-18.jsonl.gz
nq-train-19.jsonl.gz
nq-train-20.jsonl.gz
nq-train-21.jsonl.gz
nq-train-22.jsonl.gz
nq-train-23.jsonl.gz
nq-train-24.jsonl.gz
nq-train-25.jsonl.gz
nq-train-26.jsonl.gz
nq-train-27.jsonl.gz
nq-train-28.jsonl.gz
nq-train-29.jsonl.gz
nq-train-30.jsonl.gz
nq-train-31.jsonl.gz
nq-train-32.jsonl.gz
nq-train-33.jsonl.gz
nq-train-34.jsonl.gz
nq-train-35.jsonl.gz
nq-train-36.jsonl.gz
nq-train-37.jsonl.gz
nq-train-38.jsonl.gz
nq-train-39.jsonl.gz
nq-train-40.jsonl.gz
nq-train-41.jsonl.gz
nq-train-42.jsonl.gz
nq-train-43.jsonl.gz
nq-train-44.jsonl.gz
nq-train-45.jsonl.gz
nq-train-46.jsonl.gz
nq-train-47.jsonl.gz
nq-train-48.jsonl.gz
nq-train-49.jsonl.gz
nq-dev-00.jsonl.gz
nq-dev-01.jsonl.gz
nq-dev-02.jsonl.gz
nq-dev-03.jsonl.gz
nq-dev-04.jsonl.gz

The test set used in the paper is the dev set above, and the dev set used is composed by the last 50 examples from each of the training files.