KR 2023 homework part 1

Allikas: Lambda

The goal of the lab is to explore and prepare data for the question answering system to be built in the following labs. We will later modify, filter and format this data to make it suitable for simple commonsense reasoning / question answering.

Since the goal really is first exploring and then converting the data, there is no exact specification in the style of "take this and do that".

In case you use sqlite, first check out how to speed up inserts ca 1000 times or this or google analogous tips. For postgres search analogous tips to make bulk inserts several orders of mangitude faster than with the naive insertion.

If you have very little or no experience using SQL, then it is recommended to

  • Use sqlite, not Postgresql or any other database: sqlite is easiest to use.
  • Run this small example from python: kuidas sqlite kasutada
  • Read and try out live examples from the w3schools SQL tutorial: at least from the beginning up to the "delete" chapter and then the first few of the "join" chapters.
  • Look for the specific details and options of using sqlite in python from the python 3 sqlite api docs
  • Do not worry about sql injection or such security stuff: your code will not be accessible from the web and nobody can attack it using sql injection.

Yago for geodata

As a source for geographical data we use yago (here is their old page). Investigate parts of it and download from the download page (again, the old download page is here). The second important taxonomy source is wordnet: Yago contains some (?) of it, but you can use wordnet directly in your system.

Yago is large. You do not need to incorporate all of Yago in your database: the geographical facts for say, one country, and taxonomies are enough.

Your task is to

  • select a sensible sub-part of geographical facts in Yago (like, one country) and store it in the SQL database, using both standard SQL and, where it seems useful, json. Postgresql and Sqlite are the best options. Postgresql due to special capabilities of handling json, sqlite due to simplicity of use.
  • select a sensible sub-part of Yago taxonomies and store it in the same SQL database.
  • investigate whether your yago taxonomy contains relevant parts of wordnet: if not, incorporate also wordnet into your database.
  • perform some sample queries to verify that you can actually find information: try searching for both simple facts and also using taxonomies for searching for more abstract concepts.

Extras

It is a good idea - but not at all obligatory! - to explore the following basic commonsense knowledge datasets, find if you can get useful additional information for the geography domain, filter it out and store in the database.

Quasimodo

As a source for basic commonsense knowledge we explore and possibly use quasimodo. Beware: it does not cover very much and contains a lot of weird statements, like "estonia, has_color, yellow".

In particular, please

  • Find out if the dataset contains meaningful amount of information about the geographical places (or additional information about the facts about these places) for the sub-part you chose from Yago.
  • If yes, please store the relevant meaningful information part in the SQL database, using both standard SQL and, where it seems useful, json.

Conceptnet

Similarly to Quasimodo, explore, and if it makes sense, use conceptnet. This contains a lot more than Quasimodo, but does not have as good metainformation (context, plausibility, etc).

In particular,

  • Find out if the dataset contains meaningful amount of information about the geographical places (or additional information about the facts about these places) for the sub-part you chose from Yago.
  • If yes, please store a significant amout of relevant meaningful information part in the SQL database, using both standard SQL and, where it seems useful, json. Since Conceptnet is big, do not attempt to store all the seemingly relevant information.

Ascent++

  • Check out the newest Max Planck dataset ascent++, explore the data, read the paper.
  • If some parts look useful, please filter these out and store a significant amout of relevant meaningful information part in the SQL database, using both standard SQL and, where it seems useful, json.