KR lab idea by Priit Järv

Allikas: Lambda

Ontology for segmenting persons based on online searches

There is a demand for automatically profiling persons based on the information that is available about them on the open Internet. A typical application is in marketing.

1.) What is given

A body of documents harvested online that is related to a person. We may assume that we can extract and match English language words and phrases from the documents.

A list of labels, or tags to classify this person for the profiling purposes, for example "Sport" indicating that the person is involved or interested in sports, or "Pets" indicating that the person keeps or likes pets.

2.) What is needed

An ontology that maps a given set of tags (assume that these are ubiquitous concepts related to interests and activities) to words and phrases frequently found in online documents (these can be as diverse as news reports, data dumps, encyclopedic data, but first and foremost self-published content).

The mapping should preferrably allow querying the semantic distance between a short text segment and a tag (for example, in the interval (0, 1] such that 0.1 is very remotely related and 1.0 is very closely related). Alternatively, a "yes"/"no" answer is also adequate for practical purposes.

Data-structure-wise, the ontology may be a graph, but even a flat list can serve the same purpose.

The ontology should consider the morphology of the language. For example, it should at least be possible to identify the phrase "play basketball" as consisting of a verb and a noun, such that text segments "played basketball" or "play some basketball" could also be matched (algorithm for such matching is not part of the task, merely providing adequate data so that this can be accomplished).

3.) What is not needed

The method of developing the ontology is a subtask of segmenting the persons; there are many other problems that do not need to be solved within this subtask, such as:

- correctly identifying persons online - linguistic or structural analysis of online documents - actual classification methods and automatic reasoning based on the data

4.) Existing resources

Many ontologies that establish semantic relations between words and phrases exist, for example OpenCyc, Wordnet and ConceptNet.

Superficially, ConceptNet provides exactly the ontology that is needed. It is a reasonably large aggregation of other ontologies as well as original content. Its API provides a method of querying the semantic distance between two words or phrases.

However, querying related concepts reveals that for the specific task of profiling the interests of a person, ConceptNet data can be counterproductive. For example, if we search for concepts related to "sport", we will find the words "scrabble", "venison", "c++" and "smoke marijuana" being fairly close, higher up the list than the single word "basketball". The concepts related to the word "pet" are completely dominated by dog concepts, but also contain, for example, "reverse god" - while there is a semantic relation, it is extremely unlikely that a pet blogger or a pet website would use such terminology. In a rough estimate, when we look at 1000 closest concepts, as much as 80% can be considered to be noise in the context of this task.

5.) Suggestions

- Consider ambiguity. Words and phrases strongly connected to multiple tags are probably not useful - Consider the exact semantic relation to prune "bad" connections in semantic networks. ConceptNet and especially WordNet have classified the relation type. For example, since the tags are already general terms, moving to more abstract terms is not useful. - Consider what words and phrases actually appear most frequently in online documents, for example in search results when searching for the tag names