geek-guides.com, elijah wright's weblog (circa 1993)

elke michelmayr: a case study on emergent semantics in communities

folksonomies - what are they?
comparison to taxonomies


multi-user web applications that provide a simple categorization sy stem
- items: web pages, images, citations
tags = keywords ... can be chosen freely
every user has a web page with a list of own items
- sorted in reverse-chron order
- can be filtered by tags
public access to item collections and metadata

bottom-up approach to categorization
- no pre-defined model or hierarchy
- inconsistencies
-- synonyms, homonyms
-- singular and plural versions of a tag
-- keywords that conssit of two terms (ie semantic web, semantic_web,
semanticweb)
-- relies on aggregation of metadata
-- tag frequency distribution: tags most often used to annotate an item
categorize it best; no need to reach consensus
-- relationships between tags evolve from metadata

- amount of metadata crucial!
-- number of users, lifetime of folksonomy

comparison of metadata
- lots of discussion about taxonomies vs. folksonomies, eg clay shirky 2005
- experiment: compare metadata from two big community projects that 
  categorize web pages to find out about the differences
- dmoz open directory project http://dmoz.org
-- taxonomy for web pages
-- ~ 600k concepts and about 5M instances
-- available in RDF format (two big files)
- social bookmarking site del.icio.us
-- no official numbers; ~100k users
-- download the web pages (simple html)


procedure
-- use only items from del.icio.us that were annotated by more than 
   100 users (=popular items)
-- download random popular items from del.icio.us
-- lookup if items are present in the dmoz collection (~25% of the items 
   were also present in dmoz)
-- 788 items with metadata from both sources (~50% of them are instances of 
   dmoz concept Top/Computers)

preparation of data
- preparations (much mangling of data here...)
- example:  Top/Science/Math/Publication -> publication math science
- how to compare?
-- avg dmoz hierarchy length: 4.67
-- avg del.icio.us tags per item: 24.59

comparison
-- lookup for each dmoz category (is it included in the del.icio.us tags?)
-- take top 1,3,5,10,15,all tags into account
--- top tag is included in ~50% of all cases
--- top 5 is the fairest comparison
--- top tags match more often than the less popular ones


folksonomies and peer to peer networks
- architectures are very diffferent
-- folksonomies are centralized systems, aggregation is easy
-- p2p networks are distributed, aggregation is hard.
- user behavior is comparable
-- act autonomously
-- no central authority
-- want to share information
- data from a folksonomy can be used to model peers and content distribution
- ...


can interest-based locality be observed?
- interest based locality (defn)
- method
-- retrieve all users from del.icio.us that store a random bookmark
-- retrieve all their collections
- retrieved 4 test sets
-- 155, 248, 280, 551 users
-- distribution of items among users nearly equal in the test sets
-- avg.: 84% of items are not shared.

related work

adam mathes, 2004: folksonomies - cooperative classification and
communication through shared metadata

clay shirky, 2005: ontology is overrated: categories, links, and tags

scott golder and bernardo huberman, 2005...


summary
- investigated the properties of metadata provided by a folksonomy
- compared it to dmoz data collection
- tried to find interest based locality
- paper contains some other experiments i did not have time to tell you
about
- open questions
-- is there a way to combine the bottom-up and top-down approach for
creating metadata
-- how much could the semantic web benefit from it?


audience questions:

have you thought about comparing the tags used at delicious to the meta tags
provided by page authors?  e.g. to detect spamming by page authors of search
engines

- mention of delicious director

[
] permanent link

Phillipe Cudre-Mauroux: analyzing semantic interoperability in bioinformatic database networks

1) peer data management systems
2) semantic interoperability in the large
3) the sequence retrieval system
- degree distribution
- analysis of giant component
- weighted analysis
4) conclusions


beyond keyword search - searching semantically richer objects in large
scale herterogenous networks  (semi-structured or structured data)

decentralized data integration
large scale information systems (e.g. WWW) VS distributed databases

data integration: LAV/GAV
- traditional database techniques (LAV/GAV) rely on centralized schemas to
integrate data sources.
- not applicable to our context
-- scale (upper ontologies?)
-- churn
-- autonomy
- how can we foster semantic interoperability in decentralized settings?

semantic interoperability
- from 'own schema' to 'known schema'
- extending semantic interoperability to ....

peer data management systems
- pairwise mappings
-- peer datamanagement systems (PDMS)
- local mappings overcome global heterogeneity
-- interactive query rewriting


semantic mediation layer
- semantic mediation layer
over:
- overlay layer
over:
- physical layer

correlated/uncorrelated among the three layers.


schema-to-schema graph
- inter-organization of the different schemas used by the peers
-- logical model
-- directed
-- weighted
-- redundant


the semantic connectivity graph
- definition (semantic interoperability
-
- 
-

observations
- theorem
- observation 1
- observation 2

semantic interop in the large
- how can we analyze semantic interop in large-scale pdms?
-

size of the giant component


the sequence retrieval system

why is srs interesting?
- applying our heuristics on a real large-scale corpus of interconnected
databases
-- more than 380 databanks
-- more than 500 (undirected) links
-- data used by professionals on a daily basis

crawling the srs schema-to-schema graph
- custom crawler
- as of may 2005 (ebi repository)
-- 388 nodes
-- 518 edges

- giant connected component (187 nodes)
- power law distribution of node degrees
- clustering coefficient = 0.32
- diameter = 9

results
- connectivity indicator ci = 25.4
-- super critical state
- size of the giant component
-- 0.47 (derived)
-- 0.48 (observed)

graphs with same power-law degree distribution
- varying number of edges

analyzing weighted networks
- do we have a sufficient number of 'good' mappings
- introducing quality measures from the mappings
-- weights
-- attribute /schema level
-- cf. Chatty Web (WWW03)

- semantic query forwarding 
-- per hop forwarding behaviors
-- only forward if w sub i >= tau
--- tau = 0 : flooding
--- tau = 1 : exact answers


weighted results
- same degree distribution (388 nodes)
- uniformly distributed weights between 0 and 1

conclusions
- analysing a real network of bioinformatic databases
-- accurate results (even for relatively small networks)
-- weighted / unweighted
- current works
-- compositions of weights along a path
-- semantic random walkers
-- public domain simulator
- future works
-- analysing other forwarding behaviors
-- implementation in a real pdms (self-organizing mappings)
--- gridvine


references
a necessary condition for semantic interoperability in the large
cudre-maroux and karl aberer (ODBASE 2004)

gridvine: building internet-scale semantic overlay networks
ISWC2004

semantic overlay networks (tutorial)  VLDB 2005


complete reference list available at http://lsirpeope.epfl.ch/pcudre


[
] permanent link

heiner stuckenschmidt: social network analysis as a basis for partitioning
ontologies

motivations - the case for ontology partitioning

a partitioning method
- create a dependency graph
- strength of dependencies

ontologies are the backbone of semantic web applications
more and more large ontologies become available
maintenance and handling is becoming a problem

the case for partitioning:

distributed development and maintenance
selective publication and use of terminologies
manual inspection and validation
editing, visualization, and reasoning


an abstract view of the problem:

despite the standardization of languages there is no agreement on the
way ontologies are represented.

- all ontologies contain classes
- most organize them in a hierarchy
- many define relations between classes
- some provide formal definitions of classes

we concentrate on partitioning ontologies into disjoint sets of
concepts.  class hierarchy, relations, and definitions provide input
for the partitioning algorithm.


overview of the process:

1) create dependency graph

dependencies I: subclass relations
dependencies II: shared relations

2) determine strength of dependencies

relative strength networks
- compute relative strength [Burt, '92] of dependencies

3) compute partitions

computing islands 

- we use maximal line islands [Batagejl 2000] to compute partitions in
  the dependency graph [a set of vertices is a line island in network if
  and only if it induces a connected subgraph and the lines inside the
  island are stronger related among them than with the neighboring
  vertices. in particular there is a maximal spanning tree T over nodes
  in the island such that....

- the minimal weight in the spanning tree is called the 'height' of an
island.

- understanding islands
- result for the example

4) improve partitioning

improving partitions

- islands are often very small (2-4 nodes) resulting in unwanted
partitions of the ontology
- observation: small islands almost always have a large height value
(1 or 0.5)
- approach: merge partitions with a height of 1 or 0.5 with
neighboring partitions, based on strength of connection: ...

ontology partitioning tool
- features:
-- owl and kif import
-- selection of criteria 
-- computation of line islands
-- graph export
-- precision and recall measurement

an experiment
- data: acm topic hierarchy
- partitioning method:
-- relations: hierarchy
-- maximal size: 100
-- merging threshold: 0.2

- evaluation:
-- topics on dutch cs department home pages
-- compared with root nodes of determined modules

- results
-- terms do correspond to major areas in CS
-- quite some overlap with the extracted terms
-- further experiments needed


[
] permanent link


Powered by Blosxom.