elke michelmayr: a case study on emergent semantics in communities
folksonomies - what are they?
comparison to taxonomies
multi-user web applications that provide a simple categorization sy stem
- items: web pages, images, citations
tags = keywords ... can be chosen freely
every user has a web page with a list of own items
- sorted in reverse-chron order
- can be filtered by tags
public access to item collections and metadata
bottom-up approach to categorization
- no pre-defined model or hierarchy
- inconsistencies
-- synonyms, homonyms
-- singular and plural versions of a tag
-- keywords that conssit of two terms (ie semantic web, semantic_web,
semanticweb)
-- relies on aggregation of metadata
-- tag frequency distribution: tags most often used to annotate an item
categorize it best; no need to reach consensus
-- relationships between tags evolve from metadata
- amount of metadata crucial!
-- number of users, lifetime of folksonomy
comparison of metadata
- lots of discussion about taxonomies vs. folksonomies, eg clay shirky 2005
- experiment: compare metadata from two big community projects that
categorize web pages to find out about the differences
- dmoz open directory project http://dmoz.org
-- taxonomy for web pages
-- ~ 600k concepts and about 5M instances
-- available in RDF format (two big files)
- social bookmarking site del.icio.us
-- no official numbers; ~100k users
-- download the web pages (simple html)
procedure
-- use only items from del.icio.us that were annotated by more than
100 users (=popular items)
-- download random popular items from del.icio.us
-- lookup if items are present in the dmoz collection (~25% of the items
were also present in dmoz)
-- 788 items with metadata from both sources (~50% of them are instances of
dmoz concept Top/Computers)
preparation of data
- preparations (much mangling of data here...)
- example: Top/Science/Math/Publication -> publication math science
- how to compare?
-- avg dmoz hierarchy length: 4.67
-- avg del.icio.us tags per item: 24.59
comparison
-- lookup for each dmoz category (is it included in the del.icio.us tags?)
-- take top 1,3,5,10,15,all tags into account
--- top tag is included in ~50% of all cases
--- top 5 is the fairest comparison
--- top tags match more often than the less popular ones
folksonomies and peer to peer networks
- architectures are very diffferent
-- folksonomies are centralized systems, aggregation is easy
-- p2p networks are distributed, aggregation is hard.
- user behavior is comparable
-- act autonomously
-- no central authority
-- want to share information
- data from a folksonomy can be used to model peers and content distribution
- ...
can interest-based locality be observed?
- interest based locality (defn)
- method
-- retrieve all users from del.icio.us that store a random bookmark
-- retrieve all their collections
- retrieved 4 test sets
-- 155, 248, 280, 551 users
-- distribution of items among users nearly equal in the test sets
-- avg.: 84% of items are not shared.
related work
adam mathes, 2004: folksonomies - cooperative classification and
communication through shared metadata
clay shirky, 2005: ontology is overrated: categories, links, and tags
scott golder and bernardo huberman, 2005...
summary
- investigated the properties of metadata provided by a folksonomy
- compared it to dmoz data collection
- tried to find interest based locality
- paper contains some other experiments i did not have time to tell you
about
- open questions
-- is there a way to combine the bottom-up and top-down approach for
creating metadata
-- how much could the semantic web benefit from it?
audience questions:
have you thought about comparing the tags used at delicious to the meta tags
provided by page authors? e.g. to detect spamming by page authors of search
engines
- mention of delicious director
[]
permanent link
Phillipe Cudre-Mauroux: analyzing semantic interoperability in bioinformatic database networks
1) peer data management systems
2) semantic interoperability in the large
3) the sequence retrieval system
- degree distribution
- analysis of giant component
- weighted analysis
4) conclusions
beyond keyword search - searching semantically richer objects in large
scale herterogenous networks (semi-structured or structured data)
decentralized data integration
large scale information systems (e.g. WWW) VS distributed databases
data integration: LAV/GAV
- traditional database techniques (LAV/GAV) rely on centralized schemas to
integrate data sources.
- not applicable to our context
-- scale (upper ontologies?)
-- churn
-- autonomy
- how can we foster semantic interoperability in decentralized settings?
semantic interoperability
- from 'own schema' to 'known schema'
- extending semantic interoperability to ....
peer data management systems
- pairwise mappings
-- peer datamanagement systems (PDMS)
- local mappings overcome global heterogeneity
-- interactive query rewriting
semantic mediation layer
- semantic mediation layer
over:
- overlay layer
over:
- physical layer
correlated/uncorrelated among the three layers.
schema-to-schema graph
- inter-organization of the different schemas used by the peers
-- logical model
-- directed
-- weighted
-- redundant
the semantic connectivity graph
- definition (semantic interoperability
-
-
-
observations
- theorem
- observation 1
- observation 2
semantic interop in the large
- how can we analyze semantic interop in large-scale pdms?
-
size of the giant component
the sequence retrieval system
why is srs interesting?
- applying our heuristics on a real large-scale corpus of interconnected
databases
-- more than 380 databanks
-- more than 500 (undirected) links
-- data used by professionals on a daily basis
crawling the srs schema-to-schema graph
- custom crawler
- as of may 2005 (ebi repository)
-- 388 nodes
-- 518 edges
- giant connected component (187 nodes)
- power law distribution of node degrees
- clustering coefficient = 0.32
- diameter = 9
results
- connectivity indicator ci = 25.4
-- super critical state
- size of the giant component
-- 0.47 (derived)
-- 0.48 (observed)
graphs with same power-law degree distribution
- varying number of edges
analyzing weighted networks
- do we have a sufficient number of 'good' mappings
- introducing quality measures from the mappings
-- weights
-- attribute /schema level
-- cf. Chatty Web (WWW03)
- semantic query forwarding
-- per hop forwarding behaviors
-- only forward if w sub i >= tau
--- tau = 0 : flooding
--- tau = 1 : exact answers
weighted results
- same degree distribution (388 nodes)
- uniformly distributed weights between 0 and 1
conclusions
- analysing a real network of bioinformatic databases
-- accurate results (even for relatively small networks)
-- weighted / unweighted
- current works
-- compositions of weights along a path
-- semantic random walkers
-- public domain simulator
- future works
-- analysing other forwarding behaviors
-- implementation in a real pdms (self-organizing mappings)
--- gridvine
references
a necessary condition for semantic interoperability in the large
cudre-maroux and karl aberer (ODBASE 2004)
gridvine: building internet-scale semantic overlay networks
ISWC2004
semantic overlay networks (tutorial) VLDB 2005
complete reference list available at http://lsirpeope.epfl.ch/pcudre
[]
permanent link
heiner stuckenschmidt: social network analysis as a basis for partitioning
ontologies
motivations - the case for ontology partitioning
a partitioning method
- create a dependency graph
- strength of dependencies
ontologies are the backbone of semantic web applications
more and more large ontologies become available
maintenance and handling is becoming a problem
the case for partitioning:
distributed development and maintenance
selective publication and use of terminologies
manual inspection and validation
editing, visualization, and reasoning
an abstract view of the problem:
despite the standardization of languages there is no agreement on the
way ontologies are represented.
- all ontologies contain classes
- most organize them in a hierarchy
- many define relations between classes
- some provide formal definitions of classes
we concentrate on partitioning ontologies into disjoint sets of
concepts. class hierarchy, relations, and definitions provide input
for the partitioning algorithm.
overview of the process:
1) create dependency graph
dependencies I: subclass relations
dependencies II: shared relations
2) determine strength of dependencies
relative strength networks
- compute relative strength [Burt, '92] of dependencies
3) compute partitions
computing islands
- we use maximal line islands [Batagejl 2000] to compute partitions in
the dependency graph [a set of vertices is a line island in network if
and only if it induces a connected subgraph and the lines inside the
island are stronger related among them than with the neighboring
vertices. in particular there is a maximal spanning tree T over nodes
in the island such that....
- the minimal weight in the spanning tree is called the 'height' of an
island.
- understanding islands
- result for the example
4) improve partitioning
improving partitions
- islands are often very small (2-4 nodes) resulting in unwanted
partitions of the ontology
- observation: small islands almost always have a large height value
(1 or 0.5)
- approach: merge partitions with a height of 1 or 0.5 with
neighboring partitions, based on strength of connection: ...
ontology partitioning tool
- features:
-- owl and kif import
-- selection of criteria
-- computation of line islands
-- graph export
-- precision and recall measurement
an experiment
- data: acm topic hierarchy
- partitioning method:
-- relations: hierarchy
-- maximal size: 100
-- merging threshold: 0.2
- evaluation:
-- topics on dutch cs department home pages
-- compared with root nodes of determined modules
- results
-- terms do correspond to major areas in CS
-- quite some overlap with the extracted terms
-- further experiments needed
[]
permanent link
Powered by Blosxom.