...

Data Integration

by user

on
Category: Documents
27

views

Report

Comments

Transcript

Data Integration
Answer Set Programming
and Data Integration Systems
S. Costantini
Università degli Studi di L’Aquila
[email protected]
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
2
A Traditional Database Architecture
Query
(SQL)
Answer
(relation)
Database Manager
(DBMS)
-Storage mgmt
-Query processing
-View management
-(Transaction processing)
S. Costantini / Tutorial Data Integration CILC'05
Database
(relational)
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
An Online Shopper’s Information Integration Problem
El Cheapo: “Where can I get the cheapest copy (including shipping cost) of
Wittgenstein’s Tractatus Logicus-Philosophicus within a week?”
addall.com
?
Information
Integration
amazon.com
barnes&noble.com
S. Costantini / Tutorial Data Integration CILC'05
half.com
“One-World”
Mediation
A1books.com
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
3
4
A Home Buyer’s Information Integration Problem
What houses for sale under $500k have at least 2 bathrooms, 2 bedrooms,
a nearby school ranking in the upper third, in a neighborhood
with below-average crime rate and diverse population?
?
Information
Integration
Realtor
Crime Stats
S. Costantini / Tutorial Data Integration CILC'05
School Rankings
“Multiple-Worlds”
Mediation
Demographics
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
A Geoscientist’s Information Integration
Problem
What is the distribution and U/ Pb zircon ages of A-type plutons in VA?
How about their 3-D geometry ?
How does it relate to host rock structures?
?
Information
Integration
Geologic Map
(Virginia)
GeoChemical
GeoPhysical
(gravity contours)
S. Costantini / Tutorial Data Integration CILC'05
“Complex
Multiple-Worlds”
Mediation
GeoChronologic
(Concordia)
Foliation Map
(structure DB)
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
5
6
A Neuroscientist’s Information Integration Problem
What is the cerebellar distribution of rat proteins with more than 70%
homology with human NCS-1? Any structure specificity?
How about other rodents?
?
Information
Integration
protein localization
sequence info
(NCMIR)
(CaPROT)
S. Costantini / Tutorial Data Integration CILC'05
“Complex
Multiple-Worlds”
Mediation
morphometry
neurotransmission
(SYNAPSE)
(SENSELAB)
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
7
What is a Data Integration System?
A system providing:
›
›
›
›
›
›
›
Uniform (same query interface to all sources)
Access to (queries; eventually updates too)
Multiple (we want many, but 2 is hard too)
Autonomous (DBA doesn’t report to you)
Heterogeneous (data models are different)
Structured (or at least semi-structured)
Data Sources (not only databases).
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
8
Past
User must know which sites have relevant info
User must go to each one in turn
Slow: Sequential access takes time
Confusing: Each site has a different interface
User must manually integrate information
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
9
Perspective
» “Intelligent” agents such that
• User says what she wants
• Agent decides how & when to achieve it
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
10
References
» Data Integration System Group at the Dipartimento di Informatica e
Sistemistica, Univ. Of Roma “La Sapienza” (Prof. Maurizio Lenzerini)
» INFOMIX: information integration, database theory, computational logic and
deductive databases, database implementation and optimization, and logicbased software agent technology.
University of Calabria (Italy), Vienna University of Technology (Austria),
University of Rome “La Sapienza” (Italy), Rodan Systems (Poland).
» Università degli Studi di Roma "Tor Vergata"
Dipartimento di informatica, sistemi e produzione
(Prof. Maria Teresa Pazienza)
» Database Group of Microsoft Research (Phil Bernstein)
» Answer Set Programming in Data Integration Systems, Leopoldo Bertossi,
Loreto Bravo (Carleton University, Ottawa, Canada) Jan Chomichi (University
of Buffalo, USA)
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
11
Integration vs. Distributed DBMS
No common schema
› Sources with heterogeneous schemas
› Semi-structured sources
Legacy Sources
› Not necessarily relational
› Access/process limitations
Autonomous sources
› Uncontrolled source content overlap
Unpredictable run-time behavior
› Makes query execution hard
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
12
Building a Data Integration System
Create a middleware “mediator” or “data
integration system” over the sources
› Can be virtual or a data wharehouse
› Presents a uniform query interface and schema
› Abstracts away multitude of sources; consults
them for relevant data
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
13
Data Warehouse Architecture
User queries
OLAP / Decision support/
Data mining
(Relational?) database (warehouse)
Data extraction,
cleaning/
scrubbing
Data extraction
programs
Data
source
S. Costantini / Tutorial Data Integration CILC'05
Data
source
Data
source
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
14
OLTP vs. OLAP
» OLTP: On-Line Transaction
Processing
› Many short transactions
(queries + updates)
› Examples:
- Update account balance
- Enroll in course
- Add book to shopping cart
› Queries touch small amounts of
data (one record or a few
records)
› Updates are frequent
› Concurrency is biggest
performance concern
S. Costantini / Tutorial Data Integration CILC'05
» OLAP: On-Line Analytical
Processing
› Long transactions, complex
queries
› Examples:
- Report total sales for each
department in each month
- Identify top-selling books
- Count classes with fewer than
10 students
› Queries touch large amounts of
data
› Updates are infrequent
› Individual queries can require
lots of resources
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
15
Data Warehouse
Enterprise
“Database”
Customers
Orders
Transactions
Etc…
Vendors
Etc…
Data Miners:
• “Farmers” – they know
• “Explorers” - unpredictable
Copied,
organized
summarized
Data
Warehouse
S. Costantini / Tutorial Data Integration CILC'05
Data Mining
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
16
Architecture for Virtual Integration
Leave the data in the sources.
When a query comes in:
1)
2)
3)
4)
Determine which sources are relevant to query.
Break query into sub-queries for each source.
Get answers from sources
Combine.
Data is fresh.
Challenge: performance.
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
17
Virtual Integration Architecture
User queries
Mediated schema
Reformulator
Mediator:
Optimizer
Execution engine
Data source
catalog
wrapper
wrapper
wrapper
Data
source
Data
source
Data
source
Sources can be: relational, hierarchical (IMS), structured files, web sites.
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
18
Mediated Integration
» User poses queries in terms of global schema
» Relationship between global schema and local, source
schemas specied in the mediator, as source descriptions
» Mediator is responsible of solving problems of data:
› redundancy
› complementarity
› inconsistency: sources, independently, may be
consistent, but together, possibly not. E.g., Same ID
card number may be assigned to different people in
different sources
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
19
XML-Based Mediator Architecture
USER/Client
Query Q ( G (S1,..., Sk) )
Integrated Global
(XML) View G
Integrated View
Definition
MEDIATOR
G(..) S1(..)…Sk(..)
(XML) Queries & Results
(XML) View
(XML) View
(XML) View
Wrapper
Wrapper
Wrapper
S1
S2
Sk
S. Costantini / Tutorial Data Integration CILC'05
wrappers implemented
as web services
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
20
Wrapper Programs
» Task
› to communicate with the data sources and do format
translations.
» Built w.r.t. a specific source.
» Can sit either at the source or mediator.
» Often hard to build
› (very little science).
» Can be “intelligent”
› perform source-specific optimizations.
» Exploit ontologies to relate analogous domains
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
21
Example
<b> Introduction to DB </b>
Transform:
<i> Phil Bernstein </i>
<i> Eric Newcomer </i>
Addison Wesley, 1999
into:
<book>
<title> Introduction to DB </title>
<author> Phil Bernstein </author>
<author> Eric Newcomer </author>
<publisher> Addison Wesley </publisher>
<year> 1999 </year>
</book>
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
22
Semantic Mappings between Schemas
» Source schemas = XML DTDs
house
address
contact-info
agent-name
num-baths
agent-phone
1-1 mapping
non 1-1 mapping
house
location
contact
name
S. Costantini / Tutorial Data Integration CILC'05
full-baths
half-baths
phone
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
23
Accessing Sources via Wrappers
SELECT address, tel
FROM Restaurant
WHERE cuisine =
“chinese”
Chinois, 2720 Main St, 310-777-9876
Peking Star, 1 Broad St, 213-999-7676
.....
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
24
Wrapper Generation
» Approaches to automating the process
› Exploit format information (structure, HTML etc. )
› Template based approaches
› Machine learning techniques
(e.g., shallow NLP where NLP = Natural Language
Processing)
» LIXTO: semi-automated wrapper generation
› Elog: prolog + patterns
› HTML  … operator … Elog  XML
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
25
“Semantic Web”
» The “semantic-web” initiative attempts to
automate schema mapping
› Idea: Allow pages to write logical axioms
relating their vocabulary (tags) to other
external tags
› Support automatic inference of relations
between source and mediator schema using
these rules
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
26
Query Model in Virtual Integration
» User formulates query in terms of his/her
ontology on the mediated (or “global”)
schema
» System reformulates queries in terms of
sub-queries for each source (“local”
schema)
» Structure of the query model should be
more intuitive for the user
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
27
Reformulation Problem
» Given:
› A query Q posed over the mediated schema
› Descriptions of the data sources
» Find:
› A query Q’ over the data source relations, such
that:
- Q’ provides only correct answers to Q, and
- Q’ provides all possible answers from to Q
given the sources.
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
28
Deductive Databases
» Tables viewed as predicates.
» Ops on tables expressed as “datalog” rules
› (Horn clauses, without function symbols)
Enames(Name) :- Employe(Name, SSN) [Projection]
Wealthy-Employee(Name):Employee(Name,SSN), Salary(SSN,Money),Money> 10
[Selection]
Ed(Name, Dname):Employee(Name, SSN), E_Dependents(SSN, Dname)
[Join]
ERelated(Name,Dname) :- Ed(Name,Dname)
ERelated(Name,Dname) :- Ed(Name,D1), ERelated(D1,D2)
[Recursion]
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
30
Approaches to Specifying
Schema Descriptions
» Global-as-View: express the mediated
schema relations as a set of views over the
data source relations
» Local-as-View: express the source relations
as views over the mediated schema.
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
31
GaV: an example
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
32
Approaches for Mapping
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
33
Global-as-view vs. Local-as-view
» Global-as-view approach
› Each item in Global schema/ontology as a view (query)
over source schemas/ontologies
› query(G) = query(f(S1, S2, …, Sn))
» Local-as-view approach
› Each source as a view/query over global
schema/ontology
› query(G) = query(f1-1 (S1), f2-1(S2), …, fn-1 (Sn))
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
34
GAV
vs.
» Not modular
› Addition of new sources
changes the schema mapping
» Can be awkward to write
mediated schema without loss
of information
» Query reformulation easy
› Often reduces to view
unfolding (polynomial)
› Can build hierarchies of
mediated schemas
» Best when
› Few, stable, data sources
› well-known to the mediator
(e.g. corporate integration)
S. Costantini / Tutorial Data Integration CILC'05
LAV
» Modular--adding new sources is
easy
» Very flexible--power of the
entire query language available
to describe sources
» Reformulation is hard
› Involves answering queries only
using views (can be
intractable)
» Best when
› Many, relatively unknown data
sources
› possibility of addition/deletion
of sources
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
Data Integration: Global-as-View in
Answer Set Programming
S. Costantini
Università degli Studi di L’Aquila
[email protected]
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
36
Computing Answers
in Data Integration Systems
» Answer set programming (ASP) gives a semantics
to datalog (logic) programs with negation (and
possibly disjunction in the heads)
» ASP-based specification of a data integration
system
» ASP-based specification of extensions (plausible
answers, repairs)
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
37
Computing Answers
in Data Integration Systems
Methodology works for first-order queries (and
Datalog extensions), and many ICs
› LAV: Bravo and Bertossi, IJCAI’03
› GAV: Costantini and Formisano, ASP’03
(experiments and a working program)
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
38
GaV in ASP
» Instead of unfolding w.r.t. each query,
helper model so as to get a general
inference engine
» Easily express integrity constraints
» Cope with incomplete information (also
resulting form adding new sources)
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
39
GaV: an example
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
Front-end:
Global Schema through a Helper Model
person(X):- entity(1,X). % meta-concepts + naming
organization(X):- entity(2,X).
member(X,Y):- relat(1,X,Y).
student(X):- entity(3,X).
university(X):- entity(4,X).
enrolled(X,Y):- relat(2,X,Y).
age(X,Y):- attr(1,X,Y).
% Constraints on student and enrolled
enrolledE(X):- student(X),university(Y),enrolled(X,Y).
:- not student(X),enrolled(X,Y).
:- student(X),not enrolledE(X).
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
40
41
Back-end: Mapping to the Source Schemas
entity(1,X):- s(1,X).
entity(2,X):- s(2,X).
entity(4,X):- s(5,X).
% student is described implicitly as the domain of
an unknown relationship
entity(3,X):- s(3,X,Y).
% student is described implicitly as the domain of
one of the sources of enrolled
entity(3,X):- s(4,X,Z).
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
42
Back-end: Mapping to the Source Schemas
% Relationships
relat(1,X,Y):- s(7,X,Z),s(8,Z,Y).
relat(2,X,Y):- s(4,X,Y).
relat(2,X,Y):-s(9,X,Y).
attr(1,X,Y):- s(3,X,Y).
attr(1,X,Y):- s(6,X,Y,Z).
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
43
Helper Model: Relational Inference Engine
» Meta-rules for is_a chaining:
relat(M,X,Y):- schema(N,E1,E2),
entity(E1,X), entity(E2,Y),
is_a_r(N,M), relat(N,X,Y).
entity(M,X):- is_a_e(N,M), entity(N,X).
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
44
GaV: an example
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
45
Dealing with Incomplete Information
» Assume to add a new source for Enrolled, that in
the back-end becomes, e.g.,:
relat(2,X,Y):- s(4,X,Y).
relat(2,X,Y):- s(9,X,Y).
» No connection to student: but, one who is
enrolled to a University should be a student
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
Dealing with Incomplete Information
(cont. 1): ASP constraints
% Constraints on student and enrolled
enrolledE(X):- university(Y),enrolled(X,Y).
% Student mandatory in Enrolled
:- student(X),not enrolledE(X).
% One if enrolled is a student
:- not student(X),enrolled(X,Y).
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
46
Dealing with Incomplete Information
(cont. 2): Generating a Search Space
% specifies possibly incomplete entities
... student(X):- entity(3,X).
incomplete(3). % Student
% Any individual might occur in
% an incomplete entity
entity(N,X):- incomplete(N), not noentity(N,X).
noentity(N,X):- incomplete(N), not entity(N,X).
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
47
Dealing with Incomplete Information
(cont. 2): Pruned Search space
» One who is enrolled is a student
› Instance of a well-known problem in GaV
› Not solvable with simple unfolding
› Simply (though not efficiently) solved in
ASP
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
48
49
Dealing with Incomplete Information
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
Dealing with Incomplete Information
(cont. 1)
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
50
Dealing with Incomplete Information
(cont. 2)
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
51
Dealing with Incomplete Information
(cont. 3)
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
52
Dealing with Incomplete Information
(cont. 4)
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
53
54
Observations and Issues
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
55
The End, thank
you for your
attention!
S. Costantini / Tutorial Data Integration CILC'05
Dipartimento di Informatica
Università degli Studi di L’Aquila
http://www.di.univaq.it/
Fly UP