Comments
Description
Transcript
Data Integration
Answer Set Programming and Data Integration Systems S. Costantini Università degli Studi di L’Aquila [email protected] Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 2 A Traditional Database Architecture Query (SQL) Answer (relation) Database Manager (DBMS) -Storage mgmt -Query processing -View management -(Transaction processing) S. Costantini / Tutorial Data Integration CILC'05 Database (relational) Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ An Online Shopper’s Information Integration Problem El Cheapo: “Where can I get the cheapest copy (including shipping cost) of Wittgenstein’s Tractatus Logicus-Philosophicus within a week?” addall.com ? Information Integration amazon.com barnes&noble.com S. Costantini / Tutorial Data Integration CILC'05 half.com “One-World” Mediation A1books.com Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 3 4 A Home Buyer’s Information Integration Problem What houses for sale under $500k have at least 2 bathrooms, 2 bedrooms, a nearby school ranking in the upper third, in a neighborhood with below-average crime rate and diverse population? ? Information Integration Realtor Crime Stats S. Costantini / Tutorial Data Integration CILC'05 School Rankings “Multiple-Worlds” Mediation Demographics Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ A Geoscientist’s Information Integration Problem What is the distribution and U/ Pb zircon ages of A-type plutons in VA? How about their 3-D geometry ? How does it relate to host rock structures? ? Information Integration Geologic Map (Virginia) GeoChemical GeoPhysical (gravity contours) S. Costantini / Tutorial Data Integration CILC'05 “Complex Multiple-Worlds” Mediation GeoChronologic (Concordia) Foliation Map (structure DB) Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 5 6 A Neuroscientist’s Information Integration Problem What is the cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity? How about other rodents? ? Information Integration protein localization sequence info (NCMIR) (CaPROT) S. Costantini / Tutorial Data Integration CILC'05 “Complex Multiple-Worlds” Mediation morphometry neurotransmission (SYNAPSE) (SENSELAB) Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 7 What is a Data Integration System? A system providing: › › › › › › › Uniform (same query interface to all sources) Access to (queries; eventually updates too) Multiple (we want many, but 2 is hard too) Autonomous (DBA doesn’t report to you) Heterogeneous (data models are different) Structured (or at least semi-structured) Data Sources (not only databases). S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 8 Past User must know which sites have relevant info User must go to each one in turn Slow: Sequential access takes time Confusing: Each site has a different interface User must manually integrate information S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 9 Perspective » “Intelligent” agents such that • User says what she wants • Agent decides how & when to achieve it S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 10 References » Data Integration System Group at the Dipartimento di Informatica e Sistemistica, Univ. Of Roma “La Sapienza” (Prof. Maurizio Lenzerini) » INFOMIX: information integration, database theory, computational logic and deductive databases, database implementation and optimization, and logicbased software agent technology. University of Calabria (Italy), Vienna University of Technology (Austria), University of Rome “La Sapienza” (Italy), Rodan Systems (Poland). » Università degli Studi di Roma "Tor Vergata" Dipartimento di informatica, sistemi e produzione (Prof. Maria Teresa Pazienza) » Database Group of Microsoft Research (Phil Bernstein) » Answer Set Programming in Data Integration Systems, Leopoldo Bertossi, Loreto Bravo (Carleton University, Ottawa, Canada) Jan Chomichi (University of Buffalo, USA) S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 11 Integration vs. Distributed DBMS No common schema › Sources with heterogeneous schemas › Semi-structured sources Legacy Sources › Not necessarily relational › Access/process limitations Autonomous sources › Uncontrolled source content overlap Unpredictable run-time behavior › Makes query execution hard S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 12 Building a Data Integration System Create a middleware “mediator” or “data integration system” over the sources › Can be virtual or a data wharehouse › Presents a uniform query interface and schema › Abstracts away multitude of sources; consults them for relevant data S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 13 Data Warehouse Architecture User queries OLAP / Decision support/ Data mining (Relational?) database (warehouse) Data extraction, cleaning/ scrubbing Data extraction programs Data source S. Costantini / Tutorial Data Integration CILC'05 Data source Data source Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 14 OLTP vs. OLAP » OLTP: On-Line Transaction Processing › Many short transactions (queries + updates) › Examples: - Update account balance - Enroll in course - Add book to shopping cart › Queries touch small amounts of data (one record or a few records) › Updates are frequent › Concurrency is biggest performance concern S. Costantini / Tutorial Data Integration CILC'05 » OLAP: On-Line Analytical Processing › Long transactions, complex queries › Examples: - Report total sales for each department in each month - Identify top-selling books - Count classes with fewer than 10 students › Queries touch large amounts of data › Updates are infrequent › Individual queries can require lots of resources Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 15 Data Warehouse Enterprise “Database” Customers Orders Transactions Etc… Vendors Etc… Data Miners: • “Farmers” – they know • “Explorers” - unpredictable Copied, organized summarized Data Warehouse S. Costantini / Tutorial Data Integration CILC'05 Data Mining Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 16 Architecture for Virtual Integration Leave the data in the sources. When a query comes in: 1) 2) 3) 4) Determine which sources are relevant to query. Break query into sub-queries for each source. Get answers from sources Combine. Data is fresh. Challenge: performance. S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 17 Virtual Integration Architecture User queries Mediated schema Reformulator Mediator: Optimizer Execution engine Data source catalog wrapper wrapper wrapper Data source Data source Data source Sources can be: relational, hierarchical (IMS), structured files, web sites. S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 18 Mediated Integration » User poses queries in terms of global schema » Relationship between global schema and local, source schemas specied in the mediator, as source descriptions » Mediator is responsible of solving problems of data: › redundancy › complementarity › inconsistency: sources, independently, may be consistent, but together, possibly not. E.g., Same ID card number may be assigned to different people in different sources S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 19 XML-Based Mediator Architecture USER/Client Query Q ( G (S1,..., Sk) ) Integrated Global (XML) View G Integrated View Definition MEDIATOR G(..) S1(..)…Sk(..) (XML) Queries & Results (XML) View (XML) View (XML) View Wrapper Wrapper Wrapper S1 S2 Sk S. Costantini / Tutorial Data Integration CILC'05 wrappers implemented as web services Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 20 Wrapper Programs » Task › to communicate with the data sources and do format translations. » Built w.r.t. a specific source. » Can sit either at the source or mediator. » Often hard to build › (very little science). » Can be “intelligent” › perform source-specific optimizations. » Exploit ontologies to relate analogous domains S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 21 Example <b> Introduction to DB </b> Transform: <i> Phil Bernstein </i> <i> Eric Newcomer </i> Addison Wesley, 1999 into: <book> <title> Introduction to DB </title> <author> Phil Bernstein </author> <author> Eric Newcomer </author> <publisher> Addison Wesley </publisher> <year> 1999 </year> </book> S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 22 Semantic Mappings between Schemas » Source schemas = XML DTDs house address contact-info agent-name num-baths agent-phone 1-1 mapping non 1-1 mapping house location contact name S. Costantini / Tutorial Data Integration CILC'05 full-baths half-baths phone Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 23 Accessing Sources via Wrappers SELECT address, tel FROM Restaurant WHERE cuisine = “chinese” Chinois, 2720 Main St, 310-777-9876 Peking Star, 1 Broad St, 213-999-7676 ..... S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 24 Wrapper Generation » Approaches to automating the process › Exploit format information (structure, HTML etc. ) › Template based approaches › Machine learning techniques (e.g., shallow NLP where NLP = Natural Language Processing) » LIXTO: semi-automated wrapper generation › Elog: prolog + patterns › HTML … operator … Elog XML S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 25 “Semantic Web” » The “semantic-web” initiative attempts to automate schema mapping › Idea: Allow pages to write logical axioms relating their vocabulary (tags) to other external tags › Support automatic inference of relations between source and mediator schema using these rules S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 26 Query Model in Virtual Integration » User formulates query in terms of his/her ontology on the mediated (or “global”) schema » System reformulates queries in terms of sub-queries for each source (“local” schema) » Structure of the query model should be more intuitive for the user S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 27 Reformulation Problem » Given: › A query Q posed over the mediated schema › Descriptions of the data sources » Find: › A query Q’ over the data source relations, such that: - Q’ provides only correct answers to Q, and - Q’ provides all possible answers from to Q given the sources. S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 28 Deductive Databases » Tables viewed as predicates. » Ops on tables expressed as “datalog” rules › (Horn clauses, without function symbols) Enames(Name) :- Employe(Name, SSN) [Projection] Wealthy-Employee(Name):Employee(Name,SSN), Salary(SSN,Money),Money> 10 [Selection] Ed(Name, Dname):Employee(Name, SSN), E_Dependents(SSN, Dname) [Join] ERelated(Name,Dname) :- Ed(Name,Dname) ERelated(Name,Dname) :- Ed(Name,D1), ERelated(D1,D2) [Recursion] S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 30 Approaches to Specifying Schema Descriptions » Global-as-View: express the mediated schema relations as a set of views over the data source relations » Local-as-View: express the source relations as views over the mediated schema. S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 31 GaV: an example S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 32 Approaches for Mapping S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 33 Global-as-view vs. Local-as-view » Global-as-view approach › Each item in Global schema/ontology as a view (query) over source schemas/ontologies › query(G) = query(f(S1, S2, …, Sn)) » Local-as-view approach › Each source as a view/query over global schema/ontology › query(G) = query(f1-1 (S1), f2-1(S2), …, fn-1 (Sn)) S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 34 GAV vs. » Not modular › Addition of new sources changes the schema mapping » Can be awkward to write mediated schema without loss of information » Query reformulation easy › Often reduces to view unfolding (polynomial) › Can build hierarchies of mediated schemas » Best when › Few, stable, data sources › well-known to the mediator (e.g. corporate integration) S. Costantini / Tutorial Data Integration CILC'05 LAV » Modular--adding new sources is easy » Very flexible--power of the entire query language available to describe sources » Reformulation is hard › Involves answering queries only using views (can be intractable) » Best when › Many, relatively unknown data sources › possibility of addition/deletion of sources Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ Data Integration: Global-as-View in Answer Set Programming S. Costantini Università degli Studi di L’Aquila [email protected] Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 36 Computing Answers in Data Integration Systems » Answer set programming (ASP) gives a semantics to datalog (logic) programs with negation (and possibly disjunction in the heads) » ASP-based specification of a data integration system » ASP-based specification of extensions (plausible answers, repairs) S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 37 Computing Answers in Data Integration Systems Methodology works for first-order queries (and Datalog extensions), and many ICs › LAV: Bravo and Bertossi, IJCAI’03 › GAV: Costantini and Formisano, ASP’03 (experiments and a working program) S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 38 GaV in ASP » Instead of unfolding w.r.t. each query, helper model so as to get a general inference engine » Easily express integrity constraints » Cope with incomplete information (also resulting form adding new sources) S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 39 GaV: an example S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ Front-end: Global Schema through a Helper Model person(X):- entity(1,X). % meta-concepts + naming organization(X):- entity(2,X). member(X,Y):- relat(1,X,Y). student(X):- entity(3,X). university(X):- entity(4,X). enrolled(X,Y):- relat(2,X,Y). age(X,Y):- attr(1,X,Y). % Constraints on student and enrolled enrolledE(X):- student(X),university(Y),enrolled(X,Y). :- not student(X),enrolled(X,Y). :- student(X),not enrolledE(X). S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 40 41 Back-end: Mapping to the Source Schemas entity(1,X):- s(1,X). entity(2,X):- s(2,X). entity(4,X):- s(5,X). % student is described implicitly as the domain of an unknown relationship entity(3,X):- s(3,X,Y). % student is described implicitly as the domain of one of the sources of enrolled entity(3,X):- s(4,X,Z). S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 42 Back-end: Mapping to the Source Schemas % Relationships relat(1,X,Y):- s(7,X,Z),s(8,Z,Y). relat(2,X,Y):- s(4,X,Y). relat(2,X,Y):-s(9,X,Y). attr(1,X,Y):- s(3,X,Y). attr(1,X,Y):- s(6,X,Y,Z). S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 43 Helper Model: Relational Inference Engine » Meta-rules for is_a chaining: relat(M,X,Y):- schema(N,E1,E2), entity(E1,X), entity(E2,Y), is_a_r(N,M), relat(N,X,Y). entity(M,X):- is_a_e(N,M), entity(N,X). S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 44 GaV: an example S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 45 Dealing with Incomplete Information » Assume to add a new source for Enrolled, that in the back-end becomes, e.g.,: relat(2,X,Y):- s(4,X,Y). relat(2,X,Y):- s(9,X,Y). » No connection to student: but, one who is enrolled to a University should be a student S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ Dealing with Incomplete Information (cont. 1): ASP constraints % Constraints on student and enrolled enrolledE(X):- university(Y),enrolled(X,Y). % Student mandatory in Enrolled :- student(X),not enrolledE(X). % One if enrolled is a student :- not student(X),enrolled(X,Y). S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 46 Dealing with Incomplete Information (cont. 2): Generating a Search Space % specifies possibly incomplete entities ... student(X):- entity(3,X). incomplete(3). % Student % Any individual might occur in % an incomplete entity entity(N,X):- incomplete(N), not noentity(N,X). noentity(N,X):- incomplete(N), not entity(N,X). S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 47 Dealing with Incomplete Information (cont. 2): Pruned Search space » One who is enrolled is a student › Instance of a well-known problem in GaV › Not solvable with simple unfolding › Simply (though not efficiently) solved in ASP S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 48 49 Dealing with Incomplete Information S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ Dealing with Incomplete Information (cont. 1) S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 50 Dealing with Incomplete Information (cont. 2) S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 51 Dealing with Incomplete Information (cont. 3) S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 52 Dealing with Incomplete Information (cont. 4) S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 53 54 Observations and Issues S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/ 55 The End, thank you for your attention! S. Costantini / Tutorial Data Integration CILC'05 Dipartimento di Informatica Università degli Studi di L’Aquila http://www.di.univaq.it/