...

Answering Queries using views

by user

on
Category:

hardware

52

views

Report

Comments

Transcript

Answering Queries using views
Answering Queries using
views: A survey
The problem…
• We have a schema A and a set of materialized
views V1, V2,…, Vn
• We are given a query Q over schema A
• Can we answer Q using another query Q’
referring only to (one or more) views?
• Depending on the specific context:
– Schema A can be either virtual or materialized
– Q’ may be required to be an equivalent rewriting or a
maximally contained rewriting of Q
Equivalent and Containment
rewritings
• Query Containment: Q1 is contained in Q2 (Q1Q2), if for all db
instances D, Q1(D)Q2(D)
• Query Containment: Q1 is equivalent to Q2 if Q1Q2 and Q2  Q1
Let A a schema and V={V1, V2, …, Vm} a set of views over A and Q
a query over A
• Q’ is a equivalent rewriting of Q, if
– Q’ refers only to views in V
– Q’ is equivalent to Q
• Q’ is a maximally contained rewriting of Q, if
– Q’ refers only to views in V
– Q’  Q
– There is no other rewriting Q1, such as Q’  Q1  Q
When the problem arises?
• Data integration
– The sources are described as views over the mediated schema
• Data warehouse
– The schema of the warehouse can be defined as views over the
schemata of the sources
• Query Optimization
– Computing the query using previously materialized views
– semantic caching
• Database design
– Independence of the physical view of the data and logical view
– Rewriting of a query over the logical schema into a query over the
physical view
The variations of the problem
Data
integration
Equivalent
rewriting
Maximallycontained
rewriting
Views are
materialized
The db
schema is
virtual
Only if the
sources are
complete
Data
Warehouses
Query
optimization
Database
design
A running example…
• Suppose we have the following schema
Student(st_id, st_name, dp_id)
Professor(pr_id, pr_name)
Department(dp_id, dp_description)
Course(cr_id, cr_title, cr_semester)
Prof_Dep(pr_id, dp_id, university)
Student_course(st_id, cr_id)
Prof_course(pr_id, cr_id)
Data Integration
Mediator
Student(st_id, st_name, dp_id)
Professor(pr_id, pr_name)
Department(dp_id, dp_description)
Course(cr_id, cr_title, cr_semester, cr_field)
Prof_Dep(pr_id, dp_id, university)
Student_course(st_id, cr_id)
Prof_course(pr_id, cr_id)
Create View DBProfs As
Select cr_title, pr_name
From Course, Prof_course,
Professor
Where Course.cr_id =
Source 1
Prof_course.cr_id and
Professors
Of Databases Prof_course.pr_id =
Professor.pr_id and
Course.cr_field = “DB”
Source 1
Professors
Of AUEB
Create View AuebProfs As
Select cr_title, pr_name
From Course, Prof_course,
Professor
Where Course.cr_id =
Prof_course.cr_id and
Prof_course.pr_id =
Professor.pr_id and
Course.univ = “AUEB”
Data Integration
Mediator
Student(st_id, st_name, dp_id)
Professor(pr_id, pr_name)
Department(dp_id, dp_description)
Course(cr_id, cr_title, cr_semester, cr_field)
Prof_Dep(pr_id, dp_id, university)
Student_course(st_id, cr_id)
Prof_course(pr_id, cr_id)
Query Q:
Select pr_name
From Professor
Query Q’:
Select pr_name
From DBProfs
Union
Select pr_name
From AUEBProfs
Source 1
Professors
Of Databases
Source 1
Professors
Of AUEB
Q’ is a maximally-contained
Rewriting of Q
(professors of Algorithms in
the university of Piraeus cannot
be retrieved…)
Data Integration
Mediator
Student(st_id, st_name, dp_id)
Professor(pr_id, pr_name)
Department(dp_id, dp_description)
Course(cr_id, cr_title, cr_semester, cr_field)
Prof_Dep(pr_id, dp_id, university)
Student_course(st_id, cr_id)
Prof_course(pr_id, cr_id)
Query P:
Select pr_name
From Course, Prof_course, Professor
Where Course.cr_id = Prof_course.cr_id
and Prof_course.pr_id = Professor.pr_id
and Course.cr_field = “DB”
And Course.univ = “AUEB”
Query P’:
Select pr_name
From DBProfs, AUEBProfs
Where
DBProfs.pr_name = AUEBProfs.pr_name
Source 1
Professors
Of Databases
Source 1
Professors
Of AUEB
P’ is an Equivalent Rewriting of P
Query Optimization
• Suppose that we have a database with the schema of
our running example
• Suppose that we have the following materialized views
–
–
create view StudentCoursesView As
Select st_name, cr_id
from Student, Stud_course
where Student.st_id = Stud_course.st_id
create view ProfCoursesView As
Select pr_name, pr_id
from Professor, Prof_course
where Professor.pr_id = Prof_course.pr_id
Query Optimization
• Suppose we are given the query
Select st_name, prof_name
from Student, Stud_course, Prof_course,
Professor
where Student.st_id = Stud_course.cr_id
and Stud_course.cr_id = Prof_course.cr_id
and Prof_course.pr_id = Professor.pr_id
Query Optimization
• We can exploit the materialized views in
order to eliminate joins
• There are many possible combinations
• One could be:
Select st_name, pr_name
From StudentCoursesView, ProfessorCoursesView
Where StudentCoursesView.cr_id =
ProfessorCoursesView.cr_id
Query Optimization
• But:
– Our aim here is not just to find an equivalent rewriting
– We have to find all rewritings and evaluate their costs in order to
select the cheapest plan
– In some cases views may not help due to loss of indexes
– Therefore, we need to make a cost-based rewriting
• Similar case is the semantic caching
– Client-server architecture
– Query results can be temporary maintained as relations in the
client side to be exploited for the evaluation of future queries
A common question: When a view
V is useful for a query Q?
• There must be a mapping μ from the relations of V to the
relations of Q
• If so, if R1 is both in V and Q then:
– If R1 participates in a join in Q with relation R2, then
• Either R1 and R2 must be joined in the same attributes with the same
conditions in V
• Or R1 and R2 must be joined in V in the same attributes but with weaker
conditions than those in Q, provided that the join attributes are projected
in V
• Or only R1 is in V and the attributes of R1 participating in the join in Q,
are projected in V
– If Q applies a selection predicate in an attribute A of R1, then V:
• Either V must apply the same selection predicate in attribute A of R1
• Or V must apply a weaker selection predicate, provided that attribute A is
projected in V
• Or does not apply a predicate in attribute A of R1 at all and attribute A is
projected in V
• V must project at least all attributes projected in Q
Examples
Q: Select st_name, prof_name
from Student, Stud_course, Prof_course, Professor
where Student.st_id = Stud_course.st_id
and Stud_course.cr_id = Prof_course.cr_id
and Prof_course.pr_id = Professor.pr_id
and Student.st_year > 2000
V1: Select st_name
from Student where st_year>2000
V1: Select st_name, st_id
from Student
where st_year> 2000
V1: Select st_name, st_id
from Student
where st_year >1996
V1: Select st_name, st_id, ma
from Student
where am >1996
V1: Select st_name, cr_name
from Student, Stud_course,
Course
where Student.st_id = Stud_course.st_id
and Stud_course.cr_id = Course.cr_id
V1: Select st_name, st_year, cr_name,
cr_id
from Student, Stud_course,
Course
where Student.st_id = Stud_course.st_id
and Stud_course.cr_id = Course.cr_id
Algorithms for finding query
rewritings using views
• In the query optimization context
– A variation of the Selinger Algorithm
• Cost-based equivalent rewritings
• In the data integration context
– The bucket algorithm
– The inverse-rules algorithm
– The MiniCon algorithm
Maximally-contained
rewritings
Query Optimization – the Selinger
Algorithm
• Recall: the Selinger Algorithm chooses the
best join order
• A bottom-up dynamic programming
algorithm
• Each iteration explores all join combinations
of certain length
• For each combination of joins, it keeps only
the cheapest plan, taking into account costs
estimated in the previous iteration
A variation of the Selinger Algorithm
• The Selinger Algorithm can be modified in order
to choose the cheapest equivalent rewriting
• We start with views that seem useful for a
certain query
• We continue in a bottom-up fashion exploring all
join combinations
• For each combination we prune plans estimated
to be more expensive or less contributing than
others
A variation of the Selinger Algorithm
• First Iteration:
– Find all views that are relevant (useful) to the query
– Choose the best access path for each view and evaluate its cost
• Second Iteration:
– Find all pair combinations
– For each pair find all possible Joins in all possible attributes that
lead to useful plans
– Plans that are complete (i.e. are sufficient to answer the query) are
distinguished as final candidates
– For the rest plans do the following:
• Compare pairs of plans
• For each pair (p, p’), if p’ is cheaper and has greater or equal
contribution (covers more relations), then p is pruned
• Third Iteration:
– Find all possible pairs of plans derived from the previous iteration
and the plans derived from the first iteration
– …
A variation of the Selinger Algorithm
Query: R1 R2 R3 R4
Views: V1=R1 R2 , V2=R2
V3=R2
R3, V4=R3
R3
R4,
R4
(V1-V3)-V2 (V1-V3)-V4
V1-V2
V1-V3
V1
V1-V4
V2
V2-V3
V3
V4
V2-V4
V3-V4
Algorithms in the Data Integration
Context
• We will discuss 2 algorithms
– The bucket algorithm
– The inverse-rules algorithm
• But first a short introduction to conjunctive
queries
Conjunctive Queries
• Conjunctive queries are able to express selectproject-join queries
• h(X,…) :- a(Y,…), b(Z,…), …
head
body
subgoals
• Head and subgoals are atoms
• An atom consists of a predicate and a tuple of
arguments: a(Y, Z, ‘aueb’, W, 2006)
variables
constants
• Predicates stand for relations
Conjunctive Queries
• Subgoals can also be arithmetic comparisons
h(X,…) :- a(Y,…), b(Z,…), Z>3
• A variable that appears in the head is said to be
distinguished ; otherwise nondistinguished.
h(X,Y):-a(X,Z),b(Z,Y)
X,Y: distinguished, Z:nondestinguished
• project-select-join SQL queries as conjunctive queries:
Select pr_name, dep_name
From Professor, Department
Where Professor.pr_dep = Department.dep_id
and Department.dep_year>1996
and Professor.field = “DB”
q(pr_name, dep_name) :- Professor(pr_name, dep_id, “DB”),
Department(dep_id, dep_name, dep_year),
dep_year>1996.
Conjunctive Queries
• The problem of rewriting queries using
views in the notation of conjunctive queries:
– He have
• a set of Database Relations
• a set of views declared as conjunctive queries over
the Database Relations
• a query q expressed as a conjunctive query over the
Database Relations
– We want to find:
• A set of conjunctive queries over the view relations,
the disjunction of which being a maximally contained
rewriting of q
example
Relations:
R(S,C,Q), CO(C,T), TE(P,C,Q)
Views:
V1(S, C, Q, T) :- R(S, C, Q), CO(C, T), C>500, Q>97
V2(S, P, C, Q) :- R(S, C, Q), TE(P, C, Q)
V3(S, C) :- R(S, C, Q), Q<94
V4(P, C, T, Q) :- R(S, C, Q), CO(C, T), TE(P, C, Q), Q<97
Query:
Q(S, C, P) :- TE(P,C,Q), R(S,C,Q), CO(C,T), C>300, Q>95
The Bucket Algorithm
• Create a ‘bucket’ for each subgoal of the query q
• Fill each bucket with views that ‘cover’ the subgoal
• A view V covers a subgoal p(A,B) of q if there is a
subgoal in V with the same predicate p(X,Y) and:
– there is a mapping from the variables A, B of the subgoal of q to
the variables X, Y of the subgoal of the view
– For variables of the subgoal of q appearing in the head of q, their
mapping variables in the subgoal of the view must also appear in
the head of the view
– For each variable of the subgoal of q that also appear in
arithmetic conditions C in q, if its mapping variable in the subgoal
of the view also appear in arithmetic conditions C’ in the view,
then C and C’ should not be mutually exclusive
• A solution must include q view from each bucket
The Bucket Algorithm
V1(S, C, Q, T) :- R(S, C, Q), CO(C, T), C>500, Q>97
V2(S, P, C, Q) :- R(S, C, Q), TE(P, C, Q)
V3(S, C) :- R(S, C, Q), Q<94
V4(P, C, T, Q) :- R(S, C, Q), CO(C, T), TE(P, C, Q), Q<97
Q(St, Cn, Pr) :- TE(Pr,Cn,Qu), R(St,Cn,Qu), CO(Cn,Tl),
Cn>300, Qu>95
V4(P,C,T’, Q)
V2(S’,P,C, Q)
TE
V4(S’, C, T, Q’)
V2(S,P’,C,Q)
V1(S,C,Q,T’)
V1(S’, C, Q’, T)
R
CO
The Bucket Algorithm
• There are 8 combinations we have to check
• For each, we equaling appropriate variables to enforce
joins
• We check for condition confligs
• If there are duplicate subgoals we try to eliminate them
V4(P,C,T’, Q)
V2(S’,P,C, Q)
TE
V2(S,P’,C,Q)
V1(S,C,Q,T’)
R
V4(S’, C, T, Q’)
V1(S’, C, Q’, T)
CO
q’(S,C,P):-V2(S’,P,C,Q), V1(S,C,Q,T’), V1(S’,C,Q’,T)
q’(S,C,P):-V2(S’,P,C,Q), V1(S,C,Q,T’)
The first solution
The inverse-rules algorithm
• Key idea: invert the view definitions to give the
global predicates (not only those appearing in
the query) in terms of views
• For each nondistinguished variable X, pick a
new function symbol f
Example:
v(X,Y) :- p(X,Z), p(Z,Y)
Replace Z by f(X,Y):
v(X,Y) :- p(X,f(X,Y)), p(f(X,Y),Y)
The inverse-rules algorithm
• Replace a view definition by rules with:
– A subgoal as the head, and
– The view itself as the only subgoal of the
body.
• Example:
v(X,Y) :- p(X,f(X,Y)), p(f(X,Y),Y)
becomes:
p(X,f(X,Y)) :- v(X,Y)
p(f(X,Y),Y) :- v(X,Y)
The inverse-rules algorithm - example
View: v(D,C):-m(S,D), r(S, C)
Query: q(D) :- m(S,D), r(S, 444)
After applying the inverse-rules, we have:
m(f1(D,C), D) :- v(D, C)
r(f1(D,C), C) :- v(D,C)
The inverse-rules algorithm - example
• Supose that v has the following tuples:
{(CS, 444), (EE, 444), (CS, 333)}
• Applying the inverse-rules on these tuples
we have:
m: { (f1(CS, 444), CS), (f1(EE, 444),EE),
(f1(CS, 333), CS)}
r: { (f1(CS, 444), 444), (f1(EE, 444),444),
(f1(CS, 333), 333)}
The inverse-rules algorithm - example
• We can now apply the query in this
instance
q: {(CS),(EE)}
The MiniCon algorithm
• A variation of the bucket algorithm
• It takes into account the non-distinguished variables as
well, which usually imply joins
• If a view contains a ‘target’ subgoal, which participates
in a join in the query, then the ‘join’ variables should
appear in the head of the view (unless the join takes
place also within the view)
• It excludes much more views from participating in
buckets (here called MCDs)
• There are fewer combination to check at the end
Discussion…
Fly UP