This shows you the differences between two versions of the page.
magik-demo:user:school-example [2013/08/23 10:32] alex |
magik-demo:user:school-example [2017/07/06 15:24] |
||
---|---|---|---|
Line 1: | Line 1: | ||
- | {{:magik-demo:magik_logo.png?nolink&500|}} | ||
- | |||
- | ====== Completeness of School Data ====== | ||
- | Content: | ||
- | * [[#Introduction|Introduction]] | ||
- | * [[#Example 1: TC-QC plain|Example 1: TC-QC plain]] | ||
- | * [[#Example 2: TC-QC under Finite Domain Constraints|Example 2: TC-QC under Finite Domain Constraints]] | ||
- | * [[#Example 3: TC-QC under Foreign Keys|Example 3: TC-QC under Foreign Keys]] | ||
- | * [[#Example 4: TC-QC under Foreign Keys and Finite Domain Constraints|Example 4: TC-QC under Foreign Keys and Finite Domain Constraints]] | ||
- | * [[#Example 5: TC-QC under Set and Bag semantics|Example 5: TC-QC under Set and Bag semantics]] | ||
- | |||
- | ===== Introduction ===== | ||
- | In database applications such as information integration and decision support, | ||
- | one is in interested in data sets that are complete, in the sense that the data | ||
- | represent all relevant facts that hold in the real world. In many situations, though, | ||
- | it is only possible to guarantee partial completeness of the data, which means that | ||
- | for certain aspects of the application domain the data are complete, but not for others. | ||
- | In such a situation, one would like to know at least whether the available data are | ||
- | sufficient to answer a given query completely, that is, whether the answers to the | ||
- | query over the avalable data are the same as if the data set were complete. | ||
- | |||
- | An example is the management of school data in the province of Bolzano, which motivated | ||
- | the work reported here. Data in the school information system of the province are often | ||
- | incomplete because each school individually is responsible for inserting its data into | ||
- | the system and because for certain kinds of data the contribution is optional. | ||
- | Decision makers, however, need to know whether or not the statistics on which they base | ||
- | the allocation of resources to schools are derived from complete data. | ||
- | |||
- | |||
- | The problem statement that MAGIK solves is the following: | ||
- | Given a collection of table completeness (TC) statements | ||
- | (meta-statements that states which parts of tables are certainly complete) | ||
- | that hold over a database instance | ||
- | D, we can ask whether they entail that for | ||
- | a query Q the set of answers Q(D) is complete, that is, | ||
- | whether Q(D) contains all answer records that one would expect | ||
- | if D were complete (it contains all information). | ||
- | |||
- | In the rest we illustrate main reasoning featutes of MAGIK using the following | ||
- | toy schema: | ||
- | <code sql> | ||
- | pupil(name,level,code), | ||
- | class(level,code,branch), | ||
- | learns(name,lang). | ||
- | </code> | ||
- | ===== Example 1: TC-QC plain ===== | ||
- | |||
- | Assume we are interested who are the pupils from the 1st level: | ||
- | <code sql> | ||
- | Q1= SELECT p.name | ||
- | FROM pupil AS p | ||
- | WHERE p.level='1' | ||
- | </code> | ||
- | |||
- | |||
- | We also assume that we are complete for all pupils in the school, expressed with the following | ||
- | TC-statement | ||
- | ((TC statements are written in a datalog-like syntax. Syntactically, | ||
- | a statement for a table R consists of two parts: an R-atom, | ||
- | representing a selection on R and, possibly, a condition, representing | ||
- | a semijoin with other tables. Here we adapt logic programming convention that variables start with upper-case letters, while constants | ||
- | start with lower-case letters.)): | ||
- | <code sql> | ||
- | TC1= TABLE: pupil(Name,Level,Code) WHERE: | ||
- | </code> | ||
- | |||
- | We reason in the following way. If we are complete for all pupils in the school, then we are complete also for the pupils from the 1st level. | ||
- | Not surprisingly, if TC1 holds over our database instance the answer obtained from query Q1 will be complete, regardless of the completeness of the other parts of the database. | ||
- | Now if we {{:magik-demo:go.jpg?nolink&10| }}run query Q1 by selecting only TC1, the MAGIK reasoner inform us that | ||
- | {{:magik-demo:ok.jpg?nolink&10| }}**<fc green>query is complete</fc>**. | ||
- | |||
- | Now assume a different scenario where our database is less complete. | ||
- | In particular, we assume that we are complete only for pupils from the class ''1a'': | ||
- | <code sql> | ||
- | TC2= TABLE: pupil(Name,1,a) WHERE: | ||
- | </code> | ||
- | |||
- | Now TC2 states the completeness only for certain pupils on the 1st level. | ||
- | Without additional information, we don't | ||
- | know how much we are complete in the case of other classes (f.e., ''1b'' or ''1c''). | ||
- | Consequently, if we {{:magik-demo:go.jpg?nolink&10| }}run query Q1 selecting only TC2, MAGIK informs us | ||
- | that {{:magik-demo:no.jpg?nolink&10| }}**<fc red>query is not complete</fc>**. | ||
- | |||
- | Considering that the query is not complete, MAGIK analysis which parts of the database are needed for Q1 in order to be complete. | ||
- | Parts that are missing are expressed in the form of TC-statements. | ||
- | Proposed set of TC-statements is the minimal extension of the existing set of TC-statements. | ||
- | In the case our settings, we see that MAGIK suggests to complete all pupils | ||
- | from the 1st class: | ||
- | <code sql> | ||
- | TC3= TABLE: pupil(P_Name,1,P_Code) WHERE: | ||
- | </code> | ||
- | This is expected considering that we do not have other information about possible ''class'' codes. | ||
- | After adding TC3 and {{:magik-demo:go.jpg?nolink&10| }} running Q1 again, | ||
- | MAGIK informs us that | ||
- | {{:magik-demo:ok.jpg?nolink&10| }}**<fc green>query is complete</fc>**. | ||
- | ===== Example 2: TC-QC under Finite Domain Constraints ===== | ||
- | In this section we illustrate how finite domain constraints can be used to infer query completeness. | ||
- | Let us consider again query Q1 | ||
- | <code sql> | ||
- | Q1= SELECT p.name | ||
- | FROM pupil AS p | ||
- | WHERE p.level='1' | ||
- | </code> | ||
- | and TC-statement TC2 | ||
- | <code sql> | ||
- | TC2= TABLE:pupil(Name,1,a) WHERE: | ||
- | </code> | ||
- | |||
- | In addition to the previous example, we have an information that ''code'' attribute in the relation ''pupil'' can be either | ||
- | ''a'' or ''b''. We express this with the following finide domain constraint: | ||
- | <code sql> | ||
- | FD1= pupil(code) IN {a,b} | ||
- | </code> | ||
- | |||
- | We see that considering together TC2 and FD1, to answer Q1 completely, we might be only missing pupils from ''1b''. | ||
- | We {{:magik-demo:go.jpg?nolink&10| }} run the query Q1 under this setting and not surprisingly MAGIK informes us that {{:magik-demo:no.jpg?nolink&10| }}<fc red> query is not complete</fc>. | ||
- | |||
- | However, if look at the MAGIK suggestions regarding the missing TC-statements, we can see that MAGIK proposes us to collect the pupil data | ||
- | about the class ''1b'', and to confirm it by adding the following TC-statement: | ||
- | <code sql> | ||
- | TC4= TABLE:pupil(P_Name,1,b) WHERE: | ||
- | </code> | ||
- | We follow those instructions and we call {{:magik-demo:go.jpg?nolink&10| }}run query Q1, under the configuration of TC2,TC4 and FD1. | ||
- | MAGIK reasoner inform us the now the {{:magik-demo:ok.jpg?nolink&10| }}<fc green> query is complete</fc>. | ||
- | ===== Example 3: TC-QC under Foreign Keys ===== | ||
- | In this section we illustrate how foreign keys impact the query completeness. | ||
- | We assume that there exist a foreign keys from ''pupil'' to ''class'' | ||
- | <code sql> | ||
- | FK1: FOREIGN KEY pupil(level,code) REFERENCES class(level,code), | ||
- | </code> | ||
- | and from ''learns'' to ''pupil'' | ||
- | <code sql> | ||
- | FK2: FOREIGN KEY learns(name) REFERENCES pupil(name). | ||
- | </code> | ||
- | |||
- | |||
- | === Enforced semantics === | ||
- | |||
- | Let us assume that foreign keys obey so ''enforced'' semantics. That is when the foreign keys hold both | ||
- | over the available database instance and over the ideal (complete) version of it. | ||
- | Note, that we do not have ideal version at the disposal, but only available (incomplete) version of a database. | ||
- | In general, one can assume that foreign keys hold only over the | ||
- | ideal version, while the available instance can violate foreign keys. | ||
- | This situation is plausible in information | ||
- | integration scenario. Nevertheless, we for now we assume ''enforced'' semantics. | ||
- | |||
- | |||
- | Now, we ask for the all pupils in the ''science'' branch that learn French language: | ||
- | <code sql> | ||
- | Q2= SELECT p.name | ||
- | FROM pupila AS p, class AS c, learns AS l | ||
- | WHERE p.name=l.name AND l.lang='french' | ||
- | AND p.level=c.level AND p.code=c.code | ||
- | AND c.branch='science' | ||
- | </code> | ||
- | |||
- | Additionally, we assume that we are complete for all French class attendances: | ||
- | <code sql> | ||
- | TC5= TABLE: learns(Name,french) WHERE: | ||
- | </code> | ||
- | |||
- | Under this setting, we can reason in the following way. If for a some student exists a French class attendance in the ideal database then | ||
- | according TC5 it must exists | ||
- | also in the available databas instance. Considering that available database instance obeys the foreign key constraint FK1, then for that student there exists a corresponding record in the ''pupil'' table. Further, considering this and FK2 there must exists a corresponding record | ||
- | in the ''class'' table as well. In other words, if this student is a | ||
- | ''science'' student in ideal database instance then that student is ''science'' in the available instance as well. | ||
- | If we {{:magik-demo:go.jpg?nolink&10| }}run the query Q2, MAGIK | ||
- | takes all this into consideration, | ||
- | and it returns that {{:magik-demo:ok.jpg?nolink&10| }} <fc green> query is complete</fc>. | ||
- | |||
- | === Non-Enforced semantics === | ||
- | |||
- | Now we relax the assumptions from the setttings above, by assuming that foreign keys are not enforced | ||
- | over the available database instance. | ||
- | If we {{:magik-demo:go.jpg?nolink&10| }} run query Q2 for the FK1, FK2 and TC4, MAGIK inform us that | ||
- | {{:magik-demo:no.jpg?nolink&10| }} <fc red> query is not complete</fc>. | ||
- | If we look at the suggestions, MAGIK proposes us to add the following TC-statements: | ||
- | <code sql> | ||
- | TC6= TABLE: class(C_level,C_code,science) WHERE: learns(P_name,french), pupil(P_name,P_level,C_code) | ||
- | TC7= TABLE: pupil(P_name,P_level,C_code) WHERE: class(C_level,C_code,science), learns(P_name,french) | ||
- | </code> | ||
- | After adding those statements and {{:magik-demo:go.jpg?nolink&10| }} running Q1 again, MAGIK concludes that | ||
- | {{:magik-demo:ok.jpg?nolink&10| }} <fc green> query is complete</fc>. | ||
- | ===== Example 4: TC-QC under Foreign Keys and Finite Domain Constraints ===== | ||
- | In this section we illustrate how interplay between foreign keys and finite domains impact the query completeness. | ||
- | We illustrate this in the following scenario. | ||
- | We assume that there exists a foreign key (no metter if it is enforced or not) | ||
- | <code sql> | ||
- | FK1= FOREIGN KEY pupil(level,code) REFERENCES class(level,code), | ||
- | </code> | ||
- | |||
- | Then we know that ''branch'' in the ''class'' relation can be either ''science'' or ''humanities''. | ||
- | <code sql> | ||
- | FD2= class(branch) IN {humanities,science}, | ||
- | </code> | ||
- | |||
- | Lastly, we assume that we are complete for all pupils from the humanities branch and for all pupils from science branch expressed with the following TC-statement: | ||
- | <code sql> | ||
- | TC8= TABLE: pupil(Name,Class,Level) WHERE: class(Level,Code,humanities), | ||
- | TC9= TABLE: pupil(Name,Class,Level) WHERE: class(Level,Code,science). | ||
- | </code> | ||
- | |||
- | Now we have a query that ask for all pupils: | ||
- | <code sql> | ||
- | Q3= SELECT p.name | ||
- | FROM pupila AS p | ||
- | </code> | ||
- | |||
- | We reason in the following way. If there is a record of a pupil in the idel database instance, then following the foreign FK1, there exists a | ||
- | corresponding records in the ''class'' relation. Due to the FD1 this record has either ''humanities'' or ''science'' on the branch position. | ||
- | If the first is the case, then following TC8 the pupil record will exists in the available database. | ||
- | If the second is the case, then following TC9 the pupil record will exists in the available database again. | ||
- | We {{:magik-demo:go.jpg?nolink&10| }} run query Q3 for FK1, FD2, TC8 and TC9, and MAGIK inform us that | ||
- | {{:magik-demo:ok.jpg?nolink&10| }} <fc green> query is complete</fc>. | ||
- | |||
- | |||
- | ===== Example 5: TC-QC under Set and Bag semantics ===== | ||
- | |||
- | In this section we illustrate the difference between set and bag semantics. | ||
- | Default SQL semantics is beg semantics. To express set semantics one needs to use ''DISTINCT'' after the ''SELECT''. | ||
- | For example consider the following two queries. | ||
- | <code sql> | ||
- | Q4= SELECT DISTINCT l1.Name | ||
- | FROM learns AS l1, learns AS l2 | ||
- | WHERE l1.Name = l2.Name | ||
- | AND l1.Lang = ’french’ | ||
- | |||
- | Q5= SELECT l1.Name | ||
- | FROM learns AS l1, learns AS l2 | ||
- | WHERE l1.Name = l2.Name | ||
- | AND l1.Lang = ’french’ | ||
- | </code> | ||
- | |||
- | Query Q4 returns a pupil if such a pupil learns French language. Each pupil will be returned at most once, considering ''DISTINCT'' command in the | ||
- | ''SELECT''. In this case appearance of ''l2'' is absolute superfouls. | ||
- | |||
- | On the other hand, query Q5 returns each | ||
- | name of a learner of French as many times as there are learns tuples with that name. In this case appearance of ''l2'' is absolute necessary | ||
- | to preserve the answers that returns Q5. | ||
- | |||
- | Now assume that we are complete for all French learners: | ||
- | <code sql> | ||
- | TC10= TABLE: learns(Name,french) WHERE: | ||
- | </code> | ||
- | |||
- | If we {{:magik-demo:go.jpg?nolink&10| }}run query Q4 under consideration of TC10, MAGIK informs us that {{:magik-demo:ok.jpg?nolink&10| }} <fc green> query is complete</fc>. | ||
- | Contrary, if we {{:magik-demo:go.jpg?nolink&10| }}run query Q5 under consideration of TC10, MAGIK informs us that {{:magik-demo:no.jpg?nolink&10| }} <fc red> query is not complete</fc>. | ||
- | |||
- | In addition to the standard SQL ''select-project-join'' queries, MAGIK allows use of ''count(*)'' and ''GROUP BY''. | ||
- | For example, we can write a query, that prints the similar result as query Q5, but in a more elegant way: | ||
- | <code sql> | ||
- | Q6= SELECT l1.name, count(*) AS number | ||
- | FROM learns AS l1, learns AS l2 | ||
- | WHERE l1.name = l2.name | ||
- | AND l1.lang = ’french’ | ||
- | GROUP BY l1.Name | ||
- | </code> | ||
- | |||
- | Now, if we look at MAGIK suggestions for the query Q5 (the same will be for query Q6), MAGIK proposes to | ||
- | to complete the learns tuples for the names of those pupils | ||
- | who learn French: | ||
- | <code sql> | ||
- | TC11= TABLE: learns(Tl2_name,Tl2_lang) WHERE: learns(Tl2_name,french), | ||
- | </code> | ||
- | (note, the request is not for all tuples, but only | ||
- | those whose Tl2_name variable satisfies the condition). We follow this instruction and | ||
- | we {{:magik-demo:go.jpg?nolink&10| }} run query Q5 (Q6) where TC10 are TC11 activated. Finally, | ||
- | MAGIK displays that {{:magik-demo:ok.jpg?nolink&10| }} <fc green> query is complete</fc>. | ||
- | |||
- |