User Tools

Site Tools


magik-demo:user:school-example

Completeness of School Data

Introduction

In database applications such as information integration and decision support, one is in interested in data sets that are complete, in the sense that the data represent all relevant facts that hold in the real world. In many situations, though, it is only possible to guarantee partial completeness of the data, which means that for certain aspects of the application domain the data are complete, but not for others. In such a situation, one would like to know at least whether the available data are sufficient to answer a given query completely, that is, whether the answers to the query over the avalable data are the same as if the data set were complete.

An example is the management of school data in the province of Bolzano, which motivated the work reported here. Data in the school information system of the province are often incomplete because each school individually is responsible for inserting its data into the system and because for certain kinds of data the contribution is optional. Decision makers, however, need to know whether or not the statistics on which they base the allocation of resources to schools are derived from complete data.

The problem statement that MAGIK solves is the following: Given a collection of table completeness (TC) statements (meta-statements that states which parts of tables are certainly complete) that hold over a database instance D, we can ask whether they entail that for a query Q the set of answers Q(D) is complete, that is, whether Q(D) contains all answer records that one would expect if D were complete (it contains all information).

In the rest we illustrate main reasoning featutes of MAGIK using the following toy schema:

pupil(name,level,code),
class(level,code,branch),
learns(name,lang).

Example 1: TC-QC plain

Assume we are interested who are the pupils from the 1st level:

Q1=    SELECT p.name
       FROM   pupil AS p
       WHERE  p.level='1'

We also assume that we are complete for all pupils in the school, expressed with the following TC-statement 1):

TC1=   TABLE: pupil(Name,Level,Code)  WHERE:

We reason in the following way. If we are complete for all pupils in the school, then we are complete also for the pupils from the 1st level. Not surprisingly, if TC1 holds over our database instance the answer obtained from query Q1 will be complete, regardless of the completeness of the other parts of the database. Now if we  run query Q1 by selecting only TC1, the MAGIK reasoner inform us that  <fc green>query is complete</fc>.

Now assume a different scenario where our database is less complete. In particular, we assume that we are complete only for pupils from the class 1a:

TC2=   TABLE: pupil(Name,1,a)         WHERE:

Now TC2 states the completeness only for certain pupils on the 1st level. Without additional information, we don't know how much we are complete in the case of other classes (f.e., 1b or 1c). Consequently, if we  run query Q1 selecting only TC2, MAGIK informs us that  <fc red>query is not complete</fc>.

Considering that the query is not complete, MAGIK analysis which parts of the database are needed for Q1 in order to be complete. Parts that are missing are expressed in the form of TC-statements. Proposed set of TC-statements is the minimal extension of the existing set of TC-statements. In the case our settings, we see that MAGIK suggests to complete all pupils from the 1st class:

TC3=   TABLE: pupil(P_Name,1,P_Code)  WHERE:

This is expected considering that we do not have other information about possible class codes. After adding TC3 and  running Q1 again, MAGIK informs us that  <fc green>query is complete</fc>.

Example 2: TC-QC under Finite Domain Constraints

In this section we illustrate how finite domain constraints can be used to infer query completeness. Let us consider again query Q1

Q1=    SELECT p.name
       FROM   pupil AS p
       WHERE  p.level='1'

and TC-statement TC2

TC2=   TABLE:pupil(Name,1,a)  WHERE:

In addition to the previous example, we have an information that code attribute in the relation pupil can be either a or b. We express this with the following finide domain constraint:

FD1=   pupil(code)  IN {a,b}

We see that considering together TC2 and FD1, to answer Q1 completely, we might be only missing pupils from 1b. We  run the query Q1 under this setting and not surprisingly MAGIK informes us that  <fc red> query is not complete</fc>.

However, if look at the MAGIK suggestions regarding the missing TC-statements, we can see that MAGIK proposes us to collect the pupil data about the class 1b, and to confirm it by adding the following TC-statement:

TC4=   TABLE:pupil(P_Name,1,b)  WHERE:

We follow those instructions and we call  run query Q1, under the configuration of TC2,TC4 and FD1. MAGIK reasoner inform us the now the  <fc green> query is complete</fc>.

Example 3: TC-QC under Foreign Keys

In this section we illustrate how foreign keys impact the query completeness. We assume that there exist a foreign keys from pupil to class

FK1: FOREIGN KEY pupil(level,code) REFERENCES class(level,code), 

and from learns to pupil

FK2: FOREIGN KEY learns(name)      REFERENCES pupil(name).

Enforced semantics

Let us assume that foreign keys obey so enforced semantics. That is when the foreign keys hold both over the available database instance and over the ideal (complete) version of it. Note, that we do not have ideal version at the disposal, but only available (incomplete) version of a database. In general, one can assume that foreign keys hold only over the ideal version, while the available instance can violate foreign keys. This situation is plausible in information integration scenario. Nevertheless, we for now we assume enforced semantics.

Now, we ask for the all pupils in the science branch that learn French language:

Q2=    SELECT p.name
       FROM   pupila AS p, class AS c, learns AS l
       WHERE  p.name=l.name      AND l.lang='french'
       AND    p.level=c.level    AND p.code=c.code
       AND    c.branch='science'

Additionally, we assume that we are complete for all French class attendances:

TC5=   TABLE: learns(Name,french)   WHERE:

Under this setting, we can reason in the following way. If for a some student exists a French class attendance in the ideal database then according TC5 it must exists also in the available databas instance. Considering that available database instance obeys the foreign key constraint FK1, then for that student there exists a corresponding record in the pupil table. Further, considering this and FK2 there must exists a corresponding record in the class table as well. In other words, if this student is a science student in ideal database instance then that student is science in the available instance as well. If we  run the query Q2, MAGIK takes all this into consideration, and it returns that  <fc green> query is complete</fc>.

Non-Enforced semantics

Now we relax the assumptions from the setttings above, by assuming that foreign keys are not enforced over the available database instance. If we  run query Q2 for the FK1, FK2 and TC4, MAGIK inform us that  <fc red> query is not complete</fc>. If we look at the suggestions, MAGIK proposes us to add the following TC-statements:

TC6=   TABLE: class(C_level,C_code,science)	WHERE: learns(P_name,french), pupil(P_name,P_level,C_code)
TC7=   TABLE: pupil(P_name,P_level,C_code)	WHERE: class(C_level,C_code,science), learns(P_name,french)

After adding those statements and  running Q1 again, MAGIK concludes that  <fc green> query is complete</fc>.

Example 4: TC-QC under Foreign Keys and Finite Domain Constraints

In this section we illustrate how interplay between foreign keys and finite domains impact the query completeness. We illustrate this in the following scenario. We assume that there exists a foreign key (no metter if it is enforced or not)

FK1=  FOREIGN KEY pupil(level,code) REFERENCES class(level,code), 

Then we know that branch in the class relation can be either science or humanities.

FD2=  class(branch) IN  {humanities,science}, 

Lastly, we assume that we are complete for all pupils from the humanities branch and for all pupils from science branch expressed with the following TC-statement:

TC8=  TABLE: pupil(Name,Class,Level)  WHERE: class(Level,Code,humanities), 
TC9=  TABLE: pupil(Name,Class,Level)  WHERE: class(Level,Code,science).

Now we have a query that ask for all pupils:

Q3=   SELECT p.name
      FROM   pupila AS p

We reason in the following way. If there is a record of a pupil in the idel database instance, then following the foreign FK1, there exists a corresponding records in the class relation. Due to the FD1 this record has either humanities or science on the branch position. If the first is the case, then following TC8 the pupil record will exists in the available database. If the second is the case, then following TC9 the pupil record will exists in the available database again. We  run query Q3 for FK1, FD2, TC8 and TC9, and MAGIK inform us that  <fc green> query is complete</fc>.

Example 5: TC-QC under Set and Bag semantics

In this section we illustrate the difference between set and bag semantics. Default SQL semantics is beg semantics. To express set semantics one needs to use DISTINCT after the SELECT. For example consider the following two queries.

Q4=   SELECT DISTINCT l1.Name
      FROM   learns AS l1, learns AS l2
      WHERE  l1.Name = l2.Name 
      AND    l1.Lang = ’french’
 
Q5=   SELECT l1.Name
      FROM   learns AS l1, learns AS l2
      WHERE  l1.Name = l2.Name 
      AND    l1.Lang = ’french’

Query Q4 returns a pupil if such a pupil learns French language. Each pupil will be returned at most once, considering DISTINCT command in the SELECT. In this case appearance of l2 is absolute superfouls.

On the other hand, query Q5 returns each name of a learner of French as many times as there are learns tuples with that name. In this case appearance of l2 is absolute necessary to preserve the answers that returns Q5.

Now assume that we are complete for all French learners:

TC10=    TABLE: learns(Name,french)        WHERE:

If we  run query Q4 under consideration of TC10, MAGIK informs us that  <fc green> query is complete</fc>. Contrary, if we  run query Q5 under consideration of TC10, MAGIK informs us that  <fc red> query is not complete</fc>.

In addition to the standard SQL select-project-join queries, MAGIK allows use of count(*) and GROUP BY. For example, we can write a query, that prints the similar result as query Q5, but in a more elegant way:

Q6=   SELECT   l1.name, COUNT(*) AS NUMBER
      FROM     learns AS l1, learns AS l2
      WHERE    l1.name = l2.name 
      AND      l1.lang = ’french’
      GROUP BY l1.Name

Now, if we look at MAGIK suggestions for the query Q5 (the same will be for query Q6), MAGIK proposes to to complete the learns tuples for the names of those pupils who learn French:

TC11=    TABLE: learns(Tl2_name,Tl2_lang)  WHERE: learns(Tl2_name,french),

(note, the request is not for all tuples, but only those whose Tl2_name variable satisfies the condition). We follow this instruction and we  run query Q5 (Q6) where TC10 are TC11 activated. Finally, MAGIK displays that  <fc green> query is complete</fc>.

1)
TC statements are written in a datalog-like syntax. Syntactically, a statement for a table R consists of two parts: an R-atom, representing a selection on R and, possibly, a condition, representing a semijoin with other tables. Here we adapt logic programming convention that variables start with upper-case letters, while constants start with lower-case letters.
magik-demo/user/school-example.txt · Last modified: 2017/07/06 15:24 (external edit)