====== Process-aware Data Quality Assessment ====== ====== Questions ====== * What are DQ problems? * Data Completeness, Correctness, Consistency, ... * What are other properties of Data that are interesting to observe? * Data Stability (how stable is our data in the future) * Properties can be investigated statically (at the current state) or dynamic properties (that may occur in the future) ====== Approach ====== ==== Leitmotif: Solve an ordinary problem in DQ==== * without complex model and explanations that in reality there is such complexity * More simple the better * Crucial bit: It has a small application and usage (at least in some example) ==== Simple example (starting point) ==== * Consult books (and hard papers) used **only** when the research get stucked * **Think rather then read!** * Use imagination ====== Model Assumptions and Comments ====== Properties of a BP language: * modeling states marking (a state with multiple instances running) * alternatively, one instance starting from the beginning is floating around * alternatively, current marking + introduction of new tokens via //start// * modeling paralelism * using AND * modeling choice (and non-determinism) * using OR * modeling DATA * varaibles * fully fledged SQL database * modeling Recursion * e.g., allow cycles in Petri Net graph Model sophisticated properties include: * modeling TIME * modeling Comparisons (and intervals) * modeling Aggregates * modeling Arithmetics ==== (Color) Petri Nets ==== **Color Petri Nets == YAWL** Drawbacks: * concurrency of executions is implicitly assumed even though sometimes we do not need it * AND-gate models parallelism but requires more then one token to be executed Data: * Supported via **Local** and **Global** variables **=/=** fully fledged SQL database Still with proper restrictions one get very elegant model with clear semantics for representing BP. ====== References ====== ===== Business Processes: A Database Propsective (2012) ===== A [[http://www.morganclaypool.com/doi/abs/10.2200/S00430ED1V01Y201206DTM027?journalCode=dtm | book]] **by Daniel Deutch, Tova Milo** Summary: * Business Processes are important to consider because they are telling us the context in which the data is generated and manipulated (processes, users and goals). The ultimate goal is to create a declarative model and query language that will posses all advantages of the relational counter-part. We need a flow-and-data framework that will allow us developing analysis and optimization techniques as we have in the relational model. Therefore, this research is of a fundamental importance. Highlights: * BP and data main challenges: A model and a query language for that model * Existing BP languages (e.g., BPEL) very weakly support explicit description of data operations. Instead they are hidden bellow some high-level code (e.g., in Java methods). * On the other hand, existing data flow languages developed within database community (Hull, Deutsch, Abitaboul, etc.), are implicitly defining a data flow (in a way it is hidden) by explicitly specifying data transformation. * Neither existing solution fulfills the starting goal of an elegant model and declarative query language. ==== Observations ==== Query languages for BP and data can perform analysis in: * **Design-phase** (no underlying data) * Checking consistency of the model (deadlock, reachability, etc.) * Petri Nets * Speculating the future data state by abstracting data transformations * work by Simon on data completeness (real and ideal data) * Speculating the future data state by looking at the future data transformations * checking the properties of **stability** * can a numeric value increase or decrease * **Run-time** * **FUTURE** (infinite many branches) * temporal logic * mu-calculus (Diego, Babak, Marco) * alternative predicates for checking data properties * e.g., always stable = AS(pupil(X,emcl)) * **PAST** (logs) * Q: Why the data is in the state like this (bad state)? * We know exact trace, but how we can use it to fix the data or to fix the process? * **PAST + FUTURE** * Knowing the past of our data what we can tell about the future? ===== Data Quality concepts, methodologies and techniques (2006) ===== a book by **Carlo Batini and Monica Scannapieco** ===== Foundations of Data Quality Management (2012) ===== a book by ** Wenfei Fan **