Approximate Relational Acyclic Schemas: From Theory to Practice
In database design, approximate acyclic schemas offer a flexible approach to extraction of relational structure from data. Traditional acyclic schemas require the data to strictly adhere to the schema structure, which often leads to few representations if any. By relaxing these constraints, approximate acyclic schemas greatly increase the space of available representations. However, by allowing a broader range of schemas, the risk of errors and inconsistencies increases. Balancing the benefits of flexibility with the potential for inaccuracies is a critical consideration in the mining and managing of data, emphasizing the need for careful evaluation and validation of the resulting schemas.
In this research, we explored this problem from two aspects. The first part is concerned with how the data error in approximate acyclic schemas relates to information-theoretic properties of the schema. The second part is focused on Position List Index, a technique for efficient computation of frequencies in relational data, which plays a key role in mining approximate acyclic schemas from relational data. The management of PLIs in-memory is a bottleneck in many applications. Using adaptive sampling, we improved the computation of frequencies on several benchmark datasets. Additionally, we characterize our suggested solution as a new variant to the classical combinatorial Coupon Collector Problem.
M.Sc. student under the supervision of Prof. Nir Weinberger and Prof. Batya Kenig.