Exercises – CSCI 6441

Chapter 5 Exercises

Explain the relationship between data independence and physical database structure. How does a DBMS provide data independence, and why is this considered a fundamental goal of the relational approach?
Database systems store data on disk storage devices rather than keeping all data in main memory. Explain the technical and practical reasons for this decision and discuss how this impacts DBMS design.
Compare and contrast the B-tree and B+ tree index structures. What specific advantages does the B+ tree offer for database systems? Include a discussion of how these advantages impact query performance.
Given a B+ tree with a branching factor of 200: a) How many entries can be indexed with 1, 2, 3, and 4 levels? b) If accessing each level requires one disk I/O, calculate the maximum number of disk I/Os needed to find a value in a table with 10 million rows.
Draw a small B-tree with a maximum of 3 keys per node. Show what happens when you insert the following values in this order: 10, 5, 15, 3, 7, 12, 17, 4, 6, 13, 19. Include the tree state after each insertion.
Compare the suitability of three different search algorithms (linear search, binary search, and hash-based search) for each of these database operations: a) Finding all employees with salaries over $50,000 b) Locating a specific employee by their exact employee ID c) Retrieving all employees in a specific department d) Finding all employees whose last names start with “M”
Explain how the DBMS software architecture components (lexical analyzer, SQL parser, optimizer, and executor) work together to process a SQL query. For a simple query like SELECT * FROM employees WHERE salary > 50000, trace how each component contributes to producing the result.
If the WAL (Write-Ahead Logging) protocol is not followed, what specific types of database failures could occur? Explain the mechanism by which WAL prevents these failures.
The clustering of tables based on a foreign key relationship can significantly improve join performance. Using the EMP and DEPT tables as examples, explain: a) How clustering would physically organize these tables on disk b) The performance advantages for joins c) Any potential disadvantages for other operations d) When an administrator should consider implementing clustering
Describe the purpose and content of the system catalog in a database system. What specific information does it store, and how is this information used during query processing?
Design a database page layout that efficiently stores variable-length rows. Your design should include: a) A header structure for the page b) A method for locating rows within the page c) A strategy for dealing with rows that don’t fit on a single page d) An approach for managing free space on partially filled pages
Explain how database extents and tablespaces help manage the physical storage of database objects. What performance advantages do they provide, and what configuration options might a database administrator consider when setting them up?

Performance Exercises

This assignment is a further exploration of some performance issues, particularly dealing with how queries are performed. Use the performance testing database that you’ve created with large tables with columns of different data types.

Be ready to present your results at our conference with Powerpoint charts with graphs of your results. The graphs of interest are of how performance varies at different table sizes, for the various scenarios. Also, submit your results in Blackboard. Your submission should include your performance charts, an explanation of them, and analysis of the results.

Here is the scenario. Think of yourself as working on a project to develop a major database application, for Citibank, for example. As one of the most expert technical people on the project, you’ve been asked to show some data about various performance issues for your colleagues. Do some performance experiments so that you have interesting data to share, and decide what’s a good way to present it so that it is easy to understand and is persuasive in its support of the guidance you give.

First, let’s divide the class into three sections using the final digit of your GWID. Divide itss by 3, and use the remainder as your group number. We will have groups zero, one and two. Carry out the assignment that matches your group number.

You may do the exercises independently or as a group. If you do them as a group, divide the topics so that each person has a separate topic to explore and discuss in class independently. Note that for all these topics, what’s expected is graphs of the variation in performance as the table size grows, and your analysis of that performance. Be careful to clear the database cache between experiments, so that the cache does not distort your results. There are SQL statements to clear cache, so that you don’t need to end your session to clear cache. It would be appropriate to do an experiment to verify that the statement you use to clear cache does in fact do the job.

0. Multi-table joins.

You are going to give your colleagues guidance on how Oracle processes multi-table joins, so that they understand how to get good performance on their queries. The most important choice that pregrammers will have is which columns to index and how to index them, so that’s the guidance you want to give, supported by data that you obtain with experiments. Some of the areas you may want to exploare are: a six-table join, using all columns of high selectivity; one six-table join of columns of high and low selectivity; one six-table join of columns of low selectivity. These are just examples, you decide which experimental data to present and how to present it. The topic of interest is how performance varies as the table size increases in various situations, depending on how Oracle processes the queries. You can use EXPLAIN to determine how Oracle processed these queries. The important result of this experiment is your explanation of how the joins were processed and why. What guidance will you offer to your colleagues?

Indexing and NULL.

You will give your colleagues guidance about the use of indexes and query performance when a column can contain null values, and the cost of searching for null values. First, insert a single null value into half the columns of each table, varying the data type for columns you use. Now create indexes and composite indexes on another column as well as a column that contains a null value. Show the methods that force Oracle to use an index to search for a null value, and show experimental results. The variation in performance as table size grows, and as the methods that Oracle uses to process these queries, is of importance. What guidance do you give to your colleagues?

2. Join Performance.

You will give your colleagues guidance on the performance of two-table joins with and without indexes, for different levels of selectivity and different table sizes as well as exact, inequality and LIKE comparisons. Construct SELECT statements and use time reports and EXPLAIN to measure performance of various join processing methods. Show the change in performance as the table size varies for a number of different methods that Oracle uses for processing joins. The important result of your experiments is how Oracle decides which method to use in processing a join; your experimental results should support your conclusions. What guidance do you give to your colleagues?

Assignment 11

Construct a star schema data model for a data warehouse for sales made at a chain of stores that operates in many countries. The store is interested in analyzing data based on day of the week, month, season, average temperature that day, city, state, country, price, time of day, product weight, customer age, gender, store size, and years store has been open. Use ERDPlus, available on the Web, for your ERDs. Write queries for typical analysis of trends in sales data by day of the week, by city, and product weight.
Use your ERD to create an Oracle database. Enter data for hundreds of sales into your database. Write queries for analysis of trends in sales by day of the week, by city and by product weight. Show the results of those queries.
Compare and contrast the star schema with the snowflake schema. When would you choose one over the other? Provide specific business scenarios where each would be appropriate.
Explain why OLAP databases are typically separate from OLTP databases, focusing on at least three key differences in their design and usage patterns. Design a fact table and associated dimension tables for an e-commerce website that wants to analyze customer purchasing behavior. Your schema should include dimensions for customers, products, time, geography, and marketing campaigns.
List five common metrics that would appear in the fact table. Explain the concept of “grain” in the context of a fact table. How does the choice of grain impact the size and analytical capabilities of a data warehouse? Provide an example.
Walmart’s data warehouse was mentioned as a pioneering example of OLAP Research. Describe two specific ways Walmart has used their data warehouse to gain competitive advantage. Create a snowflake schema for a university that wants to analyze student enrollment patterns. Include dimensions for students, courses, instructors, departments, time, and campus locations. Explain your design choices. Bitmap indexes are crucial for OLAP performance. Explain how bitmap indexes work and why they’re particularly well-suited for star schema queries compared to B-tree indexes. Many organizations implement both a data warehouse and separate data marts. Explain the relationship between these components and list three advantages and two disadvantages of using data marts alongside a central data warehouse. Describe the ETL (Extract, Transform, Load) process for populating a data warehouse. What kinds of data transformations are typically performed during this process, and why are they necessary? Write a query using the star schema from Exercise 1 that would help a store manager analyze which products have the highest profit margins by season and customer demographic (age group and gender). Explain the concept of “slowly changing dimensions” in data warehousing. Describe three different methods for handling changes in dimension data, and when each might be appropriate. Design a multidimensional schema for a hospital that wants to analyze patient outcomes. Include dimensions for patients, treatments, physicians, diagnoses, time, and hospital departments. What measures would be appropriate for the fact table? Describe how OLAP operations (roll-up, drill-down, slice, dice, and pivot) support different analytical needs. Provide a business example for each operation using the retail star schema from Exercise 1. Research and compare two commercial OLAP products mentioned in the chapter (such as products from Oracle, IBM, or SAP). Discuss their approaches to star schema processing and their relative advantages. Master Data Management (MDM) relates to OLAP through the need for consistent dimension data. Explain how MDM principles can improve the quality and usefulness of a data warehouse, and describe a practical approach for implementing MDM alongside a data warehouse initiative.