November 21, 2008

Wondering What Comes Next...

 

DDP: Explicit and Implied Statuses

Recently I wrote about the problem of storing data for a single record type in two different tables and using the table the record belongs to in order to calculate an implied status. The headache was pretty clear, as well as the potential for data loss if not everything were properly kept up to date.

However, there are other ways to implement implied statuses that can get you in trouble as well. One is the use of a record table in correlation with an event table to track changes to the record table. While the use of these tables is a good practice where event histories need to be maintained, depending on them for statuses is suspect at best.

The reason why it's a bad idea is that it assumes that the developer, and every developer to follow, is intimately familiar with the business logic of the records being manipulated, and combine that logic into all their database views and procedures. If these records do not get combined properly, disaster is sure to follow. Even worse, the logic is likely to not be clear to a new developer seeing the database for the first time, leading to potential "alternate" solutions.

The other problem with these implied statuses is that they require the processing time to be calculated. They're not readily available to pull from a table, like an explicit status.

The worst approach I've seen was to set up the main record table with a status field, not update it on an event, and then calculate an implied status. What happened then was developers creating new views tied to the table's status field, which didn't contain the correct status. This lead to digging into the system, because the previous development team was no longer with the company, to find why some statuses were not being filtered properly out of a view, and discovering the implied status calculation based on the events. It was very misleading a quite a headache to straighten out.

The lesson is that if you need to use implied statuses, make them clear and well documented. Otherwise, stick with updating the explicit status tied to a given record to store your status information.

DDP: Nested Views

A few jobs ago, I was taking over as the DBA at a local real estate finance company, and it was my first dedicated DBA position. Prior to that, I had done a combination of application development and database programming. Well, not long after I started, we had one of our developers, who also did some database programming, quit and I had to sit down with him for a "transfer of knowledge".

What I quickly learned was that this developer did not have a grasp on how our MSSQL database server worked. Looking at his code, I began to see a number of nested views in his views called by the application, and I asked him why he used those. He told me "views are pre-compiled in the database, so that makes them run faster", all while the views took 15-45 seconds to pull back their data with nesting as many as four levels deep. The truth is that when you use nested views in 90% of databases, each view needs to execute independently, and only once a nested view has retrieved it's data can the parent view begin processing. However, when I tried to correct this developer, who was about 20 years my elder, he dismissed me as a young programmer who didn't know what he was talking about. Instead of arguing with a guy I probably wouldn't see again after the end of the week, I just let it go and happily went about rewriting his views after he left and getting their execution time to be a fraction of what they had been.

The point of this isn't to brag though, but to educate those who will listen. Nesting views can be a bad idea for a number of reasons:

  1. As mentioned above, nested views execute in a cascaded fashion. If you have a large view and nest it in another view to filter it's contents down, it will always take at least as long as the initial view to execute. This is because the parent view doesn't know what data it's working with until the nested view has returned the data.
  2. Nested views often are written from an Object Oriented perspective, trying to have a base object that can be used in a number of other parent views. However, this often means that the nested view is pulling information that the parent view doesn't need because it is trying to accommodate a number of similar but differing parent views, which just chews up processing time.
  3. While using nested views often cleans up the appearance of the code in the view, it makes them more difficult to read because each nested view needs to be opened and evaluated to find the source of the data.
  4. Using nested views can become a slippery slope when code changes need to be made as well. Every change you make to a nested view will impact every parent view involved.

While not every database handles nested views in this manner (some sophisticated databases will create optimized execution plans for the view basically just using your code as a guide), the majority of them work this way. And even in the cases that don't work this way, we still have to worry about points 3 & 4, and how changes will impact parent views.

In general, your best bet is to not use nested views, as they are a bad database practice that have consequences that most developers and DBAs do not foresee. Even if it means you have to make the same update to code in multiple views, at least you know exactly what you're impacting with that change.

DDP: Don't Move Data If You Don't Have To

Today's Database Design Principle is based around data movement within the database, specifically when it doesn't need to be done.

The basic problem I came across here was that two identical tables (A and B) were being used, along with a shared key, to manage the status of information inside the database. The application had a web form that could be filled out, saved, and later recalled to be submitted. In order to differentiate between the saved and the submitted data, the saved form fields were stored in table A, with an assumed status of 'ENTRY', while the submitted form fields were being stored in table B, with an assumed status of 'SUBMITTED'. When a saved form was then submitted, the data from table A would be copied to table B, and then the record in table A would be deleted.

While this design was functional, there were some definite drawbacks to the approach:

  • Table Maintenance - If a new field were added, or a field deleted from the form, then the same change had to be made in two places in the database.
  • Doing With Two Transactions What Could Be Done With One - With this design, the submission of data then requires two transactions: one insert to populate table B, and a delete to remove the save record from table A. If a single table had been used with a status field, then this could have been accomplished with a single update transaction.
  • Complicating Queries and Views - With the two-table approach, I now have to union together the two tables and calculate a status field to get the full set of data. This just adds a layer of complexity to the database, and it increases exponentially with each time this structure is used. Imagine if you have information in three or more different forms that are stored like this that are linked together and all the extra code that is required to keep things straight.

In this case, the easiest solution, as I implied above, would be to maintain a single table with a status column to no longer make the status implied, but explicit. This not only makes any coding linked to the table much easier, but if you are going to do any reporting or allowing non-developer business analysts access to the tables, it will be much easier for them to read the data.

The lesson here is to keep your tables simple and logically laid out. If you can avoid the hassle of moving your data all over (especially just to maintain implied statuses), you should. It will decrease the amount of work that needs to be done, and will be easier to read for anyone else who joins takes over your project.

Database Design Principles

Over the past few years, I've done a significant amount of Database Administrator (DBA) work in Oracle, Microsoft SQL and MySQL, and I've seen a lot of different approaches to databases. Needless to say, some have been more successful than others, and some have downright confounded me. While everyone has their own style and approaches, there are just some that are bad and should be avoided.

Hence, I've decided to create a topic on my blog here for Database Design Principles (DDP for short) where I can share the general mistakes I've seen, along with better suggested approaches to the problem. Some of these will be specific to certain database applications, but I plan on keeping things pretty general in order to help as many people as possible.

So keep an eye out for this new intermittent feature here on my blog!