Ah…more Sql stuff to talk about…weeee.
In my last post I rambled on about what it’s like to create a database from scratch these days using a Visual Studio Sql Server Project template and a couple of the pitfalls that I encounter…
Grab a cup o’ coffee and get comfy because today I’m posting my thoughts on table design and when to normalize or not…with another post to follow that will further expand on what happens in Linq to Sql when you have it build objects out of the tables we design in this post.
I know…exciting stuff right? I know I’m already setting up a caffine drip IV into my arm too!
Ok…so the basics which I’m sure you all know if you read the TLA SQL and dind’t GTFO…a database is a way to logically store data. Put simply a database will let you define a structure for how you want to store data and then allow you to create relationships between that data. A table is a way to logically group your data. A column is one piece of data that’s being collected. A row is one record of data containing one or more columns of data. So those are the basics. If you can wrap your head around the above paragraph…that’s all you really need to make a good and performant database…now you just need the wisdom on how to apply the above things.
Creating the database is an important step, but often there’s a Sql Administrator who takes care of that stuff…and I’m more of a schema kinda guy. Sure I know how to do the creating stuff…but it’s not the part I like to focus on because the database configuration often needs to be fiddled with after it’s deployed, so when creating a new database I tend to do just enough to get it running and available over the network…then I work on the schema. A schema is a way to refer and look at the entire collection of tables, views, stored procedures, functions, and their relationships that define the data being stored within a database. Since out of those things, I’ve only mentioned tables and relationships…lets take a closer look at what those are.
So for a data set to use for my example, we’re going to build a database for a junk yard who gets junked vehicles and sells the parts off. Lets start picking apart what we can logically group together as data. So there’s going to be “vehichles”…all vehicles pretty much have a “make”, “model”, “year”, “number of wheels”, “engine”, “how many can ride in / on the vehicle”…that’s a good group of things that pretty much ever vehicle has. So here’s how we make a table for it and I’ll explain some things after.
CREATE TABLE [dbo].[Vehicle] ( [Id] INT IDENTITY(1, 1) NOT NULL, [Make] NVARCHAR(50) NOT NULL, [Model] NVARCHAR(100) NOT NULL, [Year] INT NOT NULL, [WheelCount] INT NOT NULL, [Engine] NVARCHAR(MAX) NOT NULL, [PassengerCapacity] INT NOT NULL, CONSTRAINT PK_Vehicle PRIMARY KEY CLUSTERED (Id ASC), CONSTRAINT UNIQUE_Vehicle unique ([Make], [Model], [Year]) ) GO CREATE NONCLUSTERED INDEX NCI_Make ON Vehicle([Make] ASC) GO CREATE NONCLUSTERED INDEX NCI_Model ON Vehicle([Model] ASC) GO CREATE NONCLUSTERED INDEX NCI_Year ON Vehicle([Year] ASC) GO CREATE NONCLUSTERED INDEX NCI_WheelCount ON Vehicle([WheelCount] ASC) GO CREATE NONCLUSTERED INDEX NCI_PassengerCapacity ON Vehicle([PassengerCapacity] ASC)
Ok, so there’s the basic layout. The first part before the CONSTRAINT definitions is where I’m actually defining the columns and what kind of data it contains. Id’s are generally always an INT…but I’ve seen them be GUID’s before. Make and Model are both strings…but in Sql we call them VARCHAR…and NVARCHAR is a unicode version of VARCHAR so I like to use that as it’s more friendly when doing globalization. Year, WheelCount, and PassengerCapacity are all INT’s…but Engine is setup as a NVARCHAR(MAX). What that’s saying is to allocate an obscenely huge chunk of space for a text blob that’s going to contain the defenition of the Vehicles engine. It could take a lot to explain an engine so I made it big.
Now we can talk about the constraints. The first one I’ve already spoken about in my last post…but the second one is new. What that one is saying is to setup a rule that defines what a “unique” vehicle is. I chose to say that for every combination of Make, Model, and Year…I’m considering that unique. What being unique does is it restricts any more data from being inserted if it matches a unique record that already exists within the table. Technically I should probably add “Engine” into that equation…however NVARCHAR(MAX) cannot be used in a unique constraint…so that’s why I left it out. I have a plan for that later though. 🙂
Now…onto the CREATE NONCLUSTERED INDEX lines. What the heck is a NONCLUSTERED INDEX? To be perfectly honest I used to know what the difference is between clustered and nonclustered but I’m a little fuzzy on it now and I’m going to be too lazy to look it up mostly because I know that you can only have one clustered index and the Id “usually” takes it up…for every database I have designed it does anyway. 🙂 So…let’s not worry about what the difference is…we’ll just focus on what an INDEX is. Think of it like a cheat sheet for the database. As you’re doing querying, the database is keeping track of things that are queried for…and as that happens it uses a private lookup table called the “INDEX” to help speed itself up. What those CREATE NONCLUSTERED INDEX statements are for is to tell the database to add those columns into it’s little cheat sheet. You may have noticed that there isn’t one for “Engine”. That’s because NVARCHAR(MAX) cannot be indexed…it’s too big…again, like I said I have a plan for that. Now just what should be indexed? Well…that’s a topic of great discussion. I tend to index everything possible…I feel it should never hurt to over-index things…the worst that’ll happen is those index’s just don’t get used. The most important things to index are Id’s…or better put any sort of column that is being used to reference a specific row in a table. So that means not just the identity column of a table, but Id’s being used in foreign key relationships as well…which I’ll be getting into in just a moment.
Ok…so now we have a basic table with some explination about what each part of the syntax is doing. Woo to the hoo…but Engine is causing us some trouble because it’s big. Can’t be used in consideration for uniqueness…can’t be indexed. That darn column is going to slow down our database! We can’t have that…so this is something in my experience that pops a red flag. Whenever you have a piece of data in a table that should be considered for uniqueness and should be indexed but can’t because it’s a huge NVARCHAR(MAX)…see if you can break that huge string into it’s own table. In our case Engine can probably be broken down to “make”, “cylenders”, “displacement”, “horsepower”…I’m sure there’s probably more things to call out, but I’m just coming up with example material so I’m not going to fully flesh this out. So here’s our second table…
CREATE TABLE [dbo].[Engine] ( [Id] INT IDENTITY(1, 1) NOT NULL, [Make] NVARCHAR(50) NOT NULL, [Cylenders] INT NOT NULL, [Displacement] NVARCHAR(20) NOT NULL, [Horsepower] INT NOT NULL, CONSTRAINT PK_Engine PRIMARY KEY CLUSTERED (Id ASC), CONSTRAINT UNIQUE_Engine unique (Make, Cylenders, Displacement) ) GO CREATE NONCLUSTERED INDEX NCI_Make ON Engine(Make ASC) GO CREATE NONCLUSTERED INDEX NCI_Cylenders ON Engine(Cylenders ASC) GO CREATE NONCLUSTERED INDEX NCI_Displacement ON Engine(Displacement ASC) GO CREATE NONCLUSTERED INDEX NCI_Horsepower ON Engine(Horsepower ASC)
Now we have 2 tables defined…but currently they really have nothing to do with one another. Sure if you look at them…someone with a brain can start relating the data…but Sql is dumb and has no idea how they relate to one another so we have to be literal and tell it. We do that by adding a foreign key relationship between the two. To this we use a CONSTRAINT. I’ll repost the code for the Vehicle table replacing the Engine column with a new EngineId column that points at the Engine table.
CREATE TABLE [dbo].[Vehicle] ( [Id] INT IDENTITY(1, 1) NOT NULL, [EngineId] INT NOT NULL, [Make] NVARCHAR(50) NOT NULL, [Model] NVARCHAR(100) NOT NULL, [Year] INT NOT NULL, [WheelCount] INT NOT NULL, [PassengerCapacity] INT NOT NULL, CONSTRAINT PK_Vehicle PRIMARY KEY CLUSTERED (Id ASC), CONSTRAINT UNIQUE_Vehicle UNIQUE ([EngineId], [Make], [Model], [Year]), CONSTRAINT FK_Vehicle_EngineId FOREIGN KEY (EngineId) REFERENCES Engine(Id) ) GO CREATE NONCLUSTERED INDEX NCI_EngineId ON Vehicle([EngineId] ASC) GO CREATE NONCLUSTERED INDEX NCI_Make ON Vehicle([Make] ASC) GO CREATE NONCLUSTERED INDEX NCI_Model ON Vehicle([Model] ASC) GO CREATE NONCLUSTERED INDEX NCI_Year ON Vehicle([Year] ASC) GO CREATE NONCLUSTERED INDEX NCI_WheelCount ON Vehicle([WheelCount] ASC) GO CREATE NONCLUSTERED INDEX NCI_PassengerCapacity ON Vehicle([PassengerCapacity] ASC)
Now you can see I’ve changed “Engine” to “EngineId”, changed the data type to an INT, modified the UNIQUE CONSTRAINT, added a new CONSTRAINT called FK_Vehicle_EngineId, and added an index for the EngineId column…that pretty much solves our problems but what does it really mean? The only new thing you haven’t seen is the CONSTRAINT for the FOREIGN KEY. So what this is saying is whenever something is being inserted into the Vehicle table…the Id being used in the EngineId column MUST exist in the Engine table before it can be inserted. So if you had a 2 engines in the Engine table, Id’s 1 and 2…and then tried insert the value 3 into the Vehicle.EngineId column…you’d get an error telling you Id 3 doesn’t exist in the Engine table.
The whole process we just went through is called “normalizing” Finding data that should be in it’s own table, creating a new table out of it, then setting up a foreign key relationship between the two. There are dangers to watch out for though. You should be aware on how this effects querying.
Before normalizing, this is pretty much what the query would look like…
SELECT [Make], [Model], [Year], [WheelCount], [Engine], [PassengerCapacity] FROM Vehicle
and now after normalizing we’d have to do something like the following…
SELECT [v.Make], [v.Model], [v.Year], [v.WheelCount], [v.PassengerCapacity], [e.Make], [e.Cylenders], [e.Displacement], [e.Horsepower] FROM Vehicle AS v JOIN Engine AS e ON v.EngineId = e.Id
See now how we had to add a JOIN to the statement in order to get to any of the Engine data…sure the data is a lot more defined and structured…however doing a join in a query is expensive in terms of perf. The more you have, the slower your query will get. Sometimes it’s an ok trade off…in this case, it was the right decision. Notice how there’s a “Make” column in each table now…here’s something else that “could” be normalized, but I’m choosing not too…it’s a single string of data and it just adds too much overhead and complexity to the table design and the t-sql needed to query against it to bother for a small being of string data.
So that’s where I’ll leave off for now. In my next post, I’ll show some Linq to Sql examples that really show the pitfalls of over-normalizing…but for now, hope this helps someone. Enjoy! 🙂