Horizontal Database scaling

Handruin · Jun 2, 2008

Has anyone done any work with larger databases that scale horizontally vs trying to increase hardware for improved performance (when dealing with millions of rows of data)?

What I mean by horizontal is that tables are broken up by time (month/quarter) so that older data less frequently accessed is in its own table vs having millions of rows all in a single table. We collect large amounts of data for reporting and I'm trying to find good examples of spanning horizontally and was curious if anyone had any pointers on this. The database engine in use is SQL Server 2005. I wrote the tool to collect the data, but now I'm working on collecting it with the idea of moving older data into tables with specific date/time stamps.

Fushigi · Jun 2, 2008

Would hierarchical storage management be of use?

LiamC · Jun 2, 2008

Sounds to me like you're after partitioning. You would partition the table on (say) month or year. In effect, you have multiple smaller tables that look to be one logical large table. Oracle and DB/2 certainly can do this, and I would assume that SQL Server could.

time · Jun 2, 2008

Unfortunately, partitioning does not automatically improve performance, and particularly on MS SQL Server, may well decrease performance. Any query that crosses the boundary between partitions will suffer a performance penalty.

It can be useful where you need to access large contiguous chunks of data that are still only a fraction of the whole, eg one year out of seven or one month out of three years. It only makes sense if your queries need to do full table scans rather than utilize indexes for the selected period.

Handruin · Jun 2, 2008

I had looked into partitioning with SQL server but was reluctant to implement due to the complexity and limitation on a maximum of 1000 total partitions (so I read). What I was looking at doing was to create my own version of this practice to move data into tables based on date/months for every month or maybe quarter. I cannot find specifics how others do this and we have to do this using SQL server 2005 (no other database option will be considered). I see (and read) lots of suggestions about horizontal scaling, but no one offers examples of how they build the SQL programatically. I'm not a DBA by any means, but it is a task I have to figure out...

time, the performance you speak of will be acceptable because as our data gets older, it will be accessed less frequently. I think from what I've been reading, the model will be something called a sliding window. What's most important is the performance of reading the current month's data.

Fushigi, the HSM is certainly interesting, but in my environment we don't have that luxury of tiered storage. I'm looking for more specifics at the database table-level right now (though I understand the important of storage and location). We are using the highest tier storage for everything, so that's about the only thing going for the project even if it means higher cost right now.

blakerwry · Jun 2, 2008

Partitioning does sound right up your alley. But before you get to that point...

How many tables and rows are you talking about, what data size per row, and what data types? Have you optimized your queries and indexes? Are you using any queries that pull data from more than one table, etc? Have you monitored slow queries?

Also, is this something that can be distributed... e.g. Can you have multiple servers in sync with each other (replication) and have requests read from each?

ddrueding · Jun 2, 2008

Replication was what I had in mind, but I am not a DBA.

Replicating the more recent data to another server, and running the common queries off that seems like not-so-complicated answer.

blakerwry · Jun 4, 2008

Unfortunately, dealing with separate tables or databases (or potentially servers) based on date can be a little tricky.

First, it requires some additional logic on the front end to be able to detect where to look and second this logic must be in sync with the back-end storing the data.

If you're creating new databases per year, month, whatever, your database permissions may also create another issue.

Partitioning was designed specifically to address these issues by allowing a single 'visibly unified' database that allows one to forgo the additional front end complexities. For data sets that are stored by and queried by date range they are great.

This sounds like what Handy described. I would suggest creating a db backup and trying it out. If it doesn't work you can always revert back to your backup.

Your mention of another server may work. To implement such a setup, I would probably configure one server to have all the data and another server to have only recent data. When a query requires looking at more than just recent data, use the server with all the data otherwise use the 'recent' server. You could also do the same thing on a single server using two databases.

You could potentially accomplish this with one-way replication. I'm not sure how the specifics work in MSSQL, but MySQL would allow replicating the 'all' server to the 'recent' server, nightly you could delete old records from the 'recent' server. -If you are not modifying old records that's it. If you are, then you'd need to manually configure the 'recent' server to ignore certain replication error types or not to halt on replication errors at all. I'd figure mssql is similar.

You could also accomplish this by writing all data to both servers in your font end or wherever data is modified.

blakerwry · Jun 4, 2008

Though I think a good review and double check to ensure you've covered the basics (indexing, optimization) should be done before considering making something more complex and error prone.

Handruin · Jun 7, 2008

blakerwry said:
Partitioning does sound right up your alley. But before you get to that point...

How many tables and rows are you talking about, what data size per row, and what data types? Have you optimized your queries and indexes? Are you using any queries that pull data from more than one table, etc? Have you monitored slow queries?

Also, is this something that can be distributed... e.g. Can you have multiple servers in sync with each other (replication) and have requests read from each?

OK, been busy so I hadn't had a chance to reply until now. The number of tables is small right now. We have basically two tables, one for the date and collection ID, and the other one has the meat of the data.

The number of rows will be several million (30-50m) over the next year maybe two. The data size per row will vary based on the content collected, but it's only one column which varies greatly (plain text from a problem summary). We've done our best with the queries but given the nature of what we have to do, we do suffer some queries that use DISTINCT which isn't ideal.

This database collects from our ClearQuest bug tracking database hourly. We aren't given access to the CQ database, so we've had to do our own collections. We don't normalize the data pulled in, so we get a straight dump row by row (which I've never liked, but it's what we've had to do since we don't have access to their DB...the data is pulled through the ClearQuest API).

Yes, almost all the SQL queries pull from both tables. One is the collection ID and a DateTime stamp of the collection. We aren't getting any slow queries as of right now, but our concern is that we may down the road if the table is very large.

We are awaiting two servers to build a Microsoft MSCS active/active cluster that we might be able to put them in synch with each other, or we might offload some of the other database between the two systems. For now we're basing everything off one SQL server.

Handruin · Jun 7, 2008

blakerwry said:
Unfortunately, dealing with separate tables or databases (or potentially servers) based on date can be a little tricky.

First, it requires some additional logic on the front end to be able to detect where to look and second this logic must be in sync with the back-end storing the data using a SELECT INTO statement based on the dates (current date -90).

If you're creating new databases per year, month, whatever, your database permissions may also create another issue.

Partitioning was designed specifically to address these issues by allowing a single 'visibly unified' database that allows one to forgo the additional front end complexities. For data sets that are stored by and queried by date range they are great.

This sounds like what Handy described. I would suggest creating a db backup and trying it out. If it doesn't work you can always revert back to your backup.

Your mention of another server may work. To implement such a setup, I would probably configure one server to have all the data and another server to have only recent data. When a query requires looking at more than just recent data, use the server with all the data otherwise use the 'recent' server. You could also do the same thing on a single server using two databases.

You could potentially accomplish this with one-way replication. I'm not sure how the specifics work in MSSQL, but MySQL would allow replicating the 'all' server to the 'recent' server, nightly you could delete old records from the 'recent' server. -If you are not modifying old records that's it. If you are, then you'd need to manually configure the 'recent' server to ignore certain replication error types or not to halt on replication errors at all. I'd figure mssql is similar.

You could also accomplish this by writing all data to both servers in your font end or wherever data is modified.

The complexity is the part I'm most concerned with. I've done a small implementation right now because we needed to get something going. I've built an archive table that holds all records past 90 days. The production table will only hold 90 days worth of data and my collection tool will move the old data into the archive nightly.

Now I'm having issues with writing unions between the two tables since I've found a severe lack of examples with this type of scaling. I know people have done tables based on months for this sort of thing, but I've not see how they combine queries across the tables with this sort of strategy.

How does the partitioning handle the combined queries or know that when I'm selecting data to combine the additional tables? Maybe this is what I should be doing vs trying to union my own queries and logic in the back end. All of our SQL is run in ASP.net web services and the front end knows no difference to the changes I'm working on in order to handle the 90 day archive we've started on. As you can tell, I've never done partitioning and only read a handful of information, but nothing explained to me that the joining of data (once partitioned) is magically handled by the DB...

All my development is done on a test database server, so I can certainly try this out. The part which concerned me with partitioning in SQL Server 2005 was that it puts things in different file groups and I'm not so certain why. My guess is that way it can break up the load between actual storage files for locking issues etc?

I've done replication with MySQL, but not MSSQL, so I'm definitely learning more. The trouble is that I have to be the DBA and the developer, both of which I'm learning as I go...

Handruin · Jun 7, 2008

blakerwry said:
Though I think a good review and double check to ensure you've covered the basics (indexing, optimization) should be done before considering making something more complex and error prone.

I've done this, and gone through and even added a couple more indexes over the past week to improve queries. The performance right now is actually fine, what I'm worried about will be the growth that is going to happen in the next few months once we open the doors to all the other projects we will be collecting.

ddrueding · Jun 7, 2008

Rather than attempting to have queries bridge the two databases, I was thinking that the "current" database would only replicate 90 days worth of the "old" database, and "old" would hold _all_ records, including the last 90 days. The "old" database would be the master that received writes.

The logic would have to be in the front-end, where queries that were within 90 days would go to "new", and any that involved any older data went to "old".

Again, I don't know what I'm talking about, but I'm interested in learning everything I can.

Handruin · Jun 7, 2008

ddrueding said:
Rather than attempting to have queries bridge the two databases, I was thinking that the "current" database would only replicate 90 days worth of the "old" database, and "old" would hold _all_ records, including the last 90 days. The "old" database would be the master that received writes.

The logic would have to be in the front-end, where queries that were within 90 days would go to "new", and any that involved any older data went to "old".

Again, I don't know what I'm talking about, but I'm interested in learning everything I can.

I think I'm on the same track as you described...my boss and I discussed this same config just this evening before I left work. I'm going to change over to the model you described and have the archive table hold everything and basically there will be 90 days of overlap with the production table. I've already written the class/code to handle the 90 day check and can apply it into the code once I change my collection tool to handle the new way of having the archive table and production table have the same data.

ddrueding · Jun 7, 2008

If it works I'll feel far too smug, let me know if it doesn't.

Horizontal Database scaling

Handruin

Administrator

Fushigi

Storage Is My Life

LiamC

Storage Is My Life

time

Storage? I am Storage!

Handruin

Administrator

blakerwry

Storage? I am Storage!

ddrueding

Fixture

blakerwry

Storage? I am Storage!

blakerwry

Storage? I am Storage!

Handruin

Administrator

Handruin

Administrator

Handruin

Administrator

ddrueding

Fixture

Handruin

Administrator

ddrueding

Fixture