I have been working with a whole mess of data as of late in a series of MyISAM tables. The short of it is I have a 256GB XML file that I want in MySQL. A few tools already exist designed for this specific data. Problem is they import everything and I only want about 75% of the data and they were not as fast as I know they could be. So, being how I have so much free time, I decided to explore this a bit further and wrote my own PHP script to handle the job.
I came the conclusion that the indexes were what was killing me. The database ends up being ~3.5 billion records– yes, it is nearly that much– and indexes are kind of a very important thing. First I tried with the indexes enabled for the whole import. I do not know how long it would have taken because the larger the database grew the slower it got, kind of like how you can never quite reach the speed of light. After that I tried creating the tables without any indexing and using
ALTER TABLE after. It looked good at first but when you have such huge tables it gets slower and slower with each added column (IE the first column indexed plenty fast but the second was half, the third half that, ect). The problem was, using
ALTER TABLE, you have to add them one-by-one.
After further researching it seems to me that the best way to do this is to add the indexes while creating the table and then just disabling them. Before any of the data is inserted you do a
ALTER TABLE table DISABLE KEYS and
SET FOREIGN_KEY_CHECKS=0. After you have done your massive, massive import you just do
SET FOREIGN_KEY_CHECKS=1 and
ALTER TABLE table ENABLE KEYS to re-enable them. This way MySQL does all the work at once and virtually any time you let MySQL handle the work itself– versus multiple calls via external script– you end up with a big benefit.
A few notes:
- The primary keys are still updated during insertion. I suppose you could use
ALTER TABLEafter to add those but I am pretty happy with the speed of this setup.
- I also tried
INSERT DELAYED(both alone and in combination with the above) and it resulted in a ~10% increase in time.
I just found this article. It seems I was wrong about having to index one column at a time but there are a few caveats. The most notable one is that the order in which you specify the columns has a direct result on if it can be used or not. Any way, this is a huge topic and that is why there are DBAs. Good luck to us all…