How do I efficiently make a large SQL query into Matlab?
17 views (last 30 days)
I have a SQL table that contains data that I need for analysis which is about 60 million records and 5 fields. What is the best way to get that into Matlab variables or a single variable. I have plenty of RAM (64GB) but I find that the import is taking a very long time and often encounters errors.
- Is the quickest way to import all date into a single cell? a single table?
- With a single pull allowing 'FetchInBatches' to be set to yes to avoid JVM errors?
- Should I loop through sections of the table and import it in batches that way?
- Should I loop and do each field individually so that I can download to the right data type?
Thanks in advance for your help, Brian
Rahul Goel on 29 Oct 2015
It will help the community assist you better if you could share more details about the setup you are using. For example, the database and the driver, also bitness: 32 or 64-bit(in some cases).
Following are some observations with MySQL database using JDBC driver:
MySQL JDBC driver by default retrieves the complete 'ResultSet' and stores it in the memory. This can lead to the heap space issues if you are working with the 'ResultSets' that have large number of rows or large values which cannot be allocated a heap space in your JVM for the memory required.
One of the workaround to this limitation is to reconstruct the SQL query to include 'LIMIT' and 'OFFSET' constructs.
The LIMIT clause is used to limit the number of results returned in a SQL statement. So if you have 1000 rows in a table, but only want to return the first 10, you would do something like this:
SELECT column FROM table
Now suppose you wanted to show results 11-20. With the OFFSET keyword its just as easy, the following query will do:
SELECT column FROM table LIMIT 10 OFFSET 10
Use the following MATLAB code which makes use of these SQL constructs:
>> LIMIT = <limit>; %Change the LIMIT to select the one with better response time
>> for fetchCount = 1: N
conn = database(instance,username,password,Name,Value);
query = ['select * from <table > ' int2str(LIMIT) ' OFFSET ' int2str((fetchCount-1) * LIMIT)];
curs = exec(conn, query);
rowlimit = <row_limit>; %Change rowlimit to select the one with a better response time
while ~strcmp(curs.Data, 'No Data')
curs = fetch(curs,rowlimit);
data = curs.Data;
As mentioned in the comments from the above code, please select the limit and row_limit which gives a better performance. The values for limit and row_limit follows the bell curve trend and there exists the optimal value which gives better performance. For instance, you can use multiples of 10^N depending on the number of rows.
Hope this helps.