Categories: MSDN / DotNet / Java / Scripts / Linux / PHP Ask - La ask - La Answer

Best Way to Store Large Dataset for use in Simulation

I am creating a simulation that requires "typical hour" data-value inputs for different entities for each hour of a single "typical week" for each month of a "typical year". In total, I have approximately 2000 x 56 values (typical hours x entities) that are currently stored in a matrix-style spreadsheet in excel.

Currently, in order to extract the hourly value for each entity, I am using the Month(), Weekday() and Hour() functions in VBA to traverse the matrix.

My problem is that I need to incorporate a few more similarly-structured datasets and that I eventually want to get away from Excel (preferably c/c++ or vb.net only).

Is there a better way to store/access my data?

(In advance, sorry for crossposting but my question is interdisciplinary.)
[795 byte] By [Nai] at [2007-11-11 8:10:44]
# 1 Re: Best Way to Store Large Dataset for use in Simulation
Certainly, a custom - app is going to fit your needs better. However, you need a clearer set of requirements about how you plan to visualize and process the data. For example, I do a lot o real-time linear algebra, and I store everything as a double * because this is compatible with lapack & matlab. You might want a 3-d array instead, to be able to talk about Data[march][23][1800] or the like. Or you might find vectors or valarrays (check out the slice on valarrays!!!) to support something that makes your life easy.

Excel is an option too, however our datasets are too large for it so we use matlab scripts to plot our data. In my opinion, Matlab is way expensive and overkill if you dont *have* to have it.

Anyway, sit down and decide how you need to plot (if any), how you need to process (rows, cols, other?), and how you need to be able to access it, and any other ideas (text or binary file storage, database needed/handy, more?).
jonnin at 2007-11-11 21:01:33 >
# 2 Re: Best Way to Store Large Dataset for use in Simulation
Thanks for the advice,

However, I do have a clear idea of how I am going to access some of my datasets.

For example,

One data set currently consists of about 50 entities (columns) by 2000 hours (rows), with the hours being divided into a "typical week" for each month. What I do with this dataset is that as my hourly simulation runs, it extracts the "typical" hourly input value for each entity for the current hour, then the next, and so forth. This involves moving back and forth among the rows.

In the future, this dataset will grow by both the number of hours modelled and entities and I will be working with many similar datasets like it for various groups of entities.

Matlab is an option, but I would need to create a runtime since this app will be used by people who do not have matlab (something I have never done before).

As for binary files... maybe you can offer some enlightenment :)

Thanks.
Nai at 2007-11-11 21:02:38 >
# 3 Re: Best Way to Store Large Dataset for use in Simulation
This sounds fairly trivial in C++ -- you don't do any odd column-wise processing or graphing then. A good old fashoned 2-d array of a struct or class will do fine, or a 3-d one if you prefer the month/week/hour. You can swap in a vector or valarray if you prefer objects for everything or find their build in tools to be handy for the job at hand.

Binary files are smaller and more efficient to read/write because each record is the same size -- but you can't read them with a text editor anymore. C++ makes this trivial as well. Basically its the raw bytes that make up a number (for example, zero in a 32 bit integer is 00 00 00 00 ) While a text file might have "0" or "0.0" instead.

Matlab compiles into sort of C code but they make it difficult (basically, they want to hide their proprietary matrix algorithms). Its terribly slow even when compiled. For so simple a task, a small C++ program should be good enough.

The data set you mention is not really that large and you should not have to take any special efforts until it gets much larger -- it sounds like the entire file will fit into ram as is, even after you have added more, correct (the data files are less than say 128 MB in size)
jonnin at 2007-11-11 21:03:32 >
# 4 Re: Best Way to Store Large Dataset for use in Simulation
Thanks for bearing with my ignorance, but one more question.

Is it better to bring large amounts of data into memory (RAM) rather than access it through database recordsets (eg. ADO, ADODB.recordset)?. For example, another database I am using involves 8760hours x #Entities x #years values that I am accessing sequentially by moving forward through an access recordset. Is there a better way of doing this?
Nai at 2007-11-11 21:04:32 >
# 5 Re: Best Way to Store Large Dataset for use in Simulation
Going through the database mechanics is going to be slower but will provide power (query / report is awesome for some tasks). If you forsee using database like access and all, then of course those tools are going to be better than re-inventing the wheel. If you just want a small program to provide a line from a data file, c++ will be much faster and fairly simple to write. Does speed even matter (is this a realtime simulation?). (By default, I usually consider speed first, I work in an odd environment however. )

Anytime you use a data store you should strive for O(1) search time (for example, to fetch a number from an array in location 10, you get it instantly by array[10] without having to search through array[0], array[1], ...). This is easy to accomplish via 'hash' algorithms & the like. However, if your current method is "fast enough" a code re-write is wasted time. If the hash algorithm is too much effort simply sorting the data and doing a binary search will return your data a lot faster than a linear search -- this is fairly basic "data structures and algorithms" if you are a computer science graduate and have studied these "formal" topics. If not, you can probably find a good used book on the concepts fairly cheap.

All this stuff is language independent -- Data storage and retreival are algorithms and concepts not C++ or Basic or excel. Any "real" language will do here. THe same for file access and the like, and also for database plugins. C++ has no knowledge of "database" until you use a library from someone -- its not part of the language at all. Pick the language that you know best which supports the tools you need.
jonnin at 2007-11-11 21:05:42 >