Frequently data-driven web applications may require some sort of background data processing or indexing operations (building search engine indexes or aggregating event data, for example) that require parsing and analyzing large data sets and extracting information from them. In some situations these tasks may even require parsing and combining several large data sets and relating data between them, and as the size of the data sets grows (and the processor time required to parse them) these operations frequently need to be broken into discreet sub-tasks that save to or operate on temporary data storage. Decoupling each sub-task into discreet operations provides enhanced flexibility and makes for easier troubleshooting (especially when run times / data sets get very large), and a careful design of the temporary data stores can make for even easier diagnosis when something you don't expect shows up somewhere in the gigabytes of XML you're stuck parsing.
The Flat Text File Index
A frequent first (and last) stop in this design process is the flat text file. It's dead simple to write the code to open a file on disk and start dumping stuff to it, that creates a file you can open in your text editor and look at easily, and (for most systems) is easy to choose between memory and disk backed storage just based on the mount or path the file is written on. At first glance, this seems like a pretty ideal solution and is simple to build, but once data sets start growing into gigabyte sizes and requirements start to change things can get ugly. A common problem that arises will be the need to randomly access or search the temporary data set, and with flat text files you tend to not have many good options for doing this in a high performance manner.
For example, imagine as the first stage of a search index build process we want to pull the available commerce products from an upstream PIM system for later relation to some other data set (like extended product data interchange formats now in common use in many industries). We may be dealing with millions of individual product records that may take a significant amount of time to load from an API so locally indexing these records makes sense. We save all of them into a giant flat text file (maybe CSV) for later stages to use, and all seems right with the world. That is, until QA sends back a note saying there appears to be a few duplicate products in the indexes, and now we have a serious problem to wade into. It's tempting to just declare that upstream data issues like this need to be fixed there, but in reality when dealing with very large or automatically generated data sets, data errors will always creep in somewhere, and we're going to have to handle it.
SQLite is a self-contained database engine that implements a full SQL feature set but is designed to operate in embedded environments using a binary file format (which can be in RAM or a file on disk). SQLite is also high performance and very rigorously tested, and perhaps surprisingly by many metrics is the most used database engine in the world (largely due to it's extensive use in embedded applications – the phone in your pocket right now probably has dozens or hundreds of SQLite database instances on it). It supports a full set of SQL features, and in addition to allowing fast queries of data (including indexes), it also supports transactions and unique constraints like any other database engine. Utilizing SQLite for this type of temporary index usage makes many of the tough problems with flat text files trivial to solve. The database storage typically consists of a single disk file, making it easy to manage and move around / share as needed just like a text file, and also like text files enjoys a wide support in many different programming languages for easy interoperability. Languages that support other SQL databases usually support SQLite through the same interfaces, and is typically used the same as any other SQL database.
If you're familiar with using other SQL databases, using SQLite should feel very familiar. Here we have a simple example that shows how we can interact with a database to solve our duplicate product problem (note that this would typically be abstracted to service classes in a production system; this is just a simplified example):
$dbFilePath = '/tmp/products.db3'; $dbConnection = new \PDO('sqlite:' . $dbFilePath); $dbConnection->exec(' CREATE TABLE IF NOT EXISTS products ( id INTEGER PRIMARY KEY, sku TEXT NOT NULL, name TEXT NOT NULL, price INTEGER ) '); $dbConnection->exec(' CREATE UNIQUE INDEX product_sku ON products (sku ASC) '); $insert = $dbConnection->prepare(' INSERT INTO products (id, sku, name, price) VALUES (:id, :sku, :name, :price) '); $insert->execute([':id' => 1, ':sku' => 'Test 1', ':name' => 'Test Product #1', ':price' => '9999' ]); $insert->execute([':id' => 2, ':sku' => 'Test 2', ':name' => 'Test Product #2', ':price' => '7999' ]); $insert->execute([':id' => 3, ':sku' => 'Test 1', ':name' => 'Test Product #1 Duplicate', ':price' => '9999' ]);
Here we've setup a database connection (to a SQLite database located in the /tmp/products.db3 file) and created a new table for our products records, along with a unique index on our product SKU. Then a normal prepared query is built to insert some records, and run with three test products (note that the third record has a duplicate product SKU). After inserting all of our data, we can connect to and query this data in arbitrary ways as you'd expect (and very quickly thanks to the SQLite internal indexing). Here we'll connect with the standalone sqlite3 command line tool (which is particularly handy for examining and debugging your index data manually), but the database can be used in code just like a standard SQL connection as well:
$ sqlite3 SQLite version 3.11.0 2016-02-15 17:29:24 sqlite> .open /tmp/products.db3 sqlite> select * from products where price > 5999; 1|Test 1|Test Product #1|9999 2|Test 2|Test Product #2|7999
Just like that, our duplicate problem is solved with no additional application code required! SQLite supports full transaction support and error handling as well, so even very large or complex indexes can be built with confidence that the data is indexed as expected, along with lots of other features.