Here's some important features that I think all databases should have:
- Offering text standardization function to help with (1) data cleaning, (2) reducing volume of text information, (3) merging data from different sources or having different character sets
- Ability to categorize information (text in particular, and tagged data more generally), using built-in or ad hoc taxonomies (with a customized number of categories and subcategories), together with clustering algorithms. A data record can be assigned to multiple categories.
- Ability to efficiently store images, books, records with high variance in length, possibly though an auxiliary file management system and data compression algorithms, accessible from the database.
- Ability to process data remotely on your local machine, or externally, especially computer-intensive computations performed on a small number of records. Also, optimizing use of cache systems for faster retrieval.
- Offer SQL to NoSQL translation and SQL code beautifier.
- Offer great visuals (integration with Tableau?) and useful, efficient dashboards (integration with Birt?)
- API and Web/browser access: database calls can be made with HTTPS requests, with parameters (argument and value) embedded in the query string attached to the URL. Also, allow recovery of extracted / processed data on a server, via HTTPS calls, automatically. This is a solution to allow for machine-to-machine communications.
- DBA tools available to sophisticated users, such as fine-tuning query optimization algorithms, allowing hash joins, switching from row to column database when needed (to optimize data retrieval in specific cases).
- Real time transaction processing and built in functions such as computations of "time elapsed since last 5 transactions" at various levels of aggregation.
- Ability to automatically erase old records and keep only historical summaries (aggregates) for data older than (say) 12 months.
- Security (TBD)
Note that some database systems are very different from traditional DB architecture. For instance, I created a web app to find keywords related to keywords specified by a user. This system (it's more like a student project than a real app, though you can test it here) has the following features:
- It's based on a file management system (no actual database)
- It is a table with several million entries, each entry being a keyword and a related keyword, plus metrics that measure the quality of the match (how strong the relationship between the two keywords is), as well as frequencies attached to these two keywords, and when they are jointly found. The function to measure the quality of the match can be customized by the user.
- The table is split into 27 x 27 small text files. For instance, the file cu_keywords.txt contains all enties for keywords starting with letter cu (it's a bit more complicated than that, but this shows you how I replicated the indexing capabilities of modern databases).
- It's running on a shared server, at it peaks hundreds of users were using it and the time to get an answer (retrive keyword related data) for each query was less than 0.5 second - most of that time spent on transferring data over the Internet, with very little time used to extract the data on the server.
- It offers API and web access (and indeed, no other types of access)
- Interesting database questions
- Nasty data corruption getting exponentially worse with the size of your data
- SQL to NoSQL translator
- Excel for Big Data
- Fast clustering algorithms for massive datasets
- Big Data Analytics Ecosystem
- Source code for our Big Data keyword correlation API
- R in your Browser
- SQL: optimizing or eliminating joins?
Originally posted on Data Science Central