I am not an expert in database design, since most of my career I have worked with alternate data storage / data access solutions. But one of the very first projects I had to do back in 1985 when I was a student was to write the code for a fully functional database architecture, in Pascal, from scratch. You will probably find some of my questions naive, and some intriguing.
- When were variable-width fields first introduced? How is this type of data stored, depending on vendor or implementation: arrays, variable arrays or linked lists, or something else? Why is it always necessary to specify a max length? Is it to make indexing more efficient, or because of limitations in the way packets are transmitted across intranet, or across the Internet? Is NoSQL technology better at dealing with variable-width records?
- How do you store images or videos in databases? I'm talking about physically storing the videos, e.g. using the 'blob' binary data type. And how do you store vector images? In graph databases?
- Why is importing CSV columns containing variable-width text so cumbersome with SQL Server? Is it any easier with other database systems?
- Are there any fast, efficient database clients allowing you to perform some of the computations (maybe sorting or simple analytics) in-memory with traditional SQL: NOT on the database server itself, but locally on your machine, or even on some external machine? Or is the only option based on cloud-computing technology, Map Reduce, Hadoop etc. By fast client, I mean something far superior to Toad or Brio (the two clients I've been working with), as basic data selection (without join) using their interface, on your desktop, is 100 times slower than accessing the database server straight via a Python script connecting directly to the database server.
- Code to run SQL queries 10 times faster than Brio, Toad etc.
- Why is Vlookup (in Excel) 1,000 times slower than hash tables in Python?
- Did I create a new NoSQL database environment?
- SQL to NoSQL translator
- Nasty data corruption getting exponentially worse with the size of your data
Originally posted on Data Science Central