What are some database trends for 2022 and beyond?

Introduction

It is impossible to cover every new database technology innovation in a short article, but as TM Data ICT Solutions, we decided to share with you some of the recent trends we observed while implementing and evaluating database systems for our clients and their data requirements with respect to time series, IoT (Internet of Things), Machine Learning and other type of data sets. We believe that key decision makers, solution architects, data engineers and data scientists will take into account these trends for their ongoing and upcoming data projects.

Similar to the famous saying, there are decades where (almost) nothing happens; and then there are years where decades happen. When it comes to innovation in the world of database systems, we are witnessing a similar technology explosion and expect things to get even more exciting in 2022 and beyond.

Recent and Upcoming Database Technologies and Trends

One of the database systems we worked with recently is TimescaleDB: developed as a PostgreSQL extension, the primary reason you would be interested in TimescaleDB is because you are ingesting a lot of time series data related to IoT, IIoT, energy, etc. data sources but also because your team doesn’t want to give up their know-how for a proven, mature relational database management system such as PostgreSQL. Even though PostgreSQL has built-in partitioning support, dealing with these PostgreSQL primitives to build a database system to ingest, store and process tens of millions of rows or more per day is generally not where your business will add value. Therefore an extension such as TimescaleDB takes care of a lot of the underlying PosgreSQL primitives, so that it becomes possible for you to add business value for your millions of rows, easily growing to billions of rows, consuming terabytes of storage space. And speaking of storage space, even though nowadays storage is considered cheap, and there are a lot of scalable solutions, everything still comes with a price tag, and when you know that some parts of your data set can be easily compressed because there are a lot of repeating values and patterns, you will want your database system use storage in a smart manner, compressing data transparently based on different data patterns, and let you query compressed data without thinking about compression.

We observed that TimescaleDB was very good with compressing repetitive time series data, sometimes compressing x7 to x10 times, letting us use storage space in an efficient manner, without hurting query performance.

TimescaleDB has a lot of great features, and their team is constantly adding new ones focussed on time series querying (such as compression for continuous aggregates), but you should also be aware that in terms of operations, it is PosgreSQL for all practical purposes, with almost all of the pros and cons of operating PostgreSQL databases, especially when you’re dealing with a multi-node TimescaleDB cluster for horizontal scalability. Having said that, we predict many teams will be happy to have their regular PostgreSQL databases and tables sitting transparently next to TimescaleDB hypertables or distributed hypertables with full SQL compatibility.

The company behind this system, Timescale, has received a lot of attention from developers and investors alike because of their latest funding round, and Timescale co-founder & CTO, Michael Freedman, shed light on upcoming developments related to TimescaleDB and PostgreSQL in a live Twitter Spaces chat. One of the exciting things that Prof. Freedman mentioned was related to PostgreSQL Table Access Method and building extensions based on that, which we expect to continue being a hot topic for current PostgreSQL 14.x and upcoming major versions. Only a few days after this interesting talk, a new database technology, again based on PostgreSQL was announced by another team: OrioleDB, “a new storage engine for PostgreSQL, bringing a modern approach to database capacity, capabilities and performance to the world’s most-loved database platform”. This generated a lot of interest, too, as can be seen in “OrioleDB – solving some PostgreSQL wicked problems | Hacker News“. Buried deep inside the heated discussions was the mention of Table Access Method: “important distinction – this uses the Postgres Table Access Method to add an additional storage engine to the Postgres platform.. this is NOT a replacement for the current storage engine and can co-exist and be used on a per-table basis”.

Another powerful database management system we had the opportunity to evaluate was CrateDB. Unlike the database technologies we mentioned earlier, CrateDB is built on Elasticsearch and Lucene, and even though these are not the first technologies that come to mind when we think of large time series data processing, CrateDB’s shared-nothing, horizontally scalable architecture, coupled with its full support for PostgreSQL Wire Protocol, good enough SQL compatibility, and zero-downtime rolling upgrades were among the strong points of this database technology, together with its built-in web-based administration user interface. Thanks to its shared-nothing architecture and PostgreSQL wire protocol compatibility, it was also straightforward to put an HAProxy load balancer in front of our CrateDB installations, making it operationally comfortable to work with.

Exponential Growth in Data Requires Scalable Database Technologies and Proper Database Design

Even though CrateDB has a lot going for it when it comes to data structure flexibility, easy horizontal scalability and good redundancy without a single point of failure, it was difficult to match it feature-wise to PostgreSQL, such as having only JavaScript for User Defined Functions, or not being able to use CrateDB time series databases as a Foreign Data source for PostgreSQL. Nevertheless, we were able to handle combining PostgreSQL and CrateDB data using Apache Trino distributed SQL query engine, and we were also impressed by the default performance of CrateDB.

Its sharding and partitioning capabilities, as well as default compression capabilities made dealing with 100s millions of rows for time series data very performant and manageable, and we expect to have similar results for billions of rows, which reflects the ever-growing nature of time series data, especially when you are dealing with 100s of metrics for 1000s of devices, sending data every second.

We also evaluated another database, not from the perspective of time series, but trying to answer the following question: “what if you want a fully horizontally scalable PostgreSQL?” YugabyteDB, a relatively new and innovative database technology turned out to be a strong answer. What drew our attention was the fact that the YugabyteDB team reused PostgreSQL’s query layer to achieve a high degree of compatibility with PostgreSQL applications. Having the nice features of a mature relational database management system was very important for developer experience, and having easy horizontal scalability and resilience was important from performance and operational perspective. Migrating some of the complex PostgreSQL databases, their data, as well as stored procedures turned out to be easier than we expected. On the other hand, one of the things to be aware of is the fact that YugabyteDB, at the end of the day is a different, distributed database technology and even though interfacing with it is almost the same as interfacing with PostgreSQL, the underlying differences should be take into account, not only when it comes to database design and modeling, but also with respect to expectations for latency, throughput, etc. The curious reader can find a lot of detail in “The cost and benefit of synchronous replication in PostgreSQL and YugabyteDB” written by one of the YugabyteDB experts. (See also “Which 🐘PostgreSQL problems are solved with 🚀YugabyteDB“)

Finally, it would be unfair not to mention DuckDB. This high-performance analytical database system started its life in CWI, Netherlands, also known as the birthplace of Python. Unlike the previous database technologies we mentioned, this is not a database server system, but we saw that we could use it more like SQLite-for-Big-Data. What made DuckDB interesting and practical for us was its dead-simple deployment with no dependencies, its columnar-vectorized query execution engine, having APIs for languages such as Python, Java, R (in addition to a practical, SQLite-like command line interface), direct Parquet & CSV querying, and SQL support such as window functions.

If you export your time series data out of your database servers into compressed CSV or Parquet files, and then wish to run ad-hoc analytical queries on millions or billions of rows of data, and don’t want to install a complex analytical database, we strongly recommend giving DuckDB a try. For those data analysts and data scientists who like using Pandas, also check out “A zero-copy data integration between Apache Arrow and DuckDB” (and keep in mind that DuckDB Python package can run queries directly on Pandas data without ever importing or copying any data).

Conclusion

As concluding remarks, we can say it is obvious that the world of databases is converging in some clearly visible dimensions, such as upcoming database technologies competing with each other to provide horizontal scalability, shared-nothing architecture, while not sacrificing the benefits and comforts of SQL compliance as well as relational data modeling patterns. Being compatible with a mature database system such as PostgreSQL, and playing with it nicely is another strong convergence, and rightfully so. Needless to say, being also compatible with major ORM frameworks for .NET, Java, Python, etc. is also taken for granted. All of the database technologies we mentioned so far are open source systems with transparent development practices and they also provide the users to be used as cloud services, either via directly by their backing companies, or by being straightforward to install on your favorite public cloud provider.

We expect to see even more exciting developments in the world of databases, and encourage you to contact TM Data ICT Solutions for more information regarding your database needs ranging from scalable and optimal data architecture design to data governance, data quality, database migration, and database system integration.

About the author: Emre Sevinç is the co-founder & CTO of TM Data ICT Solutions. You can read more about him in his personal blog.

Introduction

Recent and Upcoming Database Technologies and Trends

Conclusion

Discover more from TM Data ICT Solutions