Making a Redis ORM in Rust - Preface
While the word “Rust” is in the title, there will not be any Rust code discussed in this post.
This blog post is intended to serve more as a literature review than the start of a tutorial. Prior experience using existing ORM tools or using Redis is not required but I highly encourage you to at least get familiar with Redis before reading subsequent posts on this project. You can get started by reading this very small book, it’s only 30 pages.
Basic knowledge and experience in SQL is useful to understanding this blog.
In places where I use the term ‘model class’ this can also refer to a model type or struct in the case of languages not using OOP.
Prelude⌗
You can skip to the next section if you know what Redis and an ORM are
I was reading a blog post by Agiliq where they made a simple Redis ORM in Python to mimic the Django models API. This piqued my interest as it seemed like a niche thing to have an ORM for Redis. I then got to thinking of just how much it would take to make a Redis ORM in Rust and also make a good case for it. All of this led me down a path of more questions like, “Will I need to write some Rust macros?", “How much performance am I giving away using an ORM?", “Do you even ORM bro?"…
With more questions than I had answers to, I began to research on the things I needed to know to make my first ORM. Let’s begin by defining some terms.
What is Redis?⌗
Redis can be described as an in-memory key-value store. It is a NoSQL database and has a number of unique data types (or data structures, if you will) such as Strings, Hashes, Lists, Streams etc.
A common use case for Redis is to cache information, thanks to its highly efficient read/write speeds.
What is ORM?⌗
From an academic perspective, ORM stands for Object Relational Mapping. This is a pattern used to convert data between two incompatible type systems. Normally, an ORM tool is used in the context of converting data between a program and a database whereby data from tables in databases are converted to data that can be used by the data structures in your program.
ORM can also stand for Object Relational Mapper which in this article is referred to as ‘ORM tool’.
Common open source ORM tools include:
- Diesel (Rust)
- SQLAlchemy (Python)
- ActiveRecord (Ruby on Rails)
- Django (…models)? (Django)
- Waterline (Node.js)
- Sequelize (Node.js)
To better understand the use of an ORM, we will first take a look at how programs interact with databases.
You and your data⌗
More often than not, when making a program you find yourself needing to persist some data in a database. If you use a relational database, you will have your (hopefully) normalised tables and will frequently be making calls to read from and insert into them.
Let us assume we have a program that manages a video store and has tables with the following schema:
CREATE TABLE movies (
id SERIAL PRIMARY KEY, -- SERIAL (used in Postgres) is the same as AUTOINCREMENT in MySQL
name VARCHAR(100) NOT NULL,
release_date DATE NOT NULL,
genre VARCHAR(100) NOT NULL, -- We shall assume a movie will only have one genre
language VARCHAR(40) NOT NULL,
runtime INTEGER NOT NULL
);
CREATE TABLE directors (
id SERIAL PRIMARY KEY,
name VARCHAR(100) NOT NULL
);
-- Given that one movie can have many directors and vice versa, we introduce this table
CREATE TABLE movies_directors (
movie_id SERIAL NOT NULL,
director_id SERIAL NOT NULL,
PRIMARY KEY (movie_id, director_id),
FOREIGN KEY (movie_id) REFERENCES movies(id),
FOREIGN KEY (director_id) REFERENCES directors(id)
);
We have a number of ways to access data from the database:
1. Hand coded SQL statements for all the data moving to and from the database⌗
This is the simplest approach and usually the most straightforward. If we needed all movies released in 2020, the query will like so:
SELECT * FROM movies WHERE release_date >= '2020-01-01';
This works for a lot of use cases and such queries are put into functions specific to their task.
Advantages⌗
- This approach is simple and has the least amount of overhead when making queries.
- Hand coded SQL statements allow you to make any kind of query allowed by the SQL database.
Disadvantages⌗
If you are not well versed in SQL, it is very easy to make queries that are not as performant as they should be.
You end up writing as many SQL statements by hand as there are data requirements. If you need to get all movies between a given period, that’s a query. All movies that are in English? Another hand coded SQL statement. All comedy movies released in 2018? Another one. The
moviestable alone can have upto 32 differentSELECT ... WHEREstatements! (explanation)This particular disadvantage leads us to the next way one can access data.
2. Abstracting common statements into methods⌗
Various SELECT statements can be easily grouped so as to be called from a
single function, especially if they only query one table. For example we can
find a movie with any search criteria using the following method in
Python:
class Movie:
... # Add your class properties here
def find_movie(where_clause): # where_clause is a string argument
conn = DBConnection("postgresql://localhost/test") # Assume that a database connection is made in this way
sql = "SELECT * FROM movies where " + where_clause
return conn.query(sql)
Movie.find_movie('genre = "comedy"')
Movie.find_movie("release_date >='2020-01-01'")
Movie.find_movie('genre = "drama" AND release_date < \'2019-11-18\'')
From the code snippet above, you can see that a single method can serve for a broader number of queries.
Advantages⌗
- Introducing this layer of abstraction makes your code DRY.
- Properly named methods make it easier for other developers to understand what database interactions are being done.
Disadvantages⌗
It is difficult to make reusable methods when making queries that span multiple tables. For example, if we wanted to get all movies by a specific director you can have the following method in Java:
public static List<DirectorMovie> findDirectorMovies(String director){ String sql = "SELECT movies.name movie_name, movies.release_date, movies.genre, movies.runtime, directors.name director"+ " FROM movies INNER JOIN movies_directors ON movies.id = movies_directors.movie_id"+ "INNER JOIN directors ON directors.id = movies_directors.director_id"+ " WHERE directors.name = '" +director+"'"; Connection con = DriverManager.getConnection("jdbc:psql:@localhost:1521=", "name", "pwd"); Statement stmt = con.createStatement(); ResultSet res = stmt.executeQuery(sql); List<DirectorMovie> movies = new ArrayList<DirectorMovie>(); while(res.next()){ DirectorMovie directorMovie = new DirectorMovie(); directorMovie.setName(rs.getString(1)); directorMovie.setReleaseDate(rs.getInt(2)); directorMovie.setGenre(rs.getString(3)); directorMovie.setRuntime(rs.getInt(4)); directorMovie.setDirector(rs.getString(5)); movies.add(directorMovie); } con.close(); return movies; }Given that it returns a specific type of class and has a rather specfic
SELECTstatement, it is difficult to extract reusable parts of this method.Abstracting the statements into methods still requires one to type in the method implementations across all the classes representing database objects. For example, the
INSERTstatement for themoviestable will not be identical to theINSERTstatement for thedirectorstable hence theinsert()method for both classes will need to be implemented differently.
Notwithstanding, it is also important to note that a common disadvantage shared by both approaches is the fact that if the underlying database is changed e.g from PostgreSQL to MariaDB, differences in certain SQL commands will require refactoring most of your SQL code.
3. Using an ORM⌗
So then, how would you handle accessing your data with an ORM? Let us look at how the Django ORM handles this:
import datetime
from django.db import models
class Movie(models.Model):
name = models.Charfield(max_length=100)
release_date = models.Datefield()
genre = models.Charfield(max_length=100)
language = models.Charfield(max_length=40)
runtime = models.IntegerField()
movie1 = Movie.objects.create("Speed Racer","2008-04-28","Action",135)
movie2 = Movie.objects.create("Bad Boys for Life","2020-07-01","Crime",)
Movie.objects.filter(release_date__gte=datetime.date(2020,01,01))
After defining a new Movie class that inherits from a base Model class, we
create two movies and save them to the database using create. To search for
movies released in 2020, we use the filter method. The 32 SELECT statements
(mentioned here) can all be done using this filter
method alone!
# get all English movies
Movies.objects.filter(language="English")
# get all German movies released in 2019 and are over 2 hours long
Movies.objects.filter(language="German", runtime_gte=120,release_date__year=2019)
This is already much more convenient than hand coded SQL statements.
Working with multiple tables⌗
Let us see how we can implement the findDirectorMovies method we saw here.
First, we shall define both model classes representing the movies and directors
tables.
from django.db import models
class Movie(models.Model):
name = models.Charfield(max_length=100)
release_date = models.Datefield()
genre = models.Charfield(max_length=100)
language = models.Charfield(max_length=40)
runtime = models.IntegerField()
class Director(models.Model):
name = models.CharField(max_length=100)
movies = models.ManyToManyField(Movie)
The code above will generate SQL similar to the one we saw here. Next, we create some directors and movies.
director = Director.objects.create(name="Guillermo Del Toro")
movie1 = Movie.objects.create(name="Blade II", release_date="2002-03-21",genre="Sci-fi",language="English",runtime=117)
movie2 = Movie.objects.create(name="Pacific Rim", release_date="2013-07-1",genre="Action",language="English",runtime=132)
movie3 = Movie.objects.create(name="Between Two Ferns: The Movie", release_date="2019-09-20",genre="Comedy",language="English",runtime=83)
We then establish the relationship between a movie and a director
# Relating a director to a movie they directed
director.movies.add(movie1)
# You can even relate a movie with its director
movie2.director_set.add(director)
# movie3 is not related to the director object since he didn't actually direct it
Finally, to query for all movies by Guillermo Del Toro, we type the following:
Director.objects.get(name="Guillermo Del Toro").movies.all()
Writing no SQL statements of our own, we have managed to make the same query for movies by a given director. This is very convenient as you are able to build your applications much faster than you would using the previous approaches. The scope of queries you can make is much larger than this but that is outside the scope of this post. Refer to the links posted here for further documentation on each tool.
Hopefully, you can see why you would want to use an ORM tool. Of course ORMs come with their own drawbacks which we shall see later in the footnotes.Let us look at how ORMs actually manage to provide this much convenience.
ORM * ORM = an ORMageddon?⌗
From the previous code snippets in Python showcasing the Django ORM, we saw
that a model class like Movie has to inherit from the Model superclass and
its properties were of class types found in the models package like CharField.
The Model superclass is the one that provides the filter and create methods.
This is standard OOP behaviour so it should be easy to just recreate this Model
class ourselves, right? We could simply make a similar superclass, abstract class
or an interface, if you’re not using an OOP language, and have all these methods
there. However, there is still something else we have to consider.
That’s so meta…⌗
On closer inspection, we see some kwargs (keyword arguments) in the
filter method like runtime_gte and release_date__year. runtime_gte
is used to get movies that have a runtime greater than or equal to a given value
and release_date__year is used to get movies that have a release date within
a given year. Both kwargs seem to be very similar to the names of our
properties, release_date and runtime. But how did they get there? Are they
part of the Model superclass? If they were, then they would be pretty useless
if our Movie class never had these properties. In fact, if you changed the name
of the property from runtime to runstimes (or whatever other gibberish you would
like), the kwarg changes to runstimes_gte. Thus the following method call still
returns all movies that are over 2 hours long:
## First we redefine our property from runtime to runstimes
class Movie(models.Model):
...
runstimes = model.IntegerField()
...
Movie.objects.filter(runstimes_gte=120)
What code magic is this? runstimes is not even a word. The answer to how this
is achieved is thanks to a feature present in most programming languages to
varying degrees: METAPROGRAMMING.
Metaprogramming is the technique of treating code as data. This allows a program to read another program, or even itself, and either modify it or generate a new program. Many common programs you already use as a programmer already make use of metaprogramming, such as compilers and interpreters.
Metaprogramming, normally reflection, is how most (if not all) ORMs are able to provide methods for querying and insertion while also remaining generic enough to fit a mulitude of use cases. So instead of writing all the methods by hand, which is a drawback noted here, ORMs leverage metaprogramming to automatically generate those method implementations.
Using metaprogramming is dependent on your preferred language and is not covered in this blog post.
In the case of the Django ORM, the Movie class is read through by the Model
superclass to get information on the class types of each property then it creates
other methods with special kwargs and properties that you are likely to use.
This is a very complex process and it can span across classes. This is seen when
we used the Django ORM to associate a movie with a director. We used the
director_set property of the Movie class which was created because of the
following in the Director class:
class Director(models.Model):
...
movies = models.ManyToManyField(Movie)
So we have seen that ORMs make use of metaprogramming, to generate methods that are similar to the structure of our model classes and provide a lot of convenience. Next, let us see what I have found to be common parts or features of an ORM tool.
Parts of an ORM tool⌗
Determining the parts of an ORM tool involved looking at what the ORM tools I had listed had in common. Some features were found to be absolutely essential to any ORM while others (marked with an asterisk) were less common and could always be avoided for simplicity if you were to make your own ORM.
Their order is completely arbitrary.
- Data mapping
- Data modelling
- Migration tooling
- Query building
- DSLs *
- Schema inference *
Data Mapping⌗
Data Mapping is whereby data types in your programming language of use are
converted to the types that are available in the database being used. This can be
as simple as mapping a VARCHAR to a String and become as complex as returning
a data structure from the result of mutiple joins.
Data Mapping also involves serializing and deserializing the information sent to
and from the database.
Data Modelling⌗
Data Modelling refers to how data and relationships in your database are represented in your code. This includes the names of tables and their columns as well as representing database relationships. In some ORMs like the Django ORM, this even involves naming database tables in a specific way so as to ensure that everything works correctly.
Migration tooling⌗
A database migration (in this context) is the process of moving a database schema to a database. A database migration is useful for keeping track of changes made to a database and normally these changes can be reverted back at any time. ORMs have a migration tool used to make these database migrations. This tool should be responsible for ensuring migrations are applied successfully, keep track of the migrations as well as apply migrations in an idempotent way such that repeating a migration will have no effect on the database.
Query Building⌗
Query building is the process by which database queries are made from a series
of method calls. Query building is a very complex process as it should be both
expressive enough, for any developer using your ORM to actually find it useful,
and efficient such that the actual SELECT query that is produced is as close
as you can get to an efficient hand written query. Sometimes, this is not always
possible and leads us to using DSLs.
DSLs (Domain Specific Languages) *⌗
A Domain Specific Language is a programming language used to address a specific domain. The DSLs in ORM tools are very small in grammar and are primarily used to simplify making queries or for manipulating data.
A DSL is especially useful when the language it is implemented in constrains it from being expressive. For example in Rust, there is no native support yet for variadic functions (functions that can have a varied number or arguments).
Schema inference *⌗
Schema inference is whereby your ORM tool can infer the schema of a database from which it can generate the appropriate data models. Schema inference is particularly useful in cases where you as a developer want to use an ORM on an existing database. ORMs without an inference engine would force you to have to change existing databases, tables and views or need you to copy over all your data to the newly created databases, tables and views.
Aside from knowing these parts of an ORM, there is one more important thing to take note of.
The pattern within a pattern⌗
If you have actually used more than one ORM tool, especially across different programming languages, you may have realised that not all ORM tools are the same in terms of features. Some are “batteries included” like the Django ORM while some ORM tools like Diesel need you to define the database tables yourself.
While researching on the ORM pattern, I came across this book by Martin Fowler called Patterns of Enterprise Application Architecture and I found that these ORM tools fall under two sub-patterns of ORM:
- Data Mapper
- Active Record
The other two patterns for Object Relational Mapping mentioned in the book above are not exactly prominent in the ORM tools I had listed before so I have omitted them.
Data Mapper⌗
In the book, a data mapper is defined as a layer of mappers that moves data between objects and a database while keeping them independent of each other and the mapper itself. When using a data mapper the model classes do not have any database interaction code in their methods. Some of the ORM tools using this pattern are SQLAlchemy and Diesel.
Advantages⌗
- It is much more performant than an Active Record based ORM tool since there is less overhead when generating SQL queries.
- It allows the database and the model classes to be independent of each other so they can be represented differently.
Disadvantages⌗
- Making queries using a Data Mapper is much more verbose than with an Active Record based ORM.
Active Record⌗
In the Active Record pattern, an active record is an object that wraps a row in a database table or view encapsulates the database access, and adds domain logic on that data. When using an ORM using the Active Record pattern, the model classes contain methods for database interaction. Examples of such ORM tools include Waterline, Sequelize, Django ORM and, of course, ActiveRecord.
Advantages⌗
- Making queries is less verbose and simpler than with a Data Mapper based ORM.
Disadvantages⌗
- Model classes and database tables are tightly coupled and hence must match for the ORM to work effectively.
- Complex queries involving multiple tables are not easy to express and can lead to deeply nested method calls.
Understanding which pattern to use is essential to making an ORM as it helps determine how much functionality is left to the ORM to handle.
Hoping that I have not bored you with details, we have got an extensive understanding of ORMs, now let’s see how these concepts can be applied to Redis.
Redis, meet ORM⌗
The very first thing I did when I thought of making a Redis ORM is to Google the words “Redis ORM”. Much to my disappointment, I didn’t find many tools doing this to draw inspiration from. Given that Redis is such a popular open source project (based on Github stars alone), why are there such few Redis ORMs if any? This is actually the case because of two reasons. The first one is all about semantics. The second one is about Redis being so damn simple in design. Let me explain.
Redis != ORM⌗
If you can remember from earlier on, we saw that ORM stands for Object Relational Mapping. The word relational implies that this is used in relational databases which is often SQL databases. Redis is NOT a relational database thus making a tool with ORM-like behaviour would mean giving it a new term. Maybe ORM (Object Redis Mapper)? LOL.
MongoDB which is also a non relational database, can be used with an ORM-like tool called Mongoose (used in Node.js). Mongoose describes itself as an ODM (Object Document Mapper) which is appropriate since MongoDB is a document based database. In the case of Redis then, we could call this an OKVM (Object Key Value Mapper), but I’ll leave that up to debate.
Redis is better than simple. Simple is better than complex.⌗
The commands used to query and manipulate data in Redis are very simple and straightforward. Most of the time, you will find yourself executing commands that are only 3 or 4 words long. This keeps the Redis API simple but also very limiting. Among the 5 data types discussed in ‘The Little Redis Book', i.e (Strings, Lists, Hashes, Sets and Sorted Sets), only sorted sets have a means of searching for values by filtering a given value.
It should be known that Redis is intentionally designed this way since it solves a different set of problems from relational databases.
ZRANGEBYSCORE scores:2020-02-01 90 100
A simple SQL SELECT statement, on the other hand, can look like this:
SELECT * FROM table_name WHERE table_column LIKE '%s' AND table_column < 25 ORDER BY table_column;
this has 15 words (a word here means anything separated by a space) which is a large enough interface to be sufficiently abstracted whereas abstracting Redis commands will largely involve just wrapping commands into functions with little to no gain. This reduces the need for having an ORM for Redis. Hopefully, in future posts I am able to justify the ORM tool I make for Redis.
Conclusion⌗
In this post, we looked at why ORMs exist and what an ORM is made up of. We related these ORM concepts to Redis, showing how it is not a 1:1 mapping. In the next post, we shall get started with making an ORM tool with most of the parts I listed here using Rust.
Footnotes⌗
… 32 different SELECT … WHERE statements⌗
If we exclude the id column in the movies table, one can query for a
specific movie using 5 columns. If we can combine any of these 5 columns in any
order and using any number of columns of them (minimum of 1 and maximum of 5),
we can have upto 31 different SELECT ... WHERE statements. This is shown using
the following formula:
$$ \begin{gather*} \sum_{n=1}^5 \binom{5}{n} = \binom{5}{1} + \binom{5}{2} + \binom{5}{3} + \binom{5}{4} + \binom{5}{5} = 31 \end{gather*} $$
We then add the SELECT ... WHERE statement that uses the id column and we get 32.
Disadvantages of using ORMs⌗
While ORMs provide a lot of use they also come with their own disadvantages:
Expressing complex queries involving multiple tables can be problematic with ORMs. This is usually why ORMs allow for calling raw SQL commands which can better structure the query needed. In fact, some queries made from ORM query builders are not always guaranteed to be the most efficient.
Most ORMs introduce a level of vendor lock-in as they may have a very rigid data model making it hard to switch between ORMs.
Due to how most ORMs model data, you are at the mercy of the ORM in terms of database design e.g most ORMs do not have a mechanism to be reading data from database views rather than tables.