Truth first, or why the system needs to be designed based on the database device

Hey Habr!

We continue to explore the topic Java ΠΈ Springincluding at the database level. Today we offer to read about why, when designing large applications, it is the database structure, and not the Java code, that should be of decisive importance, how this is done, and what are the exceptions to this rule.

In this rather belated article, I'll explain why I think that in almost all cases, the data model in an application should be designed "from the database" rather than "from the capabilities of Java" (or whatever client language you're working with). By choosing the second approach, you enter a long path of pain and suffering once your project starts to grow.

The article was written based on one question, given on Stack Overflow.

Interesting discussions on reddit in sections /r/java ΠΈ /r/programming.

Code generation

How surprised I am that there is such a small layer of users who, having become acquainted with jOOQ, resent the fact that jOOQ seriously relies on source code generation to run. No one is stopping you from using jOOQ the way you see fit, and no one is forcing you to use code generation. But by default (as described in the manual), jOOQ works like this: you start with a (legacy) database schema, reverse engineer it with the jOOQ code generator to get a set of classes that represent your tables, and then write type-safe queries against these tables:

	for (Record2<String, String> record : DSL.using(configuration)
//   ^^^^^^^^^^^^^^^^^^^^^^^ Π˜Π½Ρ„ΠΎΡ€ΠΌΠ°Ρ†ΠΈΡ ΠΎ Ρ‚ΠΈΠΏΠ°Ρ… Π²Ρ‹Π²Π΅Π΄Π΅Π½Π° Π½Π° 
//   основании сгСнСрированного ΠΊΠΎΠ΄Π°, Π½Π° ΠΊΠΎΡ‚ΠΎΡ€Ρ‹ΠΉ ссылаСтся ΠΏΡ€ΠΈΠ²Π΅Π΄Π΅Π½Π½ΠΎΠ΅
// Π½ΠΈΠΆΠ΅ условиС SELECT 
 
       .select(ACTOR.FIRST_NAME, ACTOR.LAST_NAME)
//           vvvvv ^^^^^^^^^^^^  ^^^^^^^^^^^^^^^ сгСнСрированныС ΠΈΠΌΠ΅Π½Π°
       .from(ACTOR)
       .orderBy(1, 2)) {
    // ...
}

The code is generated either manually outside of the build, or manually on every build. For example, such regeneration may follow immediately after Flyway database migration, which can also be done manually or automatically.

Source code generation

There are various philosophies, advantages and disadvantages associated with these approaches to code generation - manual and automatic - which I am not going to discuss in detail in this article. But, in general, the whole point of the generated code is that it allows you to reproduce in Java the β€œtruth” that we take for granted, either within our system or outside of it. In a sense, compilers that generate bytecode, machine code, or some other kind of code from source code do the same thing - we get a representation of our "truth" in another language, regardless of specific reasons.

There are many such code generators. For example, XJC can generate Java code based on XSD or WSDL files. The principle is always the same:

  • There is some truth (internal or external) - for example, a specification, a data model, etc.
  • We need a local representation of this truth in our programming language.

Moreover, it is almost always advisable to generate such a representation - in order to avoid redundancy.

Type Providers and Annotation Processing

Note: Another, more modern and specific approach to code generation for jOOQ involves the use of type providers, as they are implemented in F#. In this case, the code is generated by the compiler, actually at the compilation stage. In principle, such code does not exist in the form of source codes. In Java, there are similar, although not as elegant, tools - these are annotation processors, for example, Lombok.

In a certain sense, the same things happen here as in the first case, except:

  • You do not see the generated code (maybe this situation does not seem so repulsive to someone?)
  • You must ensure that types can be provided, that is, "true" must always be available. This is easy in the case of Lombok, which annotates "truth". It's a little more difficult with database models that depend on a live connection that is always available.

What is the problem with code generation?

In addition to the tricky question of how it is better to start code generation - manually or automatically, I have to mention that there are people who believe that code generation is not needed at all. The justification for this point of view, which I came across most often, is that it is then difficult to set up the build pipeline. Yes, it's really difficult. There are additional infrastructure costs. If you're just getting started with a particular product (be it jOOQ, or JAXB, or Hibernate, etc.), it takes time to set up a workbench that you would like to spend learning the API itself to get value out of it.

If the costs associated with understanding the device of the generator are too high, then, indeed, the API did a poor job on the usability of the code generator (and in the future it turns out that customization in it is also difficult). Usability should be the highest priority for any such API. But that's just one argument against code generation. Otherwise, write entirely by hand the local representation of internal or external truth.

Many will say that they do not have time to do all this. They're on deadline for their Super Product. Someday later we will comb the assembly conveyors, we will have time. I will answer them:

Truth first, or why the system needs to be designed based on the database device
Original, Alan O'Rourke, Audience Stack

But in Hibernate / JPA it's so easy to write code "in Java".

Really. For Hibernate and its users, this is both a boon and a curse. In Hibernate, you can simply write a couple of entities, like this:

	@Entity
class Book {
  @Id
  int id;
  String title;
}

And almost everything is ready. Now the lot of Hibernate is to generate complex "details" of how exactly this entity will be defined in the DDL of your "dialect" of SQL:

	CREATE TABLE book (
  id INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
  title VARCHAR(50),
 
  CONSTRAINT pk_book PRIMARY KEY (id)
);
 
CREATE INDEX i_book_title ON book (title);

... and start running the application. A really cool feature to get up and running quickly and try out different things.

However, let me. I was lying.

  • Will Hibernate actually enforce the definition of this named primary key?
  • Will Hibernate create an index on TITLE? I know for sure we need it.
  • Will Hibernate make this key an identity key in the Identity Specification?

Probably not. If you are developing your project from scratch, it is always convenient to simply discard the old database and generate a new one as soon as you add the necessary annotations. So, the Book entity will eventually take the form:

	@Entity
@Table(name = "book", indexes = {
  @Index(name = "i_book_title", columnList = "title")
})
class Book {
  @Id
  @GeneratedValue(strategy = IDENTITY)
  int id;
  String title;
}

Cool. Regenerate. Again, in this case, it will be very easy at the start.

But you'll have to pay for it later.

Sooner or later you will have to go into production. That's when the model stops working. Because:

In production, it will no longer be possible, if necessary, to discard the old database and start everything from scratch. Your database will turn into a legacy one.

From now on and forever you will have to write DDL migration scripts, e.g. using Flyway. And what will happen to your entities in this case? You can either tailor them manually (and double your workload) or have Hibernate regenerate them for you (how likely is the one generated this way to meet your expectations?) You lose either way.

Thus, as soon as you move into production, you will need hot patches. And they need to be brought to production very quickly. Since you haven't prepared and organized smooth pipelining of your migrations for production, you're patching wildly. And then you don't have time to do everything right. And you scold Hibernate, because it's always anyone's fault, but not you ...

Instead, from the very beginning, everything could have been done completely differently. For example, put round wheels on a bicycle.

Database first

The real "truth" in your database schema and "sovereignty" over it lies within the database. The schema is defined only in the database itself and nowhere else, and each of the clients has a copy of this schema, so it makes perfect sense to enforce adherence to the schema and its integrity, to do it right in the database - where the information is stored.
This is old even hackneyed wisdom. Primary and unique keys are good. Foreign keys are fine. Constraint checking is good. Statements - Good.

And, that's not all. For example, using Oracle, you would probably want to specify:

  • What tablespace is your table in
  • What is her PCTFREE value
  • What is the cache size in your sequence (behind the id)

All this may not matter in small systems, but it is not necessary to wait until the transition to the realm of "big data" - you can begin to benefit from vendor-provided storage optimizations, such as those mentioned above, much earlier. None of the ORMs I've seen (including jOOQ) provide access to the full set of DDL options that you might want to use in your database. ORMs offer some tools to help you write DDL.

But at the end of the day, a well-designed schema is hand-written in DDL. Any generated DDL is only an approximation of it.

What about the client model?

As mentioned above, on the client you will need a copy of your database schema, the client view. Needless to say, this client view must be in sync with the real model. What is the best way to achieve this? With a code generator.

All databases provide their meta-information via SQL. Here's how to get all tables in different SQL dialects from your database:

	-- H2, HSQLDB, MySQL, PostgreSQL, SQL Server
SELECT table_schema, table_name
FROM information_schema.tables
 
-- DB2
SELECT tabschema, tabname
FROM syscat.tables
 
-- Oracle
SELECT owner, table_name
FROM all_tables
 
-- SQLite
SELECT name
FROM sqlite_master
 
-- Teradata
SELECT databasename, tablename
FROM dbc.tables

These queries (or similar ones, depending on whether you also have to consider views, materialized views, table-valued functions) are also executed by calling DatabaseMetaData.getTables() from JDBC, or using the jOOQ meta-module.

From the results of such queries, it is relatively easy to generate any client-side representation of your database model, no matter what technology you use on the client.

  • If you are using JDBC or Spring you can create a set of string constants
  • If you are using JPA, then you can generate the entities themselves
  • If you are using jOOQ you can generate jOOQ meta model

Depending on how much power your client API offers (eg jOOQ or JPA), the generated meta model can be really rich and complete. Take, for example, the possibility of implicit joins, introduced in jOOQ 3.11, which relies on generated meta-information about foreign key relationships between your tables.

Now any database increment will automatically update the client code. Imagine for example:

ALTER TABLE book RENAME COLUMN title TO book_title;

Would you really like to do this job twice? In no case. We just commit the DDL, run it through your build pipeline, and get the updated entity:

@Entity
@Table(name = "book", indexes = {
 
  // Π’Ρ‹ ΠΎΠ± этом Π·Π°Π΄ΡƒΠΌΡ‹Π²Π°Π»ΠΈΡΡŒ?
  @Index(name = "i_book_title", columnList = "book_title")
})
class Book {
  @Id
  @GeneratedValue(strategy = IDENTITY)
  int id;
 
  @Column("book_title")
  String bookTitle;
}

Or the updated jOOQ class. Most DDL changes also affect semantics, not just syntax. Therefore, it can be convenient to see in the compiled code what code will (or could be) affected by incrementing your database.

The only truth

Regardless of which technology you use, there is always one model that is the only source of truth for some subsystem - or at least we should strive for this and avoid enterprise confusion where β€œtruth” is everywhere and nowhere at once. Everything can be much easier. If you're just exchanging XML files with some other system, just use XSD. Take a look at jOOQ's INFORMATION_SCHEMA meta-model in XML form:
https://www.jooq.org/xsd/jooq-meta-3.10.0.xsd

  • XSD is well understood
  • XSD marks up XML content very well and allows validation in all client languages
  • XSD is well versioned and highly backward compatible
  • XSD can be translated into Java code using XJC

The last point is important. When communicating with an external system using XML messages, we want to be sure that our messages are valid. This is very easy to achieve with JAXB, XJC and XSD. It would be sheer madness to think that, in a Java-first design approach where we make our messages as Java objects, they could somehow be rendered intelligibly to XML and sent for consumption to another system. The XML generated in this way would be of very poor quality, undocumented, and difficult to develop. If there was an agreement on the level of service quality (SLA) on such an interface, we would immediately screw it up.

To be honest, this is exactly what happens all the time with the JSON API, but that's another story, I'll argue next time ...

Databases: they are the same

Working with databases, you understand that they are all basically the same. The database owns its data and must manage the schema. Any modifications made to the schema must be implemented directly in DDL so that the single source of truth is updated.

When the source update has occurred, all clients must also update their copies of the model. Some clients may be written in Java using jOOQ and Hibernate or JDBC (or both). Other clients may be written in Perl (let's wish them luck), others in C#. It doesn't matter. The main model is in the database. ORM-generated models are usually of poor quality, poorly documented, and difficult to develop.

So don't make mistakes. Don't make mistakes from the start. Work from a database. Build a deployment pipeline that can be automated. Enable code generators to conveniently copy your database model and dump it on clients. And stop worrying about code generators. They are good. With them, you will become more productive. All you need to do is spend a little time setting them up right from the start, and you'll have years of improved performance to build your project's story.

Don't thank me yet, later.

Clarification

To be clear: This article does not in any way advocate that the entire system (i.e., domain, business logic, etc., etc.) needs to be flexed to fit your database model. What I'm talking about in this article is that client code that interacts with a database should act on the basis of the database model so that it doesn't reproduce the database model in "first class" status. Such logic is usually located at the data access layer on your client.

In two-level architectures, which are still preserved in some places, such a system model may be the only one possible. However, in most systems, the data access layer seems to me to be a "subsystem" that encapsulates the database model.

Exceptions

There are exceptions to every rule, and I've said before that the database first and source code generation approach can sometimes be inappropriate. Here are a couple of such exceptions (there are probably others):

  • When the schema is unknown and needs to be opened. For example, you provide a tool to help users navigate any diagram. Phew. There is no code generation here. But still - the database first of all.
  • When a circuit needs to be generated on the fly to solve some problem. This example seems to be a slightly frilly version of the pattern entity attribute value, i.e., you don't really have a well-defined schema. In this case, you often cannot even be sure at all that an RDBMS will suit you.

Exceptions are by nature exceptional. In most cases involving the use of RDBMS, the schema is known in advance, it is inside the RDBMS and is the only source of "truth", and all clients have to acquire copies derived from it. Ideally, this should involve a code generator.

Source: habr.com

Add a comment