Pragmatic me

The blog

Five reasons database integration sucks

It’s common across organisations to employ multiple IT applications that need to share data. CRM, billing systems, and technical support portals might all need to access the list of customers and their account details. Each one will no doubt be backed by a database, and they might all support the same database vendor (SQL Server, Postgresql, or heaven forbid Oracle). It also makes sense to host each of their database schemas on a centralised central database server.

The close proximity of all this data makes it extremely convenient, and possibly attractive to further integrate these applications upon this database server. On the surface, this can appear to be a simple solution requiring no middleware nor web services. A shared view from one schema can allow other applications to gain read access to say, the canonical customer list from the CRM. The odd stored procedure with appropriate privileges might also allow other applications to exploit shared functionality, for example, changing the credit limit of a customer.

However, while this integration is initially easy to implement, it can introduce a wide set of constraints that reduce agility both in terms of application development and strategy; by reducing the organisation's ability to switch out one of these applications.

1 It’s the very definition of tight coupling

Encapsulation is a key tenet in object orientation, SOA (service-oriented architecture), and other architectural styles that advocate loose coupling and high cohesion. The idea’s simple: the less collaborating applications know about each other, the easier it is to change, replace, or relocate one whilst minimising impact on the others. Encapsulation is typically implemented by separating implementation and a published external interface and ensuring any collaboration between applications is constrained through this interface.

Web architecture, and especially HATEOAS provide a great example of loose coupling where consumers such as browsers need only a base URI, a shared understanding of link relations, and a negotiated set of data formats to consume content. Providing a website continues to support this contract, it's free to change its database schema or even vendor with limited impact on the consumers. How an application persists it’s data should be considered an internal design detail, and allowance for change should be encouraged to allow the application to evolve. If other applications integrate with it by way of direct database links, a schema change might impact not just the application itself, but also it’s collaborators. This can drastically lessen the predictability for any unwitting developer asked to make such change. As well as understanding the change itself, they'd also be responsible for identifying and mitigating the impact to other applications, and potentially enable the orchestrated deployment of all related applications needed to keep them compatible.

2 Tight coupling can cause development pain

Modern development practice typically employs extensive test automation to ensure applications function correctly, and continue to do so in the future. Test automation in turn commonly relies on test isolation using test fakes, thus allowing the scope of tests to be constrained to a single application. This isolation delivers an array of benefits, although it does require the interface between an application and it’s dependencies to use protocols that allow faking. Examples of such protocols include HTTP and SMTP, which allow separation of the interface and implementation. Conveniently, many tools support stubbing of services using these protocols such as VCR.

On the other hand, database integration through SQL conflates the technology protocol with implementation details such as table names, column definitions, and implied knowledge of triggers. These concerns can be somewhat mitigated by establishing an API using stored procedures, although to maintain performance, these APIs frequently end up leaking details of the underlying database schemas, and therefore allowing tight coupling.

Database integration creates difficulties for isolated test automation. Further, equivalent tools like VCR aren’t available to fake SQL commands. Thus test automation engineers are left to home-bake tools and approaches.

The ‘opposite’ of isolated testing is end-to-end integration testing, which is frequently criticised for being unreliable, slow, and complex. Further, end-to-end testing often requires the correct versions of all collaborating applications to be deployed to a test environment to enable test execution. Licensing and resource requirements can constrain this, and often the end result are shared development environments. Shared environments throttle parallel development within a software team causing bottlenecks, and unintended systems breakages based on individual developer activities.

3 Tight coupling can prevent technical progression

Earlier, I described how tight coupling through database integration can constrain changes in an application's schema due to it negatively impacting consumers. For example, it could prevent an application maintainer's ability to change a table name to better reflect system functionality. However, it can also prevent more fundamental improvements. Imagine the component’s team need to migrate to a NoSQL database such as MongoDB, Neo4J, or Redis for it’s better suitability to the task. Perhaps they would like to migrate from a database to a CMS, replicated filestore, or a persistent queue. These changes would not be possible without significant change to collaborators that integrate at the database level, and thus this need for progression can be denied.

This is a further illustration of how tight coupling can reduce a team or organisation’s agility; their ability to choose technologies, approaches, and tools to suit an ever-changing environment.

4 Paradoxically, it can reduce performance

Increased performance is commonly cited as the justification for database integration. Enterprise databases are relatively fast and tuneable. Integrating at the database level can, on first inspection, appear the most performant option for sharing data between applications.

However, there's another option; applications can cache data owned by collaborating systems, thus avoiding the need to request it from its owner frequently. However, this cache needs updating. This can be implemented using a traditional request and cache approach with associated cache invalidation techniques. Alternatively, techniques exist such as event sourcing where data owning applications can raise notifications of organisation level events that signify data updates, and thus the need for data consuming applications to update their caches.

Further, the advantage of a consumer caching data owned by other applications offers performance improvements not just limited to reducing the need to frequently make requests to the owning schema. In addition, the cache can be optimised for it's sole responsibility of serving the consumer; it's data can be simplified, de-normalised, compressed to fit in RAM. Greater knowledge of usage patterns allows for many optimisations that provide potentially far greater performance than repeatedly going back to the data owner's schema arguably at the expense of a more complex architecture.

5 Databases are the hardest (and most expensive) components to scale

Databases are hard to scale and understandably so. CAP theorem, to which they're subject to, is complex and databases are typically designed to optimise for consistency and availability at the expense of scalability. Database scaling is often only achievable by buying expensive enterprise licenses and deploying complicated clusters of database servers. Even then, scalability can still be constrained. Further, integration itself can add even more resource pressure to a database server already operating close to capacity.

However, as introduced in the last section, autonomous datastores for each application including caches of collaborator's data can reduce the reliance on one large enterprise database server or cluster. Instead, each application can, if required, be deployed with its own database server or cluster allocated with exactly the computing resources needed for that application. It's easier to adjust the resources available on a per-application basis as demand varies, rather than having the more complicated challenge of scaling 'all the things'.

A side benefit of this approach is that it eliminates the need for synchronous collaboration between applications, and thus can offer a more resilient architecture; applications can continue to serve users to a limited fashion with cached data if their collaborators fail.

So does database integration suck, and if so, what are the alternatives?

In case I’ve not been sufficiently clear, I believe so. Integrating applications via their databases constrains performance, scalability, and a product team's ability to evolve application architecture to take advantage of advances in datastores. It tightly couples applications limiting test automation through lack of support for isolated testing, which in turn prevents effective widespread test automation, and the ability for developers to make local changes to applications without the fear of negatively impacting other applications.

Clearly, applications have to integrate across organisations to provide value, and I’d strongly recommend researching and applying alternative integration options. SOA (microservices) is the obvious option. Each service is deployed separately, and if they do require a database, it's considered an internal detail. Another service can only collaborate with another service through it’s published interface which prevents tight coupling. Additionally, embracing enterprise service buses, message queues, and REST / HATEOAS can increase an organisation’s ability to change or replace services while limiting impact across other applications.

Through this post, I hope I've conveyed my interest in this subject. I’d love to hear your opinions and chat more on the topic.