Learning Resources for Software Engineering Students »
Authors: Syed Abdullah
The programs that we create would handle data in some way or another. Be it a simple calculator application that tabulates user calculations (and maybe store them in a log of recent calculation) or a cluster of servers that handle indexing of a large number of documents. Data is what is worked on by any program, the inputs and outputs of a program or even a simple function are data.
As our programs get larger and the data that is worked on becomes much more complex, there needs to be a way for us to systematically store and access data that is being worked on. It could be a crude structure, designed specifically for a particular use case or something that is more standardised, for instance, an application that handles data.
This notion of structuring data and providing an easy to use abstraction for accessing, storing and performing mundane operations on data (for instance, sorting them) is what this guide will cover.
A database is a collection of data. While any data in a medium can theoretically be a database (for instance, scribbling on a piece of paper), databases in the context of this chapter typically involve data that are organised in some manner.
Programs generally do not access the raw data directly. Instead, a database management system (DBMS) is used, which will handle the storage, retrieval and updating of the data.
Sometimes people use the term database to refer to a DBMS. However, to avoid any confusion, this guide will use these terms as defined above.
There are various concepts in the field of databases and DBMS and this guide will cover the basic concepts that are useful for someone who is starting out.
For databases to make any sense, there has to be a certain logical structure in the database (for instance, how data is stored). This logical structure is known as a database model.
There are many types of database models. Some examples are:
Click here to learn more about the different database models.
There are advantages and disadvantages to utilising one of the many database models available. These different models seek to solve certain issues when programs deal with data.
For instance, the relational model is the most structured out of the three that were shown. The presence of structure allows the database to enhance and improve certain common operations, for instance, searching for a specific data for an entry would be faster.
However, having such a rigid structure would mean that there's a limitation on how and what kind of data can be stored in the database.
For the purposes of introduction, we would mainly cover on aspects that are used in the relational and/or document-oriented database models.
The most popular database model, relational model, makes use of relations. This model assumes that the data to be stored follow a certain 'pattern'. For instance, a database that stores products sold by a shop would contain data such as: name, description, price and current stock levels. A visualisation of the database and data can be seen below:
|Bread||Sliced for your convenience.||1.40||50|
|Water||Essential for life.||0.50||1000|
From this visualisation, we can define the different parts of the relation:
Click here for a more in-depth and formal definition of relations.
Databases are useless if the data cannot be used in a meaningful manner. However, uncontrolled access to the database would not be ideal, as it might cause problems, especially when other actions may depend on the previous action's result.
Take for instance, a program that transfers money from one bank account to another. The actions that the program would need to do involves:
A problem arises if another action takes place in between any of the steps (e.g. another transfer from the same sender), or if any of the steps fails (e.g. due to a program crash). As a result, it may cause the data to be manipulated in an undesired manner (e.g. money not credited to receiver).
Thus, transactions allow us to guard against these problems. A transaction symbolises a logical unit of work, which consists of multiple database actions, performed on a set of databases. [Source] Properties of a database transaction ensures that these actions are done in a predictable (i.e. in the particular order) and reliable (i.e. all actions must be performed correctly) manner.
Thus, the transaction that would be implemented in the program could be something like this:
As demonstrated above, the actions that needs to be done in order to transfer funds is wrapped as one large transaction. Hence, the funds transfer can be seen as the unit of work to be done on the database. The actions that make up the transaction are executed as though transaction is a single action.
For a transaction to be considered as an implementation of the transaction concept, it has to satisfy the ACID principle.
This principle states that a transaction must contain these characteristics:
The above section demonstrates how data can be related to each other. However, this demonstration assumes one thing: there is only one record of the data that is stored. What if there is a need to scale the database in such a way that the data is distributed across several servers?
Relational model DBMSes usually do not scale as well, as the ACID principle, more specifically, durability, forces the database to propagate any changes to the data across all servers. One famous theorem, the CAP theorem, states the a distributed computer system can only fulfil two out of three guarantees.
|Consistency||Read should receive latest write|
|Availability||Every request receives a response (non-error)|
|Partition tolerance||System works even though there are some communication errors|
A relational model DBMS trades off availability for consistency. As the changes are propagated across the network, subsequent requests might be dropped by the DBMS as the current state of the database violates ACID.
However, in other DBMSes, like MongoDB, consistency is the trade off. This allows the database system to scale up to multiple nodes, as all requests are served, but the requests may result in incorrect or out of date data.
As such, these DBMSes follow the BASE philosophy:
A formal method of modelling the relations that have been demonstrated in this chapter is through the use of relational algebra. This is a formal method for modelling the data and actions performed on a relational database.
We have seen how databases are structured and how the underlying DBMS ensures that a certain set of characteristics, with regards to the system, hold true.
Now, the data consumer (for instance, an application or an actual human) would preferably want to access the data in a manner that is not DBMS specific. The DBMS implementation should have very little effect on the actual method of accessing the data. If there's a need to switch over to a different DBMS that has the same set of features as the previous DBMS, the application should preferably not have to change its method of accessing the data.
Query languages solves that issue, as some of them are designed to be platform-independent. As such, the query language can be seen as an abstraction of the possible actions that can be performed on a specific set of DBMSes. However, be forewarned that query languages are not totally platform-independent, as certain DBMSes may implement features that are unique to the certain DBMS.
Take for instance SQL, which is one of the most popular query languages for relational DBMSes. While most features in the language are supported by relational DBMSes that uses SQL, certain features, for instance
SAMPLE (which allows the consumer to pick a random set of data) are not available on all of the DBMses that supports SQL.
Another level of abstraction is the database abstraction layer. This is usually an API level solution, as the programmer does not even need to know about a specific query language. Some abstractions are DBMS agnostic and as such, can be used to access data from any kind of DBMS, regardless of its features.