Do all the things like ++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatarSign Up
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple APILearn More
Voxera9095211dLogging and audit.
IntrusionCM5206211dI'm gonna be very nitpicky now.
Not because I want to make you angry or insult you, but because you asked a technical question.
Database. What do you mean?
There are tons of tons of tons of different approaches regarding database out there.
Guess you mean an relational database..... It's very important to distinguish this, since a relational database usually means you want ACID compliance.
And ACID compliance is one of the reasons heavy write is a touch harder than just read only.
But an RDBMs must not run in an ACID compatible mode, so ACID would be a requirement you should mention in your question.
You mentioned a ratio of 10:1 ...
Can you explain in more detail what you mean by that?
That for 10 single write operations there is one read operation?
Where do these numbers come from? They seem... Arbitrary.
Note: I'm avoiding the word query / transaction since that's extremely ambiguous.
An operation is a neutral term that doesn't include "time".
That's the second factor which I'm missing from your question.
There are different heavy write scenarios, and each scenario requires a different solution.
Most common are imho (terms are chosen randomly):
- logging (write once, seldom read)
- data processing (write once, read later, after read maybe deleted)
- data filtering (write once, maybe read later, after read maybe deleted)
Logging is usually never a problem. You'll just insert data.
Processing is where things get funky.
Since you insert and read from the table at the same time (!).
In the worst case, you even want to delete data.
In an RDBMs with ACID, processing is cumbersome, since locking can bite u.
Filtering is same as processing, but has a tiny detail that can make it very painful: You insert and you select data from "random" parts of the table. Random in the sense that the read operation locking will be tricky, since - in worst case - you'll filter few rows from billion.
EDIT: Filter increases locking, additionaly by the maybe delete.
invisible128211d@IntrusionCM I did mean RDBMS specifically. But I think we can add NoSQL databases which are used to store and retrieve data.
The read:write ratio of 10:1 that i specified is arbitrary. Many applications read the data more than write them into databases.
The scenarios you mentioned is what I wanted to know and gather more knowledge on.
I know about logging (and audit) as 2 cases where the amount of data written is more than the amount of data read. Even here, there could be cases where the logs/audit info is read more number of times than read.
I want to know more about the other 2 cases. Data processing and data filtering. Can you give an example of a scenario? (either an application or a part of an application)
Sorry if my initial question was too vague. To simplify it, if an interviewer asks for a use case where more data is written into db than read from it, I want to be able to provide examples that aren't logging or audit.
First, since we are now at RDBMs specifically, NoSQL isn't (necessarily) an solution to heavy write.
In most cases, it's adding a whole bunch of new things to worry about (especially transactional handling of the process itself, not transactional in database sense).
Data processing is the classic.
You've an incoming data stream… eg from an import of CSV data, monitoring data, logging data, anything.
The filtering and processing are quite the same.
Let's say the input is delivered by an REST API.
You'll continously insert data that is fed from the outside. In monitoring, this could be eg. temperature data from hardware.
The processing part is transforming the incoming data to useable data. Eg. you read every temperature for every hardware (eg CPU) written in the last 5 seconds and generate an AVG from it.
An average is an aggregate function, so you'll have one read operation, but waaay more write operations.
For data filtering I would change the example. Filtering makes most sense when the data that's delivered via your rest API might not be in order.
Let's say you're importing statistical data. You'll have tons of import files lying in a folder. Reading the files, analyzing the data, ordering it and then inserting would be extremely resource intensive.
But since each file can only contain a specific time range, you'll utilize a two fold database process.
You'll just insert all data via the rest API. As soon as eg an year is covered, you start via an second API the aggregation part for the finished time range.
Inserting data unsorted - API Write.
Keeping track of time ranges - Process feeding API Write.
When time range fulfilled - notify API read.
API read then get's the timerange that should be processed and maybe additional filters.
It then extracts / aggregates the data.
Voxera9095211dMy opinion is that if you really beed to worry about performance you first need to understand your workload and how different parts depend on each other.
An example I have is we had a solution a bit similar to google analytics (this was before analytics got big) where we had massive amounts if writes but in the end only displayed aggregates or very small parts.
Our approach was a solution where every page visit was written to a document database appending to a single document per “session” and then had a separate process reading those documents to generate aggregates, writing them to another db for reading, and once no more pages appear the document is erased from the first db.
The dbs are atomic on write and lock on document but since there very rarely are any competition on the same document this is not a problem.
And since the transfer process runs on its own, any delays is also acceptable, its not real time data anyway.
This way we could handle millions of log events per day on a single machine back in 2005 with lots of performance to spare.
But the solution was all dependent on the needs of the customer, if we had needed more real time statistics it would not be a viable solution.
vane9223211dFor example OLTP and OLAP are write heavy
You can optimize database schema for write performance. For example sqlite writes can be faster then using filesystem directly.
To achieve better write performance you need to understand how disk i/o works and optimize there.