13
dmonkey
3y

Ok now I'm gonna tell you about my "Databases 2" exam. This is gonna be long.

I'd like to know if DB designers actually have this workflow. I'm gonna "challenge" the reader, but I'm not playing smartass. The mistakes I point out here are MY mistakes.

So, in my uni there's this course, "Databases 2" ("Databases 1" is relational algebra and theoretical stuff), which consist in one exercise: design a SQL database.

We get the description of a system. Almost a two pages pdf. Of course it could be anything. Here I'm going to pretend the project is a YouTube clone (it's one of the practice exercises).

We start designing a ER diagram that describes the system. It must be fucking accurate: e.g. if we describe a "view" as a relationship between the entities User and Video, it MUST have at least another attribute, e.g. the datetime, even if the description doesn't say it. The official reason?

"The ER relationship describes a set of couples. You can not have two elements equal, thus if you don't put any attribute, it means that any user could watch a video only once. So you must put at least something else."

Do you get my point? In this phase we're not even talking about a "database", this is an analysis phase.

Then we describe the type dictionary. So far so good, we just have to specify the type of any attribute.

And now... Constraints.
Oh my god the constraints. We have to describe every fucking constraint of our system. In FIRST ORDER LOGIC. Every entity is a set, and Entity(e) means that an element e belongs to the set Entity. "A user must leave a feedback after he saw a video" becomes like

For all u,v,dv,df,f ( User(u) and Video(v) and View(u, v, dv) and feedback(u, v, f) ) ---> dv < df

provided that dv and df are the datetimes of the view and the feedback creation (it is clear in the exercise, here seems kinda cryptic)

Of course only some of the constraints are explicitly described. This one, for example, was not in the text. If you fail to mention any "hidden" constraint, you lose a lot of points. Same thing if you not describe it correctly.

Now it's time for use cases.
You start with the usual stickman diagram. So far so good.
Then you have to describe their main functions.

In first order logic. Yes.

So, if you got the point, you may think that the following is correct to get "the average amount of feedback values on a single video" (1 to 5, like the old YT).
(let's say that feedback is a relationship with attribute between User and Video

getAv(Video v): int
Let be F = { va | feedback(v, u, va) } for any User u
Let av = (sum forall f in F) / | F |
return av

But nope, there's an error here. Can you spot it (I didn't)?

F is a set. Sets do not have duplicates! So, the F set will lose some feedback values! I can not define that as a simple set!

It has to be a set of couples, like (v, u), where v is the value and u the user; this way we can have duplicate feedback values in our set.

This concludes the analysis phase. Now, the design.

Well we just refactor everything we have done until now. Is-a relations become relationships, many-to-many relationships get an "association entity" between them, nothing new.
We write down on paper every SQL statement to build any table, entity or not. We write down every possible primary key or foreign key. The constraint that are not natively satisfied by SQL and/or foreign keys become triggers, and so on.

This exam is considered the true nightmare at our department. I just love it.

Now my question is, do actually DB designers follow this workflow? Or is this just a bloody hard training in Pai Mei style?

Comments
  • 5
    This practice is only used when migrating or redesigning an existing enterprise system in order to preserve functionality and ensure data consistency. The chance of you coming across it outside of study is about 0, unless you end up as a codemonkey for some large corporation and need to implement such specifications.

    See, projects simply aren't designed like that any more. If I go to your example where you're dealing with video views and ratings: these two won't be in the same database as users and videos. There won't be any constraints. They'll get pushed to some "data hose", aggregated, and archived. Maybe the sistem will periodically prune views and ratings for deleted users and videos, but they probably won't even bother, since there's no identifying data there.
  • 0
    🤔
  • 1
    @hitko Then I wonder, how does db design works nowdays? Is there any formal practice or is it just "pen, paper and experience"?

    Of course I suppose it depends on the domain, but you said that in a situation like the one I mentioned "videos, feedbacks and users" would be separate. What influences such decision?
  • 1
    @dmonkey There are still tools like you've described for designing entities and relationships, but you start from the other end, focusing on the way some data will be consumed and produced rather than on the data itself.

    For example:
    You define User, Video, and View. Then you check who will produce and consume each of these: users and videos will be produced and consumed by an application connected to the database. But views will be produced by some kind of logger, which might or might not be a part of the same application, and they won't be consumed directly. Instead, application will only consume them in some aggregated form, e.g. as total amount of views per video, or as an index where it can quickly check whether a user has seen given video at least once. Then you check which constraints are absolutely necessary: if you delete a user, you have to delete their videos, but that doesn't affect the fact that someone once watched some video, so there's no need for constraints there.
  • 0
    @hitko Moving on, you can see that:
    - logger only needs a key from user and video
    - view is only created by logger
    - view doesn't have to be deleted when linked user or video is deleted
    - no entity depends on view
    - view is only consumed by aggregator
    - aggregator doesn't consume any other entity along with view (it doesn't perform a join while aggregating views)

    This makes 'logger -> view -> aggregator' a "black box", i.e. what's inside doesn't affect our data structure and relationships / constraints, and nothing that happens outside affects views after they've been created.
  • 0
    @hitko I see, thank you for sharing. This is gonna be useful
Add Comment