URL shortening : data partitioning

Question

I've been looking in many places about this question - most top Google search results are copy-pasted from a single source, and others are not particularly helpful. Not sure if I am allowed to include the links of my search - so refraining for now.

URL shortening has 2 aspects :

Write : get a long URL, write a mapping of short vs long, return short URL.
Read : get a short URL : find the long URL, and maybe redirect to it.

Common design I see is :

> (Key Generation service) ----> (Application server)
>         |                              |
>         |                              |
>         |                              |
>         V                              V
>     (Key DB)                    DB with Short vs Long mapping

Seems like the same Application server is doing read and write activities.

Now for data partitioning - I am getting a single advice from almost everywhere : hashing (possibly consistent) of long URL to a particular server.This only helps in the write aspect. For reading, we need to check every application server : do you have a long URL for this short URL ? And every one admits that reading is much much more important for this design than writing. So they are optimizing for the least important use case ?

Am I missing something very basic ?

But the short URL is your key or hash that you can use for sharding, right? — amon, Commented Aug 15, 2020 at 8:35
Two answers, sorry : 1. At the time of the request to write, this short URL doesn't exist. 2. Hashing to create the short URLs has many problems : so typically people suggest to use a key generator to create the short URLs. So mostly, no the sorry URL is not the hash. — Jeeves, Commented Aug 15, 2020 at 9:08
how many urls are you planning to store on which sufficiently tiny machines that you can't store all of your short/long (aka key/value) pairs on each machine? in other words, do you need data partitioning at all? why not just replication (for availability)? — davidbak, Commented Aug 17, 2020 at 23:00

Bogdan · Accepted Answer · 2020-08-15 09:48:02Z

When dealing with a distributed system, you have to accept the fact that, at some point, your nodes will see different data, either because of network partitions or latency over the network. Once you accept that, you can decide what the application will do. Will you throw an error, will you try to rebuild the data, will you go look for the data someplace else, will you display stale information, etc.

I can't know your exact use cases from the information you provided in your question so I'm making some assumptions when saying the following:

you can replicate the writes. Once a write arrives on a server to map a long URL to a short one, you do so, write the data on that server then send the data to all other servers to store it also. When a read will arrive asking for a short URL, on whatever server you are, you already have the data.
of course, as already mentioned, the servers might see different data. The previous point works for all servers that receive the data and update it in their DB. If some servers are offline for some reason, when they come back online, they will have some data missing. If you receive a read on such a server for something missing, you go ask other servers if they have it. Once you receive the data on the server, you store it so you don't have to look for it again. You might also have some periodic jobs that synchronize data between the servers, eagerly without it being asked instead of lazily on demand.
at some point you can still end up with no data and no way to retrieve it, in which case you need to decide what to do: fails with an error message, tell the user to return later, shut down that node, etc, really depends on your use case.

The approach is not very different than how a cache works: got the requested data in cache? Serve it. If not, retrieve it from the main data source, store it, serve it. You might also look at how DNS works for example, and borrow some ideas from there, since a mapping from a short to a long URL is not very different than a mapping from a URL to an IP.

Yes, it makes sense. From your first point, i gather that all servers store nearly all data. So if total 10 TB data is in URLs (short, long, and some metadata) and we use 10 servers : total hard disk space needed is 150 TB ? — Jeeves, Commented Aug 15, 2020 at 9:58
Other constraints (like data constraints which in turn can cause cost constraints) can of course affect the way you decide to do things and how you build your application architecture. You can shard the data and each server holds something different, but then you increase the number of lookups for a URL mapping, or use the short URL as a way to identify the server that holds the data. What's even worse is that if one server goes down or is otherwise inaccessible, you loose the physical shard it holds, so you might not longer have access to a lot of mappings. — Bogdan, Commented Aug 15, 2020 at 10:10
The important thing to remember is that there will be tradeoffs involved, that depend on your use cases and your constraints. You need to carefully analyze them before deciding on an implementation. It will also be helpful to extrapolate how much data you will have in 1 year, 3 years, 5, etc, so you don't build something now and in one year have to rewrite everything to change the architecture. — Bogdan, Commented Aug 15, 2020 at 10:14
@Jeeves 10TB would be enough for at least 5 billion URLs which is massive. It's not necessary for every server to have their own DB as long as they can intelligently cache popular URLs. Typically, the vast majority of entries will be requested approximately never so that replication or caching makes no sense for performance. — amon, Commented Aug 15, 2020 at 14:03

darkmoonstone · Accepted Answer · 2020-08-21 00:59:52Z

One option is to include a shard id in the tiny url you generate. E.g. Let the first n characters of the tiny url represent the shard id. Then when a read request comes in, your load balancer parses the first n characters out and uses them to redirect the request to the correct application server(s). The obvious disadvantage here is it increases your tiny url length by n characters.

Also to clarify why you may want to use partitioning even if the data fits on a single machine (like it does in this case). One reason could be query throughput. On partitions the queries will run against a smaller volume of data making them faster. Another reason could be wanting to store older (and presumably less frequently accessed data) on a different (maybe cheaper?) storage. I found this article pretty useful when reasoning about these things: https://docs.microsoft.com/en-us/azure/architecture/best-practices/data-partitioning

Stack Exchange Network

URL shortening : data partitioning

2 Answers 2

Hot Network Questions

URL shortening : data partitioning

2 Answers 2

Related

Hot Network Questions