535.Encode and Decode TinyURL
Note: This is a companion problem to the System Design problem:Design TinyURL.
TinyURL is a URL shortening service where you enter a URL such ashttps://leetcode.com/problems/design-tinyurl
and it returns a short URL such ashttp://tinyurl.com/4e9iAk
.
Design theencode
anddecode
methods for the TinyURL service. There is no restriction on how your encode/decode algorithm should work. You just need to ensure that a URL can be encoded to a tiny URL and the tiny URL can be decoded to the original URL.
Thoughts: 1. link to algorithm discussed in 534. Design TinyURL :
Hash Function: long_url -> md5/sha1: a. md5 convert a string into 128 binary bits, generally represented as 16 bytes hex:
http://site.douban.com/chuan-> c93a360dc7f3eb093ab6e304db516653; b. sha1 convert a string into 160 binary bits, generally represented as 20 bytes hex: http://site.douban.com/chuan-> dff85871a72c73c3eae09e39ffe97aea63047094
These two algorithms could make sure hash values are randomly distributed, implementation-wise simple.
Cons: the conflicts are inevitable. Any hash algorithm could have inevitable conflicts.
Solutions: 1. use (long_url + timestamp) as the hash function key. 2. When conflicts, regenerates the hash value(it's different because timestamp changes).
Overall, when urls are over 1 billion, there would be a lot of conflicts and the efficiency could be very low.
- Base62: Take short_url as a 62 base notation. 6 bits could represent 62^6=57 billion.
- This implementation simply adopts the base-62 (26*2 letters + 10 digits) mapping to the URL string. Take short_url as a 62 base notation. 6 bits could represent 62^6=57 billion.
- Each short_url represent a decimal digit. It could be the auto_increment_id in SQL database.
// You can type code here and execute it.
#include<iostream>
#include<algorithm>
#include<string>
using namespace std;
string idToShortURL(long int test){
char map[] = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
string shorturl;
while(test){
shorturl.push_back(map[test%62]);
test = test/62;
}
reverse(shorturl.begin(), shorturl.end());
return shorturl;
}
long int shortURLtoID (string shortURL){
long int id = 0;
for (int i = 0; i < shortURL.length(); i++){
if ('a' <= shortURL[i] && shortURL[i]<='z')
id = id * 62 + shortURL[i] - 'a';
if ('A' <= shortURL[i] && shortURL[i] <= 'Z')
id = id * 62 + shortURL[i] - 'A' + 26;
if('0' <= shortURL[i] && shortURL[i]<='9')
id = id * 62 + shortURL[i] - '0' + 52;
}
return id;
}
// function
int main(int argc, char** argv) {
unordered_map<int, string> record;
for (int i = 1; i< argc; i++){
cout<<"original URL: "<<(argv[i])<<endl;
record[i] = (argv[i]);
}
for (int i = 1; i< argc; i++){
cout<<"test URL: "<<record[i]<<endl;
string shortURL = idToShortURL(i);
cout<<"generated URL: "<<shortURL<<endl;
cout<<"Mapped URL back to ID: "<<shortURLtoID(shortURL)<<endl;
}
// int test = 66666;
// string shortURL = idToShortURL(test);
// cout<<"generated URL: "<<shortURL<<endl;
// cout<<"Mapped URL back to ID: "<<shortURLtoID(shortURL)<<endl;
}
Base 62 with Random Generator: by StefanPochmann's post
The following solution doesn't have these problems. It produces short URLs like
http://tinyurl.com/KtLa2U
, using a random code of six digits or letters. If a long URL is already known, the existing short URL is used and no new entry is generated.
class Codec:
alphabet = string.ascii_letters + '0123456789'
def __init__ (self):
self.long2short = {}
self.short2long = {}
def encode(self, longUrl):
"""Encodes a URL to a shortened URL.
:type longUrl: str
:rtype: str
"""
while longUrl not in self.long2short:
# randromly generate a code
shortUrl = ''.join(random.choice(Codec.alphabet) for _ in range(6))
# check the existence
if shortUrl not in self.short2long:
self.long2short[longUrl] = shortUrl
self.short2long[shortUrl] = longUrl
return 'http://tinyurl.com/' + shortUrl
def decode(self, shortUrl):
"""Decodes a shortened URL to its original URL.
:type shortUrl: str
:rtype: str
"""
return self.short2long[shortUrl[-6:]]
# Your Codec object will be instantiated and called as such:
# codec = Codec()
# codec.decode(codec.encode(url))
It's possible that a randomly generated code has already been generated before. In that case, another random code is generated instead. Repeat until we have a code that's not already in use. How long can this take? Well, even if we get up to using half of the code space, which is a whopping 626/2 = 28,400,117,792 entries, then each code has a 50% chance of not having appeared yet. So the expected/average number of attempts is 2, and for example only one in a billion URLs takes more than 30 attempts. And if we ever get to an even larger number of entries and this does become a problem, then we can just use length 7. We'd need to anyway, as we'd be running out of available codes.