Jump to content
xisto Community
levimage

Downloading The Internet

Recommended Posts

I don't know, but some institutions save on bandwidth costs by implementing a cache server of cache box. This device/host, stores copies of internet request, url, dns cache, images, code, etc. for duration till it's purged.The cache is used if people, staff, or computer labs frequent certain web sites all the time. this makes web sites (pages) appear to load faster than their Internet connection and saves bandwidth.Companies that do and create cache solutions would have a better chance along with major backbone ISPs, of answering or at least making an educated guess at the question you are trying to ask.Levimage ;)P.S. I myself have no clue, 10 GB/ps Ethernet is coming out soon. Probably good new for dvd torrent hippies ;)

Share this post


Link to post
Share on other sites

I'm wondering if it is possible to save a copy of everything on the Internet. Ignoring ISP data transfer limitations (max GB per month), I have a download speed of approximately 4 Mbps.The Internet isn't limited to web pages though, it includes everything that is public accessible (not password-protected) which includes all music, videos, pictures, software, etc. Furthermore, I am not limiting it to HTTP servers as torrents, files on FTP servers and anything on peer-to-peer networks (Gnutella/LimeWire) will count as well.Saving everything at its current state (ignoring changes to the live version after it is saved), how long will this take? What if I upgrade my Internet connection, or theoretically use all the bandwidth of (for example) educational institutions (universities), ISPs (Shaw, Comcast, etc) and large corporations (Microsoft, Google, etc).I am not talking about indexing content, I mean saving the actual file. Every web page would be considered one file, and pictures, JavaScript, CSS, etc would be their own files.

Share this post


Link to post
Share on other sites

I also think that would be a big problem : almost each computer is on the internet, so, in order to backup everything reachable through internet, you should have available in your computer the whole amount of disks of all the computers around the wolrd. As you mention it, Google computers disk space would be peanuts because you don't want to only index, you want the whole contents.

Share this post


Link to post
Share on other sites

The disk space available to me is in truly infinite amounts, the only question here is process and bandwidth, as well as CPU and disk speed to save and access everything.

A "backup" of the Internet is for experimental purposes only, and believe me, even if I manage to backup the entire Internet, including private network content, I would be using almost no disk space (an analogy would be say .... 1 electron out of the entire universe)

Share this post


Link to post
Share on other sites

Well, there are a lot of data centers with more than millions of terabytes of data, which take a lot of energy, so I doubt you could compete with them ;)

For example, a lot of content is dynamic, so it would be really hard to find the difference between different files and you could end just by an infinite loop unless you would find differences and ignore some of content in the Internet.

Moreover, as I know google doesn't offer google cache version anymore? or does it? maybe because also due to resources?

https://archive.org/ - offers a way back machine, quite cool, but also it's usually slow, it doesn't offer all the content, I mean it doesn't cache everything and it rarely caches images, well, but it's a non profit project ;)

Share this post


Link to post
Share on other sites

Interesting question. I am actually surprised that that you, FireFoxRules, asked it as it sounds like a crazy idea that I would expect from a newb. At any rate it did get me to think so I will propose an answer.

 

Assumptions

⢠You have an insane Internet backbone connection will guaranteed reliability and speed. I will assume that you have a 100 Mb/sec connection which is usually only available to ISP level organizations.

⢠You have an appropriately sized upstream connection to do all the requesting.

⢠You actually get the bandwidth you paid for. I personally have a â10 Mb/sec down and 1 Mb/sec upâ consumer cable connection. I have never seen anything close to these numbers in real life. The closest I have seen is 2 Mb/sec down (downloading ISOs from Microsoft MSDN) and there is a hard limit of around 115 kb/sec up that I constantly hit. A more typical download speed is around 500 kb/sec for regular web browsing.

⢠We will ignore all network structure and latency issues and assume you have a direct connection to your target with no hops in between.

o The nature of TCP/IP will limit you to around 80% of your bandwidth under ideal operation. When you have only two computers on a network (the idea case) you will still never get 100% bandwidth because of TCP header overhead, IP header overhead, other traffic such as ARP requests, and IP timing issues. A typical network usually sees only 45-50% bandwidth because of collisions. A stressed out network may only get 10%.

o There is latency between your request and the data.

Machine and router hardware delays. Usually microseconds.

Every hop adds delay. Usually milliseconds.

Server response time. Usually small compared to everything else but could become an issue. Ranges from milliseconds (typical) to minutes.

o In total you should expect to take at least 50% off your promised bandwidth in an idea case. This brings out 100 Mb/sec connection to more like 50 Mb/sec; but as stated earlier, we are ignoring this.

o Internet speed is based on more than your connection speed. The bandwidth of the server is also very important. You may have sufficient bandwidth but if you request from a server that is slower than your connection, you are stuck with their speed. I find that a typical website will only transfer up to 50 kb/sec so you will have to download from many different servers at the same time to fill your 100 Mb/sec pipe.

⢠You have enough computing power. At 100 Mb/sec you are starting to get into the range of IDE hard drive data transfer range. You will also want to have several threads going at the same time to maximize bandwidth utilization. You want to download a different webpage while you are waiting on the request for a separate page. Better yet, you want to keep your bandwidth pipe full even if you hit a slow server or a timeout which can be up to 2 minutes. I would guess that you would need 150-300 threads or requests going at the same time to meet this demand. A single computer likely will not be able to do this alone so you would end up with at least 5-10 servers on your end to pull this off. This of course breaks the idea case of no network congestion or collisions as described earlier.

⢠You have enough storage space. A quick search shows that YouTube alone has around 7.7 petabytes of content (http://beerpla.net/2008/08/14/how-to-find-out-the-number-of-videos-on-youtube/ . Newegg is showing 1TB hard drives for around $90. With the needed hardware and controllers, you are looking at around $100/ TB. At this rate you will need 7700 1 TB hard drives which would cost you around $770,000. A related article on BackBlaze (https://www.backblaze.com/blog/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/) shows you how to build your own 67 TB 4U rack server for $7,867 including drives and rack hardware. At the BackBlaze rate, 7.7 PB will cost $904,118 or almost 1 million dollars.


Gottchas

⢠Connection speeds are measured in BITS and not BYTES. There are 8 bits to a byte so this means that you need to divide your connection speed by 8 right off the top. This will make our 100 Mbit/sec connection a 12.5 Mbyte/sec connection. With typical network delays, this would become 6.25 Mbyte/sec.


Now letâs do some calculations (whips out trusty TI-89 calculator).

 

12.5 Mbyte/sec*60 seconds = 750 MB/min

750 MB/min* 60 mins = 45 GB/hour

45 GB/hour *24 hours = 1080 GB/day or ~1 TB/day (1.08e12)

 

With the YouTube example above of 7.7 petabytes (10e15)âŚ

 

7.7e15 Bytes/1.08e12 Bytes/day=7129.63 days

7129.63 days/365 days/year = 19.5332 years

 

Just downloading the YouTube database with an insane Internet connection will take you almost 20 years and almost 1 million dollars just in hard drive storage.

 

Hope this answers your question ;)

Share this post


Link to post
Share on other sites

With your 4Mbps download speed you'll never be able to keep up with all the data that is put on the internet daily, especialy on sites like Youtube (do note, Youtube uses 7.7PB for storing all the data, but of every video they keep the original plus a 360p, 480p, 720p and 1080p version where possible, the most efficient way would be to either store the highest resolution video or the original video).The next problem is power consumption. A single disk doesn't use a lot of power, especialy compared to a modern cpu. But 1000 disks easily use a few KWatts, generating tons of energy which you have to cool down.You'll also need a huge room for all the racks and for extra free air for cooling purposes (a backup).Imho, it's impossible to do.

Share this post


Link to post
Share on other sites

5 years from now it would not be impossible but then again there will probably be 25 times the data out there. Interesting huh?

Share this post


Link to post
Share on other sites

I came across this blog post while surfing, i'm not sure if this is the one company that is upto the download and archiving the internet. Removing the video and audio content from the web, text based sites are easy to crawl and archive i guess if these companies are doing it.

Check this blog post again. Interesting read, not sure that site is still offering such service or not. But i guess archiving wikipedia is possible to some extent then it is not bad idea i think that is also enough for some people.

Share this post


Link to post
Share on other sites

5 years from now it would not be impossible but then again there will probably be 25 times the data out there. Interesting huh?

5 years from now you might archieve the internet as it is now, but not they way it'll be in 5 years ;)

Share this post


Link to post
Share on other sites

I think that the starter topic concerned somebody having the throughput of an ISP : several hundreds of gigabytes per second. Unfortunately, these guys expend so much money for renting such communication infrastructures, that they urgently need to make money in order to simply survive, so they cannot do something such interesting than downloading the whole internet, if they happen to have the disk space available.And of course there is probably no interest to spend a lot of time and a lot of money in order to copy some data which are already free and probably backed up individually several times.And, of course, I would not appreciate that somebody exhausts my website bandwidth, probably making it temporarily unreachable, in order to perform a useless copy (useless from my own personal purposes, of course ;)

Share this post


Link to post
Share on other sites

Well, for personal use or from a business point of view it would be completely useless to backup the internet (unless they ask money for recovering websites by using their backup, but only few companies will use it ;) ).For research purposes it might be interesting, tough I can't really think of a good research question right now that could use the complete internet ;) It doesn't have to make your website unreachable. If one has a Gb downlink to download 100 different websites at a time, he'll use (rougly) 10Mbps of your website's bandwith. Nothing to worry about :P

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×
×
  • Create New...

Important Information

Terms of Use | Privacy Policy | Guidelines | We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.