Honesty Rocks! truth rules.

Downloading The Internet

HOME      >>       Websites and Web Designing

levimage

I don't know, but some institutions save on bandwidth costs by implementing a cache server of cache box. This device/host, stores copies of internet request, url, dns cache, images, code, etc. for duration till it's purged.The cache is used if people, staff, or computer labs frequent certain web sites all the time. this makes web sites (pages) appear to load faster than their Internet connection and saves bandwidth.Companies that do and create cache solutions would have a better chance along with major backbone ISPs, of answering or at least making an educated guess at the question you are trying to ask.Levimage ;)P.S. I myself have no clue, 10 GB/ps Ethernet is coming out soon. Probably good new for dvd torrent hippies ;)


FirefoxRocks

I'm wondering if it is possible to save a copy of everything on the Internet. Ignoring ISP data transfer limitations (max GB per month), I have a download speed of approximately 4 Mbps.The Internet isn't limited to web pages though, it includes everything that is public accessible (not password-protected) which includes all music, videos, pictures, software, etc. Furthermore, I am not limiting it to HTTP servers as torrents, files on FTP servers and anything on peer-to-peer networks (Gnutella/LimeWire) will count as well.Saving everything at its current state (ignoring changes to the live version after it is saved), how long will this take? What if I upgrade my Internet connection, or theoretically use all the bandwidth of (for example) educational institutions (universities), ISPs (Shaw, Comcast, etc) and large corporations (Microsoft, Google, etc).I am not talking about indexing content, I mean saving the actual file. Every web page would be considered one file, and pictures, JavaScript, CSS, etc would be their own files.


yordan

I also think that would be a big problem : almost each computer is on the internet, so, in order to backup everything reachable through internet, you should have available in your computer the whole amount of disks of all the computers around the wolrd. As you mention it, Google computers disk space would be peanuts because you don't want to only index, you want the whole contents.


Бојан

First of all you can't find enough free space to do that or internet to download it, second why would you need a backup of the internet? ;) Greetings.


FirefoxRocks

The disk space available to me is in truly infinite amounts, the only question here is process and bandwidth, as well as CPU and disk speed to save and access everything.

A "backup" of the Internet is for experimental purposes only, and believe me, even if I manage to backup the entire Internet, including private network content, I would be using almost no disk space (an analogy would be say .... 1 electron out of the entire universe)


Quatrux

Well, there are a lot of data centers with more than millions of terabytes of data, which take a lot of energy, so I doubt you could compete with them ;)

For example, a lot of content is dynamic, so it would be really hard to find the difference between different files and you could end just by an infinite loop unless you would find differences and ignore some of content in the Internet.

Moreover, as I know google doesn't offer google cache version anymore? or does it? maybe because also due to resources?

https://archive.org/ - offers a way back machine, quite cool, but also it's usually slow, it doesn't offer all the content, I mean it doesn't cache everything and it rarely caches images, well, but it's a non profit project ;)


tansqrx

Interesting question. I am actually surprised that that you, FireFoxRules, asked it as it sounds like a crazy idea that I would expect from a newb. At any rate it did get me to think so I will propose an answer.

 

Assumptions

⢠You have an insane Internet backbone connection will guaranteed reliability and speed. I will assume that you have a 100 Mb/sec connection which is usually only available to ISP level organizations.

⢠You have an appropriately sized upstream connection to do all the requesting.

⢠You actually get the bandwidth you paid for. I personally have a â10 Mb/sec down and 1 Mb/sec upâ consumer cable connection. I have never seen anything close to these numbers in real life. The closest I have seen is 2 Mb/sec down (downloading ISOs from Microsoft MSDN) and there is a hard limit of around 115 kb/sec up that I constantly hit. A more typical download speed is around 500 kb/sec for regular web browsing.

⢠We will ignore all network structure and latency issues and assume you have a direct connection to your target with no hops in between.

o The nature of TCP/IP will limit you to around 80% of your bandwidth under ideal operation. When you have only two computers on a network (the idea case) you will still never get 100% bandwidth because of TCP header overhead, IP header overhead, other traffic such as ARP requests, and IP timing issues. A typical network usually sees only 45-50% bandwidth because of collisions. A stressed out network may only get 10%.

o There is latency between your request and the data.

Machine and router hardware delays. Usually microseconds.

Every hop adds delay. Usually milliseconds.

Server response time. Usually small compared to everything else but could become an issue. Ranges from milliseconds (typical) to minutes.

o In total you should expect to take at least 50% off your promised bandwidth in an idea case. This brings out 100 Mb/sec connection to more like 50 Mb/sec; but as stated earlier, we are ignoring this.

o Internet speed is based on more than your connection speed. The bandwidth of the server is also very important. You may have sufficient bandwidth but if you request from a server that is slower than your connection, you are stuck with their speed. I find that a typical website will only transfer up to 50 kb/sec so you will have to download from many different servers at the same time to fill your 100 Mb/sec pipe.

⢠You have enough computing power. At 100 Mb/sec you are starting to get into the range of IDE hard drive data transfer range. You will also want to have several threads going at the same time to maximize bandwidth utilization. You want to download a different webpage while you are waiting on the request for a separate page. Better yet, you want to keep your bandwidth pipe full even if you hit a slow server or a timeout which can be up to 2 minutes. I would guess that you would need 150-300 threads or requests going at the same time to meet this demand. A single computer likely will not be able to do this alone so you would end up with at least 5-10 servers on your end to pull this off. This of course breaks the idea case of no network congestion or collisions as described earlier.

⢠You have enough storage space. A quick search shows that YouTube alone has around 7.7 petabytes of content (http://beerpla.net/2008/08/14/how-to-find-out-the-number-of-videos-on-youtube/ . Newegg is showing 1TB hard drives for around $90. With the needed hardware and controllers, you are looking at around $100/ TB. At this rate you will need 7700 1 TB hard drives which would cost you around $770,000. A related article on BackBlaze (https://www.backblaze.com/blog/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/) shows you how to build your own 67 TB 4U rack server for $7,867 including drives and rack hardware. At the BackBlaze rate, 7.7 PB will cost $904,118 or almost 1 million dollars.


Gottchas

⢠Connection speeds are measured in BITS and not BYTES. There are 8 bits to a byte so this means that you need to divide your connection speed by 8 right off the top. This will make our 100 Mbit/sec connection a 12.5 Mbyte/sec connection. With typical network delays, this would become 6.25 Mbyte/sec.


Now letâs do some calculations (whips out trusty TI-89 calculator).

 

12.5 Mbyte/sec*60 seconds = 750 MB/min

750 MB/min* 60 mins = 45 GB/hour

45 GB/hour *24 hours = 1080 GB/day or ~1 TB/day (1.08e12)

 

With the YouTube example above of 7.7 petabytes (10e15)âŚ

 

7.7e15 Bytes/1.08e12 Bytes/day=7129.63 days

7129.63 days/365 days/year = 19.5332 years

 

Just downloading the YouTube database with an insane Internet connection will take you almost 20 years and almost 1 million dollars just in hard drive storage.

 

Hope this answers your question ;)


wutske

With your 4Mbps download speed you'll never be able to keep up with all the data that is put on the internet daily, especialy on sites like Youtube (do note, Youtube uses 7.7PB for storing all the data, but of every video they keep the original plus a 360p, 480p, 720p and 1080p version where possible, the most efficient way would be to either store the highest resolution video or the original video).The next problem is power consumption. A single disk doesn't use a lot of power, especialy compared to a modern cpu. But 1000 disks easily use a few KWatts, generating tons of energy which you have to cool down.You'll also need a huge room for all the racks and for extra free air for cooling purposes (a backup).Imho, it's impossible to do.


levimage

5 years from now it would not be impossible but then again there will probably be 25 times the data out there. Interesting huh?


mahesh2k

I came across this blog post while surfing, i'm not sure if this is the one company that is upto the download and archiving the internet. Removing the video and audio content from the web, text based sites are easy to crawl and archive i guess if these companies are doing it.

Check this blog post again. Interesting read, not sure that site is still offering such service or not. But i guess archiving wikipedia is possible to some extent then it is not bad idea i think that is also enough for some people.


wutske

5 years from now it would not be impossible but then again there will probably be 25 times the data out there. Interesting huh?

5 years from now you might archieve the internet as it is now, but not they way it'll be in 5 years ;)

tansqrx

5 years from now you might archieve the internet as it is now, but not they way it'll be in 5 years

As I demostrated earlier, it is 20 years just to get YouTube.

wutske

As I demostrated earlier, it is 20 years just to get YouTube.

Or about 2 years if you can afford a FTTH Gigabit connection ;)

yordan

I think that the starter topic concerned somebody having the throughput of an ISP : several hundreds of gigabytes per second. Unfortunately, these guys expend so much money for renting such communication infrastructures, that they urgently need to make money in order to simply survive, so they cannot do something such interesting than downloading the whole internet, if they happen to have the disk space available.And of course there is probably no interest to spend a lot of time and a lot of money in order to copy some data which are already free and probably backed up individually several times.And, of course, I would not appreciate that somebody exhausts my website bandwidth, probably making it temporarily unreachable, in order to perform a useless copy (useless from my own personal purposes, of course ;)


wutske

Well, for personal use or from a business point of view it would be completely useless to backup the internet (unless they ask money for recovering websites by using their backup, but only few companies will use it ;) ).For research purposes it might be interesting, tough I can't really think of a good research question right now that could use the complete internet ;) It doesn't have to make your website unreachable. If one has a Gb downlink to download 100 different websites at a time, he'll use (rougly) 10Mbps of your website's bandwith. Nothing to worry about :P


yordan

If one has a Gb downlink to download 100 different websites at a time, he'll use (rougly) 10Mbps of your website's bandwith. Nothing to worry about ;)

If he really wants to do that, he will have tens of 8-gig fiber links, so he will exhaust your own website server network bandwidth until the backup completes.

wutske

If he really wants to do that, he will have tens of 8-gig fiber links, so he will exhaust your own website server network bandwidth until the backup completes.

Then he'll have 800 simultaneous downloads running at the same time ;)

tansqrx

I don’t believe that an unknown Internet user creating a backup of your site will cause a significant site slow down. While reviewing my Xisto hosted site, ycoderscookbook.com, I have seen several instances of a complete site download. Some of the automated tools even left their calling card in the form of browser identification in the HTTP headers. I’m not sure why someone would want to copy my entire site but I don’t see anything wrong with it. During the months that this happens I only see a slight increase in Xisto bandwidth usage and since Xisto has ISP level servers and bandwidth, I don’t think anyone noticed the site was any slower. Now if someone decides to copy the entire site a few thousand times a month then I would have an issue because that would use my bandwidth limit of the month. Outside of this, feel free to copy as long as you remain a respectable human in the process (don’t plagiarize or use all of my bandwidth).


BCD

Talking about the main topic. It is very much possible to have a backup of World Wide Web. To do so we would first need to create a clone of earth, probably revolving round the earth. Moon is the best contender, create enough data centers and other infrastructure and then start the download engines. May be in a few years moon would have a backup of internet and infrastructure to store further tons of GBs every minute.


FirefoxRocks

Well my main concern here is speed. I have more than enough hard disk space to store infinite copies of data. Right now there are enough hard drives to cover every square centimeter of Canada and half of the USA. And if I need more hard drives, I can get them for free and by the trillions instantaneously. I don't even know how many hard drives are hooked up, each one of them holding 1,208,925,819,614,629,174,706,176 bytes of data.The problem is, as tansqrx mentioned, is that I do not have an "insane Internet backbone", as I do not run an ISP network or anything close to that, and I'm just on a regular personal (not business) high-speed internet connection (no, not even high speed Extreme or high speed Nitro). The max upload speed I have seen is around 40 kb/s and the max download speed I have ever seen is 7 MBps when downloading a Linux distribution from a server.Furthermore with the network latency issues and server speed, that raises another issue.As for processing power, I only have two computers, none of them are servers, one with a Pentium 4 3.00 GHz processor and the other one with Intel Core 2 Duo 2.66GHz.Even if I do download the Internet, I will have to make backups of the hard drives, in case any of them fail, and that is not a problem of capacity, but a problem of CPU power.