Jump to content
xisto Community
snlildude87

Reading Site's Source Code With Php is this possible?

Recommended Posts

Hey guys. I want to know if it's possible to read a site's source code using PHP. Would using $blah = file(https://www.google.de/?gfe_rd=cr&ei=7AkjVIatDsKH8QfNkoC4DQ&gws_rd=ssl); work?

 

I want to make a script that will retrieve the comic of the day from http://www.gocomics.com/?ref=comics. Currently, I have another source, http://www.gocomics.com/, but that site doesn't have some of the ones that I want. The thing is, http://www.gocomics.com/ uses an easy file name for each comic in their comic of the day. Here's an example of the file name: ga050826.gif. That is for August 26th, 2005. Simple stuff. http://www.gocomics.com/?ref=comics, however, uses an obscure file name: barkeaterlake2005018313826.jpg, and that changes everyday, so it's hard to find a pattern. Check their sites out if you want to see what I mean.

 

If you have another suggestion besides getting the source code, post it below.

 

Thanks!

Share this post


Link to post
Share on other sites

maybe you can try reading the sourcecode, then doing a regexp something along the lines of :

preg_match ( '/src="[hH][tT][tT][pP]:////.+?/.(jpg|gif)"/', $big_string_with_sourcecode_in_it, $results_array), which should put any URI's that reference either .jpg's or gif's into the $results_array. From there on, you can roll your own code to pull each url from the array, filter out the ones you're not interested in (like the ads, layout graphics, and so on) and download the rest - maybe call wget from PHP?

If you don't know how regexps work or what they are, try reading http://forums.xisto.com/no_longer_exists/

 

In fact, read the whole thing - it's sure better than my paper hardcopy book, and it's free.

Share this post


Link to post
Share on other sites

I don't fully understand your question. If you are trying to read the PHP source of a file on another server, that is impossible as far as I know. If you are just trying to get the HTML source, you can use the Filesystem functions in PHP (http://us3.php.net/manual/en/ref.filesystem.phphttp://us3.php.net/manual/en/ref.filesystem.php).

Let me know if this helps and if you still need help getting the Filesystem functions to work.

Share this post


Link to post
Share on other sites

i remember seeing a scripted called "abc news" where the person after putting the source code on his website would be filtered news updates from the abc site, i think using hte same script would be the most helpful to.

Share this post


Link to post
Share on other sites

There's a great php class called snoopy from sourceforge. Find it here: https://sourceforge.net/projects/snoopy/

Here's an example of how to use it.

 

<?php

$snoopy = new Snoopy;

 

$snoopy->submit($submit_url,$submit_vars);

 

$snoopy->fetch("https://www.google.de/?gfe_rd=cr&ei=BwkjVKfAD8uH8QfckIGgCQ&gws_rd=ssl;:);

 

print $snoopy->results;

 

?>

Share this post


Link to post
Share on other sites

The file() function is capable of retrieving HTML documents, as are the file_get_contents(), include(), require(), and readfile() functions (although without a bit of tweaking, the last three automatically pass the retrieved data to the output buffer).

 

file() returns an array, whereas file_get_contents() returns a single string - so file_get_contents() is more or less the equivalent of:

$html = file('document_path');$html = implode("\n",$html);

If you are wanting to examine the source code, it would probably be better to have a single string value rather than an array.

 

littleweseth, whilst regular expressions are a good idea, your expression will not work - the image is stored locally, and is only referenced to in a relative manner. I would recommend narrowing it down to images that exist within the '/comics/' directory - I don't know if all 'comic of the day' images are stored there, but a quick look at the site shows that is where the current image is located, and it is the only image displayed on that page located in that directory. If that is the case, try this:

 

$html = @file_get_contents('http://comics.com/');$pattern = '[<(img|IMG).*?(src|SRC)=\"/comics/(.*?).(jpg|JPG|gif|GIF|png|PNG)]';preg_match($pattern,$html,$matches);$image_path = 'http://comics.com/comics/'.$matches[3].'.'.$matches[4];$image = file_get_contents($image_path);

Keep in mind that it may not be legal for you to leech content from that site. I would recommend you check first.

Share this post


Link to post
Share on other sites

my bad - i just hacked out a regexp to look for image urls, without actually going to look at the site source. *blush*

btw, instead of using the alternation operator, a quicker and more elegant way of doing (PNG|png|JPG|jpg..) would be to just use the case-insensitive modifier, 'i'.

$html = @file_get_contents('http://comics.com/');$pattern = '/<(img).*?(src|SRC)=\"\/comics\/(.*?).(jpg|gif|png)/i';preg_match($pattern,$html,$matches);$image_path = 'http://comics.com/comics/'.$matches[3].'.'.$matches[4];$image = file_get_contents($image_path);


By the way, i think you need to escape the backslashes, but i'm not sure. Also, don't you need /'s at the beginning and end of the pattern?

Share this post


Link to post
Share on other sites

No, you don't need /s. You can use pattern modifiers if you wish, but only if you require certain operators to act in a slightly different way.

 

The /i option may work, but it is most likely slower - it will treat the whole pattern as case insensitive, and matching any combination of upper and lower case characters to form a word which is much more resource intensive than binary/explicit matching.

 

You do need to escape backslashes - but not forward slashes. Backslashes only need to be escaped because they themselves are used for 'escaping' characters.

 

I made a mistake in my pattern. It should be:

..."/comics/(.*?)\.(jpg|...
Instead of what it currently is ('.', a single dot, matches almost any character - with the exception of linebreaks, UNLESS /s is used - and the character we are trying to match here is just a normal period/fullstop/dot).

Share this post


Link to post
Share on other sites

*thwack*Me needs to go back to regexp school - and possibly get a PHP book that doesn't suxor (i was looking at PHP in 24 Hours (SAMS publishing) because i couldn't be bothered opening my local copy of hudzilla when i had a hardcopy book just sitting there. In that book, they always use /'s.)[As a matter of curiosity, I didn't learn PCRE's by looking at the Perl docs, or the PHP docs, or even any websites - I learned from the manual of BBEdit, that beacon of Mac l33tness. It covers pretty much everything, right down to conditionals and lookaheads, so I tend to think in terms of those. I find it impossible to work with POSIX regexps that don't even understand .+?, ala the Apache mod_rewrite engine. Dammit.]In any case, there you go : you learn something every day, folks.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×
×
  • Create New...

Important Information

Terms of Use | Privacy Policy | Guidelines | We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.