vujsa 0 Report post Posted April 13, 2005 So I'm trying to learn how to pull useful content from a web page. Here's what I got so far: <?php $filename = "http://www.forum500.com/rg-erdr.php?_rpo=t = file_get_contents($filename);/* **************************************************************** *//* *//* Got this code @ http://php.net/ *//* *//* **************************************************************** */// $document should contain an HTML document.// This will remove HTML tags, javascript sections// and white space. It will also convert some// common HTML entities to their text equivalent.$search = array ('@<script[^>]*?>.*?</script>@si', // Strip out javascript '@<[\/\!]*?[^<>]*?>@si', // Strip out HTML tags '@([\r\n])[\s]+@', // Strip out white space '@&(quot|#34);@i', // Replace HTML entities '@&(amp|#38);@i', '@&(lt|#60);@i', '@&(gt|#62);@i', '@&(nbsp|#160);@i', '@&(iexcl|#161);@i', '@&(cent|#162);@i', '@&(pound|#163);@i', '@&(copy|#169);@i', '@(\d+);@e'); // evaluate as php$replace = array ('', '', '\1', '"', '&', '<', '>', ' ', chr(161), chr(162), chr(163), chr(169), 'chr(\1)');$text = preg_replace($search, $replace, $html);echo $text;?>The only line that I don't completely understand is: '@<script[^>]*?>.*?</script>@si', I know what it does, I just don't know how. Here is the output: Test Forum - Index Welcome, Guest. Please login or register. 1 Hour 1 Day 1 Week 1 Month Forever Login with username, password and session length Shoutbox guest123 : Hello SpaceWaste : meh though, Im off to bed, night man SpaceWaste : lol damn bugs....ah well. Its ok, this is just fine SpaceWaste : damn, youre right View All 2 Posts in 1 Topics by 4 Members - Latest Member: marc April 12, 2005, 10:30:12 PM Test Forum News SMF - Just Installed General Category Posts Topics Last post General Discussion Feel free to talk about anything and everything in this board. 2 1 April 06, 2005, 08:50:39 PM Re: Welcome to SMF! by Dooga Test Forum - Info Center Forum Stats Total Topics: 1Total Posts: 2 Latest Post: "Re: Welcome to SMF!" (April 06, 2005, 08:50:39 PM) View the 10 most recent posts on the forum. [More Stats] Total Members: 4 Latest Member: marc Users Online 1 Guest, 0 Users Most users online today: 1 Most users online ever: 4 ( April 10, 2005, 08:40:22 PM ) Test Forum | Powered by SMF 1.0.2. © 2001-2005, Lewis Media. All Rights Reserved. Helios design by BlocSo lets say I wanted to get the name of the Latest Member. Test Forum - Index Welcome, Guest. Please login or register. 1 Hour 1 Day 1 Week 1 Month Forever Login with username, password and session length Shoutbox guest123 : Hello SpaceWaste : meh though, Im off to bed, night man SpaceWaste : lol damn bugs....ah well. Its ok, this is just fine SpaceWaste : damn, youre right View All 2 Posts in 1 Topics by 4 Members - Latest Member: marc April 12, 2005, 10:30:12 PM Test Forum News SMF - Just Installed General Category Posts Topics Last post General Discussion Feel free to talk about anything and everything in this board. 2 1 April 06, 2005, 08:50:39 PM Re: Welcome to SMF! by Dooga Test Forum - Info Center Forum Stats Total Topics: 1Total Posts: 2 Latest Post: "Re: Welcome to SMF!" (April 06, 2005, 08:50:39 PM) View the 10 most recent posts on the forum. [More Stats] Total Members: 4 Latest Member: marc Users Online 1 Guest, 0 Users Most users online today: 1 Most users online ever: 4 ( April 10, 2005, 08:40:22 PM ) Test Forum | Powered by SMF 1.0.2. © 2001-2005, Lewis Media. All Rights Reserved. Helios design by BlocSo, how do I get just that information? Any help here would be great. vujsa Share this post Link to post Share on other sites
iGuest 3 Report post Posted April 13, 2005 So I'm trying to learn how to pull useful content from a web page. Here's what I got so far: <?php $filename = "http://forums.xisto.com/no_longer_exists/ = file_get_contents($filename);/* **************************************************************** *//* *//* Got this code @ http://php.net/ *//* *//* **************************************************************** */// $document should contain an HTML document.// This will remove HTML tags, javascript sections// and white space. It will also convert some// common HTML entities to their text equivalent.$search = array ('@<script[^>]*?>.*?</script>@si', // Strip out javascript '@<[\/\!]*?[^<>]*?>@si', // Strip out HTML tags '@([\r\n])[\s]+@', // Strip out white space '@&(quot|#34);@i', // Replace HTML entities '@&(amp|#38);@i', '@&(lt|#60);@i', '@&(gt|#62);@i', '@&(nbsp|#160);@i', '@&(iexcl|#161);@i', '@&(cent|#162);@i', '@&(pound|#163);@i', '@&(copy|#169);@i', '@(\d+);@e'); // evaluate as php$replace = array ('', '', '\1', '"', '&', '<', '>', ' ', chr(161), chr(162), chr(163), chr(169), 'chr(\1)');$text = preg_replace($search, $replace, $html);echo $text;?>The only line that I don't completely understand is: '@<script[^>]*?>.*?</script>@si', I know what it does, I just don't know how. Here is the output: So lets say I wanted to get the name of the Latest Member. So, how do I get just that information? Any help here would be great. vujsa <{POST_SNAPBACK}> Some of this I don't understand, as I didn't learn much on PCRE syntax, but it shares some commonalities between PERL's and grep's way. I don't understand the @ bit but it seems it's to show the start and end of the regex and si are some type of modifiers, i being case insensitive, s, I'm not sure. OK <script[^>]*?> finds the <script language="javascript"> part. <script matches the exact string, [^>]*? is known as a negating class, combined with *? means grab everything that's not > which could appear 0 to any instances, as well as being optional, then the > on the end means the end of that part of the script. .*? means any characters 0 to endless, as well as being optional, so needs not exist. and </script> matches the exact string, when you combined it altogether it'll match anything <script blah blah>anything here</script> but it must have the exact strings it's asking for first, which out of this is <script > </script>. Sorry I have to head off for a bit, but I'll come back and explain anything else I've left out. Cheers, MC Share this post Link to post Share on other sites
vujsa 0 Report post Posted April 13, 2005 Well, I've been playing arround with this a bit now and am beginning to understand some of it.The "@" is used as a delimiter. Nearly anything will work I guess except escape "\". Pipe "|" or Slash "/" could have just as easily been used.Eagerly awaiting more of your vast knowledge! vujsa Share this post Link to post Share on other sites
iGuest 3 Report post Posted April 13, 2005 What do you need to understand?[^0-9]*? matches anything that's not a number, *? is the greedy inverter, meaning it will only return a result if it has something else it can match with. e.g.The string 'Hello, World!' you could do [^0-9]*?! and it'd match, because it has a definite match of the ! character (at the end, must be definite match at the end of it), if you left it out [^0-9]*? will not function, if you wanted to match anything that is not numbers, then you would leave out the greedy inverter.[^0-9]* which will match anything not a number e.g. The string Hey1234You will match HeyYou missing out the numbers. This is greedy because it has no end to it other than going through the whole file and matching everything till it reaches the end of the file.The s modifier is to include whitespace characters.Greedy inverters stops greediness, e.g. .* will match anything and everything, except newline characters, well depends on setup or modifiers. .*? also will match anything and everything except it needs to have a definite end match to this.If you could explain what you need more understanding with, I'd be glad to help.Cheers,MC Share this post Link to post Share on other sites
iGuest 3 Report post Posted April 13, 2005 OK well I missed your question, but to match this the latest member bit is quite easy, we know for a fact that this is how it goes.Latest Member: User MonthSo from this we could do:Latest\sMember:\s.+?(?=\s)Notice any problems with this?Well, \s is for space,basically, we're assuming that a username does not contain a space. Now I'm not sure what characters a username can accept or not accept without viewing the source. If you know for a fact that spaces can't be used inside a Username then this is quite an acceptable approach. I'm not using the s modifier as I'm detecting the spaces myself, What I'm assuming is the Username is seperated by spaces, unless usernames can contain spaces, this expression is useless.+? means their must be an occurance here, while not being greedy. e.g. must have at least 1 match, * means 0 or many + means 1 or many.(?=\s) means look ahead to see if the next is a space, this is so we don't include the space (?=) is look ahead.* is the same as {0,}+ is the same as {1,}We could use the modifiers to our advantage.There are many methods, even using PHP's date function and not matching the month as well.I'll leave you up to decide on how this should be done, if you knew what characters aren't allowed in the username you would have better chance solving this problem.Cheers,MC Share this post Link to post Share on other sites
vujsa 0 Report post Posted April 14, 2005 Wow, MC, again a lot more information than I though I needed to know. That's what's nice about your post, they answer the next three questions too. The PHP site has a very nice expaination of the differences between PHP and Perl regex. The problem is tha I don't know Perl regex either. As you mentioned, you haven't doen much with PCRE so I really appreciate you going ahead and helping anyhow. I tend to stay away from stuff if I have to research it before I can answer the question. The explaination you gave for the '@<script[^>]*?>.*?</script>@si', was very helpful. It being the most complex statement in the array, gives the most oppurtunity to learn from. Now that I know how it works, this statement will be my starting point for learning regex. [^0-9]*?! really clarified how the greedy usage for me. Here is what I came up with before you got back. <?php $filename = "http://forums.xisto.com/no_longer_exists/ = file_get_contents($filename);// $document should contain an HTML document.// This will remove HTML tags, javascript sections// and white space. It will also convert some// common HTML entities to their text equivalent.$search = array ('@<script[^>]*?>.*?</script>@si', // Strip out javascript '@<[\/\!]*?[^<>]*?>@si', // Strip out HTML tags '@([\r\n])[\s]+@', // Strip out white space '@&(quot|#34);@i', // Replace HTML entities '@&(amp|#38);@i', '@&(lt|#60);@i', '@&(gt|#62);@i', '@&(nbsp|#160);@i', '@&(iexcl|#161);@i', '@&(cent|#162);@i', '@&(pound|#163);@i', '@&(copy|#169);@i', '@(\d+);@e', // evaluate as php );$replace = array ('', '', '\1', '"', '&', '<', '>', ' ', chr(161), chr(162), chr(163), chr(169), 'chr(\1)', );$search2 = array ('@.*?Latest Member:\s@si');$replace2 = array ('');$text = preg_replace($search, $replace, $html);$text2 = preg_replace($search2, $replace2, $text);echo "<html><head><title>Content Capture</title></head><body><b>Test String:</b><br />\n $text <br /><br /><b>Extracted Data:</b><br />\n \n\n $text2 <br /><br /></body></html>";?>With the following output: Test String: Test Forum - Index Welcome, Guest. Please login or register. 1 Hour 1 Day 1 Week 1 Month Forever Login with username, password and session length Shoutbox guest123 : Hello SpaceWaste : meh though, Im off to bed, night man SpaceWaste : lol damn bugs....ah well. Its ok, this is just fine SpaceWaste : damn, youre right View All 2 Posts in 1 Topics by 4 Members - Latest Member: marc April 13, 2005, 10:34:58 PM Test Forum News SMF - Just Installed General Category Posts Topics Last post General Discussion Feel free to talk about anything and everything in this board. 2 1 April 06, 2005, 08:50:39 PM Re: Welcome to SMF! by Dooga Test Forum - Info Center Forum Stats Total Topics: 1Total Posts: 2 Latest Post: "Re: Welcome to SMF!" (April 06, 2005, 08:50:39 PM) View the 10 most recent posts on the forum. [More Stats] Total Members: 4 Latest Member: marc Users Online 1 Guest, 0 Users Most users online today: 1 Most users online ever: 4 ( April 10, 2005, 08:40:22 PM ) Test Forum | Powered by SMF 1.0.2. © 2001-2005, Lewis Media. All Rights Reserved. Helios design by Bloc Extracted Data: marc OK well I missed your question, but to match this the latest member bit is quite easy, we know for a fact that this is how it goes. Cheers, MC <{POST_SNAPBACK}> Actually, the general explaination is quite helpful because ultimately, my example is just a way for me to learn how to do more complex things with regex. Come to think of it, if I could just borrow you brain for a few days, it would really save on the typing. Hey, thanks for all the help. vujsa By the way, the "s" modifier at the end of the regex as explained by the PHP web site: s (PCRE_DOTALL) If this modifier is set, a dot metacharacter in the pattern matches all characters, including newlines. Without it, newlines are excluded. This modifier is equivalent to Perl's /s modifier. A negative class such as [^a] always matches a newline character, independent of the setting of this modifier. Share this post Link to post Share on other sites
vizskywalker 0 Report post Posted April 14, 2005 I followed most of the explanation. But I'm still unclear as to the purpose of the "." in "@<script[^>]*?>.*?</script>@si". Also, what exactly does preg_replace() do? Thanks.~Viz Share this post Link to post Share on other sites
vujsa 0 Report post Posted April 14, 2005 (edited) I followed most of the explanation. But I'm still unclear as to the purpose of the "." in "@<script[^>]*?>.*?</script>@si". Also, what exactly does preg_replace() do? Thanks. ~Viz <{POST_SNAPBACK}> The "." specifies to match all charaters. So here is an example: $string = "Hello world! Welcome to my example. <br />\n Please feel free to increase my reputation points! <br />\n Thank you.";$string2 = preg_replace('@.*?\sThank you.@si', 'Hello world! How cool am I!', $string);echo "<html><head><title>Content Capture</title></head><body><b>Test String:</b><br />\n $string <br /><br /><b>New Data:</b><br />\n \n\n $string2 <br /><br /></body></html>"; Returns: Test String: Hello world! Welcome to my example. Please feel free to increase my reputation points! Thank you. New Data: Hello world! How cool am I! So '@.*?\sThank you.@si', means match all characters up to and including one space followed by "Thank You." Basically acts like a wild card. So for '@<script[^>]*?>.*?</script>@si' the "." means match all characters in between the <script ...> and </script> tags. preg_replace() is Pattern Regular Expression Replace function. Works like ereg_replace() but a little more complex, more flexible, and slightly different syntax. Basically, you can use an array of patterns that when matched will be replaced by the coresponding value in an array of replacement values. Instead of typing ereg_replace() over and over, just use the array method with preg_replace(). For more information try the PHP Web Site About PCRE Hope this helps. vujsa Edited April 14, 2005 by vujsa (see edit history) Share this post Link to post Share on other sites
overture 0 Report post Posted April 14, 2005 I'd just like to thankyou for starting this topic, it was very convenient for me this has helped me as much as it has helped you Vujsa. An extra thanks to M^E for the extensive explanations . thanks. Share this post Link to post Share on other sites
vizskywalker 0 Report post Posted April 14, 2005 Yes, it is very helpful to have this post. And I'm sure overture meant MC not M^E.~Viz Share this post Link to post Share on other sites
overture 0 Report post Posted April 15, 2005 lol yes i did Viz Share this post Link to post Share on other sites
^zer0dyer- 0 Report post Posted May 2, 2005 First of all, if you are just trying to escape HTML characters use: -htmlentities() OR -htmlspecialchars() which are built-in PHP functions. If you are just trying to learn regexp, more power to you! In PCRE regular expressions, there are several types of delimiters you can use for your patterns <?php# This finds all tags$pattern = "@<\w+?[^>]>@is"; // @ is the delimiter in pattern.?>That was a very eloquent pattern, mastercomputers. I just recently started using look aheads/behinds, and have had fun toying around with them Also, if you want to print out mastercomputers' result with preg_match, try the following for some good practice: <?php$file = "path/to/file";$handle = @file_get_contents($file) or die("File not acquired!\n</body>\n</html>"); // This is safer!$pattern = "#Latest\sMember:\s.+?(?=\s)#i";if ( preg_match($pattern, $handle, $matches) ){ print "<pre>\n".print_r($matches)."\n<pre>\n";}else{ print "<p>Nothing found =(";}# Finds all matchesif ( preg_match_all($pattern, $handle, $matches) ){ # $matches is now a 2-dimensional array print "<pre>\n".print_r($matches)."\n</pre>\n";}else{ print "Nothing found =(";}?> Peace,+CurTis- Share this post Link to post Share on other sites