Jump to content
xisto Community
BCD

Extracting Html Tables To A Database. Efficient Method?

Recommended Posts

I have a few hundred HTML files which contain tables (HTML tables). I need to get these tabular data present in html files into a database. By using a PHP parser know as Simple HTML DOM Parser, I am able to assign the values to variables (next step would be to put them into database). I am now testing with the extraction (each file contains a hundred or two data sets), I found that the apache server crashes while running the script. I reduced the number of data sets to extract on single pass, and changed memory variables (NOTE: Even though apache doesn't tell its memory problem, it just crashes with no notice), still no luck.Is Simple HTML DOM parser right for my problem, since I have all the files locally, is there any other method to efficiently put them onto database?Since I am already mid way through the script. And the only problem is the apache crashing during execution. Is there any way I can improve the efficiency of the script and make apache not to crash? Some info: max_execution_time = 60000 max_input_time = -1 memory_limit = 512M. Apache timeout 300 seconds. Any values which would affect the performance and would need changes?How does flush() work?I read the PHP manual about flush(), it says it outputs the write buffers wherever the flush() function is called in the script. So does this act dynamically, giving the output as and when the script processes certain parts of the script. Like after extracting each table, displaying its success or failure to do so. I am asking this because it might be helpful to implement in my script. In the user examples I saw sleep() and unsleep() being used after calling flush(). Can someone explain how all these thing work together?After going through the script, I noticed that thevariables hold quite a large amount of data (holds around 100-200 data sets). So this may be the reason for the crash. Kind of memory leak. How can I solve this issue?

Edited by BCD (see edit history)

Share this post


Link to post
Share on other sites

PHP can be ran without using Apache, as PHP isn't limited to just web development. Hearing that Apache crashed due to a PHP script sounds odd to me, though. I cannot think of any better way in achieving what you are trying to do, but you can try running your script through the console (not really sure what operating system you are using). I've had PHP parse several text files that were several megabytes all at once and compress them using PHP's zlib extension and it has never failed to complete the task.

How does flush() work?

flush(), as the manual says, is to send data that hasn't already been sent. It is mostly used in combination with echo and print but not limited to them. flush() would, for example, send data to the browser though the script is still trying to finish its job. This is useful when you want to be informed of some changes before the script ends. However, flush() may not always work as expected.

Share this post


Link to post
Share on other sites

I enabled 'info' level reporting on error logs and found this

zend_mm_heap corrupted[Wed Feb 10 07:26:11 2010] [notice] Parent: child process exited with status 1 -- Restarting.

I have no idea what is this. A bit of searching leads to pages with people reporting this thing in version 5.2.1, I am using 5.3.0 on Windows XP, XAMPP. I am not sure how to run these scripts through CLI, I will try and get back if the issue resolves or are there any other issues too.

Share this post


Link to post
Share on other sites

The issue has been resolved.

Apparently it was a SORT of memory leak. This was the first time, I experienced with such an issue. Moreover, PHP never gave any sort of error or warning about memory leakage. The only sign of failure was apache crashing and that zend_mm_heap notice appearing in apache error logs.

I got memory usage after and before executing the foreach loop which was inside MAIN for loop. The memory consumption doubled after passing loop each time. And crashed around at 11 MB. this is weird since I have memory limit set to much higher value and there is in error of any sort from PHP.

This is a useful function available at the manual page for memory usage:

function echo_memory_usage() { $mem_usage = memory_get_usage(true);

if ($mem_usage < 1024)
echo $mem_usage." bytes";
elseif ($mem_usage < 1048576)
echo round($mem_usage/1024,2)." kilobytes";
else
echo round($mem_usage/1048576,2)." megabytes";

echo "<br/>";
}


And then after knowing this, and browsing simple html dom forums, we need to unset $html variable we use to get contents, after completing the MAIN for loop each time.

$html->clear(); unset($html);


This solved the issue. Now runs with a constant memory usage of 1.5 MB even when parsing 300 hundred records at a time.

So that was how I got through this issue, in case anyone experiencing similar issues. Clearing flush() might also be considered in particular cases.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×
×
  • Create New...

Important Information

Terms of Use | Privacy Policy | Guidelines | We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.