I’m currently trying to work with the latest database dump from the English Wikipedia. It’s massive (Slightly under 10 GB uncompressed), and a pain to work with – especially since some of the behavior of PHP file functions with large files is not quite right. So, what I’ve been trying to do is break the XML dump down into sections (I’m losing a small handful of articles this way, less then 10) and then process those chunks into text files, which are then stored in a 3 level directory tree by letters: The “Disgaea” article would be stored as /home/myusername/wiki/d/i/s/disgaea.txt)
In order to create these directories, I used the following function:
function make_directory_tree ($level = 0, $parent = '', $maxlevel = 3) {
global $CONFIG;
echo ("Creating directory $parent$letter\n");
foreach ($CONFIG['directories'] as $letter) {
$status = mkdir($parent.$letter);
if ($status === FALSE) {
die('Could not create directory: ' . $parent . $letter . "\n");
}
if ($level < $maxlevel) {
make_directory_tree($level + 1,$parent . $letter.'/');
}
}
}
Truthfully, there are better ways to do this than a recursive function, but I didn't think there'd be a big difference in performance, so I was surprised by how long PHP was taking to create these directories.
And, when it was all said and done, I had the nasty surprise of finding that there were four levels of directories, rather than three - I had forgotten that the foreach loop meant that the final directory layer wasn't created recursively. Oops. I wanted 19683 directories, and accidentally created 531441. Oh well, set $maxlevel = 2, and try again, I suppose.