I do a lot of ETL work for my job. One of our vendors recently gave us a deal on the 2.0 version of their data, part of which is several million images. On the old system it was possible to download several large zip files which contained all of the images, but the current system demands that, at least for the initial load, we walk the vendor's FTP site and download all of the files. (Which, I have to say, Xceed's components made life easy enough to do.)
Anyway, the images are split into directories on the vendor site based on size, so we can download a 75, 170, 250 or 400 pixel image and display which ever size is most appropriate. A week after downloading it was discovered that a large number of images in the 75 pixel directory were actually 170 pixel images, a fact that was confirmed on the vendor's FTP site, brought to the vendor's attention and they got the problem corrected. But the correction involved too many files to download through the normal daily updates and they recommended re-downloading the files.
Which is all well and good but I didn't want to download all of the files, just the ones that were the wrong size. So I decided to get a list of the images from the database and check each one to see that it did exist in the system and that it was the right size. Simple enough plan.
- Check the image size.
- If the image size is wrong, delete the file.
- If the file doesn't exist, download and check image size.
- If image size is still wrong, write to log file.
Only thing was, I didn't have a handy image size checker. Fortunately, Google had links to one and the original code looked a lot like this:
private float MaxPixelSize(string Filename)
{
System.Drawing.Image theImage = System.Drawing.Image.FromFile(Filename);
float UploadedImageWidth = theImage.PhysicalDimension.Width;
float UploadedImageHeight = theImage.PhysicalDimension.Height;
return (UploadedImageHeight > UploadedImageWidth ? UploadedImageHeight : UploadedImageWidth);
}
Happy as a lark when a quick check worked, I quickly kicked off 6 copies of the program and almost as quickly saw them stop running due to memory issues. Here is the first Evil, and to understand it, let's look at what is happening in this function.
First, it creates an object in memory called theImage. Then it opens the file Filename and loads it into the object. I then determine the Height and Width and return the max of those two values. And I'm done, leaving the object to be disposed of by the GarbageCollection function of .NET, which "most collection cycles might look only at a few generations, while occasionally a mark-and-sweep is performed, and even more rarely a full copying is performed to combat fragmentation." In many cases, this wouldn't be a problem, but I was processing a hundred images a minute and the collector wasn't catching up to everything being stuffed in memory and quickly the system became full.
Fortunately, the fix for this was easy. I turned theImage into a global object and that cut down on my memory issues. My program started failing for all new reasons, which turned out to be that I couldn't delete the file if it was the wrong size because someone had a lock on it. It's potentially reasonable, given that these images are used by a web server, so I modified my file delete code to go into a threaded loop while the file existed, pausing a half second between attempts to delete the file, and that seemed to work better, though I sometimes had to wait for as long as a minute for the file to be unlocked. Which got me wondering and examining and lo and behold, it was me all along.
You see, garbage collection doesn't just take care of the memory used by the objects created, it also deals with things like file handles. And using the FromFile method to load the image doesn't close the file. It leaves it for the collector to deal with when it feels like it. And there's not FileClose on the object, guess because you don't really need one since the collector will handle it. But, you don't have to use FromFile to load the image, you can just as easily use streams. The advantage here is that I can open and close the stream without worrying about GarbageCollection. Here's the function in its final format:
private float MaxPixelSize(string Filename)
{
float retVal = 0;
try
{
FileStream fs = File.Open(Filename, FileMode.Open);
theImage = System.Drawing.Image.FromStream(fs);
fs.Close();
float UploadedImageWidth = theImage.PhysicalDimension.Width;
float UploadedImageHeight = theImage.PhysicalDimension.Height;
retVal = (UploadedImageHeight > UploadedImageWidth ? UploadedImageHeight : UploadedImageWidth);
theImage = null;
}
catch
{
}
return retVal;
}
This reused a global variable and opened and closed the streams so I didn't step on myself and the 6 instances can now verify several million files within the course of a couple of hours, even when they encounter something with an invalid file size.
The moral of the story? Streams are good! Files and Garbage Collection are EVOL!