Archive for July 3rd, 2008

PDF to Text + PHP

I have had a little project at work. The task is to convert a .pdf file into plain text, and then cut some parts out of the text to use them in my database. Well, the hardest part of it all was that I had to use PHP to do it. I had been looking for a conversion tool, but never really found anything. FPDI did not do what I needed, so I came across xpdf. I had to use “exec()” function of PHP to perform the conversion, but the output turned out to be very messy. So after some manipulations with the text I managed to convert almost all breaks. How much time I am going to spend on trying to “preg_match” the whole text for the stuff I need, I don’t know. I am sure it is not going to be easy, since it is not even HTML with all its tags and stuff. The thing is the conversion output format strictly depends on how the document was formatted in pdf. Well, according to my .pdf file, the creator was way far from publishing. Oh yeah, the tool is called pdftotext and is a Windows executable. However the package is distributed under many operating systems, so everybody can find a suitable package.

No WWW

This 3w thing has always made me mad. I can understand when DNS servers have “www” as an alias of every domain, but when you go for http://domain.com, but cannot access it unless you put it with “www” before, that just gets me. Whether they do it on purpose, or server administrators are just lame, I don’t know. So I decided to put a no-www.org button on my page when I get to have my own domain and hosting. Be smart people, learn how to adjust your BIND settings!

Design a site like this with WordPress.com
Get started