PDF to Text + PHP
I have had a little project at work. The task is to convert a .pdf file into plain text, and then cut some parts out of the text to use them in my database. Well, the hardest part of it all was that I had to use PHP to do it. I had been looking for a conversion tool, but never really found anything. FPDI did not do what I needed, so I came across xpdf. I had to use “exec()” function of PHP to perform the conversion, but the output turned out to be very messy. So after some manipulations with the text I managed to convert almost all breaks. How much time I am going to spend on trying to “preg_match” the whole text for the stuff I need, I don’t know. I am sure it is not going to be easy, since it is not even HTML with all its tags and stuff. The thing is the conversion output format strictly depends on how the document was formatted in pdf. Well, according to my .pdf file, the creator was way far from publishing. Oh yeah, the tool is called pdftotext and is a Windows executable. However the package is distributed under many operating systems, so everybody can find a suitable package.