zulooworking.blogg.se - A pdf extractor

At first glance, I even thought I had a perfectly extracted dataset. Impressively, ChatGPT built a mostly usable dataset. If it continued to fail, I’d make a note of it and skip the record. I retried if the validation check failed, and usually I’d get valid JSON back on the second or third attempts. Two checks were particularly important: 1) making sure the JSON was complete, not truncated or broken, and 2) making sure the keys and values matched the schema. I tried to extract a JSON object from every response and run some validation checks against it. (If you don’t know, you can always ask: “Explain how you’d _ using _.”)īecause ChatGPT understands code, I designed my prompt around asking for JSON that conforms to a given JSON schema. One tip: Figure out what wording ChatGPT uses when referring to a task and mimic that. Prompt design is the most important factor in getting consistent results, and your language choices make a huge difference. Addresses, for example, will sometimes end up as a string and sometimes as a JSON object or an array, with the constituent parts of an address split up. It will also decide on its own way to parse values. But doing this for multiple records is a bad idea because ChatGPT will invent its own schema, using randomly chosen field names from the text. You can paste in a record and say “return a JSON representation of this” and it will do it. Once it’s done, getting ChatGPT to convert a piece of text into JSON is really easy. I spent about a week getting familiarized with both datasets and doing all this preprocessing. Ask ChatGPT to turn each record into JSON.Break the documents into individual records.Clean the data as well as I could, maintaining physical layout and removing garbage characters and boilerplate text.

This was critically important because ChatGPT refused to work with poorly OCR’d text. Redo the OCR, using the highest quality tools possible.These were completely unstructured and contained emails and document scans.

1,400 memos from internal police investigations.

There were five different forms, bad OCR, and some freeform letters mixed in.

A 7,000-page PDF of New York data breach notification forms.

To test how well ChatGPT could extract structured data from PDFs, I wrote a Python script (which I’ll share at the end!) to convert two document sets to spreadsheets: The results were lackluster, but ChatGPT, OpenAI’s newest model, has several improvements that make it better suited to extraction: It’s 10 times larger than GPT-3 and is generally more coherent as a result, it’s been trained to explicitly follow instructions, and it understands programming languages. The errors, although subtle and relatively infrequent, were enough to prevent me from doing the basic analyses that most data journalists want to do.īack when OpenAI’s GPT-3 was the hot new thing, I saw Montreal journalist Roberto Rocha attempt a similar test. After throwing a couple programming problems at OpenAI’s ChatGPT and getting a viable result, I wondered if we were finally there. So every time a new iteration of AI technology arrives, I wonder if it’s capable of doing what so many people ask for: to hand off a PDF, ask for a spreadsheet, and get one back. I convert a ton of text documents like PDFs to spreadsheets.