Pdf to Markdown

Turn pdf file into text

Try it in the Widget Center

Click this url to try this widget and copy the Pro Config template.

Usage

Convert the input PDF file into markdown string

Input Parameters

Name
Type
Description
Default
Required

document

string

Provide your input file (PDF, EPUB, MOBI, XPS, FB2). Note that there should be no blank pages in the PDF.

page_range

integer

The last page you want to parse. default to -1, means all pages.

-1

parallel_factor

integer

Provide the parallel factor to use for OCR.

1

lang

string

Provide the language to use for OCR.

English

Output Parameters

Name
Type
Description
File Type

markdown_string

string

The markdown that was generated.

metadata_string

string

The metadata of the pdf file.

Output Example

{
  "markdown_string": "\n## An H1 Header\n\nParagraphs are separated by a blank line.\n\n2nd paragraph. Italic, **bold**, and monospace. Itemized lists look like:\nthis one that one the other one Note that - not considering the asterisk - the actual text content starts at 4- columns in.\n\nBlock quotes are written like so. They can span multiple paragraphs, if you like.\n\nUse 3 dashes for an em-dash. Use 2 dashes for ranges (ex., “it’s all in chapters 12– 14”). Three dots … will be converted to an ellipsis. Unicode is supported. ☺\n\n## An H2 Header\n\nHere’s a numbered list:\n\n1. first item 2. second item 3. third item\nNote again how the actual text starts at 4 columns in (4 characters from the left side). Here’s a code sample:\n# Let me re-iterate ... for i in 1 .. 10 { do-something(i) }\nAs you probably guessed, indented 4 spaces. By the way, instead of indenting the block, you can use delimited blocks, if you like:\ndefine foobar() {\n    print \"Welcome to flavor country!\"; }\n(which makes copying \u0026 pasting easier). You can optionally mark the delimited block for Pandoc to syntax highlight it:\nimport time\n# Quick, count to ten!\n\nfor i in range(10): # (but not *too* quick) time.sleep(0.5) print i\n\n## An H3 Header\n\nNow a nested list:\n1. First, get these ingredients:\ncarrots celery lentils\n\n2. Boil some water. 3. Dump everything in the pot and follow this algorithm:\nfind wooden spoon uncover pot stir cover pot balance wooden spoon precariously on pot handle wait 10 minutes goto first step (or shut off burner when done)\nDo not bump wooden spoon or it will fall.\n\nNotice again how text always lines up on 4-space indents (including that last line which continues item 3 above). Here’s a link to a website, to a local doc, and to a section heading in the current doc. Here’s a footnote 1.\n\nTables can look like this:\nShoes, their sizes, and what they’re made of size material color\n9\nleather brown\n10\nhemp canvas natural\n11\nglass transparent\n(The above is the caption for the table.) Pandoc also supports multi-line tables:\n\n| keyword                   |\n|---------------------------|\n| red                       |\n| Sunsets, apples, and      |\n| other red or reddish      |\n| things.                   |\n| green                     |\n| Leaves, grass, frogs      |\n| and other things it’s not |\n| easy being.               |\n\nA horizontal rule follows. Here’s a definition list: apples Good for making applesauce. oranges Citrus! tomatoes There’s no “e” in tomatoe.\n\nAgain, text is indented 4 spaces. (Put a blank line between each term/definition pair to spread things out more.) Here’s a “line block”: Line one Line too Line tree and images can be specified like so:\n\n## Example Image\n\nInline math equations go in like so: ω = dϕ/dt. Display math should get its own line and be put in in double-dollarsigns:\n\n## I = ∫Ρr2Dv\n\nAnd note that you can backslash-escape any punctuation characters which you wish to be displayed literally, ex.: `foo`, *bar*, etc.\n\n1. Footnote text goes here.↩︎",
  "metadata_string": "{\"language\": \"English\", \"filetype\": \"pdf\", \"toc\": [[1, \"An h1 header\", 1], [2, \"An h2 header\", 1], [3, \"An h3 header\", 1]], \"pages\": 3, \"ocr_stats\": {\"ocr_pages\": 0, \"ocr_failed\": 0, \"ocr_success\": 0}, \"block_stats\": {\"header_footer\": 0, \"code\": 0, \"table\": 1, \"equations\": {\"successful_ocr\": 0, \"unsuccessful_ocr\": 0, \"equations\": 0}}, \"postprocess_stats\": {\"edit\": {}}}"
}

Last updated