ProConfig Tutorial
  • Overview & Setup
  • Tutorial Structure
  • Tutorial
    • Hello World with Pro Config
    • Building Workflow
    • Transitions
    • Expressions and Variables
    • An Advanced Example
    • Function Calling Example
    • Random Routing Example
  • API Reference
    • Widgets
      • Bark TTS
      • Champ
      • CoinGecko
      • ControlNet with Civitai
      • Crawler
      • Crypto News
      • Data Visualizer
      • Email Sender
      • Google Flight Search
      • Google Hotel Search
      • Google Image Search
      • Google Map Search
      • Google News Search
      • Google Scholar Search
      • Google Search
      • GroundedSAM
      • Image Text Fuser
      • Information Extractor - OpenAI Schema Generator
      • Information Extractor
      • Instagram Search
      • JSON to Table
      • LinkedIn
      • MS Word to Markdown
      • Markdown to MS Word
      • Markdown to PDF
      • Mindmap Generator
      • Notion Database
      • OCR
      • Pdf to Markdown
      • RMBG
      • Stabel-Video-Diffusion
      • Stable Diffusion Inpaint
      • Stable Diffusion Recommend
      • Stable Diffusion Transform
      • Stable Diffusion Upscale
      • Stable Diffusion with 6 fixed category
      • Stable Diffusion with Civitai
      • Storydiffusion
      • Suno Lyrics Generator
      • Suno Music Generator
      • Table to Markdown
      • TripAdvisor
      • Twitter Search
      • UDOP: Document Question Answering
      • Weather forecasting
      • Whisper large-v3
      • Wikipedia
      • Wolfram Alpha Search
      • Yelp Search
      • YouTube Downloader
      • YouTube Transcriber
      • Youtube Search
  • Tools
    • AutoConfig Bot
    • Cache Mode
Powered by GitBook
On this page
  • Try it in the Widget Center
  • Usage
  1. API Reference
  2. Widgets

Pdf to Markdown

Turn pdf file into text

PreviousOCRNextRMBG

Last updated 1 year ago

Try it in the Widget Center

Click this to try this widget and copy the Pro Config template.

Usage

Convert the input PDF file into markdown string

Input Parameters

Name
Type
Description
Default
Required

document

string

Provide your input file (PDF, EPUB, MOBI, XPS, FB2). Note that there should be no blank pages in the PDF.

page_range

integer

The last page you want to parse. default to -1, means all pages.

-1

parallel_factor

integer

Provide the parallel factor to use for OCR.

1

lang

string

Provide the language to use for OCR.

English

Output Parameters

Name
Type
Description
File Type

markdown_string

string

The markdown that was generated.

metadata_string

string

The metadata of the pdf file.

Output Example

{
  "markdown_string": "\n## An H1 Header\n\nParagraphs are separated by a blank line.\n\n2nd paragraph. Italic, **bold**, and monospace. Itemized lists look like:\nthis one that one the other one Note that - not considering the asterisk - the actual text content starts at 4- columns in.\n\nBlock quotes are written like so. They can span multiple paragraphs, if you like.\n\nUse 3 dashes for an em-dash. Use 2 dashes for ranges (ex., “it’s all in chapters 12– 14”). Three dots … will be converted to an ellipsis. Unicode is supported. ☺\n\n## An H2 Header\n\nHere’s a numbered list:\n\n1. first item 2. second item 3. third item\nNote again how the actual text starts at 4 columns in (4 characters from the left side). Here’s a code sample:\n# Let me re-iterate ... for i in 1 .. 10 { do-something(i) }\nAs you probably guessed, indented 4 spaces. By the way, instead of indenting the block, you can use delimited blocks, if you like:\ndefine foobar() {\n    print \"Welcome to flavor country!\"; }\n(which makes copying \u0026 pasting easier). You can optionally mark the delimited block for Pandoc to syntax highlight it:\nimport time\n# Quick, count to ten!\n\nfor i in range(10): # (but not *too* quick) time.sleep(0.5) print i\n\n## An H3 Header\n\nNow a nested list:\n1. First, get these ingredients:\ncarrots celery lentils\n\n2. Boil some water. 3. Dump everything in the pot and follow this algorithm:\nfind wooden spoon uncover pot stir cover pot balance wooden spoon precariously on pot handle wait 10 minutes goto first step (or shut off burner when done)\nDo not bump wooden spoon or it will fall.\n\nNotice again how text always lines up on 4-space indents (including that last line which continues item 3 above). Here’s a link to a website, to a local doc, and to a section heading in the current doc. Here’s a footnote 1.\n\nTables can look like this:\nShoes, their sizes, and what they’re made of size material color\n9\nleather brown\n10\nhemp canvas natural\n11\nglass transparent\n(The above is the caption for the table.) Pandoc also supports multi-line tables:\n\n| keyword                   |\n|---------------------------|\n| red                       |\n| Sunsets, apples, and      |\n| other red or reddish      |\n| things.                   |\n| green                     |\n| Leaves, grass, frogs      |\n| and other things it’s not |\n| easy being.               |\n\nA horizontal rule follows. Here’s a definition list: apples Good for making applesauce. oranges Citrus! tomatoes There’s no “e” in tomatoe.\n\nAgain, text is indented 4 spaces. (Put a blank line between each term/definition pair to spread things out more.) Here’s a “line block”: Line one Line too Line tree and images can be specified like so:\n\n## Example Image\n\nInline math equations go in like so: ω = dϕ/dt. Display math should get its own line and be put in in double-dollarsigns:\n\n## I = ∫Ρr2Dv\n\nAnd note that you can backslash-escape any punctuation characters which you wish to be displayed literally, ex.: `foo`, *bar*, etc.\n\n1. Footnote text goes here.↩︎",
  "metadata_string": "{\"language\": \"English\", \"filetype\": \"pdf\", \"toc\": [[1, \"An h1 header\", 1], [2, \"An h2 header\", 1], [3, \"An h3 header\", 1]], \"pages\": 3, \"ocr_stats\": {\"ocr_pages\": 0, \"ocr_failed\": 0, \"ocr_success\": 0}, \"block_stats\": {\"header_footer\": 0, \"code\": 0, \"table\": 1, \"equations\": {\"successful_ocr\": 0, \"unsuccessful_ocr\": 0, \"equations\": 0}}, \"postprocess_stats\": {\"edit\": {}}}"
}

url