Document and Content Analysis

Most of the data we interact with day-to-day does not come in the form of data structures or databases, but instead in the form of documents and document images. This course introduces students to the formats, techniques, and algorithms used for representing, compressing, analyzing, processing, and displaying documents. Topics covered include:

  • document formats and standards (TIFF, JPEG, PDF, PostScript, SVG)
  • document image compression (G4, MRC, token based compression, JPEG2000)
  • logical markup (HTML, XML, word processing formats, DocBook)
  • writings systems of the world
  • character sets and character encodings (ASCII, Unicode, special coding systems)
  • text rendering, layout, ligatures, and hyphenation (Pango)
  • typesetting and page layout systems (text flow, Word, LaTeX, etc.)
  • OCR (character recognition, page segmentation)
  • spelling and orthographic variation, statistical language modeling
  • document capture, page image dewarping and handheld document capture
  • named entity recognition, information extraction, table recognition
  • document search and retrieval, text mining, document databases
  • reading, psychophysics, and human-document interaction
  • document security and forensics

Materials

Course Materials