Readability (OpenIMAJ master project 1.3.10 API)

java.lang.Object
- org.openimaj.web.readability.Readability

```
public class Readability
extends Object
```
Class for extracting the "content" from web-pages, and ignoring adverts, etc. Based upon readability.js (http://lab.arc90.com/experiments/readability/) and modified to behave better for certain sites (and typically better mimic Safari Reader functionality).

Author:

Jonathon Hare (jsh2@ecs.soton.ac.uk), Michael Matthews (mikemat@yahoo-inc.com), David Dupplaw (dpd@ecs.soton.ac.uk)

Nested Class Summary

Nested Classes
Modifier and Type	Class and Description
`protected class`	`Readability.MappingNode`
`protected static class`	`Readability.Regexps` Regular expressions for different types of content

Field Summary

Fields
Modifier and Type	Field and Description
`protected boolean`	`addTitle`
`protected String`	`article_contentType`
`protected Date`	`article_date`
`protected String`	`article_date_string`
`protected Element`	`articleContent`
`protected String`	`articleTitle`
`protected boolean`	`debug`
`protected Document`	`document`
`protected EnumSet<org.openimaj.web.readability.Readability.Flag>`	`flags`
`static float`	`LINK_DENSITY_THRESHOLD` Threshold for removing elements with lots of links

Constructor Summary

Constructors
Constructor and Description
`Readability(Document document)` Construct with the given document.
`Readability(Document document, boolean debug)` Construct with the given document.
`Readability(Document document, boolean debug, boolean addTitle)` Construct with the given document.

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`static void`	`augmentDocument(Document document)` Iterates through all the ELEMENT nodes in a document and gives them ids if they don't already have them.
`protected void`	`clean(Element e, String tag)` Clean a node of all elements of type "tag".
`protected void`	`cleanConditionally(Element e, String tag)` Clean an element of all tags of type "tag" if they look fishy.
`protected void`	`cleanHeaders(Element e)` Clean out spurious headers from an Element.
`protected void`	`cleanStyles()`
`protected void`	`cleanStyles(Element e)` Remove the style attribute on every e and under.
`protected void`	`dbg(String s)`
`protected void`	`findArticleDate()`
`protected void`	`findArticleEncoding()`
`protected String`	`findArticleTitle()` Get the article title.
`protected int`	`findChildNodeIndex(Node parent, Node childToFind)`
`protected List<Node>`	`findChildNodesWithName(Node parent, String name)`
`List<Anchor>`	`getAllLinks()`
`String`	`getArticleContentType()`
`Date`	`getArticleDate()`
`protected String`	`getArticleDateString()`
`Node`	`getArticleHTML_DOM()`
`String`	`getArticleHTML()`
`List<String>`	`getArticleImages()`
`List<Anchor>`	`getArticleLinks()`
`List<String>`	`getArticleSubheadings()`
`String`	`getArticleText()`
`List<Readability.MappingNode>`	`getArticleTextMapping()` Get the mapping between bits of text in the dom & their xpaths
`protected void`	`getArticleTextMapping(org.w3c.dom.traversal.TreeWalker walker, List<Readability.MappingNode> map)`
`String`	`getArticleTitle()`
`protected Element`	`getBody()` Equivalent to document.body in JS
`protected int`	`getCharCount(Element e)`
`protected int`	`getCharCount(Element e, String s)` Get the number of times a string s appears in the node e.
`protected int`	`getClassWeight(Element e)` Get an elements class/id weight.
`protected String`	`getInnerHTML(Node n)`
`protected String`	`getInnerText(Element e)`
`protected String`	`getInnerText(Element e, boolean normalizeSpaces)` Get the inner text of a node - cross browser compatibly.
`protected String`	`getInnerTextSep(Node e)`
`protected float`	`getLinkDensity(Element e)` Get the density of links as a percentage of the content This is the amount of text that is inside a link divided by the total text in the node.
`static Readability`	`getReadability(String html)` Convenience method to build a `Readability` instance from an html string.
`static Readability`	`getReadability(String html, boolean addTitle)` Convenience method to build a `Readability` instance from an html string.
`protected String`	`getTitle()`
`protected Element`	`grabArticle()` grabArticle - Using a variety of metrics (content score, classname, element types), find the content that is most likely to be the stuff a user wants to read.
`boolean`	`hasContent()`
`protected void`	`init()` Runs readability.
`protected void`	`initializeNode(Element node)` Initialize a node with the readability object.
`protected void`	`killBreaks(Element e)` Remove extraneous break tags from a node.
`static void`	`main(String[] argv)` Testing
`protected String[]`	`match(String input, String regex)` Javascript-like String.match
`protected String`	`nodeToString(Node n)`
`protected static String`	`nodeToString(Node n, boolean pretty)`
`protected void`	`parseDate()`
`protected void`	`prepArticle(Element articleContent)` Prepare the article node for display.
`protected void`	`prepDocument()` Prepare the HTML document for readability to scrape it.
`protected void`	`removeChildren(Node n)`
`protected void`	`removeComments(Node n)`
`protected int`	`search(String input, String regex)` Javascript-like String.search
`protected Node`	`stringToNode(String str)`

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - LINK_DENSITY_THRESHOLD
```
public static float LINK_DENSITY_THRESHOLD
```
    Threshold for removing elements with lots of links
  - document
```
protected Document document
```
  - flags
```
protected EnumSet<org.openimaj.web.readability.Readability.Flag> flags
```
  - articleTitle
```
protected String articleTitle
```
  - articleContent
```
protected Element articleContent
```
  - article_date_string
```
protected String article_date_string
```
  - article_date
```
protected Date article_date
```
  - article_contentType
```
protected String article_contentType
```
  - debug
```
protected boolean debug
```
  - addTitle
```
protected boolean addTitle
```
- Constructor Detail
  - Readability
```
public Readability(Document document)
```
    Construct with the given document. Debugging is disabled.
    
    Parameters:
    
    document - The document.
  - Readability
```
public Readability(Document document,
                   boolean debug)
```
    Construct with the given document. The second argument can be used to enable debugging output.
    
    Parameters:
    
    document - The document.
    
    debug - Enable debugging output.
  - Readability
```
public Readability(Document document,
                   boolean debug,
                   boolean addTitle)
```
    Construct with the given document. The second argument can be used to enable debugging output. The third option controls whether the title should be included in the output.
    
    Parameters:
    
    document - The document.
    
    debug - Enable debugging output.
    
    addTitle - Add title to output.
- Method Detail
  - augmentDocument
```
public static void augmentDocument(Document document)
```
    Iterates through all the ELEMENT nodes in a document and gives them ids if they don't already have them.
    
    Parameters:
    
    document -
  - dbg
```
protected void dbg(String s)
```
  - getTitle
```
protected String getTitle()
```
  - match
```
protected String[] match(String input,
                         String regex)
```
    Javascript-like String.match
    
    Parameters:
    
    input -
    
    regex -
    
    Returns:
  - hasContent
```
public boolean hasContent()
```
    Returns:
    
    True if the article has any detected content; false otherwise.
  - search
```
protected int search(String input,
                     String regex)
```
    Javascript-like String.search
    
    Parameters:
    
    input -
    
    regex -
    
    Returns:
  - findArticleEncoding
```
protected void findArticleEncoding()
```
  - findArticleDate
```
protected void findArticleDate()
```
  - parseDate
```
protected void parseDate()
```
  - findArticleTitle
```
protected String findArticleTitle()
```
    Get the article title.
    
    Returns:
    
    void
  - getBody
```
protected Element getBody()
```
    Equivalent to document.body in JS
    
    Returns:
  - init
```
protected void init()
```
    Runs readability. Workflow: 1. Prep the document by removing script tags, css, etc. 2. Build readability"s DOM tree. 3. Grab the article content from the current dom tree. 4. Replace the current DOM tree with the new one. 5. Read peacefully.
  - prepDocument
```
protected void prepDocument()
```
    Prepare the HTML document for readability to scrape it. This includes things like stripping javascript, CSS, and handling terrible markup.
  - removeComments
```
protected void removeComments(Node n)
```
  - prepArticle
```
protected void prepArticle(Element articleContent)
```
    Prepare the article node for display. Clean out any inline styles, iframes, forms, strip extraneous
    tags, etc.
    
    Parameters:
    
    Element -
  - removeChildren
```
protected void removeChildren(Node n)
```
  - initializeNode
```
protected void initializeNode(Element node)
```
    Initialize a node with the readability object. Also checks the className/id for special names to add to its score.
    
    Parameters:
    
    Element -
  - getClassWeight
```
protected int getClassWeight(Element e)
```
    Get an elements class/id weight. Uses regular expressions to tell if this element looks good or bad.
    
    Parameters:
    
    Element -
    
    Returns:
    
    number (Integer)
  - cleanStyles
```
protected void cleanStyles()
```
  - cleanStyles
```
protected void cleanStyles(Element e)
```
    Remove the style attribute on every e and under. TODO: Test if getElementsByTagName(*) is faster.
    
    Parameters:
    
    Element -
  - killBreaks
```
protected void killBreaks(Element e)
```
    Remove extraneous break tags from a node.
    
    Parameters:
    
    Element -
  - clean
```
protected void clean(Element e,
                     String tag)
```
    Clean a node of all elements of type "tag". (Unless it"s a youtube/vimeo video. People love movies.)
    
    Parameters:
    
    Element -
    
    string - tag to clean
  - cleanHeaders
```
protected void cleanHeaders(Element e)
```
    Clean out spurious headers from an Element. Checks things like classnames and link density.
    
    Parameters:
    
    Element -
  - getLinkDensity
```
protected float getLinkDensity(Element e)
```
    Get the density of links as a percentage of the content This is the amount of text that is inside a link divided by the total text in the node.
    
    Parameters:
    
    Element -
    
    Returns:
    
    number (float)
  - cleanConditionally
```
protected void cleanConditionally(Element e,
                                  String tag)
```
    Clean an element of all tags of type "tag" if they look fishy. "Fishy" is an algorithm based on content length, classnames, link density, number of images & embeds, etc.
  - getCharCount
```
protected int getCharCount(Element e,
                           String s)
```
    Get the number of times a string s appears in the node e.
    
    Parameters:
    
    Element -
    
    string - - what to split on. Default is ","
    
    Returns:
    
    number (integer)
  - getCharCount
```
protected int getCharCount(Element e)
```
  - getArticleTitle
```
public String getArticleTitle()
```
    Returns:
    
    The article title
  - getArticleContentType
```
public String getArticleContentType()
```
    Returns:
    
    The content type of the article
  - grabArticle
```
protected Element grabArticle()
```
    grabArticle - Using a variety of metrics (content score, classname, element types), find the content that is most likely to be the stuff a user wants to read. Then return it wrapped up in a div.
    
    Returns:
    
    Element
  - getInnerHTML
```
protected String getInnerHTML(Node n)
```
  - nodeToString
```
protected String nodeToString(Node n)
```
  - nodeToString
```
protected static String nodeToString(Node n,
                                     boolean pretty)
```
  - stringToNode
```
protected Node stringToNode(String str)
```
  - getInnerText
```
protected String getInnerText(Element e,
                              boolean normalizeSpaces)
```
    Get the inner text of a node - cross browser compatibly. This also strips out any excess whitespace to be found.
    
    Parameters:
    
    Element -
    
    Returns:
    
    string
  - getInnerTextSep
```
protected String getInnerTextSep(Node e)
```
  - getInnerText
```
protected String getInnerText(Element e)
```
  - getArticleHTML
```
public String getArticleHTML()
```
    Returns:
    
    The article HTML content as a String.
  - getArticleHTML_DOM
```
public Node getArticleHTML_DOM()
```
    Returns:
    
    The articles HTML dom node.
  - getArticleDateString
```
protected String getArticleDateString()
```
  - getArticleDate
```
public Date getArticleDate()
```
    Returns:
    
    The article date.
  - getArticleText
```
public String getArticleText()
```
    Returns:
    
    The text of the article.
  - getArticleLinks
```
public List<Anchor> getArticleLinks()
```
    Returns:
    
    Any links in the article.
  - getAllLinks
```
public List<Anchor> getAllLinks()
```
    Returns:
    
    Any links in the document.
  - getArticleImages
```
public List<String> getArticleImages()
```
    Returns:
    
    Any images in the article.
  - getArticleSubheadings
```
public List<String> getArticleSubheadings()
```
    Returns:
    
    Any subheadings in the article.
  - findChildNodesWithName
```
protected List<Node> findChildNodesWithName(Node parent,
                                            String name)
```
  - findChildNodeIndex
```
protected int findChildNodeIndex(Node parent,
                                 Node childToFind)
```
  - getArticleTextMapping
```
protected void getArticleTextMapping(org.w3c.dom.traversal.TreeWalker walker,
                                     List<Readability.MappingNode> map)
                              throws DOMException
```
    Throws:
    
    DOMException
  - getArticleTextMapping
```
public List<Readability.MappingNode> getArticleTextMapping()
```
    Get the mapping between bits of text in the dom & their xpaths
    
    Returns:
    
    mapping from xpath to text
  - getReadability
```
public static Readability getReadability(String html)
                                  throws SAXException,
                                         IOException
```
    Convenience method to build a Readability instance from an html string.
    
    Parameters:
    
    html - The html string
    
    Returns:
    
    new Readability instance.
    
    Throws:
    
    SAXException
    
    IOException
  - getReadability
```
public static Readability getReadability(String html,
                                         boolean addTitle)
                                  throws SAXException,
                                         IOException
```
    Convenience method to build a Readability instance from an html string.
    
    Parameters:
    
    html - The html string
    
    addTitle - Should the title be added to the generated article?
    
    Returns:
    
    new Readability instance.
    
    Throws:
    
    SAXException
    
    IOException
  - main
```
public static void main(String[] argv)
                 throws Exception
```
    Testing
    
    Parameters:
    
    argv -
    
    Throws:
    
    Exception

Class Readability

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

LINK_DENSITY_THRESHOLD

document

flags

articleTitle

articleContent

article_date_string

article_date

article_contentType

debug

addTitle

Constructor Detail

Readability

Readability

Readability

Method Detail

augmentDocument

dbg

getTitle

match

hasContent

search

findArticleEncoding

findArticleDate

parseDate

findArticleTitle

getBody

init

prepDocument

removeComments

prepArticle

removeChildren

initializeNode

getClassWeight

cleanStyles

cleanStyles

killBreaks

clean

cleanHeaders

getLinkDensity

cleanConditionally

getCharCount

getCharCount

getArticleTitle

getArticleContentType

grabArticle

getInnerHTML

nodeToString

nodeToString

stringToNode

getInnerText

getInnerTextSep

getInnerText

getArticleHTML

getArticleHTML_DOM

getArticleDateString

getArticleDate

getArticleText

getArticleLinks

getAllLinks

getArticleImages

getArticleSubheadings

findChildNodesWithName

findChildNodeIndex

getArticleTextMapping

getArticleTextMapping

getReadability

getReadability

main