Skip to content

vincensiusadi/durian

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Durian Extractor

Web page extractor and readability using Jsoup. Enable javascript serverside rendering support using JBrowserDriver (Selenium WebDriver).

Prerequisites:

Install

because this project not pushed to any public maven repos, you should install it first locally

    mvn clean install

add this project as dependency of your project

    <dependency>
        <groupId>co.mailtarget</groupId>
        <artifactId>durian</artifactId>
        <version>0.0.2-SNAPSHOT</version>
    </dependency>

Usage

###kotin

    val extractor = WebExtractor.Builder
                    .strategy(Strategy.HYBRID)
                    .build()
    
    val webData = extractor.extract(url)

or

    val forceJavascript = false
    WebData webData = extractor.extract(url, forceJavacript)

###Java

    WebExtractor extractor = new WebExtractor.Builder()
                    .strategy(Strategy.HYBRID)
                    .build();
    WebData webData = extractor.extract(url);

or

    boolean forceJavascript = false;
    WebData webData = extractor.extract(url, forceJavacript);

Options

###Extract Strategy

  • META : fastest method, just parse content from meta
  • CONTENT : prefer using content as source of extraction
  • HYBRID : fetch from meta first, if not found search deeper from content

###System Config

tried in MAC OS machine and work well, on centos machine, please install

    yum groupinstall -y "Fonts"
    yum install gtk2 

optional : gtkhtml3 libXtst libxslt alsa-lib

About

Web data extractor and readability

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Kotlin 96.9%
  • HTML 3.1%