Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions gutenberg/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
## Plugin information

[Project Gutenberg](https://www.gutenberg.org/) is an open-source initiative to offer free ebooks for books that are no longer protected by copyright. It was created to encourage the creation and distribution of ebooks. Project Gutenberg basically contains all major works of literature.

This plugin lets you retrieve the complete content of books directly as Dataiku datasets, with one record per line in the book. This is a great plugin to get started with Natural Language Processing (NLP), i.e. processing human-written text.

## How to set up

Right after installing the plugin, you will need to build its code environment. This plugin supports Python 2.7, 3.5, 3.6, and 3.7.

## How to use

We have a great [tutorial](https://academy.dataiku.com/natural-language-processing-with-visual-tools-open) that uses this plugin to get you started on your first NLP predictive model: a service that automatically recognizes writings by Mark Twain and Charles Dickens.
33 changes: 29 additions & 4 deletions import-io/README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,33 @@
This plugin offers connectivity to <a href="https://import.io/">import.io</a> thanks to:
# Plugin Information

* the **dataset** “Import.io dataset” (it calls import.io once and populates a dataset with the results),
* the **recipe** “Extractor / Magic”. This recipe enriches a dataset: for each row of the input dataset, this recipe reads the URL in a given column, calls import.io's API with it, and writes the results to the output dataset. This way of repeatedly calling the API to retrieve data is sometimes called “Bulk extract” or “Chain API” on import.io website.
* the **recipe** “Connector”. Indeed, in Import.io, to get new data one has the choice between “Magic”, “Extractor”, “Crawler” or the more advanced “Connector”. This recipe allows to connect to the last one.
[import.io](https://www.import.io/) lets users automatically turn Web pages into data, thanks to its powerful and very easy to use scraping and parsing technology.

This plugin offers advanced connectivity to import.io scrappers. By using the import.io plugin, you can easily retrieve data hidden in web pages, or enrich existing datasets with external web data.

The plugin import.io plugin can:

- Retrieve data from a single import.io API using the **dataset**
- Bulk-enrich a dataset containing URLs, repeateadly getting data from an import.io extractor on each URL, using the **recipe**

The import.io plugin offers connectivity thanks to 3 different components:

#### Dataset for single API

The **Import.io dataset** is the simplest integration. It calls the import.io once and populates a dataset with the results.

Use this to fetch structured data from a single page.

Start by defining your extractor in import.io, then create the dataset and paste the import.io API URL into the dataset configuration.

#### Recipes for bulk enrich

The enrichment recipes are used to enrich a dataset: for each row of the input dataset, this recipe reads the URL in a given column, calls import.io’s API with it, and writes the results to the output dataset. This way of repeatedly calling the API to retrieve data is sometimes called “Bulk extract” or “Chain API” on import.io website.

Start by defining your extractor on one example page in import.io, then create the recipe.

A great way to use this is together with the **editable datasets** in Dataiku.

The “Connector” recipe is also used for bulk enrich. To get new data in Import.io, one has the choice between “Magic”, “Extractor”, “Crawler” or the more advanced “Connector”. This recipe allows to request an API created with the last one.

# Changelog

Expand Down
12 changes: 12 additions & 0 deletions ip-range-matcher/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Plugin Information

IP addresses are identifiers used for almost all network connections.

This plugin provides preparation processors for checking whether IPv4 addresses belong to specific network ranges.

You need to install the plugin and then restart Dataiku.

You will then be able to filter and flag IPv4 addresses based on the ranges you entered. These can be either:

- **“Normal” ranges**: IP – IP (eg “192.168.0.1 – 192.168.255.255”)
- **CIDR ranges** (eg 10.0.0.0/8)
7 changes: 7 additions & 0 deletions musixmatch/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,13 @@

This plugin is a recipe that uses the Musixmatch API to retrieve artists tracks from artist identifiers

## How To Use

To configure the recipe, you’ll need:

- A dataset with Musixmatch artist identifiers;
- A Musixmatch API key (see [The Musixmatch developers website](https://developer.musixmatch.com/))

## Author

This plugin was written by ICTeam (Luca Grazioli)
Expand Down
12 changes: 10 additions & 2 deletions uspto-patents/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,13 @@
# US Patents dataset

This plugin provides a mechanism to download the US patents datasets from Google, as described [here](https://cloud.google.com/blog/products/gcp/google-patents-public-datasets-connecting-public-paid-and-private-patent-data).
This plugin provides a mechanism to download the US patents datasets from Google, as described [here](https://cloud.google.com/blog/products/gcp/google-patents-public-datasets-connecting-public-paid-and-private-patent-data), as a XML file.

The user can choose a partitioning strategy before building the dataset.
This connector retrieves that data

Since these XML files are not well-formed, this connector provides built-in cleansing and parsing. The resulting dataset contains one “patent” column (JSON) containing the patent metadata, abstract, description and claims

The connector enables a local folder cache to simplify your developments. You can choose to retrieve the whole patent database (beware it’s big : 40GB) or any year between 2005 and 2015

The user can choose a partitioning strategy before building the dataset.

You need to install the dependencies of the plugin. Go to the **Administration > Plugins** page to get the command-line to install dependencies