From 153a52c704d0348cd64ff47190b96e1f4ca15f2f Mon Sep 17 00:00:00 2001 From: Sevan Garois Date: Thu, 8 Jan 2026 17:23:06 +0100 Subject: [PATCH 1/2] add doc for plugin from website --- gutenberg/README.md | 13 +++++++++++++ import-io/README.md | 33 +++++++++++++++++++++++++++++---- ip-range-matcher/README.md | 12 ++++++++++++ musixmatch/README.md | 7 +++++++ uspto-patents/README.md | 12 ++++++++++-- 5 files changed, 71 insertions(+), 6 deletions(-) create mode 100644 gutenberg/README.md create mode 100644 ip-range-matcher/README.md diff --git a/gutenberg/README.md b/gutenberg/README.md new file mode 100644 index 00000000..324d8472 --- /dev/null +++ b/gutenberg/README.md @@ -0,0 +1,13 @@ +## Plugin information + +[Project Gutenberg](https://www.gutenberg.org/) is an open-source initiative to offer free ebooks for books that are no longer protected by copyright. It was created to encourage the creation and distribution of ebooks. Project Gutenberg basically contains all major works of literature. + +This plugin lets you retrieve the complete content of books directly as Dataiku datasets, with one record per line in the book. This is a great plugin to get started with Natural Language Processing (NLP), i.e. processing human-written text. + +## How to set up + +Right after installing the plugin, you will need to build its code environment. This plugin supports Python 2.7, 3.5, 3.6, and 3.7. + +## How to use + +We have a great [tutorial](https://academy.dataiku.com/natural-language-processing-with-visual-tools-open) that uses this plugin to get you started on your first NLP predictive model: a service that automatically recognizes writings by Mark Twain and Charles Dickens. \ No newline at end of file diff --git a/import-io/README.md b/import-io/README.md index a5d9a350..024b51f3 100644 --- a/import-io/README.md +++ b/import-io/README.md @@ -1,8 +1,33 @@ -This plugin offers connectivity to import.io thanks to: +# Plugin Information -* the **dataset** “Import.io dataset” (it calls import.io once and populates a dataset with the results), -* the **recipe** “Extractor / Magic”. This recipe enriches a dataset: for each row of the input dataset, this recipe reads the URL in a given column, calls import.io's API with it, and writes the results to the output dataset. This way of repeatedly calling the API to retrieve data is sometimes called “Bulk extract” or “Chain API” on import.io website. -* the **recipe** “Connector”. Indeed, in Import.io, to get new data one has the choice between “Magic”, “Extractor”, “Crawler” or the more advanced “Connector”. This recipe allows to connect to the last one. +[import.io](https://www.import.io/) lets users automatically turn Web pages into data, thanks to its powerful and very easy to use scraping and parsing technology. + +This plugin offers advanced connectivity to import.io scrappers. By using the import.io plugin, you can easily retrieve data hidden in web pages, or enrich existing datasets with external web data. + +The plugin import.io plugin can: + +- Retrieve data from a single import.io API using the **dataset** +- Bulk-enrich a dataset containing URLs, repeateadly getting data from an import.io extractor on each URL, using the **recipe** + +The import.io plugin offers connectivity thanks to 3 different components: + +#### Dataset for single API + +The **Import.io dataset** is the simplest integration. It calls the import.io once and populates a dataset with the results. + +Use this to fetch structured data from a single page. + +Start by defining your extractor in import.io, then create the dataset and paste the import.io API URL into the dataset configuration. + +#### Recipes for bulk enrich + +The enrichment recipes are used to enrich a dataset: for each row of the input dataset, this recipe reads the URL in a given column, calls import.io’s API with it, and writes the results to the output dataset. This way of repeatedly calling the API to retrieve data is sometimes called “Bulk extract” or “Chain API” on import.io website. + +Start by defining your extractor on one example page in import.io, then create the recipe. + +A great way to use this is together with the **editable datasets** in Dataiku. + +The “Connector” recipe is also used for bulk enrich. To get new data in Import.io, one has the choice between “Magic”, “Extractor”, “Crawler” or the more advanced “Connector”. This recipe allows to request an API created with the last one. # Changelog diff --git a/ip-range-matcher/README.md b/ip-range-matcher/README.md new file mode 100644 index 00000000..86eea5c8 --- /dev/null +++ b/ip-range-matcher/README.md @@ -0,0 +1,12 @@ +# Plugin Information + +IP addresses are identifiers used for almost all network connections. + +This plugin provides preparation processors for checking whether IPv4 addresses belong to specific network ranges. + +You need to install the plugin and then restart DSS. + +You will then be able to filter and flag IPv4 addresses based on the ranges you entered. These can be either: + +- **“Normal” ranges**: IP – IP (eg “192.168.0.1 – 192.168.255.255”) +- **CIDR ranges** (eg 10.0.0.0/8) diff --git a/musixmatch/README.md b/musixmatch/README.md index 4eebc26b..83ce6d18 100644 --- a/musixmatch/README.md +++ b/musixmatch/README.md @@ -2,6 +2,13 @@ This plugin is a recipe that uses the Musixmatch API to retrieve artists tracks from artist identifiers +## How To Use + +To configure the recipe, you’ll need: + +- A dataset with Musixmatch artist identifiers; +- A Musixmatch API key (see [The Musixmatch developers website](https://developer.musixmatch.com/)) + ## Author This plugin was written by ICTeam (Luca Grazioli) diff --git a/uspto-patents/README.md b/uspto-patents/README.md index 91c26b42..87ab7b48 100644 --- a/uspto-patents/README.md +++ b/uspto-patents/README.md @@ -1,5 +1,13 @@ # US Patents dataset -This plugin provides a mechanism to download the US patents datasets from Google, as described [here](https://cloud.google.com/blog/products/gcp/google-patents-public-datasets-connecting-public-paid-and-private-patent-data). +This plugin provides a mechanism to download the US patents datasets from Google, as described [here](https://cloud.google.com/blog/products/gcp/google-patents-public-datasets-connecting-public-paid-and-private-patent-data), as a XML file. -The user can choose a partitioning strategy before building the dataset. \ No newline at end of file +This connector retrieves that data + +Since these XML files are not well-formed, this connector provides built-in cleansing and parsing. The resulting dataset contains one “patent” column (JSON) containing the patent metadata, abstract, description and claims + +The connector enables a local folder cache to simplify your developments. You can choose to retrieve the whole patent database (beware it’s big : 40GB) or any year between 2005 and 2015 + +The user can choose a partitioning strategy before building the dataset. + +You need to install the dependencies of the plugin. Go to the **Administration > Plugins** page to get the command-line to install dependencies From d61703d44277215a6fba119e93fe54a38ea120f8 Mon Sep 17 00:00:00 2001 From: Sevan <35869501+sevanga@users.noreply.github.com> Date: Fri, 9 Jan 2026 15:45:04 +0100 Subject: [PATCH 2/2] Update ip-range-matcher/README.md Co-authored-by: Nicolas Courazier --- ip-range-matcher/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ip-range-matcher/README.md b/ip-range-matcher/README.md index 86eea5c8..9d7bb956 100644 --- a/ip-range-matcher/README.md +++ b/ip-range-matcher/README.md @@ -4,7 +4,7 @@ IP addresses are identifiers used for almost all network connections. This plugin provides preparation processors for checking whether IPv4 addresses belong to specific network ranges. -You need to install the plugin and then restart DSS. +You need to install the plugin and then restart Dataiku. You will then be able to filter and flag IPv4 addresses based on the ranges you entered. These can be either: