Skip to content

Introduction

johans edited this page May 29, 2018 · 3 revisions

     Pider aims to craft an elegant and useful spider framework by PHP programming language.

Motivation

     PHP is a good web programming language. There are a lot of web frameworks ,but less frameworks for scraping or data-process. I belive that PHP can do some more work than web ,likewise Python. So I want to create a scraping and data-process framework which incorporates crawler, data-cleaning , data-anaylsis , data-visulization.

Features

  • Templatize

    Pider allow you to write a spider and manage its's life cycle through customizing just a template.

  • Command Line

    Pider framework provide lots of command line tools to manage spiders and datas scraped.

  • Multiple Process

    Single process is too slow when you want to scrape enormouse number of pages. So Pider supple multiple process module to allow you to request and extract data meantime. This feature can shortten the runtime of scrapes with large number of pages remarkably.

  • Group

   Sometimes, we need to request more than one page to complete a scrape task at first, and process datas scraped from different pages after all requestes are done. We can use Group feature to bundle different requests into a group. and these responses of requests will be bundled.Then you can process these response together easily.

  • Data clean

   Most datas that we pull from webpages are always half-baked for different causes. So, we often should do lots of works to clean, reorgnized or complement the origin datas. Pider framework with a Data-Clean Model - ActiveCarbon Model can release you from cumbersome data cleaning taskes.

Clone this wiki locally