Skip to content

Data Process

johans edited this page May 30, 2018 · 17 revisions

Data Process

     Pider offers an ETL model ActivedCarbon for data processing. It is quite inspired from Item, ItemLoader and ItemPipeLine design in scrapy. Whereas, More precise controls are shipped with it.

Synposis

    The following diagram describes essential parts of ActviedCarbon and an outlet of dataflow in it (flow is indexed by arrows).

ActivedCarbon-Model

Feature

    From the compact diagram above, Several Pores are associated to an ActivedCarbon as a sole dataclean stuff. Like wise, a Pore can be apportioned into three parts, Absorber, Filter, Reaction , each of which acts as precedure of Transformation in ETL. Furthermore, either Rule or Throttle is attached into procedures for filtering.

    To introduce it more intelligiblly, We make a full example. First, we supposed that there are some book data collected from internet.

$books = [
    [ 
        'name'=>'Harry Potter and the Sorcerer\'s Stone',
        'author'=> 'Jack Thorne, John Tiffany, J.K. Rowling',
        'price' => '177.40 RMB'
    ],
    [
        'Harry Potter and the Cursed Child – Parts I & II',
        'author'=> 'Jack Thorne, John Tiffany, J.K. Rowling',
        'price' => '169.00 RMB'
    ],
    [
        'name'=>'Disney Princess Snow White',
        'author'=>'Kathryn Harper',
        'price' => '62.00 RMB'
    ],
    [
        'name'=>Shortest History of Europe',
        'author'=>'John Hirst',
        'price'=> '119.00 RMB'
    ]
];

    As we can see, there are some peices of book information, each info has three properties, name, author and price. Now, we are dispatched to process it.

Rule

     Rule defines a series of basic rules hold by precedures.Rule accepts a Callback function as parameter, which must returns a bool value ;

  • Format
$rule = new Rule(function() {
        //rules your defined, bool must be returned
        });
  • Create a Rule
use Pider\Prepost\Data\Rule;

$BookNameRule = new Rule(function($book) {
        if ($book['name'] == 'Harry Potter and the Sorcerer\'s Stone') {
           return false;
        }
        return true;
     });

     Code above, defines a Rule that accept all book data except the book, name of which is Harry Potter and the Sorcerer\'s Stone.

Throttle

     Throttle is similar to Rule. The only distinct part between Throttle and Rule is that callback accepted by Throttle must return an array not a bool. It purposes to filter the specific properties to be processed.

  • Format
$throttle = new Throttle(function() {
        //rules your defined, array must be returned
        });

  • Create a Throttle
use Pider\Prepost\Data\Throttle;

$BookNameThrottle = new Throttle(function($book) {
        if ($books['name'] == 'Harry Potter and the Sorcerer\'s Stone') {
            return [];
        }
        return ['name'=> $books['name']];
});

     From the code above, BookNameThrottle approve all book data, exclude the book name of which is Harry Potter and the Sorcerer\'s Stone, and return a subarray which only contains name property info.

Absorber

     Absorber furnishes a data sinking operation.For analyzing, some information should be deserted which may not match your requirements are also essential occasionally. You may want to filter and collect these data, then stored to somewhere, rather than just ditch them anyway.

  • Format
use Pider\Prepost\Data\Absorber;

//define as a normal class
class BookNameAbsorber extends Absorber {
    public function react(array $data, Pore $pore): array {
        //your logic and array must be returned
    }
}
//use as a normal class
$BooknameAbsorber = new BookNameAbsorber($rule);

//define and use as an anonymouse class 
$BookNameAbsorber  = new class(Rule $rule) extends Absorber {
    public function absorb(array $data, Pore $pore):array {
        //your logic and array must be returned
    }
}
  • Create an Absorber
use Pider\Prepost\Data\Absorber;
use Pider\Prepost\Data\Rule;

$BookNameRule = new Rule(function($book) {
        if ($book['name'] == 'Harry Potter and the Sorcerer\'s Stone') {
           return true;
        }
        return false;
     });
//define as a normal class 
class BookNameAbsorber extends Absorber {
    public function absorb(array $data, Pore $pore):array {
        return json_encode($data);
    }
}
$BookNameAbsorber = new BookNameAbsorber($BookNameThrottle);
//define and use as an anonnymouse class
$BookNameAbsorber  = new class($rule) extends Absorber {
    public function absorb(array $data, Pore $pore):array {
        return json_encode($data);
    }
}

     BookNameAbsorbercreated above, will store info of book,name of which is Harry Potter and the Sorcerer\'s Stone in json format.

Reaction

     Reaction is used for transforming information. During data processing, Data transformation is a very prevalent case. Massive chunks of data which don't match standard are unavoidable to be converted interminally.

  • Format
use Pider\Prepost\Data\Reaction;

//define as normal class
class BookNameReaction extends Reaction {
    public function react(array $data, Pore $pore): array {
        //transformation you defined
    }
}
//define as anonymouse class

$BookNameReaction = new class(Throttle $throttle) extends Reaction {
    public function react(array $data, Pore $pore): array {
       //transfromation you defined
    }
}
  • Create a Reaction
use Pider\Prepost\Data\Throttle;
$BookNameThrottle = new Throttle(function($book) {
        if ($books['name'] == 'Harry Potter and the Sorcerer\'s Stone') {
            return [];
        }
        return ['name'=> $books['name']];
});
//use as a normal class
class BookNameReaction extends Reaction {
    public function react(array $data, Pore $pore): array {
        return ['NAME'=> strtoupper($data['name'])]
    }
}
$BookNameReaction = new BookNameReaction($BookNameThrottle);

//use as anonymouse class
$BookNameReaction =  new class($BookNameThrottle) extends Reaction {
    public function react(array $data, Pore $pore): array {
        return ['NAME'=> strtoupper($data['name'])]
    }
}

     All operations performed by BookNameReaction captialize the name of Harry Potter and the Sorcerer\'s Stone book.

Filter

     Filter has similar function with Absorber,besides it doesn't concern data sifted.

  • Format
use Pider\Prepost\Data\Filter;

//use as normal class
class BookNameFilter extends Filter {
    public function filter(array $data, Pore $pore):bool{
        //filter operation performed
    } 
}

//use as anonymouse class
$BookNameFilter = new class(Throttle| Rule $rule ) extends {
    public function filter(array $data, Pore $pore): bool {
       //filter operation performed
    }
}

     Detail operation is proceeded by filter() method attached to Filter class. filter() method accepts data through a Rule and a Pore instance as parameters,and bool is returned.

  • Create a Filter
use Pider\Prepost\Data\Throttle;
use Pider\Prepost\Data\Filter;

$BookNameThrottle = new Throttle(function($book) {
        if ($books['name'] == 'Harry Potter and the Sorcerer\'s Stone') {
            return [];
        }
        return ['name'=> $books['name']];
});

class BookNameFilter extends Filter {
    public function filter(array $data, Pore $pore) {
        if (empty($data)) {
            return false;
        }
        return true;
    }
}

     As defined in the BookNameFilter, the book Harry Poter and Sorcerer\'s Store will be ignored in following processes.

Pore

     Pore can be considered as a collection of Actions, which is one of Absorber, Reaction or Filter.

  • Format      All Action must be defined inside Pore.selfFeatures() method and return as an associated array.
class BookNamePore extends Pore {
    public function selfFeatures():array {
        //actions defined
        ...
        ...
        //associated array must be returned
        return [
            'absorber'=> [$absorber1,$absorber2], 
            'reaction'=> [$reaction1,$reaction2],
            'filter'=>   [$filter1,$filter2]
        ];
    }
}
  • Create a Pore
use Pider\Prepost\Data\Pore;
use Pider\Prepost\Data\Rule;
use Pider\Prepost\Data\Throttle;
use Pider\Prepost\Data\Reaction;
use Pider\Prepost\Data\Filter;

class BookNamePore extends Pore {
    protected function selfFeatures():array {
        $BookNameThrottle = new Throttle(function($book) {
                if ($books['name'] == 'Harry Potter and the Sorcerer\'s Stone') {
                return [];
                }
                return ['name'=> $books['name']];
        });
        $BookNameReaction =  new class($BookNameThrottle) extends Reaction {
            public function react(array $data, Pore $pore): array {
                return ['NAME'=> strtoupper($data['name'])]
            }
        }
        return ['reaction'=> [$BookNameReaction],'absorber'=> [], 'filter'=>[]];
    }
}

ActivedCarbon

     ActviedCarbon is a container of Pore's.

  • Format      All Pores should be set in ActivedCarbon.selfPores() and dispensed as an Pore array.
use Pider\Prepost\Data\Pore;
use Pider\Prepost\Data\ActivedCarbon;

class BookActivedCarbon extends ActivedCarbon {
    /**
     * Define several pores 
     */
    protected function selfPores():array {
        $pores = [
            new Pore1(),
            new Pore2(),
            new Pore3()
        ];
        //a pore array must be returned
        return $pores;
}
  • Create a ActivedCarbon
use Pider\Prepost\Data\Pore;
use Pider\Prepost\Data\ActivedCarbon;

class BookActivedCarbon extends ActivedCarbon {
    /**
     * Define several pores 
     */
    protected function selfPores():array {
        $pores = [
            new BookNamePore(),
            new BookAuthorPore(),
            new BookPricePore()
        ];
        return $pores;
    }
}

Used with Pider

     Pider allows users to extend framework by defining their own components. All components locate in ProjectRoot/Components. You can integrates your tailored ActivedCarbon in this way. For example:

  • Create BookActivedCarbon component
Component/
`-- Preprocess
    `-- Book
        |-- BookNamePore.php
        |-- BookAuthorPore.php
        |-- BookPricePore.php
        |-- BookActivedCarbon.php
  • Use it with Pider
touch examples/BookSpider.php
use Pider\Spider;
use Pider\Http\Request;
use Pider\Http\Response;
use Preprocess\Book\BookActivedCarbon;

class BookSpider extends Spider {
        
    public function parse(Response $response) {
        //your crawler code
        ...
        $books = ...; // the data your crawled
        $book_clean = (new BookActivedCarbon($books))();
        var_dump($book_clean);
    }
}
  • Run it
../pider examples/BookSpider;

Clone this wiki locally