my gift to the php multi-curl data scrapers by rxhector

developer · @rxhector · Mar 8 '19

$0.03

my gift to the php multi-curl data scrapers

https://sergeyzhuk.me/assets/images/posts/fast-webscraping-reactphp/logo.jpg

Hi ,
my name Joe (a.k.a  @rxhector on twitter/steemit )

and this is the result of 15 years data-scraping using php w/multi-curl.

I have just recently solved one of the biggest problems I've had in scraping using php w/multi-curl,
the problem where you have to post to first page and use that data to get to yet another page of results (without having the never-ending unweildy cascading if/then/else url check bullshit)


it took 15 years because I started as a carpenter by trade (15 years) and just 
kind of accidentally discovered php/mysql,
but we will save that story for another time.

I have a working example to check out so you wont be left flying blind like i was while i learned this shit.

https://github.com/rxhector/ultimate-multicurl


And I have tried to comment the code (still needs a ton of better / prettier format)



The first caveat from the code is this little xml beauty
load any web page into xml without it breaking the shit out of php simplexml_import_dom


~~~
if (!function_exists('load_simplexml_page')) {
    /*

        this will 'force' xml to load a web page (pure html)
        sometimes simplexml_import_dom breaks when trying to import html with bad mark up (i.e - old crappy coding / scripts)
        DOMDocument will auto-magically fix shitty html

        then we can simplexml-ize it !!!

        NOTE : 
            when using php file_get_contents or dom document or simplexml
            those functions use the php.ini user_agent

            most web sites will not return a web request to empty user_agent
            or user_agent "PHP"
            i ALWAYS set php.ini user_agent to a valid browser string

			php.ini
            user_agent="Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0"

    */
    //load from url(default) or string
    function load_simplexml_page($page , $type='url'){
        /*
            static DOMDocument::loadHTMLFile() errors on malformed documents
            $dom = new DOMDocument('1.0', 'UTF-8'); //force doctype/utf - get rid of warnings
            prepend @$dom = tell php to ignore any warnings/errors
        */
        $dom = new DOMDocument('1.0', 'UTF-8');
        $type === 'url'
        ? @$dom->loadHTMLFile(trim($page))
        : @$dom->loadHTML($page);
        return simplexml_import_dom($dom);
    }//end function
}//end function check
~~~


here's a quick and dirty script that lets you test your php.ini user_agent from curl request

~~~

<?php	
	// save this on your  web server so you can hit it as web page ex. /www/test_user_agent.php
	// then run from cmd line ex. php /www/test_user_agent.php

	//require( somefile with load_simplexml_page )
	
    if(isset($_SERVER['REQUEST_METHOD'])){
        //this is a webpage hit 
        $return = print_r($_SERVER , true) ;
        echo $return;   //return something to browser/curl request
    }
    else
    {
        //no server request - this is running as cmd line script
        
        //returns default php user_agent "PHP" (if set)
        //$text = file_get_contents("http://localhost/test_user_agent.php");
        //echo $text;

        //returns default php user_agent "PHP" (if set)
        $xml = load_simplexml_page("http://localhost/test_user_agent.php");
        echo $xml->asXML();

    }  

~~~


i tried regex and substr and other normal php string processing functions
but i learned early on that this can become unweildy and complex
thats when i discovered simplexml - what an awesome tool for html !!!
it is super easy to use xpath expressions to get to any element on a web page to get data !!!


my first few scraping jobs i was a NOOB and was using good old file_get_contents and getting one slow page at a time,
but that has it limitations...
when you get a little more advanced you have to learn to make a query string and post data..(pagination is a bitch)

that's when i discovered curl - wow it took me about a week to learn how to set cookies so i could log in to sites and get backend data.

so my load_simplexml_page and curl tools came in handy - but man was it slow doing 'synchronous' page loading - one slow page at a time.
when you get a client that wants 10,000 pages at a time (instead of 100) - you better figure out how to do it AS FAST AS POSSIBLE !!!

then i stumbled into multi-curl - HOLY SHIT , talk about a learning curve..
i know there are a few 'wizards' out there that probably picked it up right away
but you gotta remember - I was a 15 year carpenter/guitar player/stoner - it took me a bit
to grab the 'asynchronous' concept - let's load 1000 pages all at once - and process them.



so now we can move on to the main multi-curl object itself
i wont post the code in its entirety here - you can find that posted at PUT GITHUB PROJECT LINK HERE

some of my favorite code tricks i've discovered along the way

~~~

	//notice the &(reference) here
	if ($x = &$this->result_callback($this->rs[$i]['options'])) {
	
	//$x is now a string and can be called as a function
		
		$x($this->rs[$i] , $i);	//old school would be something like mixed call_user_func_array ( callable $callback , array $param_arr )
		//call_user_func_array($x , [$this->rs[$i] , $i]);
		
~~~



another pretty cool use for references



~~~

	public function start_callback(&$options , $set = false){ 
	//$options can now be changed from with the callback function (trust me - it comes in handy for passing variables in multi-curl)
	
		$options['result_callback'] = $set;		//now $options is changed in the main(calling) flow 
		
~~~


the next really cool caveat you get is a proxy scraper - you know - get a list of 300 proxies
then do a quick check on the target domain to make sure you have a good proxy ;)



so - the whole point for me was to get a better understanding of how multi-curl works.
if you want to know how something works - you gotta break it and try to rebuild it.



looking forward
this code really needs some more clean up and better comments/formatting.

i would really like to add some of the functionality from zebra-curl (i didnt need all the bells right away - so i just built this for quick get/post json requests)

I hope you guys like it


https://github.com/rxhector/ultimate-multicurl


this is running in production for mid-market cap company - we are scraping about 30k records @ 1000/hr (25pages/second) - not bad !!!

tipping is allowed
the old slow way - paypal rxhector2k5@yahoo.com
the super fast ~3 second way twitter @xrptipbot https://twitter.com/@rxhector
	twitter trx bot @goseedit

👍 blhz, the-bitcoin-dood, tm50, taskmanager, oliverstoney, d3nv3r, lusterdoom, instantvoter, animad, sergino, bluesniper, zapped, sextant, commenthunter, prizeportal, awilix

`author`	rxhector
`permlink`	my-gift-to-the-php-multi-curl-data-scrapers
`category`	developer
`json_metadata`	{"tags":["developer","curl","scraping","web"],"users":["rxhector","xrptipbot","goseedit"],"image":["https://sergeyzhuk.me/assets/images/posts/fast-webscraping-reactphp/logo.jpg"],"links":["https://github.com/rxhector/ultimate-multicurl","https://twitter.com/@rxhector"],"app":"steemit/0.1","format":"markdown"}
`created`	2019-03-08 23:32:18
`last_update`	2019-03-08 23:32:18
`depth`	0
`children`	0
`last_payout`	2019-03-15 23:32:18
`cashout_time`	1969-12-31 23:59:59
`total_payout_value`	0.027 HBD
`curator_payout_value`	0.006 HBD
`pending_payout_value`	0.000 HBD
`promoted`	0.000 HBD
`body_length`	6,609
`author_reputation`	20,310,242,279,674
`root_title`	"my gift to the php multi-curl data scrapers"
`beneficiaries`	`[]`
`max_accepted_payout`	1,000,000.000 HBD
`percent_hbd`	10,000
`post_id`	80,940,704
`net_rshares`	51,215,866,461
`author_curate_reward`	""

properties (23)vote details (16)

voter	rshares	pct
d3nv3r	1,175,743,642	22.5%
oliverstoney	3,292,287,619	22.5%
the-bitcoin-dood	12,716,977,871	50%
sextant	101,828,972	22.5%
blhz	22,271,199,438	100%
commenthunter	97,666,564	22.5%
taskmanager	3,396,584,794	33.75%
lusterdoom	886,775,446	100%
prizeportal	96,234,201	22.5%
animad	419,198,548	100%
awilix	90,427,322	22.5%
sergino	360,858,216	2%
zapped	125,356,557	22.5%
tm50	5,160,075,236	45%
instantvoter	824,009,089	29.97%
bluesniper	200,642,946	0.08%