Analysing Petabytes of Websites

For each crawl (there is usually one a month) there can be upwards of 60,000 warc.gz files. These are all listed in a manifest for that months crawl. The following is the manifest file for the January 2017 crawl. The manifest is around 7 MB when decompressed and contains 57,800 lines. Each line is a single URI of a warc.gz file. There is no prefixed protocol or host name as S3 and HTTPS are both supported. $ curl -O $ gunzip -c warc.paths.gz | head crawl-data/CC-MAIN-2017-04/segments/1484560279169.4/warc/CC-MAIN-20170116095119-00000-ip-10-171-10-70.ec2.internal.warc.gz crawl-data/CC-MAIN-2017-04/segments/1484560279169.4/warc/CC-MAIN-20170116095119-00001-ip-10-171-10-70.ec2.internal.warc.gz crawl-data/CC-MAIN-2017-04/segments/1484560279169.4/warc/CC-MAIN-20170116095119-00002-ip-10-171-10-70.ec2.internal.warc.gz crawl-data/CC-MAIN-2017-04/segments/1484560279169.4/warc/CC-MAIN-20170116095119-00003-ip-10-171-10-70.ec2.internal.warc.gz crawl-data/CC-MAIN-2017-04/segments/1484560279169.4/warc/CC-MAIN-20170116095119-00004-ip-10-171-10-70.ec2.internal.warc.gz crawl-data/CC-MAIN-2017-04/segments/1484560279169.4/warc/CC-MAIN-20170116095119-00005-ip-10-171-10-70.ec2.internal.warc.gz crawl-data/CC-MAIN-2017-04/segments/1484560279169.4/warc/CC-MAIN-20170116095119-00006-ip-10-171-10-70.ec2.internal.warc.gz crawl-data/CC-MAIN-2017-04/segments/1484560279169.4/warc/CC-MAIN-20170116095119-00007-ip-10-171-10-70.ec2.internal.warc.gz crawl-data/CC-MAIN-2017-04/segments/1484560279169.4/warc/CC-MAIN-20170116095119-00008-ip-10-171-10-70.ec2.internal.warc.gz crawl-data/CC-MAIN-2017-04/segments/1484560279169.4/warc/CC-MAIN-20170116095119-00009-ip-10-171-10-70.ec2.internal.warc.gz When a page is written into a warc.gz file there are two derived datasets that are created off the back of it. The first is a JSON extract of the request, response and HTML meta data which is stored in a warc.wat.gz file. These are usually around ~350 MB in size. The second is a plain-text extract of the HTML contents which is stored in warc.wet.gz files. These are usually around ~150 MB in size. To demonstrate what these files look like Ill fetch a random HackerNews page. First, Ill search for all* pages using Common Crawls index for their January 2017 crawl. $ curl –silent '*&output=json' > hn.paths That request returned 13,591 pages from the January crawl. $ wc -l hn.paths 13591 hn.paths Note that any one page may have been crawled and stored more than once and there is no guarantee a page crawled in one months crawl will be crawled in another. Each line in the hn.paths results file is a JSON string representing the metadata of the page crawled. It contains the warc.gz file URI that the page contents can be found in as well as the byte offset in that gzip file and the length of the contents when gzip compressed. Here is one page picked at random: $ sort -R hn.paths | head -n1 | python -mjson.tool { "digest": "D6UPIKJTS6XLRLUWTW3HL2S44IE2GUZS", "filename": "crawl-data/CC-MAIN-2017-04/segments/1484560279657.18/warc/CC-MAIN-20170116095119-00156-ip-10-171-10-70.ec2.internal.warc.gz", "length": "2049", "mime": "text/html", "offset": "822555329", "status": "200", "timestamp": "20170117120519", "url": "", "urlkey": "com,ycombinator,news)/item?id=4781011" } Ill download the warc.gz file and extract the page. Ill run the head command to take the first 822,555,329 + 2,049 bytes of raw gzip data, Ill then pipe that into tail and take the last 2,049 bytes of gzip data isolating the compressed content used just for this one page. Ill then decompress the contents using gunzip. $ curl -O $ head -c 822557378 CC-MAIN-20170116095119-00156-ip-10-171-10-70.ec2.internal.warc.gz | tail -c 2049 | gunzip -c If you want to save some bandwidth you could provide the offset and range to CURL so it only fetches that 2,049 bytes from Amazon in the first place. $ curl -H "range: bytes=822555329-822557378" -O $ gunzip -c CC-MAIN-20170116095119-00156-ip-10-171-10-70.ec2.internal.warc.gz 2>/dev/null Below are the headers in full followed by the HTML. Ive truncated the HTML for this blog post but I can assure you the entire HTML of the page is present. WARC/1.0 WARC-Type: response WARC-Date: 2017-01-17T12:05:19Z WARC-Record-ID: <urn:uuid:9cd7b193-4ce0-44e5-924f-478d69798b52> Content-Length: 3955 Content-Type: application/http; msgtype=response WARC-Warcinfo-ID: <urn:uuid:c915b66c-b823-44f6-94f4-452aa439a12f> WARC-Concurrent-To: <urn:uuid:20114dd5-747e-4321-80da-0c449bb37894> WARC-IP-Address: WARC-Target-URI: WARC-Payload-Digest: sha1:D6UPIKJTS6XLRLUWTW3HL2S44IE2GUZS WARC-Block-Digest: sha1:PF6MXD5SQPBVXVKSRW6WUIW36MPMS2RS WARC-Truncated: length HTTP/1.1 200 OK Set-Cookie: __cfduid=daf9300df4da2584d3a99b5afc474f47f1484654719; expires=Wed, 17-Jan-18 12:05:19 GMT; path=/;; HttpOnly Connection: close Server: cloudflare-nginx Cache-Control: max-age=0 X-Frame-Options: DENY Strict-Transport-Security: max-age=31556900; includeSubDomains Vary: Accept-Encoding Date: Tue, 17 Jan 2017 12:05:19 GMT CF-RAY: 3229acbe785d23d8-IAD Content-Type: text/html; charset=utf-8 <html op="item"><head><meta name="referrer" content="origin"><meta name="viewport" content="width=device-width, initial-scale=1.0"><link rel="stylesheet" type="text/css" href="news.css?vg9HEiw8gAskbHjOLY38"> <link rel="shortcut icon" href="favicon.ico"> <title>What are business hours when you are a developer platform used by developers glo… | Hacker News</title></head><body><center><table id="hnmain" border="0" cellpadding="0" cellspacing="0" width="85%" bgcolor="#f6f6ef"> … If I want to fetch the JSON metadata extract for this page I can find its content in the .warc.gz files warc.wat.gz sibling. To get the right URL, change the URLs 5th sub-folder from warc to wat and change the file extension from warc.gz to warc.wat.gz. $ curl -O I dont have any offset information for the warc.wat.gz file so Ill run zgrep to find the content instead. The JSON payloads below have been truncated for readability purposes. $ zgrep -B3 -A7 'id=4781011' CC-MAIN-20170116095119-00156-ip-10-171-10-70.ec2.internal.warc.wat.gz WARC/1.0 WARC-Type: metadata WARC-Target-URI: WARC-Date: 2017-01-17T12:05:19Z WARC-Record-ID: <urn:uuid:cbdb944d-15b1-4e6a-aa26-262791508a94> WARC-Refers-To: <urn:uuid:20114dd5-747e-4321-80da-0c449bb37894> Content-Type: application/json Content-Length: 1358 {"Envelope":{"Format":"WARC","WARC-Header-Length":"361","Block-Digest":"sha1:QBYGW7UDVVNPHNJ2XECOCJD4L7YI6UPF",… WARC/1.0 WARC-Type: metadata WARC-Target-URI: WARC-Date: 2017-01-17T12:05:19Z WARC-Record-ID: <urn:uuid:c87f2618-7d4a-41d2-95c8-271721681a7d> WARC-Refers-To: <urn:uuid:9cd7b193-4ce0-44e5-924f-478d69798b52> Content-Type: application/json Content-Length: 4080 {"Envelope":{"Format":"WARC","WARC-Header-Length":"575","Block-Digest":"sha1:PF6MXD5SQPBVXVKSRW6WUIW36MPMS2RS",… WARC/1.0 WARC-Type: metadata WARC-Target-URI: WARC-Date: 2017-01-17T12:05:19Z WARC-Record-ID: <urn:uuid:e8242cc7-9787-4c35-b9d7-6b6531b99e90> WARC-Refers-To: <urn:uuid:ed64f4c4-494a-4ca3-9118-228bd9cce3a0> Content-Type: application/json Content-Length: 1109 {"Envelope":{"Format":"WARC","WARC-Header-Length":"389","Block-Digest":"sha1:I55H52HFRALSA2RHZ2TCEKYNIIZVDUUT",… There were three requests stored for this page in the file. Here is the JSON for the longest of the three. Ive truncated the HTML meta links. { "Envelope": { "Format": "WARC", "WARC-Header-Length": "575", "Block-Digest": "sha1:PF6MXD5SQPBVXVKSRW6WUIW36MPMS2RS", "Actual-Content-Length": "3955", "WARC-Header-Metadata": { "WARC-Type": "response", "WARC-Truncated": "length", "WARC-Date": "2017-01-17T12:05:19Z", "WARC-Warcinfo-ID": "<urn:uuid:c915b66c-b823-44f6-94f4-452aa439a12f>", "Content-Length": "3955", "WARC-Record-ID": "<urn:uuid:9cd7b193-4ce0-44e5-924f-478d69798b52>", "WARC-Block-Digest": "sha1:PF6MXD5SQPBVXVKSRW6WUIW36MPMS2RS", "WARC-Payload-Digest": "sha1:D6UPIKJTS6XLRLUWTW3HL2S44IE2GUZS", "WARC-Target-URI": "", "WARC-IP-Address": "", "WARC-Concurrent-To": "<urn:uuid:20114dd5-747e-4321-80da-0c449bb37894>", "Content-Type": "application/http; msgtype=response" }, "Payload-Metadata": { "Trailing-Slop-Length": "4", "Actual-Content-Type": "application/http; msgtype=response", "HTTP-Response-Metadata": { "Headers": { "X-Frame-Options": "DENY", "Strict-Transport-Security": "max-age=31556900; includeSubDomains", "Date": "Tue, 17 Jan 2017 12:05:19 GMT", "Vary": "Accept-Encoding", "CF-RAY": "3229acbe785d23d8-IAD", "Set-Cookie": "__cfduid=daf9300df4da2584d3a99b5afc474f47f1484654719; expires=Wed, 17-Jan-18 12:05:19 GMT; path=/;; HttpOnly", "Content-Type": "text/html; charset=utf-8", "Connection": "close", "Server": "cloudflare-nginx", "Cache-Control": "max-age=0" }, "Headers-Length": "453", "Entity-Length": "3502", "Entity-Trailing-Slop-Bytes": "0", "Response-Message": { "Status": "200", "Version": "HTTP/1.1", "Reason": "OK" }, "HTML-Metadata": { "Links": [{ "path": "IMG@/src", "url": "y18.gif" }, { "path": "A@/href", "url": "" }, { "text": "Hacker News", "path": "A@/href", "url": "news" }, { "text": "new", "path": "A@/href", "url": "newest" }, { "path": "FORM@/action", "method": "get", "url": "//" }], "Head": { "Link": [{ "path": "LINK@/href", "rel": "stylesheet", "type": "text/css", "url": "news.css?vg9HEiw8gAskbHjOLY38" }, { "path": "LINK@/href", "rel": "shortcut icon", "url": "favicon.ico" }], "Scripts": [{ "path": "SCRIPT@/src", "type": "text/javascript", "url": "hn.js?vg9HEiw8gAskbHjOLY38" }], "Metas": [{ "content": "origin", "name": "referrer" }, { "content": "width=device-width, initial-scale=1.0", "name": "viewport" }], "Title": "What are business hours when you are a developer platform used by developers glo… | Hacker News" } }, "Entity-Digest": "sha1:D6UPIKJTS6XLRLUWTW3HL2S44IE2GUZS" } } }, "Container": { "Compressed": true, "Gzip-Metadata": { "Footer-Length": "8", "Deflate-Length": "2049", "Header-Length": "10", "Inflated-CRC": "-1732134004", "Inflated-Length": "4534" }, "Offset": "822555329", "Filename": "CC-MAIN-20170116095119-00156-ip-10-171-10-70.ec2.internal.warc.gz" } } As you can see there is a rich set of metadata all nicely structured in a way that is easy to work with. The warc.wat.gz files are around a third the size of the warc.gz files so they save a substantial amount of bandwidth if you can design your job around not needing the entire page contents. AWS EMR Up & Running All the following commands were run on a fresh install of Ubuntu 14.04.3. To start, Ill install the AWS CLI tool and a few dependencies it needs to run. $ sudo apt-get update $ sudo apt-get -y install python-pip python-virtualenv $ virtualenv amazon $ source amazon/bin/activate $ pip install awscli Ill then enter my AWS credentials. $ read AWS_ACCESS_KEY_ID $ read AWS_SECRET_ACCESS_KEY $ export AWS_ACCESS_KEY_ID $ export AWS_SECRET_ACCESS_KEY Ill run configure to make sure us-east-1 is my default region. $ aws configure AWS Access Key ID [********************]: AWS Secret Access Key [********************]: Default region name [us-east-1]: us-east-1 Default output format [None]: Ill be launching a 5-node Hadoop cluster of m3.xlarge instances using the 5.3.1 release of AWS EMR. This comes with Hadoop 2.7.3, Hive 2.1.1, Spark 2.1.0 and Presto 0.157.1. I dont recommend using spot instances for master or core nodes but in the interest of keeping my costs for this blog post low all five nodes are spot instances where Ive bid to pay at most $0.07 / hour for each node. $ aws emr create-cluster –applications Name=Hadoop Name=Hive Name=Spark Name=Presto –auto-scaling-role EMR_AutoScaling_DefaultRole –ec2-attributes '{ "KeyName": "emr", "InstanceProfile": "EMR_EC2_DefaultRole", "SubnetId": "subnet-0489ed5c", "EmrManagedSlaveSecurityGroup": "sg-2d321350", "EmrManagedMasterSecurityGroup": "sg-3332134e" }' –enable-debugging –instance-groups '[{ "InstanceCount": 2, "BidPrice": "0.07", "InstanceGroupType": "TASK", "InstanceType": "m3.xlarge", "Name": "Task – 3" }, { "InstanceCount": 1, "BidPrice": "0.07", "InstanceGroupType": "MASTER", "InstanceType": "m3.xlarge", "Name": "Master – 1" }, { "InstanceCount": 2, "BidPrice": "0.07", "InstanceGroupType": "CORE", "InstanceType": "m3.xlarge", "Name": "Core – 2" }]' –log-uri 's3n://aws-logs-591231097547-us-east-1/elasticmapreduce/' –name 'My cluster' –region us-east-1 –release-label emr-5.3.1 –scale-down-behavior TERMINATE_AT_INSTANCE_HOUR –service-role EMR_DefaultRole –termination-protected After 12 minutes the machines were up and running and I was able to SSH into the master node. $ ssh -o ServerAliveInterval=50 -i ~/.ssh/emr.pem hadoop@ __| __|_ ) _| ( / Amazon Linux AMI ___|___|___| 11 package(s) needed for security, out of 17 available Run "sudo yum update" to apply all updates..EEEEEEEEEEEEEEEEEEEE MMMMMMMM MMMMMMMM RRRRRRRRRRRRRRR E::::::::::::::::::E M:::::::M M:::::::M R::::::::::::::R EE:::::EEEEEEEEE:::E M::::::::M M::::::::M R:::::RRRRRR:::::R E::::E EEEEE M:::::::::M M:::::::::M RR::::R R::::R E::::E M::::::M:::M M:::M::::::M R:::R R::::R E:::::EEEEEEEEEE M:::::M M:::M M:::M M:::::M R:::RRRRRR:::::R E::::::::::::::E M:::::M M:::M:::M M:::::M R:::::::::::RR E:::::EEEEEEEEEE M:::::M M:::::M M:::::M R:::RRRRRR::::R E::::E M:::::M M:::M M:::::M R:::R R::::R E::::E EEEEE M:::::M MMM M:::::M R:::R R::::R EE:::::EEEEEEEE::::E M:::::M M:::::M R:::R R::::R E::::::::::::::::::E M:::::M M:::::M RR::::R R::::R EEEEEEEEEEEEEEEEEEEE MMMMMMM MMMMMMM RRRRRRR RRRRRR I need three Python-based dependencies installed on the master and task nodes..I ran the following command by hand but Id recommend this task be wrapped up in a bootstrap step when launching EMR..$ sudo pip install boto warc Pointing Spark at Common Crawl On the master node Ill download the list of warc.wat.gz URIs for the January 2017 crawl..Ill then pick 100 URIs at random and save them onto HDFS..Note that if you want to work with data over several crawls its possible to download the paths.gz files for multiple months and concatenate them into a single manifest..$ curl -O $ gunzip -c wat.paths.gz | sort -R | head -n100 | gzip > wat.paths.100.gz $ hdfs dfs -copyFromLocal wat.paths.100.gz /user/hadoop/ I will then launch pyspark and start the data extraction job on those 100 warc.wat.gz files..$ pyspark import json import boto from boto.s3.key import Key from gzipstream import GzipStreamFile from pyspark.sql.types import * import warc def get_servers(id_, iterator): conn = boto.connect_s3(anon=True, host='') bucket = conn.get_bucket('commoncrawl') for uri in iterator: key_ = Key(bucket, uri) file_ = warc.WARCFile(fileobj=GzipStreamFile(key_)) for record in file_: if record['Content-Type'] == 'application/json': record = json.loads( try: yield record['Envelope'] ['Payload-Metadata'] ['HTTP-Response-Metadata'] ['Headers'] ['Server'].strip().lower() except KeyError: yield None files = sc.textFile('/user/hadoop/wat.paths.100.gz') servers = files.mapPartitionsWithSplit(get_servers) .map(lambda x: (x, 1)) .reduceByKey(lambda x, y: x + y) schema = StructType([ StructField("server_name", StringType(), True), StructField("page_count", LongType(), True) ]) sqlContext.createDataFrame(servers, schema=schema) .write .format("parquet") .saveAsTable('servers') In the above script I read in the 100 WAT URIs off the file stored on HDFS: files = sc.textFile('/user/hadoop/wat.paths.100.gz') I then iterated over each line in the file which represents a single URI..These all ran through the get_servers method..Afterword, I ran a MapReduce job to count how many pages each server was used to serve..To be clear, when I refer to server Im referring to the software that was reported to handle the request and serve the resulting contents (such as Apache, Nginx, IIS)..servers = files.mapPartitionsWithSplit(get_servers) .map(lambda x: (x, 1)) .reduceByKey(lambda x, y: x + y) In the get_servers method an anonymous connection to AWS S3 is made and a handle to the commoncrawl S3 bucket is acquired.. More details

Leave a Reply