<rt id="bn8ez"></rt>
<label id="bn8ez"></label>

  • <span id="bn8ez"></span>

    <label id="bn8ez"><meter id="bn8ez"></meter></label>

    paulwong

    Analyzing Apache logs with Pig



    Analyzing log files, churning them and extracting meaningful information is a potential use case in Hadoop. We don’t have to go in for MapReduce programming for these analyses; instead we can go for tools like Pig and Hive for this log analysis. I’d just give you a start off on the analysis part. Let us consider Pig for apache log analysis. Pig has some built in libraries that would help us load the apache log files into pig and also some cleanup operation on string values from crude log files. All the functionalities are available in the piggybank.jar mostly available under pig/contrib/piggybank/java/ directory. As the first step we need to register this jar file with our pig session then only we can use the functionalities in our Pig Latin
    1.       Register PiggyBank jar
    REGISTER /usr/lib/pig/contrib/piggybank/java/piggybank.jar;
    Once we have registered the jar file we need to define a few functionalities to be used in our Pig Latin. For any basic apache log analysis we need a loader to load the log files in a column oriented format in pig, we can create a apache log loader as
    2.       Define a log loader
    DEFINE ApacheCommonLogLoader org.apache.pig.piggybank.storage.apachelog.CommonLogLoader();
    (Piggy Bank has other log loaders as well)
    In apache log files the default format of date is ‘dd/MMM/yyyy:HH:mm:ss Z’ . But such a date won’t help us much in case of log analysis we may have to extract date without time stamp. For that we use DateExtractor()
    3.       Define Date Extractor
    DEFINE DayExtractor org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor('yyyy-MM-dd');
    Once we have the required functionalities with us we need to first load the log file into pig
    4.       Load apachelog file into pig
    --load the log files from hdfs into pig using CommonLogLoader
    logs = LOAD '/userdata/bejoys/pig/p01/access.log.2011-01-01' USING ApacheCommonLogLoader AS (ip_address, rfc, userId, dt, request, serverstatus, returnobject, referersite, clientbrowser);
    Now we are ready to dive in for the actual log analysis. There would be multiple information you need to extract out of a log; we’d see a few of those common requirements out here
    Note: you need to first register the jar, define the classes to be used and load the log files into pig before trying out any of the pig latin below
    Requirement 1: Find unique hits per day
    PIG Latin
    --Extracting the day alone and grouping records based on days
    grpd = GROUP logs BY DayExtractor(dt) as day;
    --looping through each group to get the unique no of userIds
    cntd = FOREACH grpd
    {
                    tempId =  logs.userId;
                    uniqueUserId = DISTINCT tempId;
                    GENERATE group AS day,COUNT(uniqueUserId) AS cnt;
    }
    --sorting the processed records based on no of unique user ids in descending order
    srtd = ORDER cntd BY cnt desc;
    --storing the final result into a hdfs directory
    STORE srtd INTO '/userdata/bejoys/pig/ApacheLogResult1';
    Requirement 1: Find unique hits to websites (IPs) per day
    PIG Latin
    --Extracting the day alone and grouping records based on days and ip address
    grpd = GROUP logs BY (DayExtractor(dt) as day,ip_address);
    --looping through each group to get the unique no of userIds
    cntd = FOREACH grpd
    {
                    tempId =  logs.userId;
                    uniqueUserId = DISTINCT tempId;
                    GENERATE group AS day,COUNT(uniqueUserId) AS cnt;
    }
    --sorting the processed records based on no of unique user ids in descending order
    srtd = ORDER cntd BY cnt desc;
    --storing the final result into a hdfs directory
    STORE srtd INTO '/userdata/bejoys/pig/ ApacheLogResult2 ';
    Note: When you use pig latin in grunt shell we need to know a few factors
    1.       When we issue a pig statement in grunt and press enter only the semantic check is being done, no execution is triggered.
    2.       All the pig statements are executed only after the STORE command is submitted, ie map reduce programs would be triggered only after STORE is submitted
    3.       Also in this case you don’t have to load the log files again and again to pig once it is loaded we can use the same for all related operations in that session. Once you are out of the grunt shell the loaded files are lost, you’d have to perform the register and log file loading steps all over again.

    posted on 2013-04-08 02:06 paulwong 閱讀(357) 評論(0)  編輯  收藏 所屬分類: PIG

    主站蜘蛛池模板: 亚洲av永久无码精品秋霞电影影院| 亚洲国产综合在线| 国产精品免费看久久久无码| 亚洲精华国产精华精华液网站| 免费人成网站在线高清| 久久免费线看线看| 亚洲砖码砖专无区2023| 亚洲中文字幕丝袜制服一区| 99精品视频在线观看免费播放 | 亚洲精品无码专区久久同性男| 91在线免费观看| 亚洲综合久久一本伊伊区| 亚洲国产精品自产在线播放| 99在线观看精品免费99| 亚洲精品欧美综合四区| 亚洲av无码无在线观看红杏| 午夜精品在线免费观看| 黄色片免费在线观看| 亚洲av无码无线在线观看| 亚洲成A人片777777| 国产精品久免费的黄网站| 亚洲免费在线视频播放| 337p日本欧洲亚洲大胆精品555588| 精品国产免费观看| 50岁老女人的毛片免费观看| 免费人成在线观看播放a| 亚洲Av高清一区二区三区| 1000部啪啪未满十八勿入免费| 特级av毛片免费观看| 亚洲国产av高清无码| 亚洲精品国产精品乱码不卡√| 在线观看免费为成年视频| 99精品在线免费观看| 国产99精品一区二区三区免费| 亚洲AV男人的天堂在线观看| 婷婷亚洲久悠悠色悠在线播放| MM131亚洲国产美女久久| 国产精品成人免费视频网站京东| 免费看黄的成人APP| 亚洲综合另类小说色区色噜噜| 男人扒开添女人下部免费视频 |