褰搈ap task 寮濮嬭繍綆楋紝騫朵駭鐢熶腑闂存暟鎹椂錛屽叾浜х敓鐨勪腑闂寸粨鏋滃茍闈炵洿鎺ュ氨綆鍗曠殑鍐欏叆紓佺洏銆傝繖涓棿鐨勮繃紼嬫瘮杈冨鏉傦紝騫朵笖鍒╃敤鍒頒簡鍐呭瓨buffer 鏉ヨ繘琛屽凡緇忎駭鐢熺殑閮ㄥ垎緇撴灉鐨勭紦瀛橈紝騫跺湪鍐呭瓨buffer 涓繘琛屼竴浜涢鎺掑簭鏉ヤ紭鍖栨暣涓猰ap 鐨勬ц兘銆傚涓婂浘鎵紺猴紝姣忎竴涓猰ap 閮戒細瀵瑰簲瀛樺湪涓涓唴瀛榖uffer 錛圡apOutputBuffer 錛屽嵆涓婂浘鐨刡uffer in memory 錛夛紝map 浼氬皢宸茬粡浜х敓鐨勯儴鍒嗙粨鏋滃厛鍐欏叆鍒拌buffer 涓紝榪欎釜buffer 榛樿鏄?00MB 澶у皬錛屼絾鏄繖涓ぇ灝忔槸鍙互鏍規嵁job 鎻愪氦鏃剁殑鍙傛暟璁懼畾鏉ヨ皟鏁寸殑錛岃鍙傛暟鍗充負錛?/span> io.sort.mb 銆傚綋map 鐨勪駭鐢熸暟鎹潪甯稿ぇ鏃訛紝騫朵笖鎶奿o.sort.mb 璋冨ぇ錛岄偅涔坢ap 鍦ㄦ暣涓綆楄繃紼嬩腑spill 鐨勬鏁板氨鍔垮繀浼氶檷浣庯紝map task 瀵圭鐩樼殑鎿嶄綔灝變細鍙樺皯錛屽鏋渕ap tasks 鐨勭摱棰堝湪紓佺洏涓婏紝榪欐牱璋冩暣灝變細澶уぇ鎻愰珮map 鐨勮綆楁ц兘銆俶ap 鍋歴ort 鍜宻pill 鐨勫唴瀛樼粨鏋勫涓嬪鎵紺猴細
map 鍦ㄨ繍琛岃繃紼嬩腑錛屼笉鍋滅殑鍚戣buffer 涓啓鍏ュ凡鏈夌殑璁$畻緇撴灉錛屼絾鏄buffer 騫朵笉涓瀹氳兘灝嗗叏閮ㄧ殑map 杈撳嚭緙撳瓨涓嬫潵錛屽綋map 杈撳嚭瓚呭嚭涓瀹氶槇鍊鹼紙姣斿100M 錛夛紝閭d箞map 灝卞繀欏誨皢璇uffer 涓殑鏁版嵁鍐欏叆鍒扮鐩樹腑鍘伙紝榪欎釜榪囩▼鍦╩apreduce 涓彨鍋歴pill 銆俶ap 騫朵笉鏄絳夊埌灝嗚buffer 鍏ㄩ儴鍐欐弧鏃舵墠榪涜spill 錛屽洜涓哄鏋滃叏閮ㄥ啓婊′簡鍐嶅幓鍐檚pill 錛屽娍蹇呬細閫犳垚map 鐨勮綆楅儴鍒嗙瓑寰卋uffer 閲婃斁絀洪棿鐨勬儏鍐點傛墍浠ワ紝map 鍏跺疄鏄綋buffer 琚啓婊″埌涓瀹氱▼搴︼紙姣斿80% 錛夋椂錛屽氨寮濮嬭繘琛宻pill 銆傝繖涓槇鍊間篃鏄敱涓涓猨ob 鐨勯厤緗弬鏁版潵鎺у埗錛屽嵆 io.sort.spill.percent 錛岄粯璁や負0.80 鎴?0% 銆傝繖涓弬鏁板悓鏍蜂篃鏄獎鍝峴pill 棰戠箒紼嬪害錛岃繘鑰屽獎鍝峬ap task 榪愯鍛ㄦ湡瀵圭鐩樼殑璇誨啓棰戠巼鐨勩備絾闈炵壒孌婃儏鍐典笅錛岄氬父涓嶉渶瑕佷漢涓虹殑璋冩暣銆傝皟鏁磇o.sort.mb 瀵圭敤鎴鋒潵璇存洿鍔犳柟渚褲?/span>
褰搈ap task 鐨勮綆楅儴鍒嗗叏閮ㄥ畬鎴愬悗錛屽鏋渕ap 鏈夎緭鍑猴紝灝變細鐢熸垚涓涓垨鑰呭涓猻pill 鏂囦歡錛岃繖浜涙枃浠跺氨鏄痬ap 鐨勮緭鍑虹粨鏋溿俶ap 鍦ㄦ甯擱鍑轟箣鍓嶏紝闇瑕佸皢榪欎簺spill 鍚堝茍錛坢erge 錛夋垚涓涓紝鎵浠ap 鍦ㄧ粨鏉熶箣鍓嶈繕鏈変竴涓猰erge 鐨勮繃紼嬨俶erge 鐨勮繃紼嬩腑錛屾湁涓涓弬鏁板彲浠ヨ皟鏁磋繖涓繃紼嬬殑琛屼負錛岃鍙傛暟涓猴細 io.sort.factor 銆傝鍙傛暟榛樿涓?0 銆傚畠琛ㄧず褰搈erge spill 鏂囦歡鏃訛紝鏈澶氳兘鏈夊灝戝茍琛岀殑stream 鍚憁erge 鏂囦歡涓啓鍏ャ傛瘮濡傚鏋渕ap 浜х敓鐨勬暟鎹潪甯哥殑澶э紝浜х敓鐨剆pill 鏂囦歡澶т簬10 錛岃宨o.sort.factor 浣跨敤鐨勬槸榛樿鐨?0 錛岄偅涔堝綋map 璁$畻瀹屾垚鍋歮erge 鏃訛紝灝辨病鏈夊姙娉曚竴嬈″皢鎵鏈夌殑spill 鏂囦歡merge 鎴愪竴涓紝鑰屾槸浼氬垎澶氭錛屾瘡嬈℃渶澶?0 涓猻tream 銆傝繖涔熷氨鏄錛屽綋map 鐨勪腑闂寸粨鏋滈潪甯稿ぇ錛岃皟澶o.sort.factor 錛屾湁鍒╀簬鍑忓皯merge 嬈℃暟錛岃繘鑰屽噺灝憁ap 瀵圭鐩樼殑璇誨啓棰戠巼錛屾湁鍙兘杈懼埌浼樺寲浣滀笟鐨勭洰鐨勩?/span>
褰搄ob 鎸囧畾浜哻ombiner 鐨勬椂鍊欙紝鎴戜滑閮界煡閬搈ap 浠嬬粛鍚庝細鍦╩ap 绔牴鎹甤ombiner 瀹氫箟鐨勫嚱鏁板皢map 緇撴灉榪涜鍚堝茍銆傝繍琛宑ombiner 鍑芥暟鐨勬椂鏈烘湁鍙兘浼氭槸merge 瀹屾垚涔嬪墠錛屾垨鑰呬箣鍚庯紝榪欎釜鏃舵満鍙互鐢變竴涓弬鏁版帶鍒訛紝鍗?/span> min.num.spill.for.combine 錛坉efault 3 錛夛紝褰搄ob 涓瀹氫簡combiner 錛屽茍涓攕pill 鏁版渶灝戞湁3 涓殑鏃跺欙紝閭d箞combiner 鍑芥暟灝變細鍦╩erge 浜х敓緇撴灉鏂囦歡涔嬪墠榪愯銆傞氳繃榪欐牱鐨勬柟寮忥紝灝卞彲浠ュ湪spill 闈炲父澶氶渶瑕乵erge 錛屽茍涓斿緢澶氭暟鎹渶瑕佸仛conbine 鐨勬椂鍊欙紝鍑忓皯鍐欏叆鍒扮鐩樻枃浠剁殑鏁版嵁鏁伴噺錛屽悓鏍鋒槸涓轟簡鍑忓皯瀵圭鐩樼殑璇誨啓棰戠巼錛屾湁鍙兘杈懼埌浼樺寲浣滀笟鐨勭洰鐨勩?/span>
鍑忓皯涓棿緇撴灉璇誨啓榪涘嚭紓佺洏鐨勬柟娉曚笉姝㈣繖浜涳紝榪樻湁灝辨槸鍘嬬緝銆備篃灝辨槸璇磎ap 鐨勪腑闂達紝鏃犺鏄痵pill 鐨勬椂鍊欙紝榪樻槸鏈鍚巑erge 浜х敓鐨勭粨鏋滄枃浠訛紝閮芥槸鍙互鍘嬬緝鐨勩傚帇緙╃殑濂藉鍦ㄤ簬錛岄氳繃鍘嬬緝鍑忓皯鍐欏叆璇誨嚭紓佺洏鐨勬暟鎹噺銆傚涓棿緇撴灉闈炲父澶э紝紓佺洏閫熷害鎴愪負map 鎵ц鐡墮鐨刯ob 錛屽挨鍏舵湁鐢ㄣ傛帶鍒秏ap 涓棿緇撴灉鏄惁浣跨敤鍘嬬緝鐨勫弬鏁頒負錛?/span> mapred.compress.map.output (true/false) 銆傚皢榪欎釜鍙傛暟璁劇疆涓簍rue 鏃訛紝閭d箞map 鍦ㄥ啓涓棿緇撴灉鏃訛紝灝變細灝嗘暟鎹帇緙╁悗鍐嶅啓鍏ョ鐩橈紝璇葷粨鏋滄椂涔熶細閲囩敤鍏堣В鍘嬪悗璇誨彇鏁版嵁銆傝繖鏍峰仛鐨勫悗鏋滃氨鏄細鍐欏叆紓佺洏鐨勪腑闂寸粨鏋滄暟鎹噺浼氬彉灝戯紝浣嗘槸cpu 浼氭秷鑰椾竴浜涚敤鏉ュ帇緙╁拰瑙e帇銆傛墍浠ヨ繖縐嶆柟寮忛氬父閫傚悎job 涓棿緇撴灉闈炲父澶э紝鐡墮涓嶅湪cpu 錛岃屾槸鍦ㄧ鐩樼殑璇誨啓鐨勬儏鍐點傝鐨勭洿鐧戒竴浜涘氨鏄敤cpu 鎹O 銆傛牴鎹瀵燂紝閫氬父澶ч儴鍒嗙殑浣滀笟cpu 閮戒笉鏄摱棰堬紝闄ら潪榪愮畻閫昏緫寮傚父澶嶆潅銆傛墍浠ュ涓棿緇撴灉閲囩敤鍘嬬緝閫氬父鏉ヨ鏄湁鏀剁泭鐨勩備互涓嬫槸涓涓獁ordcount 涓棿緇撴灉閲囩敤鍘嬬緝鍜屼笉閲囩敤鍘嬬緝浜х敓鐨刴ap 涓棿緇撴灉鏈湴紓佺洏璇誨啓鐨勬暟鎹噺瀵規瘮錛?/span>
map 涓棿緇撴灉涓嶅帇緙╋細
map 涓棿緇撴灉鍘嬬緝錛?/span>
鍙互鐪嬪嚭錛屽悓鏍風殑job 錛屽悓鏍風殑鏁版嵁錛屽湪閲囩敤鍘嬬緝鐨勬儏鍐典笅錛宮ap 涓棿緇撴灉鑳界緝灝忓皢榪?0 鍊嶏紝濡傛灉map 鐨勭摱棰堝湪紓佺洏錛岄偅涔坖ob 鐨勬ц兘鎻愬崌灝嗕細闈炲父鍙銆?/span>
褰撻噰鐢╩ap 涓棿緇撴灉鍘嬬緝鐨勬儏鍐典笅錛岀敤鎴瘋繕鍙互閫夋嫨鍘嬬緝鏃訛拷錕斤拷鐢ㄥ摢縐嶅帇緙╂牸寮忚繘琛屽帇緙╋紝鐜板湪Hadoop 鏀寔鐨勫帇緙╂牸寮忔湁錛?/span> GzipCodec 錛?/span> LzoCodec 錛?/span> BZip2Codec 錛?/span> LzmaCodec 絳夊帇緙╂牸寮忋傞氬父鏉ヨ錛屾兂瑕佽揪鍒版瘮杈冨鉤琛$殑 cpu 鍜岀鐩樺帇緙╂瘮錛?/span> LzoCodec 姣旇緝閫傚悎銆備絾涔熻鍙栧喅浜?/span> job 鐨勫叿浣撴儏鍐點傜敤鎴瘋嫢鎯寵鑷閫夋嫨涓棿緇撴灉鐨勫帇緙╃畻娉曪紝鍙互璁劇疆閰嶇疆鍙傛暟錛?/span> mapred.map.output.compression.codec =org.apache.hadoop.io.compress.DefaultCodec 鎴栬呭叾浠栫敤鎴瘋嚜琛岄夋嫨鐨勫帇緙╂柟寮忋?/span>
閫夐」 | 綾誨瀷 | 榛樿鍊?/span> | 鎻忚堪 |
io.sort.mb | int | 100 | 緙撳瓨 map 涓棿緇撴灉鐨?/span> buffer 澶у皬 (in MB) |
io.sort.record.percent | float | 0.05 | io.sort.mb 涓敤鏉ヤ繚瀛?/span> map output 璁板綍杈圭晫鐨勭櫨鍒嗘瘮錛屽叾浠栫紦瀛樼敤鏉ヤ繚瀛樻暟鎹?/span> |
io.sort.spill.percent | float | 0.80 | map 寮濮嬪仛 spill 鎿嶄綔鐨勯槇鍊?/span> |
io.sort.factor | int | 10 | 鍋?/span> merge 鎿嶄綔鏃跺悓鏃舵搷浣滅殑 stream 鏁頒笂闄愩?/span> |
min.num.spill.for.combine | int | 3 | combiner 鍑芥暟榪愯鐨勬渶灝?/span> spill 鏁?/span> |
mapred.compress.map.output | boolean | false | map 涓棿緇撴灉鏄惁閲囩敤鍘嬬緝 |
mapred.map.output.compression.codec | class name | org.apache.Hadoop.io. compress.DefaultCodec | map 涓棿緇撴灉鐨勫帇緙╂牸寮?/span> |
reduce 鐨勮繍琛屾槸鍒嗘垚涓変釜闃舵鐨勩傚垎鍒負 copy->sort->reduce 銆傜敱浜?/span> job 鐨勬瘡涓涓?/span> map 閮戒細鏍規嵁 reduce(n) 鏁板皢鏁版嵁鍒嗘垚 map 杈撳嚭緇撴灉鍒嗘垚 n 涓?/span> partition 錛屾墍浠?/span> map 鐨勪腑闂寸粨鏋滀腑鏄湁鍙兘鍖呭惈姣忎竴涓?/span> reduce 闇瑕佸鐞嗙殑閮ㄥ垎鏁版嵁鐨勩傛墍浠ワ紝涓轟簡浼樺寲 reduce 鐨勬墽琛屾椂闂達紝 hadoop 涓槸絳?/span> job 鐨勭涓涓?/span> map 緇撴潫鍚庯紝鎵鏈夌殑 reduce 灝卞紑濮嬪皾璇曚粠瀹屾垚鐨?/span> map 涓笅杞借 reduce 瀵瑰簲鐨?/span> partition 閮ㄥ垎鏁版嵁銆傝繖涓繃紼嬪氨鏄氬父鎵璇寸殑 shuffle 錛屼篃灝辨槸 copy 榪囩▼銆?/span>
Reduce task 鍦ㄥ仛 shuffle 鏃訛紝瀹為檯涓婂氨鏄粠涓嶅悓鐨勫凡緇忓畬鎴愮殑 map 涓婂幓涓嬭澆灞炰簬鑷繁榪欎釜 reduce 鐨勯儴鍒嗘暟鎹紝鐢變簬 map 閫氬父鏈夎澶氫釜錛屾墍浠ュ涓涓?/span> reduce 鏉ヨ錛屼笅杞戒篃鍙互鏄茍琛岀殑浠庡涓?/span> map 涓嬭澆錛岃繖涓茍琛屽害鏄彲浠ヨ皟鏁寸殑錛岃皟鏁村弬鏁頒負錛?/span> mapred.reduce.parallel.copies 錛?/span> default 5 錛夈傞粯璁ゆ儏鍐典笅錛屾瘡涓彧浼氭湁 5 涓茍琛岀殑涓嬭澆綰跨▼鍦ㄤ粠 map 涓嬫暟鎹紝濡傛灉涓涓椂闂存鍐?/span> job 瀹屾垚鐨?/span> map 鏈?/span> 100 涓垨鑰呮洿澶氾紝閭d箞 reduce 涔熸渶澶氬彧鑳藉悓鏃朵笅杞?/span> 5 涓?/span> map 鐨勬暟鎹紝鎵浠ヨ繖涓弬鏁版瘮杈冮傚悎 map 寰堝騫朵笖瀹屾垚鐨勬瘮杈冨揩鐨?/span> job 鐨勬儏鍐典笅璋冨ぇ錛屾湁鍒╀簬 reduce 鏇村揩鐨勮幏鍙栧睘浜庤嚜宸遍儴鍒嗙殑鏁版嵁銆?/span>
reduce 鐨勬瘡涓涓笅杞界嚎紼嬪湪涓嬭澆鏌愪釜 map 鏁版嵁鐨勬椂鍊欙紝鏈夊彲鑳藉洜涓洪偅涓?/span> map 涓棿緇撴灉鎵鍦ㄦ満鍣ㄥ彂鐢熼敊璇紝鎴栬呬腑闂寸粨鏋滅殑鏂囦歡涓㈠け錛屾垨鑰呯綉緇滅灛鏂瓑絳夋儏鍐碉紝榪欐牱 reduce 鐨勪笅杞藉氨鏈夊彲鑳藉け璐ワ紝鎵浠?/span> reduce 鐨勪笅杞界嚎紼嬪茍涓嶄細鏃犱紤姝㈢殑絳夊緟涓嬪幓錛屽綋涓瀹氭椂闂村悗涓嬭澆浠嶇劧澶辮觸錛岄偅涔堜笅杞界嚎紼嬪氨浼氭斁寮冭繖嬈′笅杞斤紝騫跺湪闅忓悗灝濊瘯浠庡彟澶栫殑鍦版柟涓嬭澆錛堝洜涓鴻繖孌墊椂闂?/span> map 鍙兘閲嶈窇錛夈傛墍浠?/span> reduce 涓嬭澆綰跨▼鐨勮繖涓渶澶х殑涓嬭澆鏃墮棿孌墊槸鍙互璋冩暣鐨勶紝璋冩暣鍙傛暟涓猴細 mapred.reduce.copy.backoff 錛?/span> default 300 縐掞級銆傚鏋滈泦緹ょ幆澧冪殑緗戠粶鏈韓鏄摱棰堬紝閭d箞鐢ㄦ埛鍙互閫氳繃璋冨ぇ榪欎釜鍙傛暟鏉ラ伩鍏?/span> reduce 涓嬭澆綰跨▼琚鍒や負澶辮觸鐨勬儏鍐點備笉榪囧湪緗戠粶鐜姣旇緝濂界殑鎯呭喌涓嬶紝娌℃湁蹇呰璋冩暣銆傞氬父鏉ヨ涓撲笟鐨勯泦緹ょ綉緇滀笉搴旇鏈夊お澶ч棶棰橈紝鎵浠ヨ繖涓弬鏁伴渶瑕佽皟鏁寸殑鎯呭喌涓嶅銆?/span>
Reduce 灝?/span> map 緇撴灉涓嬭澆鍒版湰鍦版椂錛屽悓鏍蜂篃鏄渶瑕佽繘琛?/span> merge 鐨勶紝鎵浠?/span> io.sort.factor 鐨勯厤緗夐」鍚屾牱浼氬獎鍝?/span> reduce 榪涜 merge 鏃剁殑琛屼負錛岃鍙傛暟鐨勮緇嗕粙緇嶄笂鏂囧凡緇忔彁鍒幫紝褰撳彂鐜?/span> reduce 鍦?/span> shuffle 闃舵 iowait 闈炲父鐨勯珮鐨勬椂鍊欙紝灝辨湁鍙兘閫氳繃璋冨ぇ榪欎釜鍙傛暟鏉ュ姞澶т竴嬈?/span> merge 鏃剁殑騫跺彂鍚炲悙錛屼紭鍖?/span> reduce 鏁堢巼銆?/span>
Reduce 鍦?/span> shuffle 闃舵瀵逛笅杞芥潵鐨?/span> map 鏁版嵁錛屽茍涓嶆槸绔嬪埢灝卞啓鍏ョ鐩樼殑錛岃屾槸浼氬厛緙撳瓨鍦ㄥ唴瀛樹腑錛岀劧鍚庡綋浣跨敤鍐呭瓨杈懼埌涓瀹氶噺鐨勬椂鍊欐墠鍒峰叆紓佺洏銆傝繖涓唴瀛樺ぇ灝忕殑鎺у埗灝變笉鍍?/span> map 涓鏍峰彲浠ラ氳繃 io.sort.mb 鏉ヨ瀹氫簡錛岃屾槸閫氳繃鍙﹀涓涓弬鏁版潵璁劇疆錛?/span> mapred.job.shuffle.input.buffer.percent 錛?/span> default 0.7 錛夛紝榪欎釜鍙傛暟鍏跺疄鏄竴涓櫨鍒嗘瘮錛屾剰鎬濇槸璇達紝 shuffile 鍦?/span> reduce 鍐呭瓨涓殑鏁版嵁鏈澶氫嬌鐢ㄥ唴瀛橀噺涓猴細 0.7 × maxHeap of reduce task 銆備篃灝辨槸璇達紝濡傛灉璇?/span> reduce task 鐨勬渶澶?/span> heap 浣跨敤閲忥紙閫氬父閫氳繃 mapred.child.java.opts 鏉ヨ緗紝姣斿璁劇疆涓?/span> -Xmx1024m 錛夌殑涓瀹氭瘮渚嬬敤鏉ョ紦瀛樻暟鎹傞粯璁ゆ儏鍐典笅錛?/span> reduce 浼氫嬌鐢ㄥ叾 heapsize 鐨?/span> 70% 鏉ュ湪鍐呭瓨涓紦瀛樻暟鎹傚鏋?/span> reduce 鐨?/span> heap 鐢變簬涓氬姟鍘熷洜璋冩暣鐨勬瘮杈冨ぇ錛岀浉搴旂殑緙撳瓨澶у皬涔熶細鍙樺ぇ錛岃繖涔熸槸涓轟粈涔?/span> reduce 鐢ㄦ潵鍋氱紦瀛樼殑鍙傛暟鏄竴涓櫨鍒嗘瘮錛岃屼笉鏄竴涓浐瀹氱殑鍊間簡銆?/span>
鍋囪 mapred.job.shuffle.input.buffer.percent 涓?/span> 0.7 錛?/span> reduce task 鐨?/span> max heapsize 涓?/span> 1G 錛岄偅涔堢敤鏉ュ仛涓嬭澆鏁版嵁緙撳瓨鐨勫唴瀛樺氨涓哄ぇ姒?/span> 700MB 宸﹀彸錛岃繖 700M 鐨勫唴瀛橈紝璺?/span> map 绔竴鏍鳳紝涔熶笉鏄絳夊埌鍏ㄩ儴鍐欐弧鎵嶄細寰紓佺洏鍒風殑錛岃屾槸褰撹繖 700M 涓浣跨敤鍒頒簡涓瀹氱殑闄愬害錛堥氬父鏄竴涓櫨鍒嗘瘮錛夛紝灝變細寮濮嬪線紓佺洏鍒楓傝繖涓檺搴﹂槇鍊間篃鏄彲浠ラ氳繃 job 鍙傛暟鏉ヨ瀹氱殑錛岃瀹氬弬鏁頒負錛?/span> mapred.job.shuffle.merge.percent 錛?/span> default 0.66 錛夈傚鏋滀笅杞介熷害寰堝揩錛屽緢瀹規槗灝辨妸鍐呭瓨緙撳瓨鎾戝ぇ錛岄偅涔堣皟鏁翠竴涓嬭繖涓弬鏁版湁鍙兘浼氬 reduce 鐨勬ц兘鏈夋墍甯姪銆?/span>
褰?/span> reduce 灝嗘墍鏈夌殑 map 涓婂搴旇嚜宸?/span> partition 鐨勬暟鎹笅杞藉畬鎴愬悗錛屽氨浼氬紑濮嬬湡姝g殑 reduce 璁$畻闃舵錛堜腑闂存湁涓?/span> sort 闃舵閫氬父鏃墮棿闈炲父鐭紝鍑犵閽熷氨瀹屾垚浜嗭紝鍥犱負鏁翠釜涓嬭澆闃舵灝卞凡緇忔槸杈逛笅杞借竟 sort 錛岀劧鍚庤竟 merge 鐨勶級銆傚綋 reduce task 鐪熸榪涘叆 reduce 鍑芥暟鐨勮綆楅樁孌電殑鏃跺欙紝鏈変竴涓弬鏁頒篃鏄彲浠ヨ皟鏁?/span> reduce 鐨勮綆楄涓恒備篃灝辨槸錛?/span> mapred.job.reduce.input.buffer.percent 錛?/span> default 0.0 錛夈傜敱浜?/span> reduce 璁$畻鏃惰偗瀹氫篃鏄渶瑕佹秷鑰楀唴瀛樼殑錛岃屽湪璇誨彇 reduce 闇瑕佺殑鏁版嵁鏃訛紝鍚屾牱鏄渶瑕佸唴瀛樹綔涓?/span> buffer 錛岃繖涓弬鏁版槸鎺у埗錛岄渶瑕佸灝戠殑鍐呭瓨鐧懼垎姣旀潵浣滀負 reduce 璇誨凡緇?/span> sort 濂界殑鏁版嵁鐨?/span> buffer 鐧懼垎姣斻傞粯璁ゆ儏鍐典笅涓?/span> 0 錛屼篃灝辨槸璇達紝榛樿鎯呭喌涓嬶紝 reduce 鏄叏閮ㄤ粠紓佺洏寮濮嬭澶勭悊鏁版嵁銆傚鏋滆繖涓弬鏁板ぇ浜?/span> 0 錛岄偅涔堝氨浼氭湁涓瀹氶噺鐨勬暟鎹緙撳瓨鍦ㄥ唴瀛樺茍杈撻佺粰 reduce 錛屽綋 reduce 璁$畻閫昏緫娑堣楀唴瀛樺緢灝忔椂錛屽彲浠ュ垎涓閮ㄥ垎鍐呭瓨鐢ㄦ潵緙撳瓨鏁版嵁錛屽弽姝?/span> reduce 鐨勫唴瀛橀棽鐫涔熸槸闂茬潃銆?/span>
閫夐」 | 綾誨瀷 | 榛樿鍊?/span> | 鎻忚堪 | |||
mapred.reduce.parallel.copies | int | 5 | 姣忎釜 reduce 騫惰涓嬭澆 map 緇撴灉鐨勬渶澶х嚎紼嬫暟 | |||
mapred.reduce.copy.backoff | int | 300 | reduce 涓嬭澆綰跨▼鏈澶х瓑寰呮椂闂達紙 in sec io.sort.factor | int | 10 | 鍚屼笂 |
mapred.job.shuffle.input.buffer.percent | float | 0.7 | 鐢ㄦ潵緙撳瓨 shuffle 鏁版嵁鐨?/span> reduce task heap 鐧懼垎姣?/span> | |||
mapred.job.shuffle.merge.percent | float | 0.66 | 緙撳瓨鐨勫唴瀛樹腑澶氬皯鐧懼垎姣斿悗寮濮嬪仛 merge 鎿嶄綔 | |||
mapred.job.reduce.input.buffer.percent | float | 0.0 | sort 瀹屾垚鍚?/span> reduce 璁$畻闃舵鐢ㄦ潵緙撳瓨鏁版嵁鐨勭櫨鍒嗘瘮 |
鏈変竴鎵規暟鎹敤hadoop mapreduce job澶勭悊鏃訛紝涓氬姟鐗圭偣瑕佹眰涓涓枃浠跺搴斾竴涓猰ap鏉ュ鐞嗭紝濡傛灉涓や釜鎴栧涓猰ap澶勭悊浜嗗悓涓涓枃浠訛紝鍙兘浼氭湁闂銆傚紑濮嬫兂閫氳繃璁劇疆 dfs.blocksize 鎴栬?mapreduce.input.fileinputformat.split.minsize/maxsize 鍙傛暟鏉ユ帶鍒秏ap鐨勪釜鏁幫紝鍚庢潵鎯沖埌鍏跺疄涓嶇敤榪欎箞澶嶆潅錛屽湪鑷畾涔夌殑InputFormat閲岄潰鐩存帴璁╂枃浠朵笉瑕佽繘琛宻plit灝卞彲浠ヤ簡銆?/p>
public class CustemDocInputFormat extends TextInputFormat { |
@Override |
public RecordReader<LongWritable, Text> createRecordReader(InputSplit split, TaskAttemptContext context) { |
DocRecordReader reader = null ; |
try { |
reader = new DocRecordReader(); // 鑷畾涔夌殑reader |
} catch (IOException e) { |
e.printStackTrace(); |
} |
return reader; |
} |
@Override |
protected boolean isSplitable(JobContext context, Path file) { |
return false ; |
} |
} |
榪欐牱錛岃緭鍏ユ枃浠舵湁澶氬皯涓紝job灝變細鍚姩澶氬皯涓猰ap浜嗐?/p>
hadoop涓彁渚涗簡 MultiOutputFormat 鑳藉皢緇撴灉鏁版嵁杈撳嚭鍒頒笉鍚岀殑鐩綍錛屼篃鎻愪緵浜?FileInputFormat 鏉ヤ竴嬈¤鍙栧涓洰褰曠殑鏁版嵁錛屼絾鏄粯璁や竴涓猨ob鍙兘浣跨敤 job.setInputFormatClass 璁劇疆浣跨敤涓涓猧nputfomat澶勭悊涓縐嶆牸寮忕殑鏁版嵁銆傚鏋滈渶瑕佸疄鐜?鍦ㄤ竴涓猨ob涓悓鏃惰鍙栨潵鑷笉鍚岀洰褰曠殑涓嶅悓鏍煎紡鏂囦歡 鐨勫姛鑳斤紝灝遍渶瑕佽嚜宸卞疄鐜頒竴涓?MultiInputFormat 鏉ヨ鍙栦笉鍚屾牸寮忕殑鏂囦歡浜?鍘熸潵宸茬粡鎻愪緵浜?a title="MultipleInputs" target="_blank">MultipleInputs)銆?/p>
渚嬪錛氭湁涓涓猰apreduce job闇瑕佸悓鏃惰鍙栦袱縐嶆牸寮忕殑鏁版嵁錛屼竴縐嶆牸寮忔槸鏅氱殑鏂囨湰鏂囦歡錛岀敤 LineRecordReader 涓琛屼竴琛岃鍙栵紱鍙﹀涓縐嶆枃浠舵槸浼猉ML鏂囦歡錛岀敤鑷畾涔夌殑AJoinRecordReader璇誨彇銆?/p>
鑷繁瀹炵幇浜嗕竴涓畝鍗曠殑 MultiInputFormat 濡備笅錛?/p>
import org.apache.hadoop.io.LongWritable; |
import org.apache.hadoop.io.Text; |
import org.apache.hadoop.mapreduce.InputSplit; |
import org.apache.hadoop.mapreduce.RecordReader; |
import org.apache.hadoop.mapreduce.TaskAttemptContext; |
import org.apache.hadoop.mapreduce.lib.input.FileSplit; |
import org.apache.hadoop.mapreduce.lib.input.LineRecordReader; |
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; |
public class MultiInputFormat extends TextInputFormat { |
@Override |
public RecordReader<LongWritable, Text> createRecordReader(InputSplit split, TaskAttemptContext context) { |
RecordReader reader = null ; |
try { |
String inputfile = ((FileSplit) split).getPath().toString(); |
String xmlpath = context.getConfiguration().get( "xml_prefix" ); |
String textpath = context.getConfiguration().get( "text_prefix" ); |
if (- 1 != inputfile.indexOf(xmlpath)) { |
reader = new AJoinRecordReader(); |
} else if (- 1 != inputfile.indexOf(textpath)) { |
reader = new LineRecordReader(); |
} else { |
reader = new LineRecordReader(); |
} |
} catch (IOException e) { |
// do something ... |
} |
return reader; |
} |
} |
鍏跺疄鍘熺悊寰堢畝鍗曪紝灝辨槸鍦?createRecordReader 鐨勬椂鍊欙紝閫氳繃 ((FileSplit) split).getPath().toString() 鑾峰彇鍒板綋鍓嶈澶勭悊鐨勬枃浠跺悕錛岀劧鍚庢牴鎹壒寰佸尮閰嶏紝閫夊彇瀵瑰簲鐨?RecordReader 鍗沖彲銆倄ml_prefix鍜宼ext_prefix鍙互鍦ㄧ▼搴忓惎鍔ㄦ椂閫氳繃 -D 浼犵粰Configuration銆?/p>
姣斿鏌愭鎵ц鎵撳嵃鐨勫煎涓嬶細
inputfile=hdfs://test042092.sqa.cm4:9000/ test /input_xml/common-part-00068 |
xmlpath_prefix=hdfs://test042092.sqa.cm4:9000/ test /input_xml |
textpath_prefix=hdfs://test042092.sqa.cm4:9000/ test /input_txt |
榪欓噷鍙槸閫氳繃綆鍗曠殑鏂囦歡璺緞鍜屾爣紺虹鍖歸厤鏉ュ仛錛屼篃鍙互閲囩敤鏇村鏉傜殑鏂規硶錛屾瘮濡傛枃浠跺悕銆佹枃浠跺悗緙絳夈?/p>
鎺ョ潃鍦╩ap綾諱腑錛屼篃鍚屾牱鍙互鏍規嵁涓嶅悓鐨勬枃浠跺悕鐗瑰緛榪涜涓嶅悓鐨勫鐞嗭細
@Override |
public void map(LongWritable offset, Text inValue, Context context) |
throws IOException { |
String inputfile = ((FileSplit) context.getInputSplit()).getPath() |
.toString(); |
if (- 1 != inputfile.indexOf(textpath)) { |
...... |
} else if (- 1 != inputfile.indexOf(xmlpath)) { |
...... |
} else { |
...... |
} |
} |
榪欑鏂瑰紡澶湡浜嗭紝鍘熸潵hadoop閲岄潰宸茬粡鎻愪緵浜?MultipleInputs 鏉ュ疄鐜板涓涓洰褰曟寚瀹氫竴涓?a title="鏌ョ湅inputformat涓殑鍏ㄩ儴鏂囩珷" target="_blank">inputformat鍜屽搴旂殑map澶勭悊綾匯?/p>
MultipleInputs.addInputPath(conf, new Path( "/foo" ), TextInputFormat. class , |
MapClass. class ); |
MultipleInputs.addInputPath(conf, new Path( "/bar" ), |
KeyValueTextInputFormat. class , MapClass2. class ); |
鏌愭棩錛屾帴鎵嬩簡鍚屼簨鍐欑殑浠?a title="鏌ョ湅hadoop涓殑鍏ㄩ儴鏂囩珷" target="_blank">Hadoop闆嗙兢鎷瘋礉鏁版嵁鍒板彟澶栦竴涓泦緹ょ殑紼嬪簭錛岃紼嬪簭鏄繍琛屽湪Hadoop闆嗙兢涓婄殑job銆傝繖涓猨ob鍙湁map闃舵錛岃鍙杊dfs鐩綍涓嬫暟鎹殑鏁版嵁錛岀劧鍚庡啓鍏ュ埌鍙﹀涓涓泦緹ゃ?/p>
鏄劇劧錛岃繖涓▼搴忔病鏈夎冭檻澶ф暟鎹噺鐨勬儏鍐碉紝濡傛灉杈撳叆鐩綍涓嬫枃浠跺緢澶氭垨鏁版嵁閲忓緢澶э紝灝變細瀵艱嚧map鏁板緢澶氥傝屽疄闄呬笂鎴戜滑闇瑕佹嫹璐濈殑涓涓暟鎹簮灝辨湁榪?6T錛宩ob鍚姩璧鋒潵鏈?w澶氫釜map錛屼竴涓嬪瓙鏁翠釜queue鐨勮祫婧愬氨鍗犳弧浜嗐傝櫧鐒墮氳繃璋冩暣涓浜涘弬鏁板彲浠ユ帶鍒秏ap鏁?涔熷氨鏄茍鍙戞暟)錛屼絾鏄棤娉曞噯紜殑鎺?鍒秏ap鏁幫紝鑰屼笖鎹釜鏁版嵁婧愬張寰楅噸鏂伴厤緗弬鏁般?/p>
絎竴涓敼榪涚殑鐗堟湰鏄紝鍔犱簡Reduce榪囩▼錛屼互鏈熸湜閫氳繃璁劇疆Reduce鏁伴噺鏉ユ帶鍒跺茍鍙戞暟銆傝繖鏍瘋櫧鐒惰兘綺劇‘鍦版帶鍒跺茍鍙戞暟錛屼絾鏄鍔犱簡shuffle 榪囩▼錛屽疄闄呰繍琛屼腑鍙戠幇杈撳叆鏁版嵁鏈夊炬枩錛堣宲artition鐨刱ey鐢變簬涓氬姟闇瑕佹棤娉曟洿鏀癸級錛屽鑷撮儴鍒嗘満鍣ㄧ綉緇滆鎵撴弧錛屼粠鑰屽獎鍝嶅埌浜嗛泦緹や腑鐨勫叾浠栧簲鐢ㄣ傚嵆 浣塊氳繃 mapred.reduce.parallel.copies 鍙傛暟鏉ラ檺鍒秙huffle涔熸槸娌繪爣涓嶆不鏈傝繖涓鉤鐧藉鍔犵殑shuffle榪囩▼瀹為檯涓婃氮璐逛簡寰堝緗戠粶甯﹀鍜孖O銆?/p>
鏈鐞嗘兂鐨勬儏鍐靛綋鐒舵槸鍙湁map闃舵錛岃屼笖鑳藉鍑嗙‘鐨勬帶鍒跺茍鍙戞暟浜嗐?/p>
浜庢槸錛岀浜屼釜浼樺寲鐗堟湰璇炵敓浜嗐傝繖涓猨ob鍙湁map闃舵錛岄噰鐢?a title="CombineFileInputFormat" target="_blank">CombineFileInputFormat錛?瀹冨彲浠ュ皢澶氫釜灝忔枃浠舵墦鍖呮垚涓涓狪nputSplit鎻愪緵緇欎竴涓狹ap澶勭悊錛岄伩鍏嶅洜涓哄ぇ閲忓皬鏂囦歡闂錛屽惎鍔ㄥぇ閲弇ap銆傞氳繃 mapred.max.split.size 鍙傛暟鍙互澶ф鍦版帶鍒跺茍鍙戞暟銆傛湰浠ヤ負榪欐牱灝辮兘瑙e喅闂浜嗭紝緇撴灉鍙堝彂鐜頒簡鏁版嵁鍊炬枩鐨勯棶棰樸傝繖縐嶇矖鐣ュ湴鍒唖plits鐨勬柟寮忥紝瀵艱嚧鏈夌殑map澶勭悊鐨勬暟鎹皯錛屾湁鐨?map澶勭悊鐨勬暟鎹錛屽茍涓嶅潎鍖銆傚嚑涓嫋鍚庨鐨刴ap灝卞鑷磈ob鐨勫疄闄呰繍琛屾椂闂撮暱浜嗕竴鍊嶅銆?/p>
鐪嬫潵鍙湁璁╂瘡涓猰ap澶勭悊鐨勬暟鎹噺涓鏍峰錛屾墠鑳藉畬緹庣殑瑙e喅榪欎釜闂浜嗐?/p>
絎笁涓増鏈篃璇炵敓浜嗭紝榪欐鏄噸鍐欎簡CombineFileInputFormat錛岃嚜宸卞疄鐜癵etSplits鏂規硶銆傜敱浜庤緭鍏ユ暟鎹負SequenceFile鏍煎紡錛屽洜姝ら渶瑕佷竴涓猄equenceFileRecordReaderWrapper綾匯?/p>
瀹炵幇浠g爜濡備笅錛?br /> CustomCombineSequenceFileInputFormat.java
import java.io.IOException; |
import org.apache.hadoop.classification.InterfaceAudience; |
import org.apache.hadoop.classification.InterfaceStability; |
import org.apache.hadoop.mapreduce.InputSplit; |
import org.apache.hadoop.mapreduce.RecordReader; |
import org.apache.hadoop.mapreduce.TaskAttemptContext; |
import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat; |
import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader; |
import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReaderWrapper; |
import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit; |
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat; |
/** |
* Input format that is a <code>CombineFileInputFormat</code>-equivalent for |
* <code>SequenceFileInputFormat</code>. |
* |
* @see CombineFileInputFormat |
*/ |
@InterfaceAudience .Public |
@InterfaceStability .Stable |
public class CustomCombineSequenceFileInputFormat<K, V> extends MultiFileInputFormat<K, V> { |
@SuppressWarnings ({ "rawtypes" , "unchecked" }) |
public RecordReader<K, V> createRecordReader(InputSplit split, TaskAttemptContext context) |
throws IOException { |
return new CombineFileRecordReader((CombineFileSplit) split, context, |
SequenceFileRecordReaderWrapper. class ); |
} |
/** |
* A record reader that may be passed to <code>CombineFileRecordReader</code> so that it can be |
* used in a <code>CombineFileInputFormat</code>-equivalent for |
* <code>SequenceFileInputFormat</code>. |
* |
* @see CombineFileRecordReader |
* @see CombineFileInputFormat |
* @see SequenceFileInputFormat |
*/ |
private static class SequenceFileRecordReaderWrapper<K, V> |
extends CombineFileRecordReaderWrapper<K, V> { |
// this constructor signature is required by CombineFileRecordReader |
public SequenceFileRecordReaderWrapper(CombineFileSplit split, TaskAttemptContext context, |
Integer idx) throws IOException, InterruptedException { |
super ( new SequenceFileInputFormat<K, V>(), split, context, idx); |
} |
} |
} |
MultiFileInputFormat.java
import java.io.IOException; |
import java.util.ArrayList; |
import java.util.List; |
import org.apache.commons.logging.Log; |
import org.apache.commons.logging.LogFactory; |
import org.apache.hadoop.fs.FileStatus; |
import org.apache.hadoop.fs.Path; |
import org.apache.hadoop.mapreduce.InputSplit; |
import org.apache.hadoop.mapreduce.Job; |
import org.apache.hadoop.mapreduce.JobContext; |
import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat; |
import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit; |
/** |
* multiple files can be combined in one InputSplit so that InputSplit number can be limited! |
*/ |
public abstract class MultiFileInputFormat<K, V> extends CombineFileInputFormat<K, V> { |
private static final Log LOG = LogFactory.getLog(MultiFileInputFormat. class ); |
public static final String CONFNAME_INPUT_SPLIT_MAX_NUM = "multifileinputformat.max_split_num" ; |
public static final Integer DEFAULT_MAX_SPLIT_NUM = 50 ; |
public static void setMaxInputSplitNum(Job job, Integer maxSplitNum) { |
job.getConfiguration().setInt(CONFNAME_INPUT_SPLIT_MAX_NUM, maxSplitNum); |
} |
@Override |
public List<InputSplit> getSplits(JobContext job) throws IOException { |
// get all the files in input path |
List<FileStatus> stats = listStatus(job); |
List<InputSplit> splits = new ArrayList<InputSplit>(); |
if (stats.size() == 0 ) { |
return splits; |
} |
// 璁$畻split鐨勫鉤鍧囬暱搴?/code> |
long totalLen = 0 ; |
for (FileStatus stat : stats) { |
totalLen += stat.getLen(); |
} |
int maxSplitNum = job.getConfiguration().getInt(CONFNAME_INPUT_SPLIT_MAX_NUM, DEFAULT_MAX_SPLIT_NUM); |
int expectSplitNum = maxSplitNum < stats.size() ? maxSplitNum : stats.size(); |
long averageLen = totalLen / expectSplitNum; |
LOG.info( "Prepare InputSplit : averageLen(" + averageLen + ") totalLen(" + totalLen |
+ ") expectSplitNum(" + expectSplitNum + ") " ); |
// 璁劇疆inputSplit |
List<Path> pathLst = new ArrayList<Path>(); |
List<Long> offsetLst = new ArrayList<Long>(); |
List<Long> lengthLst = new ArrayList<Long>(); |
long currentLen = 0 ; |
for ( int i = 0 ; i < stats.size(); i++) { |
FileStatus stat = stats.get(i); |
pathLst.add(stat.getPath()); |
offsetLst.add(0L); |
lengthLst.add(stat.getLen()); |
currentLen += stat.getLen(); |
if (splits.size() < expectSplitNum - 1 && currentLen > averageLen) { |
Path[] pathArray = new Path[pathLst.size()]; |
CombineFileSplit thissplit = new CombineFileSplit(pathLst.toArray(pathArray), |
getLongArray(offsetLst), getLongArray(lengthLst), new String[ 0 ]); |
LOG.info( "combineFileSplit(" + splits.size() + ") fileNum(" + pathLst.size() |
+ ") length(" + currentLen + ")" ); |
splits.add(thissplit); |
// |
pathLst.clear(); |
offsetLst.clear(); |
lengthLst.clear(); |
currentLen = 0 ; |
} |
} |
if (pathLst.size() > 0 ) { |
Path[] pathArray = new Path[pathLst.size()]; |
CombineFileSplit thissplit = |
new CombineFileSplit(pathLst.toArray(pathArray), getLongArray(offsetLst), |
getLongArray(lengthLst), new String[ 0 ]); |
LOG.info( "combineFileSplit(" + splits.size() + ") fileNum(" + pathLst.size() |
+ ") length(" + currentLen + ")" ); |
splits.add(thissplit); |
} |
return splits; |
} |
private long [] getLongArray(List<Long> lst) { |
long [] rst = new long [lst.size()]; |
for ( int i = 0 ; i < lst.size(); i++) { |
rst[i] = lst.get(i); |
} |
return rst; |
} |
} |
閫氳繃 multifileinputformat.max_split_num 鍙傛暟灝卞彲浠ヨ緝涓哄噯紜殑鎺у埗map鏁伴噺錛岃屼笖浼氬彂鐜版瘡涓猰ap澶勭悊鐨勬暟鎹噺寰堝潎鍖銆傝嚦姝わ紝闂鎬葷畻瑙e喅浜嗐?/p>