⏬ 🔚 🤽 無数のファイルと1台のラップトップ 🥜 🤽🏻 👶🏾

オンラインストアの例、ラップトップを使用して100万個のファイルを分析する方法を考えてみましょう。

かなり現代的なコンピューターを使用している場合は、 GNU Parallelユーティリティとストリーム処理を合理的に使用して、「中規模」のデータを処理できます。

ステップ1：連結（cat * >> out.txt ?!）

Unixシステムでのcatのユーティリティは、Terminalを開いたことがある人のほとんどに知られています。フォルダ内のすべてまたは一部のファイルを選択し、それらを1つの大きなファイルに結合するだけで十分です。しかし、多くのファイルがあるとすぐに次のようになります。

$ cat * >> out.txt -bash: /bin/cat: Argument list too long

ファイルの数が許容数を超えており、コンピューターがそれらを常に追跡できるとは限りません。多くのUnixツールは、約10,000個の引数しか取りません。 catコマンドでアスタリスクを使用すると、制御が拡張され、1,234,567個の引数がユーティリティに渡されます。その結果、エラーメッセージが表示されます。

次のことができます。

 for f in *; do cat "$f" >> ../transactions_cat/transactions.csv; done

そして、約10,093秒後に、複合ファイルが形成されます。

ステップ2：GNU並列＆連結

ただし、GNU Parallelを使用してプロセスを改善できます。

 ls | parallel -m -j $f "cat {} >> ../transactions_cat/transactions.csv"

コードの$ f引数が強調表示されているため、 並列処理レベルを選択できます。しかし、線形スケールは均一ではありません（下図のように- グラフコード）：

ステップ3：データ> RAM

100万個のファイルが1つのファイルに変換されると、別の問題が発生します。 19.93 GBのデータボリュームはRAMに収まりません（2014 MBPラップトップ、16 GB RAMについて話しています）。したがって、分析には、より強力なマシンまたはストリーミングによる処理が必要です。または、 チャンク （ チャンク 転送エンコーディング）を使用できます。

ただし、GNU Parallelの使用について引き続き説明し、運用データに関するいくつかの質問に答える価値があります（オンラインストアの例を使用）。

いくつのユニークな製品が販売されていますか？

1日に何件の取引が行われましたか？

1か月にストアで販売された製品はいくつですか？

ユニークな製品

 # Serial method (ie no parallelism) # This is a simple implementation of map & reduce; tr statements represent one map, sort -u statements one reducer # cut -d ' ' -f 5- transactions.csv | \ - Using cut, take everything from the 5th column and over from the transactions.csv file # tr -d \" | \ - Using tr, trim off double-quotes. This leaves us with a comma-delimited string of products representing a transaction # sort -u | \ - Using sort, put similar items together, but only output the unique values # wc -l - Count number of unique lines, which after de-duping, represents number of unique products $ time cut -d ' ' -f 5- transactions.csv | tr -d \" | tr ',' '\n' | sort -u | wc -l 331 real 292m7.116s # Parallelized version, default chunk size of 1MB. This will use 100% of all CPUs (real and virtual) # Also map & reduce; tr statements a single map, sort -u statements multiple reducers (8 by default) $ time cut -d ' ' -f 5- transactions.csv | tr -d \" | tr ',' '\n' | parallel --pipe --block 1M sort -u | sort -u | wc -l 331 # block size performance - Making block size smaller might improve performance # Number of jobs can also be manipulated (not evaluated) # --500K: 73m57.232s # --Default 1M: 75m55.268s (3.84x faster than serial) # --2M: 79m30.950s # --3M: 80m43.311s

毎日の取引

ファイル形式が最初の質問と見なされるために望ましくない場合、2番目の質問は完璧です。各行は操作を表しているため、1日あたり「Group By」に相当するSQLを実行し、行を要約するだけです。

 # Data is at transaction level, so just need to do equivalent of 'group by' operation # Using cut again, we choose field 3, which is the date part of the timestamp # sort | uniq -c is a common pattern for doing a 'group by' count operation # Final tr step is to trim the leading quotation mark from date string time cut -d ' ' -f 3 transactions.csv | sort | uniq -c | tr -d \" real 76m51.223s # Parallelized version # Quoting can be annoying when using parallel, so writing a Bash function is often much easier than dealing with escaping quotes # To do 'group by' operation using awk, need to use an associative array # Because we are doing parallel operations, need to pass awk output to awk again to return final counts awksub () { awk '{a[$3]+=1;}END{for(i in a)print i" "a[i];}';} export -f awksub time parallel --pipe awksub < transactions.csv | awk '{a[$1]+=$2;}END{for(i in a)print i" "a[i];}' | tr -d \" | sort real 8m22.674s (9.05x faster than serial)

1日および1か月あたりの総売上

この例では、 fuコマンドラインが弱い場合がありますが、シーケンシャル方式は最速の方法の1つです。もちろん、14分の実行時間では、「並列化」のリアルタイムの利点はそれほど大きくありません。

 # Serial method uses 40-50% all available CPU prior to `sort` step. Assuming linear scaling, best we could achieve is halving the time. # Grand Assertion: this pipeline actually gives correct answer! This is a very complex way to calculate this, SQL would be so much easier... # cut -d ' ' -f 2,3,5 - Take fields 2, 3, and 5 (store, timestamp, transaction) # tr -d '[A-Za-z\"/\- ]' - Strip out all the characters and spaces, to just leave the store number, timestamp, and commas to represent the number of items # awk '{print (substr($1,1,5)"-"substr($1,6,6)), length(substr($1,14))+1}' - Split the string at the store, yearmo boundary, then count number of commas + 1 (since 3 commas = 4 items) # awk '{a[$1]+=$2;}END{for(i in a)print i" "a[i];}' - Sum by store-yearmo combo # sort - Sort such that the store number is together, then the month time cut -d ' ' -f 2,3,5 transactions.csv | tr -d '[A-Za-z\"/\- ]' | awk '{print (substr($1,1,5)"-"substr($1,6,6)), length(substr($1,14))+1}' | awk '{a[$1]+=$2;}END{for(i in a)print i" "a[i];}' | sort real 14m5.657s # Parallelize the substring awk step # Actually lowers processor utilization! awksub2 () { awk '{print (substr($1,1,5)"-"substr($1,6,6)), length(substr($1,14))+1}';} export -f awksub2 time cut -d ' ' -f 2,3,5 transactions.csv | tr -d '[A-Za-z\"/\- ]' | parallel --pipe -m awksub2 | awk '{a[$1]+=$2;}END{for(i in a)print i" "a[i];}' | sort real 19m27.407s (worse!) # Move parallel to aggregation step awksub3 () { awk '{a[$1]+=$2;}END{for(i in a)print i" "a[i];}';} export -f awksub3 time cut -d ' ' -f 2,3,5 transactions.csv | tr -d '[A-Za-z\"/\- ]' | awk '{print (substr($1,1,5)"-"substr($1,6,6)), length(substr($1,14))+1}' | parallel --pipe awksub3 | awksub3 | sort real 19m24.851s (Same as other parallel run)

これらの3つの例は、妥当な時間内にGNU Parallelを使用すると、RAMを超えるデータセットを処理できることを示しています。ただし、例では、Unixユーティリティの操作が複雑になる可能性も示されています。コマンドラインスクリプトは、パイプライン処理が非常に長くなり、論理的な追跡が失われる場合に、「ワンライナー」シンドロームの外に移動するのに役立ちます。しかし、最終的には他のツールを使用することで問題を簡単に解決できます。

無数のファイルと1台のラップトップ