🙋🏽 💖 🌊 GoogleおよびYandex検索エンジン（スナップショット、_escaped_fragment_、ajax、フラグメント）のCRAWL動的ページ 🏄 🧕 🙍🏼

すべての人に平和を！

記事の内容：

1.クロールとは

2.ダイナミッククロール

3.タスク、ツール、ソリューション

4.読む

5.結論

1. クロールとは

これは、必要な情報を取得するための検索エンジンによるサイトのページのスキャンです。このスキャンの結果は、エンドポイントでのhtml表現です（各検索エンジンには、jsをロードするかどうか（起動の有無にかかわらず）、css、imgなど）、またはサイトの「スナップショット」とも呼ばれる独自の設定があります。

2. ダイナミッククロール

ここでは、動的なCRAWLページ、つまりサイトに動的なコンテンツがある場合（またはAjaxコンテンツと呼ばれる場合）について説明します。 Angular.js + HTML5ルーターを使用するプロジェクトがあります（これはdomain.ru＃！パスがない場合ですが、domain.ru / pathのようになります）。すべてのコンテンツは<ng-view> </ ng-view>で変更され、 index.phpおよび特別な設定.htaccess。これにより、ページを更新した後、すべてが適切に表示されます。

これは、角度ルーターの設定に記載されています。

$locationProvider.html5Mode({ enabled: true, requireBase: false });

これは.htaccessで綴られています：

 RewriteEngine on # Don't rewrite files or directories RewriteCond %{REQUEST_FILENAME} -f [OR] RewriteCond %{REQUEST_FILENAME} -d RewriteRule ^ - [L] # Rewrite everything else to index.html to allow html5 state links RewriteRule ^ index.php [L]

3. タスク、ツール、ソリューション

タスク：

1.アプリケーションのレンダリングおよび初期化後のページの動的コンテンツをそのまま提供します

2. HTMLページのスナップショットの作成、最適化、圧縮

3.検索エンジンにHTMLスナップショットを提供します

ツール：

1. NPMのインストール（npmはnode.jsパッケージマネージャーです。これを使用すると、モジュールと依存関係を管理できます。）

2.次のコマンドを使用してhtml-snapshotsモジュールをインストールしました。

  npm install html-snapshots

3.正しい構成

解決策：

速度を上げるため、localhost（ローカルWebサーバー）で「クリア」を実行することをお勧めします

最初に、headのmetaタグでメインのindex.phpに追加する必要があります。

 <meta name="fragment" content="!">

sitemap.xmlの例：

 <?xml version="1.0" encoding="UTF-8"?> <!-- created with www.mysitemapgenerator.com --> <urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"> <url> <loc>http://localhost/domain.ru/www/product/30</loc> <lastmod>2016-07-22T19:47:25+01:00</lastmod> <priority>1.0</priority> </url> </urlser>

Server.jsの構成：

 var fs = require("fs"); var path = require("path"); var util = require("util"); var assert = require("assert"); var htmlSnapshots = require("html-snapshots"); var minify = require('html-minifier').minify; htmlSnapshots.run({ //#1   SITEMAP //input: "sitemap", //source: "sitemap_localhost.xml", //#2      input: "array", source: ["http://localhost/domain.ru/www/product/30"], //protocol: "https", // setup and manage the output outputDir: path.join(__dirname, "./tmp"), //      outputDirClean: false, //   ,    <ng-view></ng-view>      selector: "#product", //    12 ,     timeout: 120000, //      CRAWL phantomjsOptions: [ "--ssl-protocol=any", "--ignore-ssl-errors=true", "--load-images=false" ] }, function (err, snapshotsCompleted) { var body; console.log("completed snapshots:"); assert.ifError(err); snapshotsCompleted.forEach(function(snapshotFile) { body = fs.readFileSync(snapshotFile, { encoding: "utf8"}); //     var regExp = /<style[^>]*?>.*?<\/style>/ig; var clearBody = body.replace(regExp, ''); //    var domain = /http:\/\/localhost\/domain.ru\/www/ig; clearBody = clearBody.replace(domain, '//domain.ru'); //  html  clearBody = minify(clearBody, { conservativeCollapse: true, removeComments: true, removeEmptyAttributes: true, removeEmptyElements: true, collapseWhitespace: true }); //   fs.open(snapshotFile, 'w', function(e, fd) { if (e) return; fs.write(fd, clearBody); }); }); }); console.log('FINISH');

コマンドで実行：

 node server

アルゴリズムの理解：

1.最初に、彼はすべてのページを「中毒」します

2. urlに従ってファイルを作成し、フォルダーに名前を付けます：product / 30 / index.hmtl（index.html、またはproduct / 30.htmlは誰にでも便利なので使用できます）

3.その後、コールバック-> snapshotsCompletedを呼び出し、ページの各index.htmlスナップショットを最適化します

あなたのサイトの写真は準備されていますが、入力時に検索ボットに提供することは残っています：

index.php

 if (isset($_GET['_escaped_fragment_'])) { if ($_GET['_escaped_fragment_'] != ''){ $val = $_GET['_escaped_fragment_']; include_once "snapshots" . $val . '/index.html'; }else{ $url = "https://" . $_SERVER["HTTP_HOST"] . $_SERVER["REQUEST_URI"]; $arrUrl = parse_url($url); $val = $arrUrl['path']; include_once "snapshots" . $val . '/index.html'; } }else { include_once('pages/home.php'); }

説明

1. html5プッシュ状態

html5プッシュ状態を使用する場合（推奨）：

このメタタグをページの先頭に追加するだけです

 <meta name="fragment" content="!">

URLが次のように見える場合：

www.example.com/user/1

次に、次のようにURLにアクセスします。

www.example.com/user/1?_escaped_fragment_=

2. ハッシュバング

hashbang（＃！）を使用する場合：

URLが次のように見える場合：

www.example.com/#！/ user / 1

次に、次のようにURLにアクセスします。

www.example.com/?_escaped_fragment_=/user/1

さらに、写真はあるが最適化はしていない場合：

 var fs = require("fs"); var minify = require('html-minifier').minify; var path = require("path"); var util = require("util"); var assert = require("assert"); var htmlSnapshots = require("html-snapshots"); //   var myPath = path.join(__dirname, "./tmp/domain.ru/www/"); function getFiles (dir, files_){ files_ = files_ || []; var files = fs.readdirSync(dir); for (var i in files){ var name = dir + '/' + files[i]; if (fs.statSync(name).isDirectory()){ getFiles(name, files_); } else { files_.push(name); } } return files_; } var allFiles = getFiles(myPath); //var allFiles = [ 'C:\\xampp\\htdocs\\nodejs\\crawler\\tmp\\domain.ru\\www\\/product/30/index.html' ]; var body; allFiles.forEach(function(snapshotFile){ body = fs.readFileSync(snapshotFile, { encoding: "utf8"}); var regExp = /<style[^>]*?>.*?<\/style>/ig; var clearBody = body.replace(regExp, ''); var domain = /http:\/\/localhost\/domain.ru\/www/ig; clearBody = clearBody.replace(domain, '//domain.ru'); clearBody = minify(clearBody, { conservativeCollapse: true, removeComments: true, removeEmptyAttributes: true, removeEmptyElements: true, collapseWhitespace: true }); var social = /<ul class=\"social-links\">.*?<\/ul>/ig; clearBody = clearBody.replace(social, ''); fs.open(snapshotFile, 'w', function(e, fd) { if (e) return; fs.write(fd, clearBody); }); }); console.log('COMPLETE');

4.読む

stackoverflow.com/questions/2727167/getting-all-filenames-in-a-directory-with-node-js-node.jsのファイルを操作する

github.com/localnerve/html-snapshots-スナップショットモジュールドキュメント

perfectionkills.com/experimenting-with-html-minifier-オプションスナップショットモジュールドキュメント

yandex.ru/support/webmaster/robot-workings/ajax-indexing.xml-yandexクローラー情報

developers.google.com/webmasters/ajax-crawling/docs/specification-Googleクローラー情報

www.ng-newsletter.com/posts/serious-angular-seo.html-記事

prerender.io/js-seo/angularjs-seo-get-your-site-indexed-and-to-the-to-the-the-search-results-記事

prerender.io/documentation-記事

regexr.com -regexr

stackoverflow.com/questions/15618005/jquery-regexp-selecting-and-removeclass-regexr

5.結論

検索ボットによる「起動」を心配することなく、任意のSPAアプリケーションを安全に作成できるようになりました。また、「サーバー」と「クライアント」の両方に対して「ツール」に適切な構成を選択できます。

すべてのプロの成功！

GoogleおよびYandex検索エンジン（スナップショット、_escaped_fragment_、ajax、フラグメント）のCRAWL動的ページ