渲染 & 爬蟲


Posted by TempuraEngineer on 2022-09-17

目錄


靜/動態網站 & script

script(scripting language)

script是指程式語言,分為客戶端語言(client side script)、伺服器端語言(serve side script)

A script or scripting language is a computer language with several commands within a file capable of being executed without being compiled.

  • 客戶端語言(client side script)

    在客戶端上跑的script,例如Javascript。

  • 伺服器端語言(serve side script)

    在伺服器上跑的script,例如Java、Node.js、PHP、C。它提供介面給使用者,並規範他們可以操作哪些資料

    客戶端發出一個request伺服器會處理要丟到頁面上的資料,再回傳給客戶端

    客戶端收到回傳的資料時,若資料裡面有client side script先執行script再載入畫面


靜態 vs 動態

static vs dynamic

(圖片取自learnwebskill)

靜態 動態
定義 只用HTML、CSS、Javascript構築 有用後端語言(serve side script)構築
資料庫、伺服器
優點 1. 載入比動態網站(因為直接就是HTML,或者build時就轉成HTML)。載入速度是Google評估網站效能的一項關鍵,當效能太差時會導致SEO分數下降
2. 較安全,因為不打request,故地方攻擊
使用者能發request改變頁面內容,內容彈性
缺點 1. 內容固定,如果要改內容只能改HTML
2. 當網站規模大時,一頁一檔案難以維護
1. 切換畫面時必須打request,使用者體驗差
2. 伺服器負擔較大

但事實上並非所有網站都能乾淨地分為靜態或動態


渲染(render)

渲染指的是何時、何處、如何template被轉換成網頁內容

Rendering in the context of this series refers to how/when/where template (a preliminary version of markup) and data are combined to create the final markup content of a site.

渲染分為客戶端渲染(client side render)、伺服器端渲染(server side render)

  • 客戶端渲染(client side render)

    CSR是指渲染全由JS在客戶端上完成,所以在行動裝置上效能會比SSR,因為操作DOM會比打request更吃效能

    CSR SPAs are Javascript intensive therefore, features and performance depend heavily on the browser and the device. DOM manipulation can often be more computationally expensive than requesting a new page from a server.

    CSR常見於SPA,這類網站常用AJAX、fetch,或者第三方套件、框架提供的API來打request

    CSR不一定是SPA,也可能是MPA(ex:多頁的靜態網站)。同樣地SPA不一定是CSR,也可能是universal Javascript


  • 伺服器端渲染(server side render)

    SSR是指客戶端發出一個request,伺服器的模板引擎(template engine,ex:EJS、Pug)解析完回傳HTML,客戶端收到後載入HTML


  • Isomorphic / Universal JavaScript。

    isomorphic Javascript和universal Javascript是同個概念,意思是同一份Javascript程式,在客戶端、伺服器端都能運行

    Isomorphic JavaScript applications are applications written in JavaScript that can run both on the client and on the server.

    universal Javascript可以想成SSR與CSR的混合(hydrated),因為第一頁會是SSR,其他則是CSR,故能有效解決SEO問題,使用者體驗也會比傳統SSR好。Nuxt的SSR(universal)模式就是這樣


比較

來小小結論一下

SSR CSR
SPA Nuxt、Next建立的網站(實為universal Javascript) Vue、React建立的網站
MPA 傳統動態網頁 一般沒有router的靜態網頁
SEO link preview 伺服器(Hosting) 使用者變多後的維護 支援離線使用 使用者體驗 效能
SSR MPA 🌸🌸🌸 🌸🌸🌸 🌸🌸 🌸 🌸 🌸 🌸🌸
SSR SPA 🌸🌸🌸 🌸🌸🌸 🌸🌸 🌸 🌸🌸 🌸🌸🌸 🌸🌸
CSR MPA 🌸🌸🌸 🌸🌸🌸 🌸🌸🌸 🌸🌸🌸 🌸🌸 🌸🌸 🌸🌸🌸
CSR SPA 🌸 🌸 🌸🌸🌸 🌸🌸🌸 🌸🌸🌸 🌸🌸🌸 🌸

(🌸越多越簡單)


SEO & Javascript

常聽說SPA網頁(因為CSR是用Javascript渲染)SEO很差,但其實沒糟糕到不行,只能說會增加SEO的難度

Google & Bin的搜尋引擎優化

約在2010年左右,Google和Bing的爬蟲已有能力爬取Javascript網頁內容,而之後他們也持續在優化爬蟲爬取Javascript網站的能力

2019年發布了

The new evergreen Bingbot simplifying SEO by leveraging Microsoft Edge
Introducing a new JavaScript SEO video series
The new evergreen Googlebot

其中2019年Bing發表的文章中提到「我們優化搜尋引擎使其可以跑Javascript並渲染頁面」、「對網頁開發者來說減緩了SEO(的難題)」

Today we’re announcing that Bing is adopting Microsoft Edge as the Bing engine to run JavaScript and render web pages. Doing so will create less fragmentation of the web and ease Search Engines Optimization (SEO) for all web developers.

2008年Google爬蟲其實已經以初步地爬取Javascript網頁的內容,2019年Google發表的文章也提到「升級了爬蟲的渲染引擎

we are happy to announce that Googlebot now runs the latest Chromium rendering engine (74 at the time of this post) when rendering pages for Search.
Moving forward, Googlebot will regularly update its rendering engine to ensure support for latest web platform features.
Compared to the previous version, Googlebot now supports 1000+ new features, like:

・ ES6 and newer JavaScript features
・ IntersectionObserver for lazy-loading
・ Web Components v1 APIs


爬蟲做了甚麼

(圖片取自Google Search Central)

Googlebot是Google搜尋引擎的爬蟲

爬取Javascript網站大約可以分為3步驟-爬取、渲染、排序

  1. 當Googlebot 下載(fetch)好網站檔,首先會讀在根目錄的robots.txt確認那些檔案可爬取。如果URL被設為Disallow,那它就會跳過這個URL

    The disallow directive specifies paths that must not be accessed by the crawlers

    Google can't index the content of pages which are disallowed for crawling, but it may still index the URL and show it in search results without a snippet.

    // User-agent若為Googlebot、AdsBot-Google,無法檢索任何以https://tempura327.github.io/The-F2E-tourism/開頭的網址
    // 設定Disallow時大小寫一定要對
    
    User-agent: Googlebot
    User-agent: AdsBot-Google
    Disallow: /The-F2E-tourism/
    
  2. 解析HTML,尋找a標籤href的URL,並把它丟到crawl queue。如果有不希望被爬的網址可以加上rel="nofollow"

    Use the nofollow value when other values don't apply, and you'd rather Google not associate your site with, or crawl the linked page from, your site. For links within your own site, use the robots.txt disallow rule.

    <!-- 不要追蹤這個連結 -->
    <a rel="nofollow" href="https://...">Foo</a>
    
    <!-- 不要顯示這個標籤的文字摘要、影片預覽 -->
    <div data-nosnippet>not in snippet</div>
    
  3. 把完成以上步驟的網站丟到render queue

    如果不想被排序的話,在meta標籤加上noindex,該網址幾秒後就會被扔出render queue

    <!-- 不要在搜尋結果中顯示這個網頁、媒體或資源 -->
    <meta name="googlebot" content="noindex">
    
    <!-- 不要在這個網頁的搜尋結果中顯示文字摘要或影片預覽畫面。但是如果有靜態圖片縮圖,而且顯示出來有助於提升使用者體驗,那麼系統仍可能會顯示這類縮圖 -->
    <meta name="googlebot-news" content="nosnippet">
    
  4. 渲染

    • 一般靜態網站、SSR網站
      如果HTML內沒有script,直接使用渲染引擎將畫面渲染

      如果HTML有script先執行再渲染,但若是外連的script則爬蟲首先要下載它

      <!-- 內連的script -->
      <script>
       function foo(){
         return 'foo';
       }
      
       foo();
      </script>
      
      <!-- 外連的script -->
      <script type="text/javascript" src="https://www./.../index.js" />
      

      但外連的script可能會遭遇到爬取配額(crawl budget)的問題

      爬取配額指的是一段時間內不造成問題、降低使用者體驗的範圍內,網頁可被爬取的次數

      搜尋引擎會根據爬取網站伺服器的負擔能力爬取的需求(和網站人氣、老舊度有關)來計算,但這個數字絕對不會是無限

      Googlebot is designed to be a good citizen of the web. Crawling is its main priority, while making sure it doesn't degrade the experience of users visiting the site. We call this the "crawl rate limit," which limits the maximum fetching rate for a given site.

      ・ Crawl health: If the site responds really quickly for a while, the limit goes up, meaning more connections can be used to crawl. If the site slows down or responds with server errors, the limit goes down and Googlebot crawls less.

      ・ Limit set in Search Console: Website owners can reduce Googlebot's crawling of their site. Note that setting higher limits doesn't automatically increase crawling.

      Even if the crawl rate limit isn't reached, if there's no demand from indexing, there will be low activity from Googlebot.

      ・ Popularity: URLs that are more popular on the Internet tend to be crawled more often to keep them fresher in our index.

      ・ Staleness: Our systems attempt to prevent URLs from becoming stale in the index.

    • SPA
      先讓Javascript渲染畫面

      這階段還有一個潛在的危險是網站的js檔和爬蟲使用的Javascript引擎不相容,不過就Google來說的話,在2019年的文章中他們提到「定期更新轉譯引擎(rendering engine)」,所以應該不用太擔心

  5. 排序(indexing)


Google的SEO指引

Understand the JavaScript SEO basics中有提到一些不利、友善SEO的東西

  1. fragment URL
    當Googlebot在頁面裡找連結時,它只會認\的href屬性。避免使用fragment URL,因為fragment不會被送到伺服器,故爬蟲不會去爬取

    <!-- fragment URL -->
    <a href="#/chapter1" />
    

    替代方案可以用History API

  1. 使用可讀性高的HTTP code

    不要在SPA使用soft 404。soft 404是指網址所傳回的頁面告知使用者該網頁不存在,但卻回傳200

    soft 404會讓使用者誤認錯誤頁為實際上線的網頁,這種網頁會被Google搜尋排除

    替代的方案可以將使用者導到404 not found的頁面,或者如果頁面沒有內容時使用meta robots標籤將content設為noindex

    Googlebot在執行頁面的Javascript前,遇到meta robots標籤noindex則不會去渲染頁面,也會進行排序

    If Google encounters the noindex tag, it skips rendering and JavaScript execution. Because Google skips your JavaScript in this case, there is no chance to remove the tag from the page.

     fetch(`/api/products/${productId}`)
       .then(response => response.json())
       .then(product => {
         if(product.exists) {
           showProductDetails(product); // shows the product information on the page
         } else {
           // this product does not exist, so this is an error page.
           // Note: This example assumes there is no other meta robots tag present in the HTML.
           const metaRobots = document.createElement('meta');
           metaRobots.name = 'robots';
           metaRobots.content = 'noindex';
           document.head.appendChild(metaRobots);
         }
       })
    
  2. 使用結構化資料(structured data)

    結構化資料是一種標準格式,它提供網頁資訊並將網頁內容分類,而JSON-LD就是一種,其他還有Microdata、RDFa

    通常JSON-LD會以script的形式出現在head,但body內也可以

    A JavaScript notation embedded in a script tag in the page head or body. The markup is not interleaved with the user-visible text, which makes nested data items easier to express, such as the Country of a PostalAddress of a MusicVenue of an Event. Also, Google can read JSON-LD data when it is dynamically injected into the page's contents, such as by JavaScript code or embedded widgets in your content management system.

     <head>
       <title>Party Coffee Cake</title>
       <script type="application/ld+json">
       {
         "@context": "https://schema.org/",
         "@type": "Recipe",
         "name": "Party Coffee Cake",
         "author": {
           "@type": "Person",
           "name": "Mary Stone"
         },
         "datePublished": "2018-03-10",
         "description": "This coffee cake is awesome and perfect for parties.",
         "prepTime": "PT20M"
       }
       </script>
     </head>
    

    Googlebot會根據找到的JSON-LD去分析網頁內容,甚至啟用一些搜尋結果的特殊顯示功能


參考資料

Server-side scripting
Static vs Dynamic Websites

Understanding Rendering in Web Apps: Intro

Understanding Rendering in Web Apps: SSR
Isomorphic JavaScript Applications — the Future of the Web?

Understanding Rendering in Web Apps: CSR
Understanding Rendering in Web Apps: SPA vs MPA

Understanding Rendering in Web Apps: CSR vs SSR

SEO & JavaScript: The Good, the Bad & the Uncertainty
Understand the JavaScript SEO basics
Fix Search-related JavaScript problems
What Crawl Budget Means for Googlebo

Create a robots.txt file
Understand how structured data works
General structured data guidelines
Explore the search gallery


#CSR #universal JS #SEO #ssr #googlebot







Related Posts

GCC with MinGW || VS Code

GCC with MinGW || VS Code

[6] 持續整合,自動化測試的價值

[6] 持續整合,自動化測試的價值

[Leetcode in Java] 209. Minimum Size Subarray Sum

[Leetcode in Java] 209. Minimum Size Subarray Sum


Comments