How to unload data with a nested structure from Google BigQuery using custom Google Analytics as an example

image






Google BigQuery is a popular cloud database used by companies around the world. It is especially convenient for working with raw Google Analytics data: in GA 360, integration with BigQuery is configured in a few clicks, and for the free version there are third-party scripts and modules.



In the raw data of Google Analytics, each record (row) corresponds to a session. Inside such an entry are nested fields that correspond to the hits of the session:



image






Often such a nested data structure is confusing for users who do not understand how to work with it and unload such data.



Using the Google Analytics custom parameters as an example, I’ll try to explain “on the fingers” how embedded data is stored in Google BigQuery and how it can be uploaded.



The given query codes are working, they can be used in solving problems, substituting their table names and the required Custom Dimensions index numbers.



• Unloading rows

• Unloading while maintaining the nesting structure

• Example of replacing user parameter values



The basics



BigQuery supports 2 dialects of SQL: Legacy and Standard. Google recommends using the newer SQL Standard, on which we will write queries for unloading.

Everyone who has worked with SQL at least a little knows the standard query construct:



SELECT *  * FROM *  * WHERE * *
      
      





This design works if the table structure is simple, without other fields nested in the cells:



image






We consider tables with nested fields. The structure of such a table (for example, Google Analytics user parameters):



image




Custom GA Options in BQ



In Google BigQuery, such a table will have the following column names (the separator "." Shows the nesting structure):



image






So how do we unload data from nested fields?



Unloading Rows



Back to the table with an example of GA custom parameters in BQ.



The columns customDimensions.index and customDimensions.value are the indexes and values ​​of the session and user-defined Custom Dimensions.



The hits.customDimensions.index and hits.customDimensions.value columns are the indexes and values ​​of the hit Custom Dimensions.



In Google BigQuery there is another level of user parameter action - the product. The names and values ​​of Google Custom Big Dimension Custom Dimensions are in the hits.product.customDimensions.index and hits.product.customDimensions.value columns. They are unloaded by analogy with hit user parameters, you only need to take into account another level of nesting.



Session and user level user options



What should we do if we need to unload the values ​​of session (custom) Custom Dimensions for each date without saving the nested structure (i.e. line by line)?



To answer the question, let's take a closer look at the table with GA user parameters in BQ.

It shows that the values ​​of the cells in the customDimensions column are another table :



image






It is enough to make a subquery to this table in the main query:



 SELECT --   date, --    value (SELECT value --   customDimensions,     t FROM t.customDimensions --       WHERE index = 1) AS customDimensions1 FROM --    t     `project.dataset.tablename` AS t
      
      





The output is a table:



image






If we need to add a column with the value of another user parameter, we make another subquery:



 SELECT date, (SELECT value FROM t.customDimensions WHERE index = 1) AS customDimensions1, --      customDimensions (SELECT value FROM t.customDimensions WHERE index = 2) AS customDimensions2 FROM `project.dataset.tablename` AS t
      
      





We get the following:



image






Hit-level custom options



Hit user parameters are unloaded in the same way as session (user) parameters, except that the subquery must be made to the hits hits table. In other words, the values ​​of the cells in the hits column in the raw data table of Google Analytics are a nested table in which the customDimensions table is nested:



image






The request for downloading line-by-line hit user parameters will be as follows:



 SELECT --   date, --    value (SELECT value --   customDimensions,     h FROM h.customDimensions --       WHERE index = 3) AS customDimensions3 FROM --    t     `project.dataset.tablename` AS t, --   t.hits  h     t.hits AS h
      
      





The result of the query will be a table:



image






You can unload several hit user parameters and add the hitNumber parameter (the sequence number of the hit in the session):



 SELECT date, h.hitNumber AS hitNumber, (SELECT value FROM h.customDimensions WHERE index = 3) AS customDimensions3, --        h.customDimensions (SELECT value FROM h.customDimensions WHERE index = 4) AS customDimensions4 FROM `project.dataset.tablename` AS t, t.hits AS h
      
      





Get the table:



image






Session (user) + hit user parameters



If in one query we want to unload session and hit user parameters, you just need to make the necessary subqueries for the main and nested tables:



 SELECT date, h.hitNumber AS hitNumber, --     (SELECT value FROM t.customDimensions WHERE index=1) AS customDimensions1, (SELECT value FROM t.customDimensions WHERE index=2) AS customDimensions2, --     (SELECT value FROM h.customDimensions WHERE index=3) AS customDimensions3, (SELECT value FROM h.customDimensions WHERE index=4) AS customDimensions4 FROM `project.dataset.tablename` AS t, t.hits AS h
      
      





The table that will be obtained as a result of the query:



image






Unloading while maintaining the nesting structure



Such an unloading may be necessary when replacing the values ​​of a custom parameter in Google BigQuery.



Example

In Google Analytics, the name of the country in full format is transferred to the session user parameter with an index of 12 and the hit user parameter with an index of 25 for users from Russia: RUSSIA. It is necessary to change the country format to abbreviated: RUS.



To do this, you need to replace the required values ​​of the user parameter with the user's country for the entire history of data in Google BigQuery.



The procedure for solving the problem:



  1. Unload all data while maintaining the nesting structure
  2. Replace the value of the user parameter with the country
  3. Rewrite the table


To upload data while maintaining the nesting structure, you must use the ARRAY function and the SELECT AS STRUCT construct. Let’s figure out what it is.



The syntax of the ARRAY function is as follows:



 ARRAY(**)
      
      





It returns an array of elements.



Comparing an array with line-by-line recording:



image




On the left is an array, on the right is a line recording



If we want to save a nested structure and unload an array with several columns, we must use ARRAY (SELECT AS STRUCT ...) :



image




Nested Array



Session and user level user options



For unloading while maintaining the structure of session (custom) Custom Dimensions, we use the request:



 SELECT date, --  ARRAY(SELECT AS STRUCT...)    ARRAY(SELECT AS STRUCT index, value FROM t.customDimensions) AS customDimensions FROM `project.dataset.tablename` AS t
      
      





As a result of its execution, a table is obtained in which the nesting structure of the “raw” Google Analytics data is saved:



image






Hit-level custom options



To unload the values ​​of hit user parameters from Google BigQuery while maintaining the nesting structure, it is important to note that the customDimensions table is nested in the hits table. In other words, you need to make the ARRAY subquery 2 times (SELECT AS STRUCT ...): first, to the nested hits table, then to the nested customDimensions table:



 SELECT date, --    t.hits ARRAY(SELECT AS STRUCT hitNumber, --    h.customDimensions ARRAY(SELECT AS STRUCT index, value FROM h.customDimensions) AS customDimensions FROM t.hits AS h ) AS hits FROM `project.dataset.tablename` AS t
      
      





The result of this query will be a table:



image






Session (user) + hit user parameters



As with line-by-line unloading, we need to combine the ARRAY subqueries (SELECT AS STRUCT ...) in the same query to the desired nested tables:



 SELECT date, --  () Custom Dimensions ARRAY(SELECT AS STRUCT index, value FROM t.customDimensions ) AS customDimensions, --  Custom Dimensions ARRAY(SELECT AS STRUCT hitNumber, ARRAY(SELECT AS STRUCT index, value FROM h.customDimensions ) AS customDimensions FROM t.hits AS h) AS hits FROM `project.dataset.tablename` AS t
      
      





What is the result:



image






Example of replacing user parameter values



Let's get back to our example .

In the previous section, we received a request for downloading session (user) and hit user parameters of Google Analytics while maintaining the nesting structure.

We supplement this query with SELECT * REPLACE constructs for unloading with replacement and CASE for updating the values ​​of the necessary user parameters:



 --      t.customDimensions  t.hits SELECT *REPLACE( --  () Custom Dimensions ARRAY(SELECT AS STRUCT index, --      CASE WHEN index=12 AND value='RUSSIA' THEN 'RUS' ELSE value END AS value FROM t.customDimensions) AS customDimensions, --  Custom Dimensions --   t.hits      h.customDimensions ARRAY(SELECT AS STRUCT *REPLACE( ARRAY(SELECT AS STRUCT index, --      CASE WHEN index=25 AND value='RUSSIA' THEN 'RUS' ELSE value END AS value FROM h.customDimensions) AS customDimensions) FROM t.hits AS h) AS hits) FROM `project.dataset.tablename` AS t
      
      





As a result of this request, we will get the original table with the raw data from Google Analytics. She will completely retain the original nesting structure, but the values ​​of the necessary user parameters will be changed to new ones.



The topic of working with a nested data structure in Google BigQuery is not easy.



I hope I managed to clarify this issue. But, let me remind you, the best way to learn how to do something is to practice more.



All Articles