Fingerprinting browser. How to track users on the web

I was always bothered by how obsessively Google AdSense popped contextual advertising depending on my old searches in the search engine. It seems that quite a lot of time has passed since the search, and the cookies and browser cache have been cleaned more than once, and the ad remains. How did they keep track of me? It turns out that there are plenty of ways to do this.

Short introduction

Identification, user tracking, or simply web tracking means calculating and setting a unique identifier for each browser that visits a particular site. In general, initially this was not conceived by some universal evil and, like everyone else, has the opposite side, that is, it is intended to be useful. For example, to allow site owners to distinguish ordinary users from bots, or to provide the ability to store user preferences and apply them on a subsequent visit. But at the same time, this opportunity really appealed to the advertising industry. As you well know, cookies are one of the most popular ways to identify users. And they began to actively apply in the advertising industry from the mid-nineties.

Since then, much has changed, technology has gone far ahead, and at present, user tracking is not limited to cookies. In fact, users can be identified in various ways. The most obvious option is to set some sort of identifiers, like cookies. The next option is to use the data about the PC user used, which can be obtained from the HTTP headers of the sent requests: address, type of OS used, time, and the like. And finally, you can distinguish the user by his behavior and habits (cursor movements, favorite sections of the site, etc.).

Explicit Identifiers

This approach is quite obvious, all that is required is to save on the user side some long-lived identifier that can be requested during a subsequent visit to the resource. Modern browsers provide enough ways to accomplish this transparently to the user. First of all, these are good old cookies. Then, the features of some plugins that are close in functionality to cookies, for example, Local Shared Objects in a flash or Isolated Storage in silverlite. HTML5 also includes several client-side storage engines, including the localStorage, File, and IndexedDB APIs. In addition to these places, unique tokens can also be stored in cached resources of the local machine or cache metadata (Last-Modified, ETag). In addition, you can identify the user by fingerprints received from Origin Bound certificates generated by the browser for SSL connections, according to the data contained in the SDCH dictionaries, and metadata of these dictionaries. In a word - the possibilities are full.

Cookies

When it comes to storing some small amount of data on the client side, cookies are the first thing that usually comes to mind. The web server sets a unique identifier for the new user, storing it in cookies, and with all subsequent requests the client will send it to the server. And although all popular browsers have long been equipped with a convenient interface for managing cookies, and the Web is full of third-party utilities for managing and blocking them, cookies still continue to be actively used for tracking users. The fact is that few people look at and clean them (remember when you last did this). Perhaps the main reason for this is that everyone is afraid to accidentally delete the necessary “cookie”, which, for example, can be used for authorization. And although some browsers allow you to limit the installation of third-party cookies, the problem does not disappear, since very often browsers consider cookies to be "native" received through HTTP redirects or other methods while loading page content. Unlike most of the mechanisms that we will discuss later, the use of cookies is transparent to the end user. In order to “tag” a user, it is not even necessary to store a unique identifier in a separate cookie - it can be collected from the values of several cookies or stored in metadata, such as Expiration Time. Therefore, at this stage it is quite difficult to figure out whether a particular cookie is used for tracking or not.

Local shared objects

Adobe Flash uses the LSO mechanism to store client-side data. It is an analogue of cookies in HTTP, but unlike the latter, it can store not only short fragments of text data, which, in turn, complicates the analysis and verification of such objects. Prior to version 10.3, the behavior of flash cookies was set separately from the browser settings: you had to visit the Flash settings manager located on macromedia.com. Today, this can be done directly from the control panel. In addition, most modern browsers provide fairly tight integration with a flash player: for example, when deleting cookies and other site data, LSOs will also be deleted. On the other hand, the interaction of browsers with the player is not so close, so setting the browser policy for third-party cookies will not always affect flash cookies (on the Adobe website you can see how to manually disable them).

Silverlight Isolated Storage

The Silverlight software platform has quite a bit in common with Adobe Flash. So, an analogue of the flash Local Shared Objects is a mechanism called Isolated Storage. True, unlike a flash, the privacy settings here are not tied to the browser, therefore, even if the cookies and browser cache are completely cleared, the data stored in Isolated Storage will still remain. But it’s even more interesting that the repository turns out to be common for all browser windows (except for those opened in the “Incognito” mode) and all profiles installed on the same machine. As in the LSO, from a technical point of view there are no barriers to storing session identifiers. Nevertheless, given that it is not yet possible to reach this mechanism through the browser settings, it has not received such wide distribution as a repository for unique identifiers.

HTML5 and client data storage

HTML5 provides a set of mechanisms for storing structured data on the client. These include localStorage, File API, and IndexedDB. Despite the differences, all of them are designed to provide permanent storage of arbitrary portions of binary data associated with a specific resource. Plus, unlike HTTP and Flash cookies, there are no significant restrictions on the size of the stored data. In modern browsers, HTML5 storage is located along with other site data. However, how to manage storage through browser settings is very difficult to guess. For example, to remove data from localStorage in Firefox, the user will have to select offline website data or site preferences and set the time period to everything. Another extraordinary feature that is unique to IE is that the data exists only for the lifetime of the tabs that were open at the time they were saved. In addition, the above mechanisms do not really try to follow the restrictions applicable to HTTP cookies. For example, you can write to localStorage and read from it through cross-domain frames even with third-party cookies disabled.

Cached objects

Everyone wants the browser to work smartly and without brakes. Therefore, he has to add the resources of the sites he visits to the local cache (so as not to request them on a subsequent visit). And although this mechanism was clearly not intended to be used as a random access store, it can be turned into one. For example, a server can return a JavaScript document with a unique identifier inside its body to the user and set the distant future in the headers Expires / max-age =. Thus, the script, and with it the unique identifier, will be written in the browser cache. After that, it will be possible to access it from any page on the Web, simply requesting to download the script from a well-known URL. Of course, the browser will periodically ask with the If-Modified-Since header if a new version of the script has appeared. But if the server will return the code 304 (Not modified), then the cached copy will be used forever. What else is interesting cache? There is no concept of "third-party" objects, as, for example, in the case of HTTP cookies. At the same time, disabling caching can seriously affect performance. And the automatic determination of tricky resources that store some identifiers / tags is difficult due to the large volume and complexity of JavaScript documents found on the Web. Of course, all browsers allow the user to manually clear the cache. But as practice shows (even our own example), this is not done so often, if at all.

ETag and Last-Modified

In order for caching to work correctly, the server needs to somehow inform the browser that a newer version of the document is available. The HTTP / 1.1 standard offers two ways to solve this problem. The first is based on the date the document was last modified, and the second is based on an abstract identifier known as ETag. In the case of ETag, the server initially returns the so-called version tag in the response header along with the document itself. On subsequent requests to the given URL, the client tells the server through the If-None-Match header this value associated with its local copy. If the version indicated in this header is current, then the server responds with HTTP code 304 (Not Modified), and the client can safely use the cached version. Otherwise, the server sends a new version of the document with the new ETag. This approach is somewhat reminiscent of HTTP cookies - the server stores an arbitrary value on the client only in order to read it later. Another method involving using the Last-Modified header allows you to store at least 32 bits of data in a date string, which is then sent by the client to the server in the If-Modified-Since header. Interestingly, most browsers do not even require this string to be a date in the correct format. As in the case of user identification through cached objects, ETag and Last-Modified are not affected in any way by deleting cookies and site data, you can get rid of them only by clearing the cache.

HTML5 AppCache

Application Cache allows you to specify which part of the site should be stored on disk and be accessible, even if the user is offline. Everything is managed using manifests that set the rules for storing and retrieving cache elements. Like the traditional caching mechanism, AppCache also allows you to store unique, user-dependent data - both inside the manifest itself and inside resources that are stored indefinitely (unlike a regular cache, resources from which are deleted after a while). AppCache is intermediate between HTML5 storage engines and the normal browser cache. In some browsers, it is cleared when deleting cookies and site data, in others only when deleting the browsing history and all cached documents.

SDCH dictionaries

SDCH is a compression algorithm developed by Google that is based on the use of dictionaries provided by the server and allows achieving a higher level of compression than Gzip or deflate. The fact is that in everyday life the web server gives too much duplicate information - page headers / footers, built-in JavaScript / CSS and so on. In this approach, the client receives a dictionary file from the server containing lines that may appear in subsequent answers (the same headers / footers / JS / CSS). After that, the server can simply refer to these elements inside the dictionary, and the client will independently assemble the page on their basis. As you know, these dictionaries can easily be used to store unique identifiers, which can be placed both in the dictionary IDs returned by the client to the server in the Avail-Dictionary header, and directly in the content itself. And then use it just like with the usual browser cache.

Other storage mechanisms

But there are more options. Using JavaScript and its teammates, you can save and request a unique identifier so that it remains alive even after deleting the entire browsing history and site data. As one of the options, you can use window.name or sessionStorage to store. Even if the user erases all cookies and site data, but does not close the tab in which the tracking site was opened, the next time he accesses, the identifying token will be received by the server and the user will again be bound to the data already collected about him. JS has the same behavior, any open JavaScript context retains state, even if the user deletes the site data. At the same time, such JavaScript can not only belong to the displayed site, but also hide in iframes, web-workers, and so on. For example, an advertisement loaded in an iframe will not pay attention to deleting the browsing history and site data and continue to use the identifier stored in the local variable in JS.

Protocols

In addition to the mechanisms related to caching, the use of JS and various plug-ins, modern browsers have several more network features that allow you to store and retrieve unique identifiers.

Origin Bound Certificates (aka ChannelID) - persistent self-signed certificates that identify the client to the HTTPS server. A separate certificate is created for each new domain, which is used for future connections. Sites can use OBC to track users without taking any actions that will be visible to the client. As a unique identifier, you can take the cryptographic hash of the certificate provided by the client as part of a legitimate SSL handshake.

Similarly, TLS also has two mechanisms - session identifiers and session tickets, which allow clients to resume interrupted HTTPS connections without performing a full handshake. This is achieved through the use of cached data. These two mechanisms, over a short period of time, allow servers to identify requests originating from a single client.

Almost all modern browsers implement their own internal DNS cache to speed up the name resolution process (and in some cases reduce the risk of DNS rebinding attacks). Such a cache can easily be used to store small amounts of information. For example, if you have 16 available IP addresses, about 8–9 cached names will be enough to identify each computer on the Web. However, this approach is limited by the size of the browsers' internal DNS cache and can potentially lead to name resolution conflicts with the DNS provider.

Machine specifications

All the methods discussed before were based on the fact that a unique identifier was set for the user, which was sent to the server on subsequent requests. There is another, less obvious approach to tracking users that relies on querying or measuring the characteristics of a client machine. One by one, each characteristic obtained represents only a few bits of information, but if you combine several, they can uniquely identify any computer on the Internet. Besides the fact that such surveillance is much more difficult to recognize and prevent, this technique will allow you to identify a user sitting under different browsers or using private mode.

Browser Imprints

The simplest approach to tracking is to build identifiers by combining a set of parameters available in a browser environment, each of which individually is of no interest, but together they form a unique value for each machine:

User Agent Gives the browser version, OS version and some of the installed add-ons. In cases when the User-Agent is absent or you want to check its “veracity”, you can determine the browser version by checking for the presence of certain features implemented or changed between releases.

Clock If the system does not synchronize its clock with a third-party time server, then sooner or later they will start to lag or rush, which will create a unique difference between real and system time, which can be measured accurate to the microsecond using JavaScript. In fact, even when synchronizing with the NTP server, there will still be small deviations that can also be measured.

Information about the CPU and GPU. You can get it either directly (via GL_RENDERER) or through benchmarks and tests implemented using JavaScript.

Monitor resolution and browser window size (including settings for the second monitor in the case of a multi-monitor system).

A list of fonts installed in the system, obtained, for example, using getComputedStyleAPI.

A list of all installed plugins, ActiveX controls, Browser Helper Objects, including their versions. It is possible to get through navigator.plugins [] (some plugins show their presence in HTTP headers).

Information about installed extensions and other software. Extensions, such as ad blockers, make certain changes to the pages viewed, by which you can determine what the extension is and its settings.

Network fingerprints

A number of other signs lie in the architecture of the local network and the configuration of network protocols. Such signs will be characteristic of all browsers installed on the client machine, and they cannot simply be hidden with the help of privacy settings or some security utilities. They include:

The external IP address. For IPv6 addresses, this vector is especially interesting, since the last octets in some cases can be obtained from the device’s MAC address and therefore can be saved even when connected to different networks.

Port numbers for outgoing TCP / IP connections (usually selected sequentially for most operating systems).

The local IP address for users who are behind a NAT or HTTP proxy. Together with an external IP, it uniquely identifies most customers.

Information about proxy servers used by the client, obtained from the HTTP header (X-Forwarded-For). In combination with the real client address obtained through several possible methods of bypassing the proxy, it also allows you to identify the user.

Behavioral Analysis and Habits

Another option is to look towards characteristics that are not tied to a PC, but rather to the end user, such as regional settings and behavior. This method again allows you to identify clients between different browser sessions, profiles, and in the case of private browsing. Conclusions can be made on the basis of the following data, which are always available for study:

Preferred language, default encoding and time zone (all of this lives in HTTP headers and is accessible from JavaScript).

Data in the client’s cache and its browsing history. Cache elements can be detected using time attacks - the tracker can detect long-lived cache elements related to popular resources by simply measuring the time from loading (and canceling the transition if the time exceeds the expected loading time from the local cache). You can also retrieve URLs stored in your browsing history, although such an attack in modern browsers will require little user interaction.

Mouse gestures, frequency and duration of keystrokes, data from the accelerometer - all of these parameters are unique to each user.

Any changes to standard site fonts and their sizes, zoom level, use of special features, such as text color, size.

The state of certain browser features configured by the client: blocking third-party cookies, DNS prefetching, blocking pop-ups, Flash security settings and so on (ironically, users who change the default settings actually make their browser much easier to identify).

And these are just the obvious options that lie on the surface. If you dig deeper - you can come up with more.

To summarize

As you can see, in practice there are a large number of different ways for tracking a user. Some of them are the result of errors in implementation or omissions and can theoretically be corrected. Others are almost impossible to eradicate without a complete change in the principles of computer networks, web applications, and browsers. Some techniques can be counteracted - to clear the cache, cookies and other places where unique identifiers can be stored. Others work completely invisibly to the user, and it is unlikely to protect themselves from them. Therefore, the most important thing - traveling on the Web, even in a private viewing mode, remember that your movements can still be tracked.

All Articles