"Saying you don't care about privacy because you have nothing to hide is like saying you don't care about freedom of speech because you have nothing to say. It's not that you have something to hide, it's that you have something to protect: your freedom." Edward Snowden
Recently, as a sign of privacy concerns when surfing the Web, many browsers have included a private browsing mode in their user interface that allows users to browse privately. In these browsers, these modes are known as Private Browsing, InPrivate or Incognito. However, these modes do not offer a complete solution for privacy-preserving browsing.
The main objectives of these private browsing modes are twofold.
- Firstly, to leave no trace on the user's computer of the websites visited.
- Secondly, that the user's activity cannot be linked to the Web sites visited and that the activities performed in private mode are not known in public mode.
Thus, these modes offer only a partial solution, as users could be tracked, for example, from their Internet Protocol (IP) address.
The high-level process
- When a user enters the URL of a web site into his web browser, the browser sends an HTTP GET request to the web server.
- The server responds with an HTTP 200 OK response that includes the requested HTML page.
- The browser processes the HTML page and downloads all the web objects included in it, such as images, scripts, style sheets, etc. These objects are requested from the web server or from other third-party servers if the HTML page includes elements found on other websites. These third-party websites may be ad servers, market researchers, affiliate marketers, retargeters, third-party data collectors, etc. and may pose risks to user privacy. These objects may also be advertisements or Web bugs.
- The web browser activity ends once all objects have been downloaded, and the process repeats again when the user clicks on a new link.
Websites may collect information such as user name, email address, location information, interests and access patterns. This information is used to create user profiles. Profiles are used both to enhance the user experience and provide personalized services and for marketing purposes and to attract more advertisers.
The information collected is used for targeted advertising and dynamic pricing. However, this collection of information poses a threat to user privacy if it is not done with the user's explicit consent. The privacy dimension of information collection requires that data be collected only with the knowledge and explicit consent of the user.
Self-identifying information can be obtained from three different conceptual layers: TCP/IP layer, HTTP layer and application layer.
TCP/IP layer
The first thing we expose when we naturally surf the Internet is our IP and the information that can be obtained from a user through HTTP requests using the TCP/IP protocol includes:
- IP address
- Port used Domain name
- Geolocation
- Internet Service Provider (ISP)
- City, country, region and continent
- Connection round trip time
- Operating system
- NAT detection.
From the domain name, information about the organization can be obtained, including the name of the administrator.
From the TCP level trace you can also get the computers involved in the communication, the uptime, and some other properties of the connection.
HTTP layer
HTTP requests are made via the request-response pattern and contain information about the URL of the Web page to be accessed. The HTTP response contains the HTML page and the additional resources needed to display the Web page correctly.
These Web resources can be images, Flash objects, CSS (Cascading Style Sheets), JavaScript, VBScript, etc.
At this level there are two elements that can be used to identify and track the user: HTTP headers and HTTP cookies.
From the connection established for the request we can obtain:
- IP and port of connection to the server.
- User's web browser (User-agent)
- Language, encoding and character set preferences (Accept-Language, AcceptEncoding, Accept-Charset)
- URL of the previously visited website (Referer)
- User's e-mail address (From)
Cookies are used for personalization based on user preferences, identification of an authentication session, automatic login, location memory, customer sorting, etc.
A cookie is a text string containing a name and its value, an expiration date (set in the optional Expire and Max-Age attributes) and the originating site (set in the optional Domain and path attributes). Cookies can be set for the domain where the user is downloading the web page (first-party cookie) or for a different domain (third-party cookie).
A cookie is sent by the Web server in the Set-Cookie header and the server can send several cookies including in the same response as many Set-Cookie headers as cookies to be set.
Once the cookie is set, the Web browser will send it every time the user accesses the Web. The cookie is sent by the Cookie header. Depending on the lifetime of a cookie, they are classified into session cookies and persistent cookies.
The former are those that are deleted when the Web browser is closed. Therefore, it only resides in memory. The latter are stored on the hard disk and persist even when the web browser is closed (or until they expire or the user deletes them). For this reason, they are also called tracking cookies.
From cookies you can:
- Track, profile and monitor users' browsing activities.
- Know the number of times the user has visited the Web, the Web pages visited and when visited.
- Store the user's movement on the Web site.
Application layer
Some objects embedded in web pages, such as web bugs and banner ads, can be a potential source of personally identifiable information (PII) leakage and decrease download performance.
A Web bug (also known as a Web beacon, 1x1 gif or tracking bug) is an element or object (usually a transparent GIF) that is embedded in a Web page or even in an e-mail.
Web search engines can obtain identifying information about the user. Web bugs are used by third parties to monitor user activity and to compile statistics, and can be combined with cookies to create a user profile. Scripts, objects and plugins such as Javascript, ActiveX, Java applets and Flash can also pose a threat to privacy as they can be used to fingerprint and identify the user.
User data that can be obtained:
- Name
- Social security number
- Location
- Job
- Family
- Interests
- Future plans
- If you have seen a web page or e-mail
- The computer the user is accessing
- Web page opened
- Time of start of the visit
- Number of times you have accessed a web site
- From which servers you accessed
- Fingerprints of the user's machine
Fingerprinting is the identification of a set of browser characteristics, such as user agent, HTTP Accept header content types, screen resolution, time zone, browser plugins, plugin versions and MIME types, system fonts, and certain information provided by some cookie tests.
If this information is sufficiently distinctive, it allows the identification of a user.
The user's fingerprints can be taken even with his or her typing.
Web search engines can also make inferences and links on query terms, redirections in the results provided to distinguish between users.
For some activities such as behavioral targeting, the information provided by search queries is several times better than the information provided by the pages clicked by the user. This information combined with the time of day (as well as online information) allows web search engines to obtain valuable information about the user (his job, interests, future plans, etc.) and his activities at a given time.