Text mining users’ definitions of browsing privacy

One issue that’s been on everyone’s mind lately is privacy. Privacy is extremely important to us at Mozilla, but it isn’t exactly clear how Firefox users define privacy. For example, what do Firefox users consider to be essential privacy issues? What features of a browsing experience lead users to consider a browser to be more or less private?

In order to answer these questions, we asked users to give us their definitions of privacy, specifically privacy while browsing, in order to answer these questions. The assumption was that users will have different definitions, but that there will be enough similarities between groups of responses that we could identify “themes” amongst the responses. By text mining user responses to an open-ended survey question asking for definitions of browsing privacy, we were able to identify themes directly from the users’ mouths:

Regarding privacy issues, people know that tracking and browser history are different issues, validating the need for browser features that address these issues independently (“private browsing” and “do not track”)
People’s definition of personal information vary, but we can group people according to the different ways they refer to personal information (this leads to a natural follow-up question; what makes some information more personal than others?)
Previous focus group research, contracted by Mozilla, showed that users are aware that spam indicates a security risk, but what didn’t come out of the focus group research was that users also also consider spam to be an invasion of their privacy (a follow-up question, what do users define as “spam?” Do they consider targeted ads to be spam?)
There are users who don’t distinguish privacy and security from each other

Some previous research on browsing and privacy

We knew from our own focus group research that users are concerned about viruses, theft of their personal information and passwords, that a website might misuse their information, that someone may track their online “footprint”, or that their browser history is visible to others. Users view things like targeted ads, spam, browser crashes, popups, and windows imploring them to install updates as security risks.

But it’s difficult to broadly generalize findings from focus groups. One group may or may not have the same concerns as the general population. The quality of the discussion moderator, or some unique combination of participants, the moderator, and/or the setting can also influence the findings you get from focus groups.

One way of validating the representativeness of focus group research is to use surveys. But while surveys may increase the representativeness of your findings, they are not as flexible as focus groups. You have to give survey respondents their answer options up front. Therefore, by providing the options that a respondent can endorse, you are limiting their voice.

A typical way to approach this problem in surveys is to use open-ended survey questions. In the pre-data mining days, we would have to manually code each of these survey responses: a first pass of all responses to get an idea of respondent “themes” or “topics” and a second pass to code each response according to those themes. This approach is costly in terms of time and effort, plus it also suffers from the problem of reproducibility; unless themes are extremely obvious, different coders might not classify a response as part of the same theme. But with modern text mining methods, we can simulate this coding process much more quickly and reproducibly.

Text mining open-ended survey questions

Because text mining is growing in popularity primarily due to its computational feasibility , it’s important to review the methods in some detail. Text mining, as with any machine learning-based approach, isn’t magic. There are a number of caveats to make about the text mining approach used. First, the clustering algorithm I chose to use requires an arbitrary and a priori decision regarding the number of clusters. I looked at 4 to 8 clusters and decided that 6 provided the best trade-off between themes expressed and redundancy. Second, there is a random component to clustering, meaning that one clustering of the same set of data may not produce the exact same results as another clustering. Theoretically, there shouldn’t be tremendous differences between the themes expressed in one clustering over another, but it’s important to keep these details in mind.

The general idea of text mining is to assume that you can represent documents as “bags of words”, that bags of words can be represented or coded quantitatively, and that the quantitative representation of text can be projected into a multi-dimensional space. For example, I can represent survey respondents in two dimensions, where each point is a respondent’s answer. Points that are tightly clustered together mean that these responses are theoretically very similar with respect to lexical content (e.g., commonality of words).

I also calculated a score that identifies the relative frequency of each word in a cluster, which is reflected in the size of the word on each cluster’s graph. In essence, the larger the word, the more it “defines” the cluster (i.e. its location and shape in the space).

Higher resolution .pdf files of these graphs can be found here and here.

Cluster summaries

“Privacy and Personal information”: Clusters 1, 4, and 5 are dominated by, unsurprisingly, concerns about information. What’s interesting are the lower-level associations between the clusters and the words. The largest, densest cluster (cluster 4) deals mostly with access to personal information whereas cluster 1 addresses personal information as it relates to identity issues (such as when banking). Cluster 5 is subtly different from both 1 and 4. The extra emphasis on “share” could imply that users have different expectations of privacy with personal information that they explicitly choose to leak onto the web as opposed to personal information that they aren’t aware they are expressing. One area of further investigation would be to seek out user definitions on personal information; what makes some information more “personal” than others?

“Privacy and Tracking”: Cluster 6 clearly shows that people associate being tracked as a privacy issue. The lower-scored words indicate what kind of tracked information concerns them (e.g., keystrokes, cookies, site visits), but in general the notion of “tracking” is paramount to respondents in this cluster. Compare this with cluster 2, which is more strongly defined by the words “look” and “history.” This is obviously a reference to the role that browsing history has in defining privacy. It’s interesting that these clusters are so distinct from each other, because it implies that users are aware there is a difference between their browser history and other behaviors they exhibit that could be tracked. It’s also interesting that users who consider browser history a privacy issue also consider advertising and ads (presumably a reference to targeted ads) as privacy issues as well. We can use this information to extend the focus group research on targeted ads; in addition to a security risk, some users also view targeted ads as an invasion of privacy. One interesting question naturally arises: do users differentiate between spam and targeted advertisements?

“Privacy and Security”: The weakest defined group is cluster 3, which can be interpreted in many ways. The least controversial inference could be that these users simply don’t have a strong definition of privacy aside from a notion that privacy is related to identity and security. This validates a notion from our focus group research that some users really don’t differentiate between privacy and security.

Final thoughts

User privacy and browser security are very important to us at Mozilla, and developing a product that improves on both requires a deep and evolving understanding of what those words mean to people of all communities – our entire user population. In this post, we’ve shown how text mining can enhance our understanding of pre-existing focus group research and generate novel directions for further research. Moreover, we’ve also shown how it can provide insight into users’ perception by looking at the differences in the language they use to define a concept. In the next post, I’ll be using the same text mining approach to evaluate user definitions of security while browsing the web.

5 responses

Donnie Berkholz wrote on August 15, 2011 at 12:40 pm :

Is the code available that you used for this?
Rebecca Weiss wrote on August 15, 2011 at 3:11 pm:

Yes, but I’d want to comment it a bit more thoroughly before I’d share it.

I did everything in R, from data preprocessing, text mining, and plotting.
Raf wrote on August 23, 2011 at 8:55 am:

(Sorry for late comment)

Hi, cool stuff. As a suggestion, it may help to use tf-idf (http://en.wikipedia.org/wiki/Tf%E2%80%93idf) to filter out the more common words like ‘look’ and ‘info’, an gain a better insight into the content of each cluster.

If you already did this, feel free to ignore my comment of course 🙂
smo wrote on August 25, 2011 at 1:30 pm:

R rules OK:)
Rebecca Weiss wrote on September 7, 2011 at 11:21 pm:

Late comments are always welcome!

I did try with and without tf-idf 🙂

I still intend to upload some code for people to look at/play with. I primarily used the tm package in R, which is quite powerful and makes it very easy to see the results of tf-idf weighting.