{"id":4960,"date":"2011-08-15T11:32:11","date_gmt":"2011-08-15T18:32:11","guid":{"rendered":"http:\/\/blog.mozilla.org\/metrics\/?p=4960"},"modified":"2019-09-18T12:05:34","modified_gmt":"2019-09-18T19:05:34","slug":"text-mining-users-definitions-of-browsing-privacy","status":"publish","type":"post","link":"https:\/\/blog.mozilla.org\/metrics\/2011\/08\/15\/text-mining-users-definitions-of-browsing-privacy\/","title":{"rendered":"Text mining users&#8217; definitions of browsing privacy"},"content":{"rendered":"<p><span style=\"color: #000000;\">One issue that\u2019s been on everyone\u2019s mind lately is privacy.\u00a0 Privacy is <span style=\"color: #0000ff;\"><a href=\"http:\/\/firstpersoncookie.wordpress.com\/2011\/01\/12\/mozillas-draft-privacy-data-operating-principles\/\"><span style=\"color: #0000ff;\">extremely<\/span><\/a> <a href=\"https:\/\/wiki.mozilla.org\/Privacy\/Roadmap_2011#Operating_Principles:\"><span style=\"color: #0000ff;\">important<\/span><\/a><\/span> to us at Mozilla, but it isn&#8217;t exactly clear how Firefox users define privacy.\u00a0 For example, what do Firefox users consider to be essential privacy issues?\u00a0 What features of a browsing experience lead users to consider a browser to be more or less private?<\/span><\/p>\n<div>\n<p><span style=\"color: #000000;\">In order to answer\u00a0 these questions, we asked users to give us their definitions of privacy, specifically\u00a0<strong><em>privacy<\/em><em>\u00a0while browsing<\/em><\/strong>,\u00a0in order to answer these questions.\u00a0 The assumption was that users will have different definitions, but that there will be enough similarities between groups of responses that we could identify \u201cthemes\u201d amongst the responses. By text mining\u00a0user\u00a0responses to an open-ended survey question asking for definitions of browsing privacy,\u00a0 we were able to identify themes directly from the users&#8217; mouths:<\/span><\/p>\n<ol>\n<li><span style=\"color: #000000;\">Regarding\u00a0 privacy issues, people know that tracking and browser history are\u00a0 different issues, validating the need for browser features that address\u00a0 these issues independently (&#8220;private browsing&#8221; and &#8220;do not track&#8221;)<\/span><\/li>\n<li><span style=\"color: #000000;\">People&#8217;s definition of personal information vary, but we can group people\u00a0 according to the different ways they refer to personal information (this leads to a natural follow-up question; what makes some information more personal than others?)<\/span><\/li>\n<li><span style=\"color: #000000;\">Previous focus group research, contracted by Mozilla, showed that users are aware that spam indicates a\u00a0 security risk, but\u00a0what didn&#8217;t come out of the focus group research was that users also\u00a0also consider spam to be an invasion of their privacy (a follow-up question, what do users define as \u201cspam?\u201d\u00a0 Do they consider targeted ads to be spam?)<br \/>\n<\/span><\/li>\n<li><span style=\"color: #000000;\">There are users who don&#8217;t distinguish privacy and security from each other<\/span><\/li>\n<\/ol>\n<h2 id=\"magicdomid12\"><span style=\"color: #000000;\">Some previous research on browsing and privacy<\/span><\/h2>\n<p id=\"magicdomid14\"><span style=\"color: #000000;\">We\u00a0 knew from our own focus group research that users are concerned about viruses, theft of their personal information and passwords, that a\u00a0 website might misuse their information, that someone may track their\u00a0 online \u201cfootprint\u201d, or that their browser history is visible to others.\u00a0\u00a0 Users view things like targeted ads, spam, browser crashes, popups, and\u00a0 windows imploring them to install updates as security risks.<\/span><\/p>\n<p><span style=\"color: #000000;\">But it&#8217;s difficult to broadly generalize\u00a0findings from focus groups.\u00a0 One group may or may not have the same concerns as the general population.\u00a0 The quality of the discussion moderator, or some unique combination of participants,\u00a0 the moderator, and\/or the setting can also influence the findings you get from focus groups.<\/span><\/p>\n<p><span style=\"color: #000000;\">One way of validating the representativeness of focus group research is to use surveys.\u00a0 But while surveys may increase the representativeness of your findings, they are not as flexible as focus groups.\u00a0 You have to give survey respondents their answer options up front.\u00a0 Therefore, by providing the options that a respondent can endorse, you are limiting their voice.<\/span><\/p>\n<p><span style=\"color: #000000;\">A typical\u00a0 way to approach this problem in surveys is to use open-ended survey questions.\u00a0 In the pre-data mining days, we would have to manually code\u00a0 each of these survey responses: a first pass of all responses to get an idea of respondent \u201cthemes\u201d or \u201ctopics\u201d and a second pass to code each\u00a0 response according to those themes.\u00a0 This approach is costly in terms of time and effort, plus it also suffers from the problem of reproducibility; unless themes are extremely obvious, different coders might not classify a response as part of the same theme.\u00a0 But with modern text mining methods, we can simulate this coding process much more quickly and reproducibly.<\/span><\/p>\n<h2 id=\"magicdomid19\"><span style=\"color: #000000;\"><strong>Text mining open-ended survey questions<\/strong><\/span><\/h2>\n<p><span style=\"color: #000000;\">Because text mining is growing in popularity\u00a0primarily due to its computational feasibility\u00a0, it\u2019s important to review the\u00a0 methods in some detail.\u00a0 Text mining, as with any machine learning-based approach, isn\u2019t magic.\u00a0 There are a number of caveats to make about\u00a0the\u00a0text mining\u00a0approach used. First, the clustering algorithm I chose to use requires an arbitrary\u00a0and a priori\u00a0decision regarding the number of clusters.\u00a0 I looked at 4 to 8 clusters and decided that 6 provided the best trade-off between themes expressed and redundancy.\u00a0 Second, there is a random component to\u00a0 clustering, meaning that one clustering of the same set of data may not produce the exact same results as another clustering. Theoretically,\u00a0 there shouldn&#8217;t be tremendous differences between the themes expressed in one clustering over another, but it&#8217;s important to keep these details in mind.<\/span><\/p>\n<p id=\"magicdomid20\"><span style=\"color: #000000;\">The general idea of text mining is to assume that you can represent documents as &#8220;bags of words&#8221;, that bags of words can be represented\u00a0or coded\u00a0quantitatively, and that the quantitative representation of text can be projected into\u00a0a multi-dimensional space. For example, I can represent survey respondents in two dimensions, where each point is a respondent&#8217;s answer.\u00a0 Points that are tightly clustered together mean that these responses are theoretically very similar with respect to lexical content\u00a0(e.g., commonality of words).<\/span><\/p>\n<div><span style=\"color: #000000;\"><a href=\"http:\/\/blog.mozilla.org\/metrics\/files\/2011\/06\/p_kmeans_cosdist_clusters.png\"><span style=\"color: #000000;\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-4986 aligncenter\" title=\"p_kmeans_cosdist_clusters\" src=\"http:\/\/blog.mozilla.org\/metrics\/files\/2011\/06\/p_kmeans_cosdist_clusters.png\" alt=\"\" width=\"480\" height=\"480\" srcset=\"https:\/\/blog.mozilla.org\/metrics\/files\/2011\/06\/p_kmeans_cosdist_clusters.png 480w, https:\/\/blog.mozilla.org\/metrics\/files\/2011\/06\/p_kmeans_cosdist_clusters-150x150.png 150w, https:\/\/blog.mozilla.org\/metrics\/files\/2011\/06\/p_kmeans_cosdist_clusters-300x300.png 300w\" sizes=\"(max-width: 480px) 100vw, 480px\" \/><\/span><\/a><\/span><\/div>\n<p><span style=\"color: #000000;\">I\u00a0 also calculated a score that identifies the relative frequency of each word in a cluster, which is reflected in the size of the word on each\u00a0 cluster\u2019s graph.\u00a0 In essence, the larger the word, the more it \u201cdefines\u201d\u00a0 the cluster\u00a0(i.e.\u00a0its location and shape in the space).<\/span><\/p>\n<p><span style=\"color: #000000;\"><a href=\"http:\/\/blog.mozilla.org\/metrics\/files\/2011\/06\/p_kmeans_freq.png\"><span style=\"color: #000000;\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-4988 aligncenter\" title=\"p_kmeans_freq\" src=\"http:\/\/blog.mozilla.org\/metrics\/files\/2011\/06\/p_kmeans_freq.png\" alt=\"\" width=\"480\" height=\"480\" srcset=\"https:\/\/blog.mozilla.org\/metrics\/files\/2011\/06\/p_kmeans_freq.png 480w, https:\/\/blog.mozilla.org\/metrics\/files\/2011\/06\/p_kmeans_freq-150x150.png 150w, https:\/\/blog.mozilla.org\/metrics\/files\/2011\/06\/p_kmeans_freq-300x300.png 300w\" sizes=\"(max-width: 480px) 100vw, 480px\" \/><\/span><\/a><\/span><\/p>\n<p><span style=\"color: #000000;\">Higher resolution .pdf files of these graphs can be found <span style=\"color: #0000ff;\"><a href=\"http:\/\/blog.mozilla.org\/metrics\/?attachment_id=4991\"><span style=\"color: #0000ff;\">here<\/span><\/a><\/span> and <span style=\"color: #0000ff;\"><a href=\"http:\/\/blog.mozilla.org\/metrics\/?attachment_id=4990\"><span style=\"color: #0000ff;\">here<\/span><\/a><\/span>.<\/span><\/p>\n<h2><span style=\"color: #000000;\"><strong>Cluster summaries<\/strong><\/span><\/h2>\n<div id=\"magicdomid26\">\n<ul>\n<li><span style=\"color: #000000;\"><strong>&#8220;Privacy and Personal information&#8221;:\u00a0<\/strong>Clusters\u00a0 1, 4, and 5 are dominated by, unsurprisingly, concerns about\u00a0\u00a0<strong>information<\/strong>.\u00a0 What\u2019s interesting are the lower-level associations\u00a0 between the clusters and the words.\u00a0 The largest, densest cluster\u00a0 (cluster 4)\u00a0 deals mostly with access to\u00a0<strong>personal\u00a0<\/strong>information whereas\u00a0 cluster 1 addresses personal information as it relates to\u00a0<strong>identity<\/strong>\u00a0 issues (such as when banking).\u00a0 Cluster 5 is subtly different from both 1\u00a0 and 4.\u00a0 The extra emphasis on &#8220;share&#8221; could imply that users have\u00a0 different expectations of privacy with personal information that they\u00a0<strong>explicitly<\/strong>\u00a0choose to leak onto the web as opposed to personal information that they\u00a0 aren&#8217;t aware they are expressing.\u00a0 One area of further investigation would be to seek out user definitions on personal information; what makes some information more &#8220;personal&#8221; than others?<\/span><\/li>\n<\/ul>\n<\/div>\n<ul>\n<li><span style=\"color: #000000;\"><strong>&#8220;Privacy and Tracking&#8221;:\u00a0<\/strong>Cluster\u00a0 6 clearly shows that people associate being tracked as a\u00a0<strong>privacy<\/strong>\u00a0issue.\u00a0\u00a0 The lower-scored words indicate what kind of tracked information\u00a0 concerns them (e.g., keystrokes, cookies, site visits), but in general\u00a0 the notion of \u201ctracking\u201d is paramount to respondents in this cluster.\u00a0\u00a0 Compare this with cluster 2, which is more strongly defined by the words\u00a0 \u201clook\u201d and \u201chistory.\u201d\u00a0 This is obviously a reference to the role that\u00a0<strong>\u00a0browsing<\/strong>\u00a0history has in defining privacy.\u00a0 It&#8217;s interesting that these clusters are so distinct from each other, because it implies that users\u00a0 are aware there is a difference between their browser history and other\u00a0 behaviors they exhibit that could be tracked.\u00a0 It&#8217;s also interesting\u00a0 that users who consider browser history a privacy issue also consider\u00a0\u00a0<strong>advertising and ads<\/strong>\u00a0(presumably a reference to targeted ads) as privacy\u00a0 issues as well.\u00a0 We can use this information to extend the focus group\u00a0 research on targeted ads; in addition to a security risk, some users\u00a0 also view targeted ads as an invasion of privacy.\u00a0 One interesting question naturally arises: do users differentiate between spam and\u00a0 targeted advertisements?<\/span><\/li>\n<\/ul>\n<div id=\"magicdomid28\">\n<ul>\n<li><span style=\"color: #000000;\"><strong>&#8220;Privacy and Security&#8221;:\u00a0<\/strong>The\u00a0 weakest defined group is cluster 3, which can be interpreted in many ways.\u00a0 The least controversial inference could be that these users simply don&#8217;t have a strong definition of privacy aside from a notion\u00a0 that privacy is related to identity and security.\u00a0 This validates\u00a0a\u00a0notion\u00a0from our focus group research\u00a0that some users really\u00a0<strong>don&#8217;t differentiate<\/strong>\u00a0between privacy and\u00a0 security.<\/span><\/li>\n<\/ul>\n<\/div>\n<h2 id=\"magicdomid29\"><span style=\"color: #000000;\"><strong>Final thoughts<\/strong><\/span><\/h2>\n<p id=\"magicdomid30\"><span style=\"color: #000000;\">User\u00a0 privacy and browser security are very important to us at Mozilla, and\u00a0 developing a product that improves on both requires a deep and evolving\u00a0 understanding of what those words mean to people of all communities\u00a0&#8211; our entire user population.\u00a0\u00a0\u00a0 In this post, we\u2019ve shown how text mining can enhance our understanding\u00a0 of pre-existing focus group research and generate novel directions for\u00a0 further research.\u00a0Moreover, we\u2019ve also shown how it can provide insight into\u00a0 users&#8217; perception by looking at the differences in the language they use\u00a0 to define a concept.\u00a0 In the next post, I&#8217;ll be using the same text\u00a0 mining approach to evaluate user definitions of security while browsing\u00a0 the web.<\/span><\/p>\n<\/div>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>One issue that\u2019s been on everyone\u2019s mind lately is privacy.\u00a0 Privacy is extremely important to us at Mozilla, but it isn&#8217;t exactly clear how Firefox users define privacy.\u00a0 For example, what do Firefox users consider to be essential privacy issues?\u00a0 &hellip; <a class=\"go\" href=\"https:\/\/blog.mozilla.org\/metrics\/2011\/08\/15\/text-mining-users-definitions-of-browsing-privacy\/\">Continue reading<\/a><\/p>\n","protected":false},"author":268,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[101,491],"tags":[],"_links":{"self":[{"href":"https:\/\/blog.mozilla.org\/metrics\/wp-json\/wp\/v2\/posts\/4960"}],"collection":[{"href":"https:\/\/blog.mozilla.org\/metrics\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.mozilla.org\/metrics\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/metrics\/wp-json\/wp\/v2\/users\/268"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/metrics\/wp-json\/wp\/v2\/comments?post=4960"}],"version-history":[{"count":0,"href":"https:\/\/blog.mozilla.org\/metrics\/wp-json\/wp\/v2\/posts\/4960\/revisions"}],"wp:attachment":[{"href":"https:\/\/blog.mozilla.org\/metrics\/wp-json\/wp\/v2\/media?parent=4960"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.mozilla.org\/metrics\/wp-json\/wp\/v2\/categories?post=4960"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.mozilla.org\/metrics\/wp-json\/wp\/v2\/tags?post=4960"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}