{"id":743,"date":"2014-07-30T14:50:58","date_gmt":"2014-07-30T14:50:58","guid":{"rendered":"http:\/\/mozscienceblog.wpengine.com\/?p=743"},"modified":"2019-02-28T12:40:46","modified_gmt":"2019-02-28T20:40:46","slug":"discovery-of-scientific-software","status":"publish","type":"post","link":"https:\/\/blog.mozilla.org\/foundation-archive\/mozilla-science\/discovery-of-scientific-software\/","title":{"rendered":"Discovery of scientific software"},"content":{"rendered":"<p><em>This is a guest post by Jure Triglav, an open science hacker. You can check out <a title=\"http:\/\/juretriglav.si\" href=\"http:\/\/juretriglav.si\">http:\/\/juretriglav.si<\/a>\u00a0for more of his projects or <a href=\"https:\/\/twitter.com\/juretriglav\">follow him on Twitter.<\/a><\/em><\/p>\n<p><strong>TL;DR: An open API for science helps researchers discover great software. <a title=\"Install the Scholar Ninja extension\" href=\"https:\/\/chrome.google.com\/webstore\/detail\/scholar-ninja\/mngpckgljabecionknlpnnbamopcehgp?hl=en\">Install the Scholar Ninja extension<\/a>\u00a0and you\u2019ll get recommendations (based on software citations) on-the-fly while browsing GitHub.<\/strong><\/p>\n<p><img decoding=\"async\" class=\"aligncenter wp-post-image\" style=\"float: none\" alt=\"scientificsoftware-blogpost\" src=\"https:\/\/mozscienceblog.wpengine.com\/wp-content\/uploads\/2014\/07\/scientificsoftware-blogpost.png\" width=\"700\" \/><\/p>\n<p>A while back I wrote about an open distributed search engine for science, <a href=\"http:\/\/juretriglav.si\/an-open-distributed-search-engine-for-science\/\">Scholar Ninja<\/a>, and about how great it will be to have an open API which you can query and <strong>get to all of science, no matter if you\u2019re human or machine<\/strong>. Having the world\u2019s knowledge openly accessible like that will result in a paradigm shift. I dare you to say it ain\u2019t so! (Also, do check out <a href=\"http:\/\/contentmine.org\">ContentMine<\/a>!)<\/p>\n<p>While that project is still in early stages of development (<a href=\"https:\/\/github.com\/ScholarNinja\/extension\/issues\/8\">most recently, we even had to turn off our core feature, WebRTC<\/a>), lots of people have asked what the use cases for such a search engine\/API would be, anyway. A great many don\u2019t appreciate how important this will be, and while that might seem silly at first glance, the vast majority of researchers is quite satisfied with the closed, machine-unfriendly Google Scholar and other closed behemoths. How can we show them the light?<\/p>\n<p>After a lot of conversations at the <a href=\"http:\/\/2014.okfestival.org\/\">#OKFest14<\/a> and more recently at the <a href=\"https:\/\/mozillascience.org\/how-to-join-us-virtually-for-the-global-sprint-july-22-23\/\">Mozilla Science Lab Code Sprint<\/a>, one use case that surfaced from virtually every discussion, was helping scientists (or anyone) discover great software in their field (or any field, or anywhere!). As a tsunami of code is reaching the shores of Science, discovering scientific software will only get more important:<\/p>\n<p><a href=\"https:\/\/mozscienceblog.wpengine.com\/wp-content\/uploads\/2014\/07\/Bj1axSBCMAAjBsd.png\"><img decoding=\"async\" class=\"aligncenter wp-post-image\" style=\"float: none\" alt=\"Bj1axSBCMAAjBsd\" src=\"https:\/\/mozscienceblog.wpengine.com\/wp-content\/uploads\/2014\/07\/Bj1axSBCMAAjBsd.png\" width=\"400\" \/><\/a><\/p>\n<p>Luckily, it\u2019s quite possible to use Scholar Ninja for scientific software discovery. In fact, it\u2019s more than possible \u2014 it\u2019s already done.<\/p>\n<p>Before we get ahead of ourselves, let me provide just a bit of necessary backstory: <a href=\"https:\/\/chrome.google.com\/webstore\/detail\/scholar-ninja\/mngpckgljabecionknlpnnbamopcehgp?hl=en\">Scholar Ninja<\/a> indexes (or rather, <em>did and will again<\/em> index, see <a href=\"https:\/\/github.com\/ScholarNinja\/extension\/issues\/\">issues<\/a>) every paper you read online and <strong>adds paper\u2019s metadata, keywords and URLs to a globally distributed search index<\/strong>, which is based on browsers, WebRTC and magic. Everyone who has the extension installed is a node in a <a href=\"http:\/\/en.wikipedia.org\/wiki\/Chord_(peer-to-peer)\">Chord DHT network<\/a> and is both an indexer and a server of content. <strong>Scholar Ninja\u2019s mission in life is to become a complete and completely open search engine for science.<\/strong><\/p>\n<p>To get right back to the main story here \u2014 if we can get URLs from each paper, that means we can get a large portion of software citations, which usually look something like this:<\/p>\n<blockquote><p>&#8230; reads longer than 300 bases were separated by barcode and trimmed using sickle (<a href=\"https:\/\/github.com\/najoshi\/sickle\">https:\/\/github.com\/najoshi\/sickle<\/a>); 72.3% of reads were retained &#8230;<\/p><\/blockquote>\n<p>That is, URL inlined directly into the text. Or referenced classically in the References section:<\/p>\n<blockquote><p>24. Nikhil J. Sickle &#8211; a windowed adaptive trimming tool for FASTQ files using quality. <a href=\"https:\/\/github.com\/najoshi\/sickle\">https:\/\/github.com\/najoshi\/sickle<\/a>.<\/p><\/blockquote>\n<p>In both cases the URL is right there, and that appears to be case for most software citations. The URL itself usually points to places like GitHub, BitBucket, SourceForge, Google Code, R Project, etc.<\/p>\n<p>Alright, enough mumbling, imagine that you\u2019re new to the field of bioinformatics and your first task on the job is to process a vast number of FASTQ files representing partial genetic sequences and assemble them into a coherent sequence. \u201cNo problem\u201d, you say, \u201cfirst I need to analyse the sequences and trim them when their <a href=\"http:\/\/en.wikipedia.org\/wiki\/FASTQ_format#Quality\">quality<\/a> gets too low.\u201d You start to do this by hand and a week goes by, then two, then three weeks, <em>woosh<\/em>. Luckily a wise co-worker walks by your tiny windowless office, sees what you\u2019re doing, chuckles kindly, then proceeds to tell you about this great piece of software called <a href=\"https:\/\/github.com\/najoshi\/sickle\"><strong>sickle<\/strong><\/a> and gives you a link to the GitHub repository. Now, naturally, you find <a href=\"https:\/\/github.com\/najoshi\/sickle\"><strong>sickle<\/strong><\/a> fantastic and wonder how you ever could have processed your FASTQ files without its windowed adaptive trimming functionality. <strong>\u201cSickle is the best!\u201d<\/strong> You\u2019d like to know if there are more tools like it out there that could save you hours and days and weeks of work in the future. Unfortunately your co-worker just left the department to go sailing around the world for a year, so you have no one to ask. What can you do?<\/p>\n<p>Here is where the power of an open API for science can really start to shine: How about you take a weekend and create <strong>a scientific software recommender system based on software citations that you get from the API?<\/strong><\/p>\n<p>Let\u2019s do it!<\/p>\n<h2 id=\"step0createopenapiforscience\">Step 0. Create open API for science<\/h2>\n<p>This is what Scholar Ninja hopes to provide in the future. It\u2019s a sizeable project, fraught with many perils, not least of which are the many licensing restrictions publishers place on scientific content. But I digress. We\u2019ve had a working version of the network running with around 70 nodes at peak, but due to <a href=\"https:\/\/code.google.com\/p\/chromium\/issues\/detail?id=392651\">bugs in Chrome<\/a> and our own implementation, it turned out to not be ready for prime time just yet. However, once complete, if you\u2019ll run a Scholar Ninja super-node in Node.js, you\u2019ll be able to join the network and query it just like you would query a good old regular HTTP API.<\/p>\n<h2 id=\"step1useopenapiforscience\">Step 1. Use open API for science<\/h2>\n<p>OK, step 0 just gave you a great API to work with. For our purpose, you could for example ask it to give you all citations of <strong>sickle<\/strong>:<\/p>\n<p><code>GET \/citations\/software?url=github.com\/najoshi\/sickle<\/code><\/p>\n<p>And you&#8217;ll get a nice JSON back, with the papers that cite <strong>sickle<\/strong>. OK, so now query the API to give you other software which these papers cite:<\/p>\n<p><code class=\"`\">GET \/software?cited_in=[\"10.1155\/2014\/404578\", \"10.1371\/journal.pone.0101021\", ...]<br \/>\n<\/code><\/p>\n<p>And again get a nice JSON back:<\/p>\n<p><code>[{<br \/>\n\"id\": \"github.com\/jstjohn\/SeqPrep\",<br \/>\n\"source\": \"doi:10.1155\/2014\/404578\",<br \/>\n},<br \/>\n{<br \/>\n\"id\": \"github.com\/vsbuffalo\/scythe\",<br \/>\n\"source\": \"doi:10.1186\/gb-2013-14-6-r66\",<br \/>\n},<br \/>\n...]<br \/>\n<\/code><\/p>\n<p>Well that was easy. We have our data, all we need to do now is package it up and present it to the user.<\/p>\n<h2 id=\"step2win\">Step 2. Win.<\/h2>\n<p>For the purposes of this demonstration, I\u2019ve indexed 848,418 open access papers from <a href=\"http:\/\/europepmc.org\">Europe PMC<\/a> and analysed 18,765,516 citations, plus 10,837 software citations found within (<a href=\"https:\/\/github.com\/ScholarNinja\/importer\">source code here<\/a>). When Scholar Ninja\u2019s network will be fully operational, the indexing will be real-time and organic, as users navigate the web and read scientific papers \u2014 but have patience, we\u2019re not there yet. However, to help pass the time, if you install <a href=\"https:\/\/chrome.google.com\/webstore\/detail\/scholar-ninja\/mngpckgljabecionknlpnnbamopcehgp\">the extension<\/a> right now, every GitHub page where scientific software is detected and recommendations are available, will be enhanced by a neat unobtrusive panel full of great scientific software:<\/p>\n<p><img decoding=\"async\" class=\"aligncenter wp-post-image\" style=\"float: none\" alt=\"interface2\" src=\"https:\/\/mozscienceblog.wpengine.com\/wp-content\/uploads\/2014\/07\/interface2.png\" width=\"300\" \/><\/p>\n<p>There are <a href=\"https:\/\/www.youtube.com\/watch?v=sAz_UvnUeuU\">eleven<\/a> recommendations for <a href=\"https:\/\/github.com\/najoshi\/sickle\">https:\/\/github.com\/najoshi\/sickle<\/a>. Eleven pieces of software were cited in papers where <strong>sickle<\/strong> was cited and in papers which the paper where <strong>sickle<\/strong> was cited cited. <em>Wooh.<\/em> The external link icon above signifies citations, for example, we found 6 citations for <strong>scythe<\/strong>. The rest should be self explanatory. This view is then simply embedded into the standard GitHub interface:<\/p>\n<p><img decoding=\"async\" class=\"aligncenter wp-post-image\" style=\"float: none\" alt=\"interface3\" src=\"https:\/\/mozscienceblog.wpengine.com\/wp-content\/uploads\/2014\/07\/interface3.png\" width=\"700\" \/><\/p>\n<p>Awesome? I sure think so. Be sure to <a href=\"https:\/\/chrome.google.com\/webstore\/detail\/scholar-ninja\/mngpckgljabecionknlpnnbamopcehgp\">install the extension<\/a> and try it out! And <strong>this is just a very very tiny glimpse of what the future holds if we build an open API for science.<\/strong> <a href=\"https:\/\/github.com\/ScholarNinja\/extension\">Come on, join us!<\/a><\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter wp-post-image\" style=\"float: none\" alt=\"XKpHM\" src=\"https:\/\/mozscienceblog.wpengine.com\/wp-content\/uploads\/2014\/07\/XKpHM.gif\" width=\"259\" height=\"196\" \/><\/p>\n","protected":false},"excerpt":{"rendered":"<p>This is a guest post by Jure Triglav, an open science hacker. You can check out http:\/\/juretriglav.si\u00a0for more of his projects or follow him on Twitter. TL;DR: An open API for science helps researchers discover great software. Install the Scholar &hellip; <a class=\"go\" href=\"https:\/\/blog.mozilla.org\/foundation-archive\/mozilla-science\/discovery-of-scientific-software\/\">Continue reading<\/a><\/p>\n","protected":false},"author":144,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[290376],"tags":[],"_links":{"self":[{"href":"https:\/\/blog.mozilla.org\/foundation-archive\/wp-json\/wp\/v2\/posts\/743"}],"collection":[{"href":"https:\/\/blog.mozilla.org\/foundation-archive\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.mozilla.org\/foundation-archive\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/foundation-archive\/wp-json\/wp\/v2\/users\/144"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/foundation-archive\/wp-json\/wp\/v2\/comments?post=743"}],"version-history":[{"count":0,"href":"https:\/\/blog.mozilla.org\/foundation-archive\/wp-json\/wp\/v2\/posts\/743\/revisions"}],"wp:attachment":[{"href":"https:\/\/blog.mozilla.org\/foundation-archive\/wp-json\/wp\/v2\/media?parent=743"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.mozilla.org\/foundation-archive\/wp-json\/wp\/v2\/categories?post=743"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.mozilla.org\/foundation-archive\/wp-json\/wp\/v2\/tags?post=743"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}