Forty-nine (49) dissertations on machine learning from Carnegie Mellon as of today.
Cold weather is setting in (in the Northern Hemisphere) so take it as additional reading material.
Forty-nine (49) dissertations on machine learning from Carnegie Mellon as of today.
Cold weather is setting in (in the Northern Hemisphere) so take it as additional reading material.
I saw a tweet today from Clojure/conf:
We will be doing same day conference video publishing this week at #clojure_conj! Watch youtube.com/user/ClojureTV for updates
Now there’s a great idea!
Mining Idioms from Source Code by Miltiadis Allamanis and Charles Sutton.
We present the first method for automatically mining code idioms from a corpus of previously written, idiomatic software projects. We take the view that a code idiom is a syntactic fragment that recurs across projects and has a single semantic role. Idioms may have metavariables, such as the body of a for loop. Modern IDEs commonly provide facilities for manually defining idioms and inserting them on demand, but this does not help programmers to write idiomatic code in languages or using libraries with which they are unfamiliar. We present HAGGIS, a system for mining code idioms that builds on recent advanced techniques from statistical natural language processing, namely, nonparametric Bayesian probabilistic tree substitution grammars. We apply HAGGIS to several of the most popular open source projects from GitHub. We present a wide range of evidence that the resulting idioms are semantically meaningful, demonstrating that they do indeed recur across software projects and that they occur more frequently in illustrative code examples collected from a Q&A site. Manual examination of the most common idioms indicate that they describe important program concepts, including object creation, exception handling, and resource management.
A deeply interesting paper that identifies code idioms without the idioms being specified in advance.
Opens up a path to further investigation of programming idioms and annotation of such idioms.
I first saw this in: Mining Idioms from Source Code – Miltiadis Allamanis a review of a presentation by Felienne Hermans.
Science fiction fanzines to be digitized as part of major UI initiative by Kristi Bontrager.
From the post:
The University of Iowa Libraries has announced a major digitization initiative, in partnership with the UI Office of the Vice President for Research and Economic Development. 10,000 science fiction fanzines will be digitized from the James L. “Rusty” Hevelin Collection, representing the entire history of science fiction as a popular genre and providing the content for a database that documents the development of science fiction fandom.
Hevelin was a fan and a collector for most of his life. He bought pulp magazines from newsstands as a boy in the 1930s, and by the early 1940s began attending some of the first organized science fiction conventions. He remained an active collector, fanzine creator, book dealer, and fan until his death in 2011. Hevelin’s collection came to the UI Libraries in 2012, contributing significantly to the UI Libraries’ reputation as a major international center for science fiction and fandom studies.
Interesting content for many of us but an even more interesting work flow model for the content:
Once digitized, the fanzines will be incorporated into the UI Libraries’ DIY History interface, where a select number of interested fans (up to 30) will be provided with secure access to transcribe, annotate, and index the contents of the fanzines. This group will be modeled on an Amateur Press Association (APA) structure, a fanzine distribution system developed in the early days of the medium that required contributions of content from members in order to qualify for, and maintain, membership in the organization. The transcription will enable the UI Libraries to construct a full-text searchable fanzine resource, with links to authors, editors, and topics, while protecting privacy and copyright by limiting access to the full set of page images.
The similarity between the Amateur Press Association (APA) structure and modern open source projects is interesting. I checked the APA’s homepage, they are have a more traditional membership fee now.
In Suffix Trees and their Applications in String Algorithms, I pointed out that a subset of the terms for “suffix tree” resulted in About 1,830,000 results (0.22 seconds).
Not a very useful result, even for the most dedicated of graduate students.
A better result would be an indexing entry for “suffix tree,” included results using its alternative names and enabled the user to quickly navigate to sub-entries under “suffix tree.”
To illustrate the benefit from actual indexing, consider that “Suffix Trees and their Applications in String Algorithms” lists only three keywords: “Pattern matching, String algorithms, Suffix tree.” Would you look at this paper for techniques on software maintenance?
Probably not, which would be a mistake. The section 4 covers the use of “parameterized pattern matching” for software maintenance of large programs in a fair amount of depth. Certainly more so than it covers “multidimensional pattern matching,” which is mentioned in the abstract and in the conclusion but not elsewhere in the paper. (“Higher dimensions” is mentioned on page 3 but only in two sentences with references.) Despite being mentioned in the abstract and conclusion as major theme of the paper.
A properly constructed index would break out both “parameterized pattern matching” and “software maintenance” as key subjects that occur in this paper. A bit easier to find than wading through 1,830,000 “results.”
Before anyone comments that such granular indexing would be too time consuming or expensive, recall the citation rates for computer science, 2000 – 2010:Field 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 All years Computer science 7.17 7.66 7.93 5.35 3.99 3.51 2.51 3.26 2.13 0.98 0.15 3.75
The reason for the declining numbers is that citations to papers from the year 2000 decline over time.
But the highest percentage rate, 7.93 in 2002, is far less than the total number of papers published in 2000.
At one point in journal publication history, manual indexing was universal. But that was before full text searching became a reality and the scientific publication rate exploded.
The STM Report by Mark Ware and Michael Mabe.
Rather than an all human indexing model (not possible due to the rate of publication, costs) or an all computer-based searching model (leads to poor results as described above), why not consider a bifurcated indexing/search model?
The well over 90% of CS publications that aren’t cited should be subject to computer-based indexing and search models. On the other hand, the meager 8% that are cited, perhaps subject to some scale of citation, could be curated by human/machine assisted indexing.
Human/machine assisted indexing would increase access to material already selected by other readers. Perhaps even as a value-add product as opposed to take your chances with search access.
Suffix Trees and their Applications in String Algorithms by Roberto Grossi and Giuseppe F. Italiano.
The suffix tree is a compacted trie that stores all suffixes of a given text string. This data structure has been intensively employed in pattern matching on strings and trees, with a wide range of applications, such as molecular biology, data processing, text editing, term rewriting, interpreter design, information retrieval, abstract data types and many others.
In this paper, we survey some applications of suffix trees and some algorithmic techniques for their construction. Special emphasis is given to the most recent developments in this area, such as parallel algorithms for suffix tree construction and generalizations of suffix trees to higher dimensions, which are important in multidimensional pattern matching.
The authors point out that “suffix tree” is only one of the names for this subject:
The importance of the suffix tree is underlined by the fact that it has been rediscovered many times in the scientific literature, disguised under different names, and that it has been studied under numerous variations. Just to mention a few appearances of the suffix tree, we cite the compacted bi-tree , the prefix tree , the PAT tree , the position tree [3, 65, 75], the repetition finder , and the subword tree [8, 24]….
Which is an advantage if you are researching another survey paper and tracing every thread on suffix trees by whatever name, not so much of an advantage if you miss this paper or an application under name other than “suffix tree.”
Of course, a search with:
“suffix tree” OR “impacted by-tree” OR “prefix tree” OR “PAT tree” OR “position tree” OR “repetition finder” OR “subword tree”
that returns About 1,830,000 results (0.22 seconds), isn’t very helpful.
In part because no one is going to examine 1,830,000 results and that is a subset of all the terms for suffix trees.
I think we can do better than that and without an unreasonable expenditure of resources. (See: Less Than Universal & Uniform Indexing)
Unless you have been in a coma or just arrived from off-world, you have probably heard about #shirtgate/#shirtstorm. If not, take a minute to search on those hash tags to come up to speed.
During the ensuing flood of posts, tweets, etc., I happened to stumble upon To the science guys who want to understand #shirtstorm by Janet D. Stemwedel.
It is impressive because despite the inability of men and women to fully appreciate the rhetoric of the other gender, Stemwedel finds a third rhetoric, that of science, in which to conduct her argument.
Not that the rhetoric of science is a perfect fit for either gender but it is a rhetoric in which both genders share some assumptions and methods of reasoning. Those partially shared assumptions and methods make Stemwedel’s argument effective.
Take her comments on data gathering (formatted on her blog as tweets):
So, first big point: women’s accounts of their own experiences are better data than your preexisting hunches about their experiences.
Another thing you science guys know: sometimes we observe unexpected outcomes. We don’t say, That SHOULDN’T happen! but, WHY did it happen?
Imagine, for sake of arg, that women’s rxn to @mggtTaylor’s porny shirt was a TOTAL surprise. Do you claim that rxn shouldn’t hv happened?
Or, do you think like a scientist & try to understand WHY it happened? Do you stay stuck in your hunches or get some relevant data?
Do you recognize that women’s experiences in & with science (plus larger society) may make effect of porny shirt on #Rosetta publicity…
…on those women different than effect of porny shirt was on y’all science guys? Or that women KNOW how they feel about it better than you?
Science guys telling women “You shouldn’t be mad about porny shirt on #Rosetta video because…” is modeling bad scientific method!
Finding a common rhetoric is at the core of creating sustainable mappings between differing semantics. Stemwedel illustrates the potential for such a rhetoric even in a highly charged situation.
PS: You need to read Stemwedel’s post in the original.
If you want to start a debate among faculty this holiday season, print this graphic out and leave it laying around with one or two local names penciled in.
For example, I would not list naive realism as a “philosophy of science” as much as an error, taken for a “philosophy of science.”
I first saw this as Positions in the philosophy of science by Chris Blattman.
When Information Design is a Matter of Life or Death by Thomas Bohm.
From the post:
In 2008, Lloyds Pharmacy conducted 20 minute interviews1 with 1,961 UK adults. Almost one in five people admitted to having taken prescription medicines incorrectly; more than eight million adults have either misread medicine labels or misunderstood the instructions, resulting in them taking the wrong dose or taking medication at the wrong time of day. In addition, the overall problem seemed to be more acute among older patients.
Almost one in five people admitted to having taken prescription medicines incorrectly; more than eight million adults have either misread medicine labels or misunderstood the instructions.
Medicine or patient information leaflets refer to the document included inside medicine packaging and are typically printed on thin paper (see figures 1.1–1.4). They are essential for the safe use of medicines and help answer people’s questions when taking the medicine.
If the leaflet works well, it can lead to people taking the medicine correctly, hopefully improving their health and wellness. If it works poorly, it can lead to adverse side effects, harm, or even death. Subsequently, leaflets are heavily regulated in the way they need to be designed, written, and produced. European2 and individual national legislation sets out the information to be provided, in a specific order, within a medicine information leaflet.
A good reminder that failure to communicate in some information systems has more severe penalties than others.
I was reminded while reading the “thin paper” example:
Medicine information leaflets are often printed on thin paper and folded many times to fit into the medicine package. There is a lot of show-through from the information printed on the back of the leaflet, which decreases readability. When the leaflet is unfolded, the paper crease marks affect the readability of the text (see figures 1.3 and 1.4). A possible improvement would be to print the leaflet on a thicker paper.
of a information leaflet that unfolded to be 18 inches wide and 24 inches long. A real tribute to the folding art. The typeface was challenging even with glasses and a magnifying glass. Too tiring to read much of it.
I don’t think thicker paper would have helped, unless the information leaflet became an information booklet.
What are the consequences if someone misreads your interface?
From the post:
For enterprise customers who value agility but can’t compromise on resiliency, MarkLogic software is the only database platform that integrates Google-like search with rich query and semantics into an intelligent and extensible data layer that works equally well in a data center or in the cloud. Unlike other NoSQL solutions, MarkLogic provides ACID transactions, HA, DR, and other hardened features that enterprises require, along with the scalability and agility they need to accelerate their business.
“As more complex data, much of it semi-structured, becomes increasingly important to businesses’ daily operations, enterprises are realizing that they must look beyond relational databases to help them understand, integrate, and manage all of their data, deriving maximum value in a simple, yet sophisticated manner,” said Carl Olofson, research vice president at IDC. “MarkLogic has a history of bringing advanced data management technology to market and many of their customers and partners are accustomed to managing complex data in an agile manner. As a result, they have a more mature and creative view of how to manage and use data than do mainstream database users. MarkLogic 8 offers some very advanced tools and capabilities, which could expand the market’s definition of enterprise database technology.”
I’m not in the early release program but if you are, heads up!
By “semantics,” MarkLogic means RDF triples and the ability to query those triples with text, values, etc.
Since we can all see triples, text and values with different semantics, your semantic mileage with MarkLogic may vary greatly.
From the introduction:
This two-part blog post tells the story of my venturing into Clojure. To get a better grasp of the language, I wanted to move beyond solving programming puzzles and build something tangible in the browser. Omingard is a Solitaire-like HTML5 card game built with Om, a ClojureScript interface to Facebook’s React.
In this first part, “My Way into Clojure”, I’ll provide some background on why I built Omingard and introduce the concepts behind Clojure. What’s so fascinating about functional programming in general, and Clojure in particular, and why was the appearance of Om a game changer for me?
In the upcoming second part, “Building a Card Game with Om”, we’ll look at how I built Omingard. What are the rules of the game, and what role do React, ClojureScript, Om, Leiningen, Garden, and Google Closure Tools play in its implementation? We’ll also take a detailed look at the concepts behind Om, and how it achieves even faster performance than React.
This is a very cool exercise in learning Clojure.
Do try the game. The version I know, has slightly different rules than the ones I observe here.
Apache Lucene™ 5.0.0 is coming! by Michael McCandless.
There are no promises for the exact timing (it’s done when it’s done!), but we already have a volunteer release manager (thank you Anshum!).
A major release in Lucene means all deprecated APIs (as of 4.10.x) are dropped, support for 3.x indices is removed while the numerous 4.x index formats are still supported for index backwards compatibility, and the 4.10.x branch becomes our bug-fix only release series (no new features, no API changes).
5.0.0 already contains a number of exciting changes, which I describe below, and they are still rolling in with ongoing active development.
Michael has a great list and explanation of changes you will be seeing in Lucene 5.0.0. Pick your favorite(s) to follow and/or contribute to the next release.
Programming in the Life Sciences by Egon Willighagen.
From the first post in this series, Programming in the Life Sciences #1: a six day course (October, 2013):
Our department will soon start the course Programming in the Life Sciences for a group of some 10 students from the Maastricht Science Programme. This is the first time we give this course, and over the next weeks I will be blogging about this course. First, some information. These are the goals, to use programming to:
So, this course will be a mix of things. I will likely start with a lecture or too about scientific programming, such as the importance of reproducibility, licensing, documentation, and (unit) testing. To achieve these learning goals we have set a problem. The description is:
So, it becomes pretty clear what the students will be doing. They only have six days, so it won’t be much. It’s just to learn them the basic skills. The students are in their 3rd year at the university, and because of the nature of the programme they follow, a mixed background in biology, mathematics, chemistry, and physics. So, I have a good hope they will surprise me in what they will get done.
Pharmacology is the basic topic: drug-protein interaction, but the students are free to select a research question. In fact, I will not care that much what they like to study, as long as they do it properly. They will start with Open PHACTS’ Linked Data API, but here too, they are free to complement data from the OPS cache with additional information. I hope they do.
(For the Dutch readers, would #mscpils be a good tag?)
For quite a few “next weeks,” Egon’s blogging has gone on and life sciences, to say nothing of his readers, are all better off for it! His most recent post is titled: Programming in the Life Sciences #20: extracting data from JSON.
Definitely a series to catch or to pass along for anyone involved in life sciences.
Apache Spark RefCardz by Ashwini Kuntamukkala.
Counting the cover, four (4) out of the eight (8) pages don’t qualify for inclusion in a cheat sheet or refcard. Depending on how hard you want to push that, the count could easily be six (6) out of the eight (8) pages should not be in a cheat sheet or refcard.
The “extra” information present on this RefCard is useful but you will rapidly outgrow it. Unless you routinely need help installing Apache Spark or working a basic word count problem.
A two (2) page (front/back) cheatsheet for Spark would be more useful.
This is your Brain on Big Data: A Review of “The Organized Mind” by Stephen Few.
From the post:
In the past few years, several fine books have been written by neuroscientists. In this blog I’ve reviewed those that are most useful and placed Daniel Kahneman’s Thinking, Fast & Slow at the top of the heap. I’ve now found its worthy companion: The Organized Mind: Thinking Straight in the Age of Information Overload.
This new book by Daniel J. Levitin explains how our brains have evolved to process information and he applies this knowledge to several of the most important realms of life: our homes, our social connections, our time, our businesses, our decisions, and the education of our children. Knowing how our minds manage attention and memory, especially their limitations and the ways that we can offload and organize information to work around these limitations, is essential for anyone who works with data.
See Stephen’s review for an excerpt from the introduction and summary comments on the work as a whole.
I am particularly looking forward to reading Levitin’s take on the transfer of information tasks to us and the resulting cognitive overload.
I don’t have the volume, yet, but it occurs to me that the shift from indexes (Readers Guide to Periodical Literature and the like) and librarians to full text search engines, is yet another example of the transfer of information tasks to us.
Indexers and librarians do a better job of finding information than we do because discovery of information is a difficult intellectual task. Well, perhaps, discovering relevant and useful information is a difficult task. Almost without exception, every search produces a result on major search engines. Perhaps not a useful result but a result none the less.
Using indexers and librarians will produce a line item in someone’s budget. What is needed is research on the differential between the results from indexer/librarians versus us and what that translates to as a line item in enterprise budgets.
That type of research could influence university, government and corporate budgets as the information age moves into high gear.
The Organized Mind by Daniel J. Levitin is a must have for the holiday wish list!
Functional and Reactive Domain Modeling by Debasish Ghosh.
From the post:
Manning has launched the MEAP of my upcoming book on Domain Modeling.
The first time I was formally introduced to the topic was way back when I played around with Erik Evans’ awesome text on the subject of Domain Driven Design. In the book he discusses various object lifecycle patterns like the Factory, Aggregate or Repository that help separation of concerns when you are implementing the various interactions between the elements of the domain model. Entities are artifacts with identities, value objects are pure values while services model the coarse level use cases of the model components.
In Functional and Reactive Domain Modeling I look at the problem with a different lens. The primary focus of the book is to encourage building domain models using the principles of functional programming. It’s a completely orthogonal approach than OO and focuses on verbs first (as opposed to nouns first in OO), algebra first (as opposed to objects in OO), function composition first (as opposed to object composition in OO), lightweight objects as ADTs (instead of rich class models).
The book starts with the basics of functional programming principles and discusses the virtues of purity and the advantages of keeping side-effects decoupled from the core business logic. The book uses Scala as the programming language and does an extensive discussion on why the OO and functional features of Scala are a perfect fit for modelling complex domains. Chapter 3 starts the core subject of functional domain modeling with real world examples illustrating how we can make good use of patterns like smart constructors, monads and monoids in implementing your domain model. The main virtue that these patterns bring to your model is genericity – they help you extract generic algebra from domain specific logic into parametric functions which are far more reusable and less error prone. Chapter 4 focuses on advanced usages like typeclass based design and patterns like monad transformers, kleislis and other forms of compositional idioms of functional programming. One of the primary focus of the book is an emphasis on algebraic API design and to develop an appreciation towards ability to reason about your model.
An easy choice for your holiday wish list! Being a MEAP, it will continue to be “new” for quite some time.
From the post:
NLM is pleased to announce the following releases available for download:
For more information about CMT, please see the NLM CMT Frequently Asked Questions page.
The purpose of this subset is to provide the frequently used SNOMED CT concepts for use in general/family practice electronic health records within the following data fields: reason for encounter, and health issue. The purpose of the map from the SNOMED CT GP/FP subset to ICPC-2 is to allow for the granular concepts to be recorded by GPs/FPs at the point of care using SNOMED CT, with subsequent analysis and reporting using the internationally recognized ICPC-2 classification. However please note that use within clinical systems cannot be supported at this time. This Candidate Baseline is distributed for evaluation purposes only and should not be used in production clinical systems or in clinical settings.
The subsets are aligned to the July 2014 SNOMED CT International Release. The SNOMED CT to ICPC-2 map is a Candidate Baseline, which IHTSDO expects to confirm as the Baseline release following the January 2015 SNOMED CT International Release.
If your work in any way touches upon medical teminology, Convergent Medical Terminology (CMT) and SNOMED CT (Systematized Nomenclature of Medicine–Clinical Terms), among other collections of medical terminology will be of interest to you.
Medical terminology is a small part of the world at large and you can see what it takes for the NLM to maintain a semblance of chaotic order. Great benefit flow even from a semblance of order but those benefits are not free.
Greek Digitisation Project Update: 40 Manuscripts Newly Uploaded by Sarah J Biggs.
From the post:
We have now passed the half-way point of this phase of the Greek Manuscripts Digitisation Project, generously funded by the Stavros Niarchos Foundation and many others, including the A. G. Leventis Foundation, Sam Fogg, the Sylvia Ioannou Foundation, the Thriplow Charitable Trust, and the Friends of the British Library. What treasures are in store for you this month? To begin with, there are quite a few interesting 17th- and 18th-century items to look at, including two very fine 18th-century charters, with seals intact, an iconographic sketch-book (Add MS 43868), and a fascinating Greek translation of an account of the siege of Vienna in 1683 (Add MS 38890). We continue to upload some really exciting Greek bindings – of particular note here are Add MS 24372 and Add MS 36823. A number of scrolls have also been uploaded, mostly containing the Liturgy of Basil of Caesarea. A number of Biblical manuscripts are included, too, but this month two manuscripts of classical authors take pride of place: Harley MS 5600, a stunning manuscript of the Iliad from 15th-century Florence, and Burney MS 111, a lavishly decorated copy of Ptolemy’s Geographia.
Additional riches from the British Library!
Spark: Parse CSV file and group by column value by Mark Needham.
Mark parses a 1GB file that details 4 million crimes from the City of Chicago.
And he does it two ways: Using Unix and Spark.
Results? One way took more than 2 minutes, the other way, less than 10 seconds.
Place your bets with office staff and then visit Mark’s post for the results.
Defence: a quick guide to key internet links by David Watt and Nicole Brangwin.
While browsing at Full Text Reports, I saw this title with the following listing of contents:
The document is a five (5) page PDF file that has a significant number of links, particularly to Australian military resources. Under “Foreign defense” I did find the Chinese Peoples’ Liberation Army but no link for ISIL.
This may save you some time if you are spidering Australian military sites but appears to be incomplete for other areas.