Another word for it

Subscribe to Another word for it feed
Updated: 6 days 7 hours ago

Scikit-learn 0.15 release

Thu, 07/17/2014 - 23:16


Topic Maps

Scikit-learn 0.15 release by Gaël Varoquaux.

From the post:


Quality— Looking at the commit log, there has been a huge amount of work to fix minor annoying issues.

Speed— There has been a huge effort put in making many parts of scikit-learn faster. Little details all over the codebase. We do hope that you’ll find that your applications run faster. For instance, we find that the worst case speed of Ward clustering is 1.5 times faster in 0.15 than 0.14. K-means clustering is often 1.1 times faster. KNN, when used in brute-force mode, got faster by a factor of 2 or 3.

Random Forest and various tree methods— The random forest and various tree methods are much much faster, use parallel computing much better, and use less memory. For instance, the picture on the right shows the scikit-learn random forest running in parallel on a fat Amazon node, and nicely using all the CPUs with little RAM usage.

Hierarchical aglomerative clusteringComplete linkage and average linkage clustering have been added. The benefit of these approach compared to the existing Ward clustering is that they can take an arbitrary distance matrix.

Robust linear models— Scikit-learn now includes RANSAC for robust linear regression.

HMM are deprecated— We have been discussing for a long time removing HMMs, that do not fit in the focus of scikit-learn on predictive modeling. We have created a separate hmmlearn repository for the HMM code. It is looking for maintainers.

And much more— plenty of “minor things”, such as better support for sparse data, better support for multi-label data…

Get thee to Scikit-learn!

April 2014 Crawl Data Available

Thu, 07/17/2014 - 19:40


Topic Maps

April 2014 Crawl Data Available by Stephen Merity.

From the post:

The April crawl of 2014 is now available! The new dataset is over 183TB in size containing approximately 2.6 billion webpages. The new data is located in the aws-publicdatasets bucket at /common-crawl/crawl-data/CC-MAIN-2014-15/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://aws-publicdatasets/ or to each line, you end up with the S3 and HTTP paths respectively.

Thanks again to Blekko for their ongoing donation of URLs for our crawl!

Well, at 183TB, I don’t guess I am going to have a local copy.


FDA Recall Data

Wed, 07/16/2014 - 23:53


Topic Maps

OpenFDA Provides Ready Access to Recall Data by Taha A. Kass-Hout.

From the post:

Every year, hundreds of foods, drugs, and medical devices are recalled from the market by manufacturers. These products may be labeled incorrectly or might pose health or safety issues. Most recalls are voluntary; in some cases they may be ordered by the U.S. Food and Drug Administration. Recalls are reported to the FDA, and compiled into its Recall Enterprise System, or RES. Every week, the FDA releases an enforcement report that catalogues these recalls. And now, for the first time, there is an Application Programming Interface (API) that offers developers and researchers direct access to all of the drug, device, and food enforcement reports, dating back to 2004.

The recalls in this dataset provide an illuminating window into both the safety of individual products and the safety of the marketplace at large. Recent reports have included such recalls as certain food products (for not containing the vitamins listed on the label), a soba noodle salad (for containing unlisted soy ingredients), and a pain reliever (for not following laboratory testing requirements).

You will get warnings that this data is “not for clinical use.”

Sounds like a treasure trove of data if you are looking for products still being sold despite being recalled.

Or if you want to advertise for “victims” of faulty products that have been recalled.

I think both of those are non-clinical uses.

Darwin’s ship library goes online

Wed, 07/16/2014 - 20:48


Topic Maps

Darwin’s ship library goes online by Dennis Normile.

From the post:

As Charles Darwin cruised the world on the HMS Beagle, he had access to an unusually well-stocked 400-volume library. That collection, which contained the observations of numerous other naturalists and explorers, has now been recreated online. As of today, all of more than 195,000 pages and 5000 illustrations from the works are available for the perusal of scholars and armchair naturalists alike, thanks to the Darwin Online project.

Perhaps it isn’t the amount of information you have available but how deeply you understand it that makes a difference.


Which gene did you mean?

Wed, 07/16/2014 - 20:38


Topic Maps

Which gene did you mean? by Barend Mons.


Computational Biology needs computer-readable information records. Increasingly, meta-analysed and pre-digested information is being used in the follow up of high throughput experiments and other investigations that yield massive data sets. Semantic enrichment of plain text is crucial for computer aided analysis. In general people will think about semantic tagging as just another form of text mining, and that term has quite a negative connotation in the minds of some biologists who have been disappointed by classical approaches of text mining. Efforts so far have tried to develop tools and technologies that retrospectively extract the correct information from text, which is usually full of ambiguities. Although remarkable results have been obtained in experimental circumstances, the wide spread use of information mining tools is lagging behind earlier expectations. This commentary proposes to make semantic tagging an integral process to electronic publishing.

From within the post:

If all words had only one possible meaning, computers would be perfectly able to analyse texts. In reality however, words, terms and phrases in text are highly ambiguous. Knowledgeable people have few problems with these ambiguities when they read, because they use context to disambiguate ‘on the fly’. Even when fed a lot of semantically sound background information, however, computers currently lag far behind humans in their ability to interpret natural language. Therefore, proper semantic tagging of concepts in texts is crucial to make Computational Biology truly viable. Electronic Publishing has so far only scratched the surface of what is needed.

Open Access publication shows great potential, andis essential for effective information mining, but it will not achieve its full potential if information continues to be buried in plain text. Having true semantic mark up combined with open access for mining is an urgent need to make possible a computational approach to life sciences.

Creating semantically enriched content as part and parcel of the publication process should be a winning strategy.

First, for current data, estimates of what others will be searching for should not be hard to find out. That will help focus tagging on the material users are seeking. Second, a current and growing base of enriched material will help answer questions about the return on enriching material.

Other suggestions for BMC Bioinformatics?

Introducing Source Han Sans:…

Wed, 07/16/2014 - 19:57


Topic Maps

Introducing Source Han Sans: An open source Pan-CJK typeface by Caleb Belohlavek.

From the post:

Adobe, in partnership with Google, is pleased to announce the release of Source Han Sans, a new open source Pan-CJK typeface family that is now available on Typekit for desktop use. If you don’t have a Typekit account, it’s easy to set one up and start using the font immediately with our free subscription. And for those who want to play with the original source files, you can get those from our download page on SourceForge.

It’s rather difficult to describe your semantics when you can’t write in your own language.

Kudos to Adobe and Google for sponsoring this project!

I first saw this in a tweet by James Clark.

…[S]emantically enriched open pharmacological space…

Wed, 07/16/2014 - 19:25


Topic Maps

Scientific competency questions as the basis for semantically enriched open pharmacological space development by Kamal Azzaoui, et al. (Drug Discovery Today, Volume 18, Issues 17–18, September 2013, Pages 843–852)


Molecular information systems play an important part in modern data-driven drug discovery. They do not only support decision making but also enable new discoveries via association and inference. In this review, we outline the scientific requirements identified by the Innovative Medicines Initiative (IMI) Open PHACTS consortium for the design of an open pharmacological space (OPS) information system. The focus of this work is the integration of compound–target–pathway–disease/phenotype data for public and industrial drug discovery research. Typical scientific competency questions provided by the consortium members will be analyzed based on the underlying data concepts and associations needed to answer the questions. Publicly available data sources used to target these questions as well as the need for and potential of semantic web-based technology will be presented.

Pharmacology may not be your space but this is a good example of what it takes for semantic integration of resources in a complex area.

Despite the “…you too can be a brain surgeon with our new web-based app…” from various sources, semantic integration has been, is and will remain difficult under the best of circumstances.

I don’t say that to discourage anyone but to avoid the let-down when integration projects don’t provide easy returns.

It is far better to plan for incremental and measurable benefits along the way than to fashion grandiose goals that are ever receding on the horizon.

I first saw this in a tweet by ChemConnector.

Free Companies House data to boost UK economy

Tue, 07/15/2014 - 21:57


Topic Maps

Free Companies House data to boost UK economy

From the post:

Companies House is to make all of its digital data available free of charge. This will make the UK the first country to establish a truly open register of business information.

As a result, it will be easier for businesses and members of the public to research and scrutinise the activities and ownership of companies and connected individuals. Last year (2013/14), customers searching the Companies House website spent £8.7 million accessing company information on the register.

This is a considerable step forward in improving corporate transparency; a key strand of the G8 declaration at the Lough Erne summit in 2013.

It will also open up opportunities for entrepreneurs to come up with innovative ways of using the information.

This change will come into effect from the second quarter of 2015 (April – June).

In a side bar, Business Secretary Vince Cable said in part:

Companies House is making the UK a more transparent, efficient and effective place to do business.

I’m not sure about “efficient,” but providing incentives for lawyers and others to track down insider trading and other business as usual practices and arming them with open data would be a start in the right direction.

I first saw this in a tweet by Hadley Beeman.

Spy vs. Spies

Tue, 07/15/2014 - 21:41


Topic Maps

XRay: Enhancing the Web’s Transparency with Differential Correlation by Mathias Lécuyer, et al.


Today’s Web services – such as Google, Amazon, and Facebook – leverage user data for varied purposes, including personalizing recommendations, targeting advertisements, and adjusting prices. At present, users have little insight into how their data is being used. Hence, they cannot make informed choices about the services they choose. To increase transparency, we developed XRay, the first fine-grained, robust, and scalable personal data tracking system for the Web. XRay predicts which data in an arbitrary Web account (such as emails, searches, or viewed products) is being used to target which outputs (such as ads, recommended products, or prices). XRay’s core functions are service agnostic and easy to instantiate for new services, and they can track data within and across services. To make predictions independent of the audited service, XRay relies on the following insight: by comparing outputs from different accounts with similar, but not identical, subsets of data, one can pinpoint targeting through correlation. We show both theoretically, and through experiments on Gmail, Amazon, and YouTube, that XRay achieves high precision and recall by correlating data from a surprisingly small number of extra accounts.

Not immediately obvious, until someone explains it, but any system that reacts based on input you control can be investigated. Whether that includes dark marketing forces or government security agencies.

Be aware that provoking government security agencies is best left to professionals.

The next step will be to have bots that project false electronic trails for us to throw advertisers (or others) off track.

Very much worth your time to read.

Graph Classes and their Inclusions

Tue, 07/15/2014 - 21:25


Topic Maps

Information System on Graph Classes and their Inclusions

From the webpage:

What is ISGCI?

ISGCI is an encyclopaedia of graphclasses with an accompanying java application that helps you to research what’s known about particular graph classes. You can:

  • check the relation between graph classes and get a witness for the result
  • draw clear inclusion diagrams
  • colour these diagrams according to the complexity of selected problems
  • find the P/NP boundary for a problem
  • save your diagrams as Postscript, GraphML or SVG files
  • find references on classes, inclusions and algorithms

As of 214-07-06, the database contains 1497 classes and 176,888 inclusions.

If you are past the giddy stage of “Everything’s a graph!,” you may find this site useful.

Why Extended Attributes are Coming to HDFS

Sat, 06/28/2014 - 21:23


Topic Maps

Why Extended Attributes are Coming to HDFS by Charles Lamb.

From the post:

Extended attributes in HDFS will facilitate at-rest encryption for Project Rhino, but they have many other uses, too.

Many mainstream Linux filesystems implement extended attributes, which let you associate metadata with a file or directory beyond common “fixed” attributes like filesize, permissions, modification dates, and so on. Extended attributes are key/value pairs in which the values are optional; generally, the key and value sizes are limited to some implementation-specific limit. A filesystem that implements extended attributes also provides system calls and shell commands to get, list, set, and remove attributes (and values) to/from a file or directory.

Recently, my Intel colleague Yi Liu led the implementation of extended attributes for HDFS (HDFS-2006). This work is largely motivated by Cloudera and Intel contributions to bringing at-rest encryption to Apache Hadoop (HDFS-6134; also see this post) under Project Rhino – extended attributes will be the mechanism for associating encryption key metadata with files and encryption zones — but it’s easy to imagine lots of other places where they could be useful.

For instance, you might want to store a document’s author and subject in sometime like and user.subject=HDFS. You could store a file checksum in an attribute called user.checksum. Even just comments about a particular file or directory can be saved in an extended attribute.

In this post, you’ll learn some of the details of this feature from an HDFS user’s point of view.

Extended attributes sound like an interesting place to tuck away additional information about a file.

Such as the legend to be used to interpret it?

Communicating and resolving entity references

Fri, 06/27/2014 - 18:17


Topic Maps

Communicating and resolving entity references by R.V. Guha.


Statements about entities occur everywhere, from newspapers and web pages to structured databases. Correlating references to entities across systems that use different identifiers or names for them is a widespread problem. In this paper, we show how shared knowledge between systems can be used to solve this problem. We present “reference by description”, a formal model for resolving references. We provide some results on the conditions under which a randomly chosen entity in one system can, with high probability, be mapped to the same entity in a different system.

An eye appointment is going to prevent me from reading this paper closely today.

From a quick scan, do you think Guha is making a distinction between entities and subjects (in the topic map sense)?

What do you make of literals having no identity beyond their encoding? (page 4, #3)

Redundant descriptions? (page 7) Would you say that defining a set of properties that must match would qualify? (Or even just additional subject indicators?)

Expect to see a lot more comments on this paper.


I first saw this in a tweet by Stefano Bertolo.

Propositions as Types

Fri, 06/27/2014 - 17:56


Topic Maps

Propositions as Types by Philip Wadler.

From the Introduction::

Powerful insights arise from linking two fields of study previously thought separate. Examples include Descartes’s coordinates, which links geometry to algebra, Planck’s Quantum Theory, which links particles to waves, and Shannon’s Information Theory, which links thermodynamics to communication. Such a synthesis is offered by the principle of Propositions as Types, which links logic to computation. At first sight it appears to be a simple coincidence—almost a pun—but it turns out to be remarkably robust, inspiring the design of automated proof assistants and programming languages, and continuing to influence the forefronts of computing.

Propositions as Types is a notion with many names and many origins. It is closely related to the BHK Interpretation, a view of logic developed by the intuitionists Brouwer, Heyting, and Kolmogorov in the 1930s. It is often referred to as the Curry-Howard Isomorphism, referring to a correspondence observed by Curry in 1934 and refined by Howard in 1969 (though not published until 1980, in a Festschrift dedicated to Curry). Others draw attention to significant contributions from de Bruijn’s Automath and Martin-Löf’s Type Theory in the 1970s. Many variant names appear in the literature, including Formulae as Types, Curry-Howard-de Bruijn Correspondence, Brouwer’s Dictum, and others.

Propositions as Types is a notion with depth. It describes a correspondence between a given logic and a given programming language, for instance, between Gentzen’s intuitionistic natural deduction (a logic) and Church’s simply-typed lambda calculus (which may be viewed as a programming language). At the surface, it says that for each proposition in the logic there is a corresponding type in the programming language—and vice versa…

Important work even if it is very heavy sledding!

BTW, Wadler mentions two textbook treatments of the subject:

M. H. Sørensen and P. Urzyczyn. Lectures on the Curry-Howard isomorphism. Elsevier, 2006. Amazon has it listed for $146.33.

S. Thompson. Type Theory and Functional Programming. Addison-Wesley, 1991. Better luck here, out of print and posted online by the author: Errata page was last updated October 2013.

I just glanced at 4.10 Equality and 5.1 Assumptions – 5.2 Naming and abbreviations in Thompson and it promises to be an interesting read!


I first saw this in a tweet by Chris Ford.

Charities, Transparency and Trade Secrets

Fri, 06/27/2014 - 00:00


Topic Maps

Red Cross: How We Spent Sandy Money Is a ‘Trade Secret’ by Justin Elliott.

From the post:

Just how badly does the American Red Cross want to keep secret how it raised and spent over $300 million after Hurricane Sandy?

The charity has hired a fancy law firm to fight a public request we filed with New York state, arguing that information about its Sandy activities is a “trade secret.”

The Red Cross’ “trade secret” argument has persuaded the state to redact some material, though it’s not clear yet how much since the documents haven’t yet been released.

The documents include “internal and proprietary methodology and procedures for fundraising, confidential information about its internal operations, and confidential financial information,” wrote Gabrielle Levin of Gibson Dunn in a letter to the attorney general’s office.

If those details were disclosed, “the American Red Cross would suffer competitive harm because its competitors would be able to mimic the American Red Cross’s business model for an increased competitive advantage,” Levin wrote.

The letter doesn’t specify who the Red Cross’ “competitors” are.

I see bizarre stories on a regular basis but this is a real “man bites dog” sort of story.

See Justin’s post for the details, such as are known now. I am sure there will be follow up stories on these records.

It may just be my background but when anyone, government, charity, industry, assures me that information I can’t see is ok, that sets off multiple alarm bells.


PS: Not that I think transparency automatically leads to better government or decision making. I do know that a lack of transparency, cf. the NSA, leads to very poor decision making.

Graphing 173 Million Taxi Rides

Thu, 06/26/2014 - 23:43


Topic Maps

Interesting taxi rides dataset by Danny Bickson.

From the post:

I got the following from my collaborator Zach Nation. NY taxi ride dataset that was not properly anonymized and was reverse engineered to find interesting insights in the data.

Danny mapped the data using GraphLab and asks some interesting questions of the data.

BTW, Danny is offering the iPython notebook to play with!


This is the same data set I mentioned in: On Taxis and Rainbows

Asteroid Hunting!

Thu, 06/26/2014 - 23:29


Topic Maps

Planetary Resources Wants Public to Help Find Asteroids by Doug Messier.

From the post:

Planetary Resources, the asteroid mining company, and Zooniverse today launched Asteroid Zoo (, empowering students, citizen scientists and space enthusiasts to aid in the search for previously undiscovered asteroids. The program allows the public to join the search for Near Earth Asteroids (NEAs) of interest to scientists, NASA and asteroid miners, while helping to train computers to better find them in the future.

Asteroid Zoo joins the Zooniverse’s family of more than 25 citizen science projects! It will enable participants to search terabytes of imaging data collected by Catalina Sky Survey (CSS) for undiscovered asteroids in a fun, game-like process from their personal computers or devices. The public’s findings will be used by scientists to develop advanced automated asteroid-searching technology for telescopes on Earth and in space, including Planetary Resources’ ARKYD.

“With Asteroid Zoo, we hope to extend the effort to discover asteroids beyond astronomers and harness the wisdom of crowds to provide a real benefit to Earth,” said Chris Lewicki, President and Chief Engineer, Planetary Resources, Inc. “Furthermore, we’re excited to introduce this program as a way to thank the thousands of people who supported Planetary Resources through Kickstarter. This is the first of many initiatives we’ll introduce as a result of the campaign.”

The post doesn’t say who names an asteroid that qualifies for an Extinction Event. If it is a committee, it may go forever nameless.

Visualizing Algorithms

Thu, 06/26/2014 - 21:03


Topic Maps

Visualizing Algorithms by Mike Bostock.

From the post:

Algorithms are a fascinating use case for visualization. To visualize an algorithm, we don’t merely fit data to a chart; there is no primary dataset. Instead there are logical rules that describe behavior. This may be why algorithm visualizations are so unusual, as designers experiment with novel forms to better communicate. This is reason enough to study them.

But algorithms are also a reminder that visualization is more than a tool for finding patterns in data. Visualization leverages the human visual system to augment human intellect: we can use it to better understand these important abstract processes, and perhaps other things, too.

Better start with fresh pot of coffee when you read Mike’s post. Mike covers visualization of sampling algorithms using Van Gogh’s The Starry Night, sorting and maze generation (2-D). It is well written and illustrated but it is a lot of material to cover in one read.

The post finishes up with numerous references to other algorithm visualization efforts.

Put on your “must read” list for this weekend!

Who Needs Terrorists? We Have The NSA.

Thu, 06/26/2014 - 20:45


Topic Maps

Germany dumps Verizon for government work over NSA fears by David Meyer.

From the post:

The German government is ditching Verizon as its network infrastructure provider, and it’s citing Edward Snowden’s revelations about NSA surveillance as a reason.

David summarizes and gives pointers to all the statements you will need for “thank you” notes to the NSA or complaints to the current and past administrations.

United States citizens don’t need to worry about possible terrorist attacks. Our own government agencies are working to destroy any trust or confidence in U.S. technology companies. Care to compare that damage to the fictional damage from imagined terrorists?

Are there terrorists in the world? You bet. But the relevant question is: Other than blowing smoke for contracts and appropriations, what real danger exists for average U.S. citizen?

I read recently that “6 times more likely to die from hot weather than from a terrorist attack.” For similar numbers and sources, see: Fear of Terror Makes People Stupid.

Let’s not worry the country into the poor house over terrorism.

When anyone claims we are in danger from terrorism, press them for facts. What data? What intelligence? Press for specifics.

If they claim the details are “secret,” know that they don’t know and don’t want you to know they don’t know. (Remembering the attack that was going to happen at the Russian Olympics. Not a threat, not a warning, but was going to happen. Which didn’t happen, by the way.)

Storm 0.9.2 released

Thu, 06/26/2014 - 00:36


Topic Maps

Storm 0.9.2 released

From the post:

We are pleased to announce that Storm 0.9.2-incubating has been released and is available from the downloads page. This release includes many important fixes and improvements.

There are a number of fixes and improvements but the topology visualization tool by Kyle Nusbaum (@knusbaum) will be the one that catches your eye.

Upgrade before the next release catches you.