Another word for it
PHPTMAPI 3 by Johannes Schmidt.
From the webpage:
PHPTMAPI 3 is the succession project of http://phptmapi.sourceforge.net/
PHPTMAPI is a PHP5 API for creating and manipulating topic maps, based on the http://tmapi.sourceforge.net/ project. This API enables PHP developers an easy and standardized implementation of ISO/IEC 13250 Topic Maps in their applications.
What is TMAPI?
TMAPI is a programming interface for accessing and manipulating data held in a topic map. The TMAPI specification defines a set of core interfaces which must be implemented by a compliant application as well as (eventually) a set of additional interfaces which may be implemented by a compliant application or which may be built upon the core interfaces.
Please spread the word to our PHP brethren.
Elementary Algorithms by Xinyu LIU.
From the github page:
AlgoXY is a free book about elementary algorithms and data structures. This book doesn’t only focus on an imperative (or procedural) approach, but also includes purely functional algorithms and data structures. It doesn’t require readers to master any programming languages, because all the algorithms are described using mathematical functions and pseudocode.
For reference and implementation purposes, source code in C, C++, Haskell, Python, Scheme/Lisp is available in addition to the book.
The contents of the book are provided under GNU FDL and the source code is under GNU GPLv3.
The PDF version can be downloaded from github: https://github.com/liuxinyu95/AlgoXY/blob/algoxy/preview/elementary-algorithms.pdf?raw=true
This book is also available online at: https://sites.google.com/site/algoxy/home
I was concerned when the HTML version for trie was only 2 pages long. You need to view the pdf version, which for trie is some forty (40) pages, to get an idea of the coverage of any particular algorithm.
I first saw this in a tweet by OxAX
Metadata: Organizing and Discovering Information by Jeffrey Pomerantz.
Coursera course described in part as follows:
If you use nearly any digital technology, you make use of metadata. Use an ATM today? You interacted with metadata about your account. Searched for songs in iTunes or Spotify? You used metadata about those songs. We use and even create metadata constantly, but we rarely realize it. Metadata — or data about data — describes real and digital objects, so that those objects may be organized now and found later.
Metadata is a tool that enables the information age functions performed by humans as well as those performed by computers. Metadata is important to many fields, particularly Computer Science; but this course is not purely a Computer Science course. This course approaches Metadata from the perspective of Information Science, which is a broad interdisciplinary field that studies how people create and manage information.
Unit 1: Organizing Information
Unit 2: Dublin Core
Unit 3: How to Build a Metadata Schema
Unit 4: Alphabet Soup: Metadata Schemas That You (Will) Know and Love
Unit 5: Metadata for the Web
Unit 6: Metadata for Networks
Unit 7: How to Create Metadata
Unit 8: How to Evaluate Metadata
An eight week course, July 14 – September 8, 2014, at 4 to 6 hours per week.
I first saw this in a tweet by Aaron Kirschenfeld that reads:
Every one of your legal hackers out there: where’s the metadata? Please learn from @jpom #metadatamooc on @coursera. My brain is crackling.
My follow-up question being: Where are the subject identifications to help map between heterogeneous metadata systems?
1 August 2014, abstracts of ~ 750 words and a minimal bio sent to firstname.lastname@example.org.
31 August 2014, Deadline for Early Registration Discount.
19 September 2014, Dealing for group rate reservations at the Orrington Hotel.
23-24 October, 2014 Colloquium.
From the call for papers:
The ninth annual meeting of the Chicago Colloquium on Digital Humanities and Computer Science (DHCS) will be hosted by Northwestern University on October 23-24, 2014.
The DHCS Colloquium has been a lively regional conference (with non-trivial bi-coastal and overseas sprinkling), rotating since 2006 among the University of Chicago (where it began), DePaul, IIT, Loyola, and Northwestern. At the first Colloquium Greg Crane asked his memorable question “What to do with a million books?” Here are some highlights that I remember across the years:
- An NLP programmer at Los Alamos talking about the ways security clearances prevented CIA analysts and technical folks from talking to each other.
- A demonstration that if you replaced all content words in Arabic texts and focused just on stop words you could determine with a high degree of certainty the geographical origin of a given piece of writing.
- A visualization of phrases like “the king’s daughter” in a sizable corpus, telling you much about who owned what.
- A social network analysis of Alexander the Great and his entourage.
- An amazingly successful extraction of verbal parallels from very noisy data.
- Did you know that Jane Austen was a game theorist before her time and that her characters were either skillful or clueless practitioners of this art?
And so forth. Given my own interests, I tend to remember “Text as Data” stuff, but there was much else about archaeology, art, music, history, and social or political life. You can browse through some of the older programs at http://lucian.uchicago.edu/blogs/dhcs/.
One of the weather sites promises that October is between 42 F for the low and 62 F for the high (on average). Sounds like a nice time to visit Northwestern University!
To say nothing of an exciting conference!
I first saw this in a tweet by David Bamman.
Anita gives all the high minded and very legitimate reasons for creating highly effective data, with examples.
Read her slides to pick up the rhetoric you need and leads on how to create highly effective data.
Let me add one concern to drive your interest in creating highly effective data:
Funders want researchers to create highly effective data.
Answers to creating highly effective data continue to evolve but not attempting to create highly effective data is a losing proposal.
Word Meanings Evolve to Selectively Preserve Distinctions on Salient Dimensions by Catriona Silvey, Simon Kirby, and Kenny Smith.
Words refer to objects in the world, but this correspondence is not one-to-one: Each word has a range of referents that share features on some dimensions but differ on others. This property of language is called underspecification. Parts of the lexicon have characteristic patterns of underspecification; for example, artifact nouns tend to specify shape, but not color, whereas substance nouns specify material but not shape. These regularities in the lexicon enable learners to generalize new words appropriately. How does the lexicon come to have these helpful regularities? We test the hypothesis that systematic backgrounding of some dimensions during learning and use causes language to gradually change, over repeated episodes of transmission, to produce a lexicon with strong patterns of underspecification across these less salient dimensions. This offers a cultural evolutionary mechanism linking individual word learning and generalization to the origin of regularities in the lexicon that help learners generalize words appropriately.
I can’t seem to access the article today but the premise is intriguing.
Perhaps people can have different “…less salient dimensions…” and therefore are generalizing words “inappropriately” from the standpoint of another person.
Curious if a test can be devised to identify those “…less salient dimensions…” in some target population? Might lead to faster identification of terms likely to be mis-understood.
Clojure Destructuring Tutorial and Cheat Sheet by John Louis Del Rosario.
From the post:
When I try to write or read some Clojure code, every now and then I get stuck on some destructuring forms. It’s like a broken record. One moment I’m in the zone, then this thing hits me and I have to stop what I’m doing to try and grok what I’m looking at.
So I decided I’d write a little tutorial/cheatsheet for Clojure destructuring, both as an attempt to really grok it (I absorb stuff more quickly if I write it down), and as future reference for myself and others.
Below is the whole thing, copied from the original gist. I’m planning on adding more (elaborate) examples and a section for compojure’s own destructuring forms. If you want to bookmark the cheat sheet, I recommend the gist since it has proper syntax highlighting and will be updated first.
John’s right, the gist version is easier to read.
As of 27 July 2014, the sections on “More Examples” and “Compojure” are blank if you feel like contributing.
I first saw this in a tweet by Daniel Higginbotham.
The Simplicity of Clojure by Bridget Hillyer and Clinton N. Dreisbach. OSCON 2014.
A great overview of Clojure that covers:
- Clojure Overview
- Flow Control
- Coljure Libraries
Granted they are slides so you need to fill in with other sources of content, such as Clojure for the Brave and True, but they do provide an outline for learning more.
I first saw this in a tweet by Christophe Lalanne.
Stanford Large Network Dataset Collection by Jure Leskovec.
From the webpage:
- Social networks : online social networks, edges represent interactions between people
- Networks with ground-truth communities : ground-truth network communities in social and information networks
- Communication networks : email communication networks with edges representing communication
- Citation networks : nodes represent papers, edges represent citations
- Collaboration networks : nodes represent scientists, edges represent collaborations (co-authoring a paper)
- Web graphs : nodes represent webpages and edges are hyperlinks
- Amazon networks : nodes represent products and edges link commonly co-purchased products
- Internet networks : nodes represent computers and edges communication
- Road networks : nodes represent intersections and edges roads connecting the intersections
- Autonomous systems : graphs of the internet
- Signed networks : networks with positive and negative edges (friend/foe, trust/distrust)
- Location-based online social networks : Social networks with geographic check-ins
- Wikipedia networks and metadata : Talk, editing and voting data from Wikipedia
- Twitter and Memetracker : Memetracker phrases, links and 467 million Tweets
- Online communities : Data from online communities such as Reddit and Flickr
- Online reviews : Data from online review systems such as BeerAdvocate and Amazon
- Information cascades : …
If you need software to go with these datasets, consider Stanford Network Analysis Platform (SNAP)
Stanford Network Analysis Platform (SNAP) is a general purpose, high performance system for analysis and manipulation of large networks. Graphs consists of nodes and directed/undirected/multiple edges between the graph nodes. Networks are graphs with data on nodes and/or edges of the network.
The core SNAP library is written in C++ and optimized for maximum performance and compact graph representation. It easily scales to massive networks with hundreds of millions of nodes, and billions of edges. It efficiently manipulates large graphs, calculates structural properties, generates regular and random graphs, and supports attributes on nodes and edges. Besides scalability to large graphs, an additional strength of SNAP is that nodes, edges and attributes in a graph or a network can be changed dynamically during the computation.
A Python interface is available for SNAP.
I first saw this at: Stanford Releases Large Network Datasets by Ryan Swanstrom.
Scikit-learn 0.15 release by Gaël Varoquaux.
From the post:
Quality— Looking at the commit log, there has been a huge amount of work to fix minor annoying issues.
Speed— There has been a huge effort put in making many parts of scikit-learn faster. Little details all over the codebase. We do hope that you’ll find that your applications run faster. For instance, we find that the worst case speed of Ward clustering is 1.5 times faster in 0.15 than 0.14. K-means clustering is often 1.1 times faster. KNN, when used in brute-force mode, got faster by a factor of 2 or 3.
Random Forest and various tree methods— The random forest and various tree methods are much much faster, use parallel computing much better, and use less memory. For instance, the picture on the right shows the scikit-learn random forest running in parallel on a fat Amazon node, and nicely using all the CPUs with little RAM usage.
Hierarchical aglomerative clustering— Complete linkage and average linkage clustering have been added. The benefit of these approach compared to the existing Ward clustering is that they can take an arbitrary distance matrix.
Robust linear models— Scikit-learn now includes RANSAC for robust linear regression.
HMM are deprecated— We have been discussing for a long time removing HMMs, that do not fit in the focus of scikit-learn on predictive modeling. We have created a separate hmmlearn repository for the HMM code. It is looking for maintainers.
And much more— plenty of “minor things”, such as better support for sparse data, better support for multi-label data…
Get thee to Scikit-learn!
April 2014 Crawl Data Available by Stephen Merity.
From the post:
The April crawl of 2014 is now available! The new dataset is over 183TB in size containing approximately 2.6 billion webpages. The new data is located in the aws-publicdatasets bucket at /common-crawl/crawl-data/CC-MAIN-2014-15/.
To assist with exploring and using the dataset, we’ve provided gzipped files that list:
- all segments (CC-MAIN-2014-15/segment.paths.gz)
- all WARC files (CC-MAIN-2014-15/warc.paths.gz)
- all WAT files (CC-MAIN-2014-15/wat.paths.gz)
- all WET files (CC-MAIN-2014-15/wet.paths.gz)
By simply adding either s3://aws-publicdatasets/ or https://aws-publicdatasets.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.
Thanks again to Blekko for their ongoing donation of URLs for our crawl!
Well, at 183TB, I don’t guess I am going to have a local copy.
OpenFDA Provides Ready Access to Recall Data by Taha A. Kass-Hout.
From the post:
Every year, hundreds of foods, drugs, and medical devices are recalled from the market by manufacturers. These products may be labeled incorrectly or might pose health or safety issues. Most recalls are voluntary; in some cases they may be ordered by the U.S. Food and Drug Administration. Recalls are reported to the FDA, and compiled into its Recall Enterprise System, or RES. Every week, the FDA releases an enforcement report that catalogues these recalls. And now, for the first time, there is an Application Programming Interface (API) that offers developers and researchers direct access to all of the drug, device, and food enforcement reports, dating back to 2004.
The recalls in this dataset provide an illuminating window into both the safety of individual products and the safety of the marketplace at large. Recent reports have included such recalls as certain food products (for not containing the vitamins listed on the label), a soba noodle salad (for containing unlisted soy ingredients), and a pain reliever (for not following laboratory testing requirements).
You will get warnings that this data is “not for clinical use.”
Sounds like a treasure trove of data if you are looking for products still being sold despite being recalled.
Or if you want to advertise for “victims” of faulty products that have been recalled.
I think both of those are non-clinical uses.
Darwin’s ship library goes online by Dennis Normile.
From the post:
As Charles Darwin cruised the world on the HMS Beagle, he had access to an unusually well-stocked 400-volume library. That collection, which contained the observations of numerous other naturalists and explorers, has now been recreated online. As of today, all of more than 195,000 pages and 5000 illustrations from the works are available for the perusal of scholars and armchair naturalists alike, thanks to the Darwin Online project.
Perhaps it isn’t the amount of information you have available but how deeply you understand it that makes a difference.
Which gene did you mean? by Barend Mons.
Computational Biology needs computer-readable information records. Increasingly, meta-analysed and pre-digested information is being used in the follow up of high throughput experiments and other investigations that yield massive data sets. Semantic enrichment of plain text is crucial for computer aided analysis. In general people will think about semantic tagging as just another form of text mining, and that term has quite a negative connotation in the minds of some biologists who have been disappointed by classical approaches of text mining. Efforts so far have tried to develop tools and technologies that retrospectively extract the correct information from text, which is usually full of ambiguities. Although remarkable results have been obtained in experimental circumstances, the wide spread use of information mining tools is lagging behind earlier expectations. This commentary proposes to make semantic tagging an integral process to electronic publishing.
From within the post:
If all words had only one possible meaning, computers would be perfectly able to analyse texts. In reality however, words, terms and phrases in text are highly ambiguous. Knowledgeable people have few problems with these ambiguities when they read, because they use context to disambiguate ‘on the fly’. Even when fed a lot of semantically sound background information, however, computers currently lag far behind humans in their ability to interpret natural language. Therefore, proper semantic tagging of concepts in texts is crucial to make Computational Biology truly viable. Electronic Publishing has so far only scratched the surface of what is needed.
Open Access publication shows great potential, andis essential for effective information mining, but it will not achieve its full potential if information continues to be buried in plain text. Having true semantic mark up combined with open access for mining is an urgent need to make possible a computational approach to life sciences.
Creating semantically enriched content as part and parcel of the publication process should be a winning strategy.
First, for current data, estimates of what others will be searching for should not be hard to find out. That will help focus tagging on the material users are seeking. Second, a current and growing base of enriched material will help answer questions about the return on enriching material.
Other suggestions for BMC Bioinformatics?
Introducing Source Han Sans: An open source Pan-CJK typeface by Caleb Belohlavek.
From the post:
Adobe, in partnership with Google, is pleased to announce the release of Source Han Sans, a new open source Pan-CJK typeface family that is now available on Typekit for desktop use. If you don’t have a Typekit account, it’s easy to set one up and start using the font immediately with our free subscription. And for those who want to play with the original source files, you can get those from our download page on SourceForge.
It’s rather difficult to describe your semantics when you can’t write in your own language.
Kudos to Adobe and Google for sponsoring this project!
I first saw this in a tweet by James Clark.
Scientific competency questions as the basis for semantically enriched open pharmacological space development by Kamal Azzaoui, et al. (Drug Discovery Today, Volume 18, Issues 17–18, September 2013, Pages 843–852)
Molecular information systems play an important part in modern data-driven drug discovery. They do not only support decision making but also enable new discoveries via association and inference. In this review, we outline the scientific requirements identified by the Innovative Medicines Initiative (IMI) Open PHACTS consortium for the design of an open pharmacological space (OPS) information system. The focus of this work is the integration of compound–target–pathway–disease/phenotype data for public and industrial drug discovery research. Typical scientific competency questions provided by the consortium members will be analyzed based on the underlying data concepts and associations needed to answer the questions. Publicly available data sources used to target these questions as well as the need for and potential of semantic web-based technology will be presented.
Pharmacology may not be your space but this is a good example of what it takes for semantic integration of resources in a complex area.
Despite the “…you too can be a brain surgeon with our new web-based app…” from various sources, semantic integration has been, is and will remain difficult under the best of circumstances.
I don’t say that to discourage anyone but to avoid the let-down when integration projects don’t provide easy returns.
It is far better to plan for incremental and measurable benefits along the way than to fashion grandiose goals that are ever receding on the horizon.
I first saw this in a tweet by ChemConnector.
From the post:
Companies House is to make all of its digital data available free of charge. This will make the UK the first country to establish a truly open register of business information.
As a result, it will be easier for businesses and members of the public to research and scrutinise the activities and ownership of companies and connected individuals. Last year (2013/14), customers searching the Companies House website spent £8.7 million accessing company information on the register.
This is a considerable step forward in improving corporate transparency; a key strand of the G8 declaration at the Lough Erne summit in 2013.
It will also open up opportunities for entrepreneurs to come up with innovative ways of using the information.
This change will come into effect from the second quarter of 2015 (April – June).
In a side bar, Business Secretary Vince Cable said in part:
Companies House is making the UK a more transparent, efficient and effective place to do business.
I’m not sure about “efficient,” but providing incentives for lawyers and others to track down insider trading and other business as usual practices and arming them with open data would be a start in the right direction.
I first saw this in a tweet by Hadley Beeman.
XRay: Enhancing the Web’s Transparency with Differential Correlation by Mathias Lécuyer, et al.
Today’s Web services – such as Google, Amazon, and Facebook – leverage user data for varied purposes, including personalizing recommendations, targeting advertisements, and adjusting prices. At present, users have little insight into how their data is being used. Hence, they cannot make informed choices about the services they choose. To increase transparency, we developed XRay, the first fine-grained, robust, and scalable personal data tracking system for the Web. XRay predicts which data in an arbitrary Web account (such as emails, searches, or viewed products) is being used to target which outputs (such as ads, recommended products, or prices). XRay’s core functions are service agnostic and easy to instantiate for new services, and they can track data within and across services. To make predictions independent of the audited service, XRay relies on the following insight: by comparing outputs from different accounts with similar, but not identical, subsets of data, one can pinpoint targeting through correlation. We show both theoretically, and through experiments on Gmail, Amazon, and YouTube, that XRay achieves high precision and recall by correlating data from a surprisingly small number of extra accounts.
Not immediately obvious, until someone explains it, but any system that reacts based on input you control can be investigated. Whether that includes dark marketing forces or government security agencies.
Be aware that provoking government security agencies is best left to professionals.
The next step will be to have bots that project false electronic trails for us to throw advertisers (or others) off track.
Very much worth your time to read.
From the webpage:
What is ISGCI?
ISGCI is an encyclopaedia of graphclasses with an accompanying java application that helps you to research what’s known about particular graph classes. You can:
- check the relation between graph classes and get a witness for the result
- draw clear inclusion diagrams
- colour these diagrams according to the complexity of selected problems
- find the P/NP boundary for a problem
- save your diagrams as Postscript, GraphML or SVG files
- find references on classes, inclusions and algorithms
As of 214-07-06, the database contains 1497 classes and 176,888 inclusions.
If you are past the giddy stage of “Everything’s a graph!,” you may find this site useful.