Another word for it

Subscribe to Another word for it feed
Updated: 1 day 15 hours ago

Enter, Update, Exit… [D3.js]

Wed, 04/16/2014 - 00:49

Categories:

Topic Maps

Enter, Update, Exit – An Introduction to D3.js, The Web’s Most Popular Visualization Toolkit by Christian Behrens.

From the webpage:

Over the past couple of years, D3, the groundbreaking JavaScript library for data-driven document manipulation developed by Mike Bostock, has become the Swiss Army knife of web-based data visualization. However, talking to other designers or developers who use D3 in their projects, I noticed that one of the core concepts of it remains somewhat obscure and is often referred to as »D3’s magic«: Data joins and selections.

Given a solid command of basic JavaScript, this article should help you to wrap your head around these two fundamental concepts and get you started using D3 for your dataviz projects.

If you encounter anyone not already using D3.js, pass this page along to them.

I first saw this in a tweet by Halftone.

GraphChi-DB [src released]

Wed, 04/16/2014 - 00:33

Categories:

Topic Maps

GraphChi-DB

From the webpage:

GraphChi-DB is a scalable, embedded, single-computer online graph database that can also execute similar large-scale graph computation as GraphChi. it has been developed by Aapo Kyrola as part of his Ph.D. thesis.

GraphChi-DB is written in Scala, with some Java code. Generally, you need to know Scala quite well to be able to use it.

IMPORTANT: GraphChi-DB is early release, research code. It is buggy, it has awful API, and it is provided with no guarantees. DO NOT USE IT FOR ANYTHING IMPORTANT.

GraphChi-DB source code arrives!

Enjoy!

Wandora – New Version [TMQL]

Wed, 04/16/2014 - 00:23

Categories:

Topic Maps

Wandora – New Version

From the webpage:

It is over six months since last Wandora release. Now we are finally ready to publish new version with some very interesting new features. Release 2014-04-15 features TMQL support and embedded HTML browser, for example. TMQL is the topic map query language and Wandora allows the user to search, query and modify topics and associations with TMQL scripts. Embedded HTML browser expands Wandora’s internal visualizations repertoire. Wandora embedded HTTP server services are now available inside the Wandora application….

Change Log, Download.

Two of the biggest changes:

Download your copy today!

I will post a review by mid-May, 2014.

Interested to hear your comments, questions and suggestions in the mean time.

BTW, the first suggestion I have is that the download file should NOT be wandora.zip but rather wandora-(date).zip if nothing else. Ditto for the source files and javadocs.

The Bw-Tree: A B-tree for New Hardware Platforms

Tue, 04/15/2014 - 21:22

Categories:

Topic Maps

The Bw-Tree: A B-tree for New Hardware Platforms by Justin J. Levandoski, David B. Lomet, and, Sudipta Sengupta.

Abstract:

The emergence of new hardware and platforms has led to reconsideration of how data management systems are designed. However, certain basic functions such as key indexed access to records remain essential. While we exploit the common architectural layering of prior systems, we make radically new design decisions about each layer. Our new form of B-tree, called the Bw-tree achieves its very high performance via a latch-free approach that effectively exploits the processor caches of modern multi-core chips. Our storage manager uses a unique form of log structuring that blurs the distinction between a page and a record store and works well with flash storage. This paper describes the architecture and algorithms for the Bw-tree, focusing on the main memory aspects. The paper includes results of our experiments that demonstrate that this fresh approach produces outstanding performance.

With easy availability of multi-core chips, what new algorithms are you going to discover while touring SICP or TAOCP?

Is that going to be an additional incentive to tour one or both of them?

Why and How to Start Your SICP Trek

Tue, 04/15/2014 - 13:35

Categories:

Topic Maps

Why and How to Start Your SICP Trek by Kai Wu.

From the post:

This post was first envisioned for those at Hacker Retreat – or thinking of attending – before it became more general. It’s meant to be a standing answer to the question, “How can I best improve as a coder?”

Because I hear that question from people committed to coding – i.e. professionally for the long haul – the short answer I always give is, “Do SICP!” *

Since that never seems to be convincing enough, here’s the long answer. I’ll give a short overview of SICP’s benefits, then use arguments from (justified) authority and argument by analogy to convince you that working through SICP is worth the time and effort. Then I’ll share some practical tips to help you on your SICP trek.

* Where SICP = The Structure and Interpretation of Computer Programs by Hal Abelson and Gerald Sussman of MIT, aka the Wizard book.

BTW, excuse my enthusiasm for SICP if it comes across at times as monolingual theistic fanaticism. I’m aware that there are many interesting developments in CS and software engineering outside of the Wizard book – and no single book can cover everything. Nevertheless, SICP has been enormously influential as an enduring text on the nature and fundamentals of computing – and tends to pay very solid dividends on your investments of attention.

A great post with lots of suggestions on how to work your way through SICP.

What it can’t supply is the discipline to actually make your way through SICP.

I was at a Unicode Conference some years ago and met Don Knuth. I said something in the course of the conversation about reading some part of TAOCP and Don said rather wistfully that he wished he would met someone who had read it all.

It seems sad that so many of us have dipped into it here or there but not really taken the time to explore it completely. Rather like reading Romeo and Juliet for the sexy parts and ignoring the rest.

Do you have a reading plan for TAOCP after you finish SICP?

I first saw this in a tweet by Computer Science.

SIGBOVIK 2014

Mon, 04/14/2014 - 15:14

Categories:

Topic Maps

SIGBOVIK 2014 (pdf)

From the cover page:

The Association for Computational Heresy

presents

A record of the Proceeding of

SIGBOVIK 2014

The eight annual intercalary robot dance in celebration of workshop on symposium about Harry Q. Bovik’s 26th birthday.

Just in case news on computer security is as grim this week as last, something to brighten your spirits.

Enjoy!

I first saw this in a tweet by John Regehr.

tagtog: interactive and text-mining-assisted annotation…

Mon, 04/14/2014 - 13:55

Categories:

Topic Maps

tagtog: interactive and text-mining-assisted annotation of gene mentions in PLOS full-text articles by Juan Miguel Cejuela, et al.

Abstract:

The breadth and depth of biomedical literature are increasing year upon year. To keep abreast of these increases, FlyBase, a database for Drosophila genomic and genetic information, is constantly exploring new ways to mine the published literature to increase the efficiency and accuracy of manual curation and to automate some aspects, such as triaging and entity extraction. Toward this end, we present the ‘tagtog’ system, a web-based annotation framework that can be used to mark up biological entities (such as genes) and concepts (such as Gene Ontology terms) in full-text articles. tagtog leverages manual user annotation in combination with automatic machine-learned annotation to provide accurate identification of gene symbols and gene names. As part of the BioCreative IV Interactive Annotation Task, FlyBase has used tagtog to identify and extract mentions of Drosophila melanogaster gene symbols and names in full-text biomedical articles from the PLOS stable of journals. We show here the results of three experiments with different sized corpora and assess gene recognition performance and curation speed. We conclude that tagtog-named entity recognition improves with a larger corpus and that tagtog-assisted curation is quicker than manual curation.

Database URL: www.tagtog.net, www.flybase.org.

Encouraging because the “tagging” is not wholly automated nor is it wholly hand-authored. Rather the goal is to create an interface that draws on the strengths of automated processing as moderated by human expertise.

Annotation remains at a document level, which consigns subsequent users to mining full text but this is definitely a step in the right direction.

3 Common Time Wasters at Work

Sun, 04/13/2014 - 21:32

Categories:

Topic Maps

3 Common Time Wasters at Work by Randy Krum.

See Randy’s post for the graphic but #2 was:

Non-work related Internet Surfing

It occurred to me that “Non-work related Internet Surfing” is indistinguishable from….search. At least at arm’s length or better.

And so many people search poorly that a lack of useful results is easy to explain.

Yes?

So, what is the strategy to get the rank and file to use more efficient information systems than search?

Their non-use or non-effective use of your system can torpedo a sale just as quickly as any other cause.

Suggestions?

Testing Lucene’s index durability after crash or power loss

Sun, 04/13/2014 - 01:08

Categories:

Topic Maps

Testing Lucene’s index durability after crash or power loss by Mike McCandless.

From the post:

One of Lucene’s useful transactional features is index durability which ensures that, once you successfully call IndexWriter.commit, even if the OS or JVM crashes or power is lost, or you kill -KILL your JVM process, after rebooting, the index will be intact (not corrupt) and will reflect the last successful commit before the crash.

If anyone at your startup is writing an indexing engine, be sure to pass this post from Mike along.

Ask them for a demonstration of equal durability of the index before using their work instead of Lucene.

You have enough work to do without replicating (poorly) work that already has enterprise level reliability.

Read Access on Google Production Servers

Sun, 04/13/2014 - 00:26

Categories:

Topic Maps

How we got read access on Google’s production servers

From the post:

To stay on top on the latest security alerts we often spend time on bug bounties and CTF’s. When we were discussing the challenge for the weekend, Mathias got an interesting idea: What target can we use against itself?

Of course. The Google search engine!

What would be better than to scan Google for bugs other than by using the search engine itself? What kind of software tend to contain the most vulnerabilities?

  • Old and deprecated software
  • Unknown and hardly accessible software
  • Proprietary software that only a few people have access to
  • Alpha/Beta releases and otherwise new technologies (software in early stages of it’s lifetime)

I read recently that computer security defense is 10 years behind computer security offense.

Do you think that’s in part due to the difference in sharing of information between the two communities?

Computer offense aggressively sharing and computer defense aggressively hording.

Yes?

If you are interested in a less folklorish way of gathering computer security information (such as all the software versions that are known to have the Heartbeat SSL issue), think about using topic maps.

Reasoning that the pattern that lead to the Heartbeat SSL memory leak was not unique.

As you build a list of Heartbeat susceptible software, you have a suspect list for similar issues. Find another leak and you can associate it with all those packages, subject to verification.

BTW, a good starting point for your research, the detectify blog.

Faceboook Gets Smarter with Graph Engine Optimization

Sun, 04/13/2014 - 00:07

Categories:

Topic Maps

Faceboook Gets Smarter with Graph Engine Optimization by Alex Woodie.

From the post:

Last fall, the folks in Facebook’s engineering team talked about how they employed the Apache Giraph engine to build a graph on its Hadoop platform that can host more than a trillion edges. While the Graph Search engine is capable of massive graphing tasks, there were some workloads that remained outside the company’s technical capabilities–until now.

Facebook turned to the Giraph engine to power its new Graph Search offering, which it unveiled in January 2013 as a way to let users perform searches on other users to determine, for example, what kind of music their Facebook friends like, what kinds of food they’re into, or what activities they’ve done recently. An API for Graph Search also provides advertisers with a new revenue source for Facebook. It’s likely the world’s largest graph implementation, and a showcase of what graph engines can do.

The company picked Giraph because it worked on their existing Hadoop implementation, including HDFS and its MapReduce infrastructure stack (known as Corona). Compared to running the computation workload on Hive, an internal Facebook test of a 400-billion edge graph ran 126x faster on Giraph, and had a 26x performance advantage, as we explained in a Datanami story last year.

When Facebook scaled its internal test graph up to 1 trillion edges, they were able to keep the processing of each iteration of the graph under four minutes on a 200-server cluster. That amazing feat was done without any optimization, the company claimed. “We didn’t cheat,” Facebook developer Avery Ching declared in a video. “This is a random hashing algorithm, so we’re randomly assigning the vertices to different machines in the system. Obviously, if we do some separation and locality optimization, we can get this number down quite a bit.”

High level view with technical references on how Facebook is optimizing its Apache Giraph engine.

If you are interested in graphs, this is much more of a real world scenario than building “big” graphs out of uniform time slices.

PyCon US 2014 – Videos (Tutorials)

Sat, 04/12/2014 - 19:04

Categories:

Topic Maps

The tutorial videos from PyCon US 2014 are online! Talks to follow.

Tutorials arranged by author for your finding convenience:

  • Blomo, Jim mrjob: Snakes on a Hadoop

    This tutorial will take participants through basic usage of mrjob by writing analytics jobs over Yelp data. mrjob lets you easily write, run, and test distributed batch jobs in Python, on top of Hadoop. Hadoop is a MapReduce platform for processing big data but requires a fair amount of Java boilerplate. mrjob is an open source Python library written by Yelp used to process TBs of data every day.
  • Clifford, Williams, G. 0 to 00111100 with web2py

    This tutorial teaches basic web development for people who have some experience with HTML. No experience with CSS or JavaScript is required. We will build a basic web application using AJAX, web forms, and a local SQL database.
  • Grisel, Olivier; Jake, Vanderplas Exploring Machine Learning with Scikit-learn

    This tutorial will offer an introduction to the core concepts of machine learning, and how they can be easily applied in Python using Scikit-learn. We will use the scikit-learn API to introduce and explore the basic categories of machine learning problems, related topics such as feature selection and model validation, and the application of these tools to real-world data sets.
  • Love, Kenneth Getting Started with Django, a crash course

    Getting Started With Django is a well-established series of videos teaching best practices and common approaches for building web apps to people new to Django. This tutorial combines the first few lessons into a single lesson. Attendees will follow along as I start and build an entire simple web app and, network permitting, deploy it to Heroku.
  • Ma, Eric How to formulate a (science) problem and analyze it using Python code

    Are you interested in doing analysis but don’t know where to start? This tutorial is for you. Python packages & tools (IPython, scikit-learn, NetworkX) are powerful for performing data analysis. However, little is said about formulating the questions and tying these tools together to provide a holistic view of the data. This tutorial will provide you with an introduction on how this can be done.
  • Müller, Mike Descriptors and Metaclasses – Understanding and Using Python's More Advanced Features

    Descriptors and metaclasses are advanced Python features. While it is possible to write Python programs without active of knowledge of them, knowing how they work provides a deeper understanding about the language. Using examples, you will learn how they work and when to use as well as when better not to use them. Use cases provide working code that can serve as a base for own solutions.
  • Vanderplas, Jake; Olivier Grisel Exploring Machine Learning with Scikit-learn

    This tutorial will offer an introduction to the core concepts of machine learning, and how they can be easily applied in Python using Scikit-learn. We will use the scikit-learn API to introduce and explore the basic categories of machine learning problems, related topics such as feature selection and model validation, and the application of these tools to real-world data sets.

Tutorials or talks with multiple authors are listed under each author. (I don’t know which one you will remember.)

I am going to spin up the page for the talks so when the videos appear, all I need do is to insert the video links.

Enjoy!

Lost Boolean Operator?

Sat, 04/12/2014 - 18:15

Categories:

Topic Maps

Binary Boolean Operator: The Lost Levels

From the post:

The most widely known of these four siblings is operator number 11. This operator is called the “material conditional”. It is used to test if a statement fits the logical pattern “P implies Q”. It is equivalent to !P || Q by the material implication.

I only know one language that implementes this operation: VBScript.

The post has a good example of why material conditional is useful.

Will your next language have a material conditional operator?

I first saw this in Pete Warden’s Five short links for April 3, 2014.

Prescription vs. Description

Sat, 04/12/2014 - 15:59

Categories:

Topic Maps

Kurt Cagle posted this image on Facebook:

with this comment:

The difference between INTJs and INTPs in a nutshell. Most engineers, and many programmers, are INTJs. Theoretical scientists (and I’m increasingly putting data scientists in that category) are far more INTPs – they are observers trying to understand why systems of things work, rather than people who use that knowledge to build, control or constrain those systems.

I would rephrase the distinction to be one of prescription (engineers) versus description (scientists) but that too is a false dichotomy.

You have to have some real or imagined description of a system to start prescribing for it and any method for exploring a system has some prescriptive aspects.

The better course is to recognize exploring or building systems has some aspects of both. Making that recognition, may (or may not) make it easier to discuss assumptions of either perspective that aren’t often voiced.

Being more from the descriptive side of the house, I enjoy pointing out that behind most prescriptive approaches are software and services to help you implement those prescriptions. Hardly seems like an unbiased starting point to me.

To be fair, however, the descriptive side of the house often has trouble distinguishing between important things to describe and describing everything it can to system capacity, for fear of missing some edge case. The “edge” cases may be larger than the system but if they lack business justification, pragmatics should reign over purity.

Or to put it another way: Prescription alone is too brittle and description alone is too endless.

Effective semantic modeling/integration needs to consist of varying portions of prescription and description depending upon the requirements of the project and projected ROI.

PS: The “ROI” of a project not in your domain, that doesn’t use your data, your staff, etc. is not a measure of the potential “ROI” for your project. Crediting such reports is “ROI” for the marketing department that created the news. Very important to distinguish “your ROI” from “vendor’s ROI.” Not the same thing. If you need help with that distinction, you know where to find me.

Hemingway App

Sat, 04/12/2014 - 00:10

Categories:

Topic Maps

Hemingway App

We are a long way from something equivalent to Hemingway App for topic maps or other semantic technologies but it struck me that may not always be true.

Take it for a spin and see what you think.

What modifications would be necessary to make this concept work for a semantic technology?

Definitions Extractions from the Code of Federal Regulations

Sat, 04/12/2014 - 00:03

Categories:

Topic Maps

Definitions Extractions from the Code of Federal Regulations by Mohamma M. AL Asswad, Deepthi Rajagopalan, and Neha Kulkarni. (poster)

From a description of the project:

Imagine you’re opening a new business that uses water in the production cycle. If you want to know what federal regulations apply to you, you might do a Google search that leads to the Code of Federal Regulations. But that’s where it gets complicated, because the law contains hundreds of regulations involving water that are difficult to narrow down. (The CFR alone contains 13898 references to water.) For example, water may be defined one way when referring to a drinkable liquid and another when defined as an emission from a manufacturing facility. If the regulation says your water must maintain a certain level of purity, to which water are they referring? Definitions are the building blocks of the law, and yet pouring through them to find what applies to you is frustrating to an average business owner. Computer automation might help, but how can a computer understand exactly what kind of water you’re looking for? We at the Legal Information Institute think this is pretty important challenge, and apparently Google does too.

Looking forward to learning more about this project!

BTW, this is the same Code of Federal Regulations that some members of Congress don’t think needs to be indexed.

Knowing what legal definitions apply is a big step towards making legal material more accessible.

Google Top 10 Search Tips

Fri, 04/11/2014 - 23:47

Categories:

Topic Maps

Google Top 10 Search Tips by Karen Blakeman.

From the post:

These are the top 10 tips from the participants of a recent workshop on Google, organised by UKeiG and held on 9th April 2014. The edited slides from the day can be found on authorSTREAM at http://www.authorstream.com/Presentation/karenblakeman-2121264-making-google-behave-techniques-better-results/ and on Slideshare at http://www.slideshare.net/KarenBlakeman/making-google-behave-techniques-for-better-results

Ten search tips from the trenches. Makes a very nice cheat sheet.

Transcribing Piano Rolls…

Fri, 04/11/2014 - 23:14

Categories:

Topic Maps

Transcribing Piano Rolls, the Pythonic Way by Zulko.

From the post:

Piano rolls are these rolls of perforated paper that you put in the saloon’s mechanical piano. They have been very popular until the 1950s, and the piano roll repertory counts thousands of arrangements (some by greatest names of jazz) which have never been published in any other form.

NSA news isn’t going to subside anytime soon so I am including this post as one way to relax over the weekend.

I’m not a musicologist but I think transcribing music from a image of roll music being played is quite fascinating.

I first saw this in a tweet from Lars Marius Garshol.