Tuesday, July 27, 2010

It's not Open Data, so stop calling it that...

While it is a great positive change that data is being released through numerous efforts around the world, data release is not the same as Open Data release. A number of Canadian cities have announced Open Data initiatives, but they are not releasing Open Data. They are just releasing data. Of course, this is better than not releasing data. But let's at least be honest about what we are doing.

Why aren't they Open Data? Because their licenses are not Open Data licenses:
  • Not Open Data: Edmonton: "The City may, in its sole discretion, cancel or suspend your access to the datasets without notice and for any reason..." - from Terms of Use
  • Not Open Data: Vancouver: "The City may, in its sole discretion, cancel or suspend your access to the datasets without notice and for any reason..." - Terms of Use
  • Not Open Data: Ottawa: "The City may, in its sole discretion, cancel or suspend your access to the datasets without notice and for any reason..." - from Terms of Use
  • Not Open Data: Toronto: "The City may, in its sole discretion, cancel or suspend your access to the datasets without notice and for any reason..." - from Terms of Use
All of these licenses also suffer from the additional mis-feature of arbitrary retroactivity:
"The City may at any time and from time to time add, delete, or change the datasets or these Terms of Use. Notice of changes may be posted on the home page for these datasets or this page. Any change is effective immediately upon posting, unless otherwise stated"

These two clauses mean that there is no stability for someone using this data. If, something they do or say (data related or not) is not liked by the city whose data they are using, they can lose access. Or if the city finds that many data users are doing things they do not like, they can change the terms of reference to impact data previously obtained by users.

How to fix
Obligatory versioning of both datasets and licenses, and losing the above two clauses. When a dataset is released, it is given a version, and that release is matched to a (usually the most recent) license version, that will always apply to that version of that data release. Any change to a license generates a new version, only applicable to subsequent releases that choose to use the new license.

This is how things work in the Open Source world. It means that if you possess a piece of Open Source software, with a license of a specific version, someone half-way across the world from you cannot turn you into criminal and/or shut you down by retroactively changing the license. It means that you have stability. Of course, you may be shut out of the next version if they change its license, but that doesn't necessarily shut you down today. You have some level of stability.

An example: an SME builds a business based on data released by the cities. This business perhaps includes data mining tools that reveal some things that some of the cities do not like revealed or discussed. They change the license (remember: "...cancel or suspend ...without notice and for any reason...") or simply cancel or suspend the company's data access to shut this company out, and the company goes out of business.


So, if you want to release Open Source code or Open Data, you must be willing to accept that it will be used in ways that you may find offensive, to you (and/or your constituents). That is how it works.

Update: 2010 10 14: Eight Principles of Open Data from Open Government Data Principles:
  1. Primary Data is as collected at the source, with the highest possible level of granularity, not in aggregate or modified forms.
  2. Complete All public data is made available. Public data is data that is not subject to valid privacy, security or privilege limitations.
  3. Timely Data is made available as quickly as necessary to preserve the value of the data.
  4. Accessible Data is available to the widest range of users for the widest range of purposes.
  5. Machine processable Data is reasonably structured to allow automated processing.
  6. Non-discriminatory Data is available to anyone, with no requirement of registration.
  7. Non-proprietary Data is available in a format over which no entity has exclusive control.
  8. License-free Data is not subject to any copyright, patent, trademark or trade secret regulation. Reasonable privacy, security and privilege restrictions may be allowed.
The above cities' licenses are are not compliant with #4 and #6 of these eight principles. See also http://zzzoot.blogspot.com/2010/08/what-is-open-gov-data-sunlight.html

Update: 2010 Nov 7: http://acrosscanadatrails.posterous.com/civicaccess-discuss-importance-of-true-open-d

Saturday, July 24, 2010

University visitor @ Australian National University

Tomorrow is (sadly) my last official day* as a university visitor at the Australian National University (ANU), Canberra.
I've been here since late June, invited by ANU adjunct and Funnelback chief scientist (and ex-CSIRO) David Hawking, to visit the Algorithms and Data Research Group, School of Computer Science, College of Engineering and Computer Science.

I was installed in a lovely office

looking out into the campus,

where I've been working on large scale journal visualization, a continuation of the Torngat project. I've been working on a couple of things, including applying Mulan to the multi-label problem of the corpus I am working with, so I can get precision and recall to evaluate this method empirically. My productivity has been hampered by a recurring stomach problem (which appears to be gone this last week: yay!), so I've not progressed as much as I would have wanted to.... :-(
At the end of last week I gave a presentation at CSIRO (in the same building) on this work entitled: Search refinement: visualizing research journals in semantic space. After this talk I had a discussion with Alex Krumpholz and Hanna Suominen, and it is looking like we will be working together on a project involving Torngat.

I've also enjoyed the company of John Maindonald, (went on a very nice Sunday walk with him and his wife and some of their friends) one of the important players in the R universe. He's arranged an invite for me to talk tomorrow about how I've used R in the Torngat project, with the Canberrra R Users Group.

I also enjoyed an afternoon this past week meeting with the Australian National Data Service (ANDS) people, arranged by the wonderful Monica Omodei (formerly Berko), learning about their success in putting together ANDS and where they were going. They were also interested in Torngat, so I gave them a brief presentation on it.

A bit of a surprise collaboration: I have committed myself to helping improve the single-threaded Lucene indexing benchmark in the DaCapo Java benchmarks, after discussions with ANU's Steve Blackburn, a Java VM and GC guru. I've also committed to implementing a new multi-threaded indexing benchmark. Most of the code will be derived from existing code from my LuSql tool (it is actually from the yet un-released LuSql v1.0 codebase).

While it has been winter (Spring/Fall by Canadian standards...) here in Canberra, I have still been amazed at the fantastic birds that are (still) here. Like nothing we have at home, there are the loud and raucous-yet-endearing sulfur crested cockatoos:

Gang-gang cockatoos, crimson and eastern rosellas, and many others that flittered past in a rush of colour or sang in the distance, unseen and unrecognized by me. These guys below were also fairly common, but I'm not sure what they are:
A number of people have been very helpful and gratious with their time, and I'm going to list them here, in no particular order: Alex Krumpholz & family (thanks for the hike up Black Mountain & dinner & talk afterwards!); Tom Rowlands; Tim Jones; Paul Thomas; David Hawking & Kathy Griffiths; Diane Kossatz; Chelsea Holton; John Maindonald; Monica Omodei; Hanna Suominen; Steve Blackburn.


I think this is Mount Stromlo, (the low hill/mountain to the right, with the mountains of the Brindabella Range (I think) in the background) from Black Mountain. You can see some of the Mount Stromlo Observatory at white dots on the crest of Mt. Stromlo. The observatory and the forest that was on Mt. Stromlo were mostly destroyed in the 2003 Canberra bushfires. When I was up at the observatory earlier in the month there were many burnt-out tree stumps to be seen. And 'roos. :-)

*Today is July 25th: I am 14 hours ahead here in Canberra of the eastern time zone in North America. But blogger thinks I am there on the 24th...

Thursday, July 08, 2010

E-Government ICT Conference

The recent conference E-Government ICT Professionalism and Competences Service Science (IFIP International Federation for Information Processing, IFIP 20th World Computer Congress, Industry-Oriented Conferences, September 7-10, 2008, Milano, Italy) is of interest to those involved with ICT in government and eGovernment in general (although the conference is rather EU-centred).
Note the content is behind a pay-wall, so you can't read the articles unless you belong to an institution that has a subscription or you have one yourself.

Of interest:

Search: "open source":
  • Business Process Monitoring: BT Italy case study
  • The Italian Public Administration Electronic Market: Scenario, Operation, Trends:
    "The selected software to develop the dashboard has been Pentaho suite, which is an open source application. It better fits all the project needs that can be summarized by the following drivers:
    • Low license costs. It has no license costs.
    • Low impact on current systems architecture. It does not need a complex integration with source systems.
    • Availability of “off the shelf” features (reporting and KPIs analysis). It has rich libraries of graphical objects and reports to better show indicators.
    • Short Time to Delivery. The Dashboard has been delivered in three months including a tuning phase in which some new features had been added."

Table of Contents: