Tuesday, November 13, 2007

W3C Proposed SPARQL Recommendations

The W3C has just released three SPARQL-related proposed recommendations:

CSIRO research program takes data leadership

CSIRO has announced a new program, Terabyte Science, (press release: From molecules to the Milky Way: dealing with the data deluge) that is oriented around dealing with the issues of the large volumes of data generated by much of modern science. While this includes the difficult problem of the management of large volumes of data, this program will also focus on new ways to analyse and exploit this data.

Kudos to CSIRO for recognizing this issue and realizing an important program for data science, both to their own country and the rest of us. Hopefully other such initiatives will take hold in other countries. And perhaps we will be seeing some of their work published in CODATA's Data Science Journal. (Disclosure: I am an observer on the Canadian National Committee for CODATA).

Wednesday, October 31, 2007

"When Is Open Access Not Open Access?"

The article When Is Open Access Not Open Access? (CJ MacCallum) PLoS Biology examines the slippery activities of publishers that try and fly the flag of Open Access (with varying degrees of capitalization) but who only offer the free-as-in-beer definition of freedom, as opposed to the Open Access definition, which includes --- as well as free-gratis freedom -- extensive intellectual property rights permitting unrestricted derivative use. This issue and these distinctions were discussed earlier this year in "Free but not open?" at the PLoS blog. I have noticed that many journals use the weasel words like "We conform to open access as defined by SHERPA". The SHERPA definition does not include the extensive IP rights described by Open Access:

By "open access" to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.
-Budapest Open Access Initiative
This watering-down of freedom from "free-gratis and free-to-use-and-modify-and-distribute" to simply "free-gratis" (and maybe some IP freedom for the authors) and the general obfuscation/duplicity/ignorance by publishers parallels similar activities in the software world, where the freedom issue has also been confused and watered-down in various "open source" (note case) licenses. See Open Source vs. Free Software.

Saturday, October 27, 2007

Tag Cloud inspired HTML Select lists

I have been working with Tag clouds and other Web 2.0 sorts of things quite a bit lately [see earlier post: Drill Clouds for Search Refinement] and couldn't help notice that it might be useful to use the Tag cloud "Size reflects frequency/importance" idiom in HTML select lists, so I did a little bit of experimenting (BTW, I did look for these on the Web but didn't find them: it doesn't mean they are not already out there...).

So I played with the styles of these elements, and was able to get something that looks like this:

I am not sure how the above HTML renders in your browser (Update: Daniel has some info on how/if this works in different browsers), but here is how it renders in mine (Firefox on Linux (Suse 10.2):

It is interesting how the browser allocates space: it seems like it uses the largest (tallest) item in the list to allocate the height of the widget, which makes sense. But while the version of Firefox appropriately sizes the pull-down contents (i.e. above, left), when a term is selected, it is sized at the default text font size (above right), even if its font size as defined and as displayed in the pull-down is larger. This appears to be a bug. But it is easily possible that there is some CSS that I should be using to look after this but do not know about. I have not tested this behaviour in other browsers, but I have for other versions of Firefox (1.06,

Notwithstanding this behaviour, on experimenting with these select variations, I think that they work well and are useful in the appropriate situations.

Update 2008 Oct 16: Oh, here is how I made this:

<option style="font-size: 80%;" value="Aggregators"> Aggregators</option>

<option style="font-size: 155%;" value="Blogs"> Blogs</option>

<option style="font-size: 80%;" value="Collaboration"> Collaboration</option>

<option style="font-size: 125%;" value="Joy of Use"> Joy of Use</option>

<option style="font-size: 80%;" value="Podcasting"> Podcasting</option>

<option style="font-size: 125%;" value="RSS"> RSS</option>

<option style="font-size: 200%;" value="Web 2.0"> Web 2.0</option>

<option style="font-size: 80%;" value="XHTML"> XHTML</option>


<select size="5">

<option style="font-size: 80%;" value="Aggregators"> Aggregators</option>

<option style="font-size: 155%;" value="Blogs"> Blogs</option>

<option style="font-size: 80%;" value="Collaboration"> Collaboration</option>

<option style="font-size: 125%;" value="Joy of Use"> Joy of Use</option>

<option style="font-size: 80%;" value="Podcasting"> Podcasting</option>

<option style="font-size: 125%;" value="RSS"> RSS</option>

<option style="font-size: 200%;" value="Web 2.0"> Web 2.0</option>

<option style="font-size: 80%;" value="XHTML"> XHTML</option>


digg this

Tuesday, October 23, 2007

Intellectual Property articles in CACM

The October issue of the Communications of the ACM has two complementary articles in the area of Intellectual Property. Complementary in that one is one copyright reform and the other is on (software) patents:

NIH Open Access at Risk in U.S bill

Peter Suber of Open Access News reports that a U.S. Senate labour bill has recently had an amendment added to it, putting the Open Access mandate of the NIH at risk:
The provision to mandate OA at the NIH is in trouble. Late Friday, just before the filing deadline, a Senator acting on behalf of the publishing lobby filed two harmful amendments, one to delete the provision and one to weaken it significantly.

Cyberinfrastructure and Data preservation

Richard Akerman - my colleague here at CISTI - has a couple of excellent pointers to digital preservation and cyberinfrastructure resources at Science Library Pad:
-- CLIR cyberinfrastructure short articles:
-- PV 2007 - Ensuring the Long-Term Preservation and Value Adding to Scientific
and Technical Data :

Thursday, October 18, 2007

Minister of Industry (Canada) Appoints Members of Science, Technology and Innovation Council

It is good to see this advisory body -- promised in the Canadian government's science and technology strategy (Mobilizing Science and Technology to Canada's Advantage) released in May 2007 -- has now been created and appointments made. I hope that it will be effective in its activities. Now that it is in place, perhaps this body might lend some focus (and hopefully its support) to various national science activities, initiatives and proposals, such as the recommendations of the National Consultation on Access to Scientific Research Data (NCASRD).

Saturday, October 13, 2007

IJDL Special Issue: Connecting digital libraries to eScience

The International Journal on Digital Libraries has a special issue entitled "Connecting digital libraries to eScience". I haven't had a chance to read any of the articles, but they look very interesting, and include some discussion on various scientific data issues, collaboration, repositories, research infrastructure, etc:

Thursday, October 11, 2007

New JISC Data Sharing Documents

As part of its DISC-UK DataShare project, JISC has released two documents:
  1. DISC-UK DataShare: State-of-the-Art Review, Harry Gibbs
  2. Data Sharing Continuum graphic, Robin Rice
The former is a summary of recent projects and policy, and introduced me to a number of projects and initiatives that I hadn't previously known about. The latter is a well thought-out view of the data sharing continuum, showing us where we have been (and perhaps for some of us, still are!) and a good idea of where we will/should be going. A good graphic to show to a manager trying to understand the big picture.

Monday, October 08, 2007

Drill Clouds for Search Refinement

I'd like to introduce something I call drill clouds, an extension to tag clouds for search refinement in information retrieval.

I will be using an experimental Lucene-based search platform that I have developed, called Ungava (more in this later), which includes my implementation of drill clouds. Note that much of this posting is derived from a posting of mine on drill clouds on the CISTI Lab wiki.

Drill clouds are what I call an extension to tag clouds to make them a useful tool for search refinement. That is, to use a tag cloud to refine an existing query by adding new elements to the query through interactions with the cloud. As this results in a kind of drill-down search behaviour, these new clouds have been named drill clouds. Some differences between traditional tag clouds and drill clouds:
  • Drill clouds are applied to search results and -- as search results can be very large and include many result items and many tags -- the cloud that is presented is created from a subset of the result set (usually the top N). This is done for both for user interface and performance reasons. This is different from traditional tag clouds which are usually applied to all items. In Ungava, the number of tags and number of search results articles from which those tags were derived is displayed and can be manipulated by the user.
  • When a tag is used in the query refinement, this tag is excluded from the subsequent cloud, as it exists in every result item of the new search. This is perhaps the most distinguishing attribute of a drill-cloud: the exclusion of accumulating search-refinement tags from the subsequent query(ies).
For example:
  1. Using the default Ungava search form, the user searches for fulltext: cell

  2. The user now clicks on the keyword cloud link, getting:

  3. The user now clicks on the chromatin keyword cloud entry, which adds keyword:chromatin to the original search query, resulting in a new set of results:Note that this refined search results in 52 hits, down from the original 3461 hits.

  4. Now when the user clicks on the Keyword cloud link, they get the keyword drill cloud for the new results, but with the keyword (tag) chromatin excluded from the cloud, removing its dominating influence on the cloud (as all articles would have chromatin as a keyword). Here is the resulting keyword drill cloud:If the chromatin were not excluded, its dominance would reduce the other clouds entries to small entries, reducing the discriminating power of the cloud, and its overall usefulness. Here is what the cloud would look like in our example:The user can continue to iteratively refine their search using the drill cloud from each search. Note that users are not constrained to using the same type of metadata tag cload for refinement, i.e. they can follow a keword drill cloud refined search with one from one of the other available drill clouds.

  5. Continuing our example that produced this results list:
  6. Selecting the Author cloud produces the following drill cloud:

  7. Clicking on the author tag Ausió, Juan produces the following results:
Note how after only a small number of drill cloud iterations involving only mouse clicks (not typing in forms), the original result set was reduced from the original 3461 hits to 5 hits.

I am hoping on putting together a paper on drill clouds & submitting it to JCDL 2008.

  • Tuesday, Oct 9: Ungava appears to be back up again.
  • Monday, Oct 8 2007, 1300 EST: It seems that Ungava is down. Today is Canadian Thanksgiving, so I don't think I'll be able to have this brought back up before tomorrow

Friday, September 28, 2007

New NSF Program: Cyber-Enabled Discovery and Innovation (CDI)

The NSF today announced a very exciting -- at least to me and my research colleagues -- program:

CDI seeks ambitious, transformative, multidisciplinary research proposals within or across the following three thematic areas:

  • From Data to Knowledge: enhancing human cognition and generating new knowledge from a wealth of heterogeneous digital data;
  • Understanding Complexity in Natural, Built, and Social Systems: deriving fundamental insights on systems comprising multiple interacting elements; and
  • Building Virtual Organizations: enhancing discovery and innovation bybringing people and resources together across institutional, geographical and cultural boundaries.
Congruent with the three thematic areas, CDI projects will enable transformative discovery to identify patterns and structures in massive datasets; exploit computation as a means of achieving deeper understanding in the natural and social sciences and engineering; simulate and predict complex stochastic or chaotic systems; explore and model nature’s interactions, connections, complex relations, and interdependencies, scaling from sub-particles to galactic, from subcellular to biosphere, and from the individual to the societal; train future generations of scientists and engineers to enhance and use cyber resources; and facilitate creative, cyber-enabled boundary-crossing collaborations, including those with industry and international dimensions, to advance the frontiers of science and engineering and broaden participation in STEM fields.

My own research work fits very well into this program. It is an area that will alter the nature of how science is done, and promises to -- as I said earlier today in an email -- radically increase the usefulness and value of the research ecosystem, both to those inside and outside of research. But more important, it will change many aspects of the dominant way in which science is done, including researchers being restricted to narrow and deep knowledge in a particular research area, sometimes comically (and somewhat unaccurately) illustrated by this limerick:

There once was an old man from Esser,
Who's knowledge grew lesser and lesser.
It at last grew so small,
He knew nothing at all,
And now he's a college professor.

The tools that this program suggests, along with the research programs at other organizations (including CISTI Research here at CISTI) allow for rich, productive and integrative inter-disciplinary approaches. Some of these tools which in themselves will grow towards being a researcher's de facto collaborator(s), in how they can represent expert knowledge external to the researcher's own focus, that can flag important external and out-of-band connections to the researcher. Who wouldn't want to have 30-40 (friendly!) experts in domains -- in which you are not an expert -- monitoring every thing you write, everything you search, everything you read, having an understanding of your core research interests, ready to identify and flag important external connections, collaborators, publications, initiatives, datasets, claims. knowledge?

Notice I never used the phrase "paradigm-shift". :-)

Thanks Teaching Ideas for Primary Teachers for the limerick!

Sustainable Digital Data Preservation and Access Network Partners (DATANET) CFP

The US National Science Foundation Office of Cyberinfrastructure has a call for proposals. From the call:

The new types of organizations envisioned in this solicitation will integrate library and archival sciences, cyberinfrastructure, computer and information sciences, and domain science expertise to:

  • provide reliable digital preservation, access, integration, and analysis capabilities for science and/or engineering data over a decades-long timeline;
  • continuously anticipate and adapt to changes in technologies and in user needs and expectations;
  • engage at the frontiers of computer and information science and cyberinfrastructure with research and development to drive the leading edge forward; and
  • serve as component elements of an interoperable data preservation and access network.
...these exemplar organizations can serve as the basis for rational investment in digital preservation and access by diverse sectors of society at the local, regional, national, and international levels, paving the way for a robust and resilient national and global digital data framework.
This is a significant step forward, in the spirit of many of the plans and ideas contained in the Canadian NCASRD (and other efforts), and hopefully some of this visionary investment in data management and data for science will take hold here in Canada.

Wednesday, September 26, 2007

Scholarly Electronic Publishing Bibliography, 2006

I've just discovered this great resource, created by Charles W. Bailey, Jr, which is a 266 page bibliography (PDF) which includes sections on:
  • Economic Issues
  • Electronic Books and Texts
  • Electronic Serials (including Electronic Distribution of Printed Journals)
  • Legal Issues (including Intellectual Property Rights and License Agreements)
  • Library Issues (including Digital Libraries and Information Integrity and Preservation)
  • New Publishing Models
  • Publisher Issues (including Digital Rights Management)
  • Repositories, E‐Prints, and OAI
This is the latest edition of this bibliography, first described in:
Bailey, Charles W., Jr. ʺEvolution of an Electronic Book:
The Scholarly Electronic Publishing Bibliography
The Journal of Electronic Publishing 7 (December 2001).

Wednesday, September 19, 2007

Open Data for Global Science

The CODATA (Committee on Data for Science and Technology) Data Science Journal has a special issue entitled "Open Data for Global Science" (June 2007) which has a series of excellent articles. It is broken into two parts (Recent International and National Governmental Data Policy Developments and Analysis of Data Policy Issues). The Canadian National Consultation on Access to Scientific Research Data (NCASRD) is reported on, and other articles of interest include:
The "Big Opportunities..." article I find particularly engaging for a couple of reasons: 1) their introduction of the "Commons of Scientific and Technical Data (CSTD)" as managed by the network of university libraries sounds promising, and 2) my own pet belief, that the data produced by "small science" is also useful and needs to be looked after.

Thanks to Cliff Lynch for the pointer.

Friday, September 14, 2007

New Zealand Science and Open Access

In "An Information Revolution", David Penman discusses Open Access and Open Data (especially as applied to government-funded research) in general, and more specifically as applied to New Zealand science and scientists. While there is some good news:
"The Foundation for Research, Science and Technology is now reviewing its data policy and moving towards the norm for the OECD – greater open access for publicly-funded data. Rather than the research provider deciding on access, all information is openly and freely available unless restrictions such as national security, environmental damage (eg, the GPS co-ordinates of threatened species), or clear commercial disadvantage can be justified."

He has some blunt - and appropriate - words for NZ scientists:
Our researchers will also have to change. No longer can they sit with filing cabinets full of data waiting for the definitive experiment or the life time monograph. Publish quickly in electronic media, make your data and models freely available and get rewards from both publishing and showing that your data are being used by others – this should become the norm.
He also has an interesting view of the future of libraries, one that many libraries would be unwise to ignore:
Libraries are becoming available to all without leaving your home, information on your environment will become openly and freely available and communities will be able to use the internet to take more control of our institutions – a new style of democracy will emerge.
These comments are reminiscent of the meeting that was held in Wellington in 2003, described in Peter Davis' report "Saving and sharing research data: issues of policy and practice" where the relative vacuum in this area was also pointed-out:
"If New Zealand is to assess this international science trend and respond to it, a much more united scientific response is required, and also a “whole of government” approach. Not only the traditional science agencies, but also the National Library, Statistics New Zealand, even local government, all may need to contribute in some way and assist with a consensual and comprehensive solution."
Fortunately, there are other examples where New Zealand is moving ahead in this area: the recent (July 2007) announcement of a free data policy by the New Zealand National Institute of Water & Atmospheric Research (NIWA). Hopefully more like this will follow....


Wednesday, September 05, 2007

CIHR announces "Policy on Access to Research Outputs"

The Canadian Institutes of Health Research (CIHR) have announced this new policy, which includes publications AND data. Basically, they have taken a gentle but significant step in opening-up the research outputs of the grant recipiants they support. It does not impact all forms of publications. Two of the most salient points:
5.1.1 Peer-reviewed Journal Publications
Grant recipients are now required to make every effort to ensure that their peer-reviewed publications are freely accessible through the Publisher's website (Option #1) or an online repository as soon as possible and in any event within six months of publication (Option #2)."
5.1.2 Publication-related Research Data
Recognizing that access to research data promotes the advancement of science and further high-quality and ethical investigation, CIHR explored current best practices and standards related to the deposition of publication-related data in openly accessible databases. As a first step, CIHR will now require grant recipients to deposit bioinformatics, atomic, and molecular coordinate data into the appropriate public database, as already required by most journals, immediately upon publication of research results (e.g., deposition of nucleic
acid sequences into GenBank). Please refer to the Annex for examples of research outputs and the corresponding publicly accessible repository or database."

Gentle but firm: "as soon as possible....in any event within six months".

Thanks to Heather Morrison.


Financial Times on Open Access

In "The irony of a web without science" (Sept 4 2007) James Boyle decries the state of scientific research and describes the limited amount of scientific output - in particular journal publications - that can be accessed in an Open Access manner. The author cannot reconcile what he describes as the "genius of the web is that it is an open network" with the closed and expensive nature of what is modern science and the modern scientific publishing landscape.

But the author goes on to say:

Thus I do not support the proposal that all articles based on state-funded research must pass immediately into the public domain. But there are more modest proposals that deserve our attention.

Pending legislation in the US balances the interest of commercial publishers and the public by requiring that, a year after its publication, NIH-funded research must be available, online, in full...

I think the author muddles a number of different ideas here (OA does not imply public domain...) and does not properly understand what is the Open Access movement. But it is interesting to see this discussed in something like the FT.

Wednesday, August 29, 2007

Fedora to grow to include open access publishing, eScience, and eScholarship

Sandy Payette has plans to expand Fedora to support open access publishing, eScience and eScholarship. With a recent $4.9M grant from the Moore foundation, it looks like she might have the opportunity to do this...

Clifford Lynch on Cyberinfrastructure and E-Research

Clifford Lynch's closing keynote to the 2007 Seminars On Academic Computing entitled "The Institutional Challenges of Cyberinfrastructure and E-Research" is now available as a podcast.

Abstract: It has become clear that scholarly practice and scholarly communication across a wide range of disciplines are being transfigured by a series of developments in IT and networked information. While this has been widely discussed at the national and international levels in the context of large-scale advanced scientific projects, the challenges at the level of individual universities and colleges may prove more complex and more difficult. This presentation will focus on these challenges, as well as the development of truly institution-wide strategies that can support and advance the promises of e-research.

Friday, August 24, 2007

Study: Reduced Open Source developer productivity linked to "restrictive" FLOSS licenses (where "restrictive"=GPL and "non-restrictive"=BSD)

A study by economists from Tel Aviv University and the Centre for Economic Policy Research (CEPR) entitled "Open source software: Motivation and restrictive licensing"[1] (pre-print) looks at the productivity of developers on Open Source projects and concludes:
"...that the output per contributor in open source projects is much higher when licenses are less restrictive and more commercially oriented."

and observe:
"Projects written for the Linux operating system have lower output per contributor than projects written for other operating systems..."
"Output per contributor in projects oriented towards end users
(DESKTOP) is significantly lower than that in projects for developers."
They also observed that the median # of contributors in "restrictive" projects (13) to be much less than for "non-restrictrive" projects (35).

They chose the 71 most active projects on SourceForge in January 2000 and studied them over an 18 month period starting in January 2002. They measure these projects every 2 months over this period resulting in 9 samples. The metrics they used include: Source lines of code (SLOC), #contributors, the "restrictiveness" of the license (ranging from GPL = very; LGPL, Mozilla, NPL, MPL = moderate; or BSD = non), operating system, age of project, if it is a desktop or system application, language (C++ or C = 1; all others = 0), and others. They took in to account the difference between the LOC of language by separately also looking at just the C++ or C projects.

I do not understand the lag in choosing the projects (January 2000) and the start of the data sampling (January 2002). This in itself could have skewed the results, i.e. the 71 most active projects in 2000 would almost definitely NOT be the most active 2 years later. I think this may be a major flaw in this study.

I also don't think that the sampling size is large enough & that the sampling method should have been a random selection of projects that met some reasonable criteria, like:
  • had at least C contributors
  • had at least L lines of code contributed over the last M months
  • had at least D downloads over the last M months (penalized very new & very unpopular projects??)
I also believe that they made another possible error: they observe in their discussion that the median number of LOC per project was 53K for "non-restrictive" and 60k for "restrictive". They suggest that this is not a big difference (they do not appear to verify the nature of the distribution of LOC in projects by license grouping statistically). But I would suggest that 500 lines of code for a project that has 5k LOC can often be a more significant contribution than 500 LOC to a 100K LOC project. They should have looked into the effect of normalizing the contributed LOC by the total LOC in the project.

I haven't taken too much time to go over all of their experimental design, model & stats....

This study builds on an earlier study titled "The Scope of Open Source Licensing"[2] 2005, (pre-print), which is where the authors get their view of "restrictiveness" for licenses. This study found:
"Projects geared toward end-users tend to have restrictive licenses, while those oriented toward developers are less likely to do so. Projects that are designed to run on commercial operating systems and whose primary language is English are less likely to have restrictive licenses. Projects that are likely to be attractive to consumers—such as games—and software developed in a corporate setting are more likely to have restrictive licenses. Projects with unrestricted licenses attract more contributors."
This study used all 40k SourceForge projects available (2002).

[1] Fershtman, C. & N. Gandal. 2007. Open source software: Motivation and restrictive licensing. International Economics and Economic Policy. http://dx.doi.org/10.1007/s10368-007-0086-4

[2] Lerner J, Tirole J (2005) The scope of open source licensing. Journal of Law, Economics and Organization 21:20–56

Monday, August 20, 2007

Australia talks about Research data archiving

I see how the Australians appear to have the good fortune of having the discussion on research data archiving moving forward, as suggested by the upcoming meeting in September, "Long-Lived Collections: the Future of Australia's research data" at the National Library of Australia. This meeting is a follow-up to some very good efforts, including the Australian government's "Data for Science (DFS)" prepared for the Prime Minister’s Science, Engineering and Innovation Council, and the Australian Partnership for Sustainable Repositories' "Sustainability Issues for Australian Research Data: The report of the Australian eResearch Sustainability Survey Project".

I can only be envious of this activity, given the -- unfortunately -- almost complete vacuum of activity following the release of Canada's National Consultation on Access to Scientific Research Data (NCASRD).

The two reports - DfS and NCASRD - are very similar in scope and in recommendations, reflecting the similar but not the same situations in both countries.

Related blog entries:

Update 2008 09 01:

Tuesday, August 14, 2007

Data Archiving of Publicly Funded Research in Canada

Carol Perry presented this revealing study at last year's Access & Privacy Workshop 2006 held in Toronto. Its objectives were:
  • "To assess the attitudes of academic researchers regarding the archiving of data resulting from publicly funded research
  • To assess impediments to the creation of a national data archive program in Canada"
She randomly polled 173 SSHRC grant recipients for 2004-2005 (with 75 respondents). Her results:
  • "41% indicated they had current plans to archive their research data
  • Of these, only 18.7% identified an established data archive as a deposit site for their data.
  • 72% were not aware of SSHRC’s mandatory data archiving policy for all grant recipients
  • 90% were not aware that Canada is a recent signatory to the OECD declaration on access to publicly funded data."

•"In 2001:
  • 60% favoured a national data archive
  • 39% analyzed data created by others
•In 2006:
  • 69% favoured a national data archive
  • 48% analyzed data created by others"

"86.7% would not alter their grant-seeking behaviour if SSHRC enforced its data archiving policy"

These results are both hopeful and frustrating: meta-analysis is up, support for a national data archive is up, there is little perceived negative impact by researchers of SSHRC data archiving policy and almost half of the respondents indicated that they were planning to archive their data; on the other hand, 3/4 of the respondents didn't even know about SSHRC's policy, and 9/10 didn't know about Canada's recent OECD commitment (which has led to the recent publication from the OECD: OECD Principles and Guidelines for Access to Research Data from Public Funding, which has a great deal of overlap with Canada's National Consultation on Access to Research Data (2005) and the earlier National Data Archive Consultation Building Infrastructure for Access to and Preservation of Research Data (2002)).

I believe the level of acceptance of researchers is high enough to move forward on an national data archive, and clearly there also needs to be a better education campaign by SSHRC and other Canadian research funding bodies both at the strategic level - read "policy and funding" - and at the tactical level - read "engaging, informing and educating researchers".

"Sharing the fruits of science"

University Affairs has an interesting article on Open Science that examines the patents and licensing regime and its impacts on science and the ability to do science. While at times advocating an Open Source-like model of Open Science, the author is a little to wishy-washy and supports hybrid models which are too much of a slippery slope for me.

I also don't agree with a number of statements including:
But now an international scientific counterculture is emerging. Often referred to as "open science" this growing movement proposes that we err on the side of collaboration and sharing.
Counter-culture? I think that he has it backwards: despite the many biotechnologists and biotech companies and other science-based industries that use the patent system to support their business interests - usually encumbering further scientific discovery - the vast majority
of scientists - at least working in academia, and of course with exceptions - have long been and will continue, working in an Open Science environment. Not to take away from the Open Science movement and what it is trying to do. But it existed before someone decided to call it Open Science and it is the default model / mode for most scientists in academia. The tail is wagging the dog a little here...

Thanks to Mary Zborowsky and Michel Sabourin for pointing-out this article.

Related article in University Afairs: "The bottom line on open access" by John Lorinc, March 2006.

Monday, July 16, 2007

Data Curation Report

Liz Lyon, of UKOLN and DCC, has produced an excellent report titled "Dealing with Data". It is a very applied look at the issues around data curation and preservation and examines at "the roles, rights, responsibilities and relationships of institutions, data centres and other key stakeholders who work with data." While it is UK-oriented, most of its recommendations can be applied to other regions.

It includes 35 recommendations in eight categories:
  1. Co-ordination and Strategy
  2. Policy and Planning
  3. Practice
  4. Technical Integration and Interoperability
  5. Legal and Ethical Issues
  6. Sustainability
  7. Advocacy
  8. Training and Skills
Many of the recommendations resonate with many of the recommendations of the National Consultation on Access to Research Data (NCASRD) here in Canada that I and others helped organize in 2005.

Some recommendations of particular interest:
  • REC 2. Research funding organisations should jointly develop a co-ordinated Data Curation and Preservation Strategy to address critical data issues over the longer term.
  • REC 6. Each research funding organisation should openly publish, implement and enforce, a Data Management, Preservation and Sharing Policy.
  • REC 9. Each funded research project, should submit a structured Data Management Plan for peer-review as an integral part of the application for funding.
  • REC 10. Each higher education institution should implement an institutional Data Management, Preservation and Sharing Policy, which recommends data deposit in an appropriate open access data repository and/or data centre where these exist.
  • REC 19. All relevant stakeholders should commission a study to evaluate the re-purposing of data-sets, to identify the significant properties which facilitate re-use, and to develop and promote good practice guidelines and effective quality assurance mechanisms.
  • REC 20. JISC should initiate a survey to gather user requirements from practising researchers to inform the development of value-added tools and services to interpret, transform and re-use data held in archives and repositories.
  • REC 26. JISC should fund repository technical development projects which build on OAI-ORE work and create robust, bi-directional interdisciplinary links between data objects and derived resources.
  • REC 27. JISC should fund technical development projects seeking to enhance data discovery services, which operate across the entire data and information environment.
  • REC 30. The JISC should work in partnership with the research funding bodies and jointly commission a cost-benefit study of data curation and preservation infrastructure.

Related links:

Wednesday, July 11, 2007

Microsoft Open XML efforts good? - British Library. Update

Microsoft Open XML efforts good? - British Library.

For more problems with the OOXML "open" standard: see Slashdot's Microsoft's OOXML Formulas Could Be Dangerous and the original article by Rob Weir, The Formula for Failure.

And perhaps of more significance, some real questions from FSF Europe to national standards bodies, perhaps lessons learned (or those which should be learned) from the OOXML standardization fiasco: Six Questions to national standardization bodies:
  1. Application Independence?
  2. Supporting pre-existing Open Standards?
  3. Backward compatibility for all vendors?
  4. Proprietary extensions?
  5. Dual Standards?
  6. Legally safe?

Monday, July 09, 2007

Microsoft Open XML efforts good? - British Library

It seems that - in a BBC article ("Warning of Data Ticking Time Bomb", discovered at the ACM TechNews for this week)- Adam Farquhar, head of e-architecture at the British Library, has made a rather disappointing comment on Microsoft and the Open XML format:
Microsoft has taken tremendous strides forward in addressing this problem. There has been a sea change in attitude.

Sigh. This is very sad. The original press release from the U.K. National Archives is here: The National Archives and Microsoft join forces to preserve the UK´s digital heritage.

The Wikipedia article on Open XML shows why this is such a disappointing comment.

Thursday, June 28, 2007

Some catching up....
I am rather behind on some posts (like I attended JCDL2007 in Vancouver last week - sans wireless - and need to post on some goings-on there...) and would like to point out some excellent work presented by a colleague of mine at CISTI: Richard Akerman's presentation at ICSTI 2007 Nancy,
titled "Web tools for web reviewers...and Everyone" and at IATUL titled "Library service-oriented architecture to enhance access to science".

[Thanks to Richard for correcting my earlier confusions....]

Wednesday, June 27, 2007

W3C Releases WSDL 2.0 Recommendation
Today the W3C released version 2.0 of WSDL, which supports both REST-style HTTP and SOAP, and includes a converter to WSDL 2.0.

Tuesday, June 19, 2007

Nature Preceedings

Nature has announced what is basically a repository, Nature Preceedings - similar to arXiv.org for physics - for researchers in biology, medicine, chemistry and the Earth sciences to share early findings: "pre-publication research, unpublished manuscripts, presentations, posters, white papers, technical papers, supplementary findings, and other scientific documents". There is no peer review, but staff curators filter-out materials that are not legitimate scientific contributions. There are also 13 subject RSS feeds. Of particular interest is how every item is given a DOI or Handler, making it more easily citable. More discussion at O'Reilly Radar and Connotea.

Tuesday, June 12, 2007

Stewardship of digital research data: a framework of principles and guidelines

Sub-title: "Responsibilities of research institutions and funders, data managers, learned societies and publishers"

This draft report from the Research Information Network (RIN), UK, for consultation is a must-read for those wrestling with policies and guidelines concerned with the long-term management, access to, and archiving of digital data generated by the activities of researchers. It outlines a comprehensive policy framework, based around five principles:
  1. Roles and responsibilities
  2. Standards and quality assurance
  3. Access, usage and credit
  4. Benefits and cost effectiveness
  5. Preservation and sustainability
This draft report is a follow-up to the excellent January 2007 report: Research Funders’ Policies for the management of information outputs and the June 2005 RCUK position on issue of improved access to research outputs, the latter focusing solely on research outputs as publications. Of particular interest are the reponses of the various research funding councils in the U.K.:

Update 2008 01 15:

Ontario Data Documentation, Extraction Service and Infrastructure Initiative (ODESI) - Launched

In what likely will become a busy trend, the Ontario Council of University Libraries (OCUL) has announced a project for the creation of a data service providing researchers access to "a significant number of datasets". ODESI will be part of OCUL's already popular Scholar's Portal.

The press release is unclear as to whether this will only house standard data sets (like those from Statistics Canada, etc.) or that this service will allow for researchers to deposite their data. I would argue that a data deposite archive service is much more important at this time, as described and argued in the National Consultation on Access to Scientific Research Data (NCASRD), of which I was a participant.

I also was not able to find any mention of this on the OCUL or Scholar's Portal.

Tuesday, June 05, 2007

Tag Cloud inspired HTML Select lists

I have been working with Tag clouds and other Web 2.0 sorts of things quite a bit lately and couldn't help notice that it might be useful to use the Tag cloud "Size reflects frequency/importance" idiom in HTML select lists, so I did a little bit of experimenting (BTW, I did look for these on the Web but didn't find them: it doesn't mean they are not already out there...).

So I played with the styles of these elements, and was able to get something that looks like this:

I am not sure how the above HTML renders in your browser, but here is how it renders in mine (Firefox on Linux (Suse 10.2):

It is interesting how the browser allocates space: it seems like it uses the largest (tallest) item in the list to allocate the height of the widget, which is makes sense. But while the version of Firefox appropriately sizes the pull-down contents (i.e. above, left), when a term is selected, it is sized at the default text font size (above right), even if its font size as defined and as display in the pull-down is larger. This appears to be a bug. But it is easily possible that there is some CSS that I should be using to look after this but do not know about. I have not tested this behaviour in other browsers, but I have for other versions of Firefox (1.06,

Notwithstanding this behaviour, on experimenting with these select variations, I think that they work well and are useful in the appropriate situations.

Monday, May 28, 2007

Geist: Open Data and Open Access
In an article in the Toronto Star ("Science and Tech Strategy a Missed Opportunity"; archived version) Michael Geist is strongly advocating that the new Canadian government's science and technology strategy go further, and mandate the Open Access for articles derived from publicly -supported research, those supported by the Federal research funding agencies (NSERC, SSHRC, CIHR, etc), as well as the opening of publicly-supported research data ("raw scientific data"). This to better support re-use by both industry and researchers without the existing complicating and onerous licensing regimes that encumber these data.

Thursday, May 24, 2007

NSF Community-based Data Interoperability Networks (INTEROP) Proposal Solicitation

A solicitation for proposals has been issued by the NFS's US National Science Foundation Office of Cyberinfrastructure with the goals of funding projects supporting the re-use and re-purposing of data, data discovery, interoperability and "consensus-building activities and for providing the expertise necessary to turn the consensus into technical standards with associated implementation tools and resources."

Scientific data is very expensive to acquire, and much of it cannot be reproduced, due to its temporal nature. Vast resources of data acquired through publicly-funded research languish due to the lack of archiving of these data sets. Much of this exists on the hard-drives and (yes) floppy disks of researchers, much of which is thrown away when the researcher retires.

Both due to the loss of dataset, and the lack of standard metadata (some disciplines are better off than others) and tools for the discovery and use (interoperability) of existing data sets, re-use and re-purposing of data is -- at present -- very limited, and the kind of unforeseen and creative use of data analogous to the Web 2.0 mashups are not possible. The metastudies and metasyntheses that are often more revealing and more powerful than the original works need to be made possible.

Thanks to Cliff Lynch (CNI) for pointing out this solicitation.

Additional references:

Thursday, May 10, 2007

I am at WWW2007 only for today (Thursday) after attending the W3C meeting on Sunday -- Tuesday. I must say that I regret not registering for the rest of the WWW2007 conference, as it has moved to what I believe to be a more relevant, robust venue for web research and activities.

This morning I attended the panel session "Building a Semantic Web in Which Our Data Can Participate" session, moderated by Paul Miller of Talis, with panelists
  • Steve Coast (OpenStreetMap)
  • Peter Murray-Rust (University of Cambridge)
  • Rob Styles (Talis)
  • Jamie Taylor (Metaweb)
It was a very good panel, although the discussion revolved more around getting access to data as opposed the the Semantic Web aspect.

Tuesday, May 08, 2007

"Everyone uses Linux, because everyone uses Google" - Tim O'Reilly
Tim O'Reilly points out in his presentation at the W3C AC meeting at Banff, Alberta, that since Google is the largest deployed Linux app (the backend Google farm is Linux boxen), and since everyone uses Google, therefor everyone uses Linux.

"Embracing Web 3.0"

In IEEE Internet Computing, Ora Lassila and James (Jim) Hendler discuss Web 3.0 in their Embracing Web 3.0 as a union of (some) parts of Web 2.0 and the Semantic Web.

Monday, May 07, 2007

W3C AC Meeting
I've made my way to Banff, Alberta for the May 2007 W3C advisory committee (AC) meeting (I am the NRC's W3C AC rep). I will be reporting on various aspects of the meeting.