[Clifford]: Welcome to our closing session
for the Fall 2018 Member Meeting. I hope that you've had a good meeting here. Certainly the sessions I've been able to look
in on seem to have gone really well. I have to say, I'm seeing an unusual amount
of energy in the discussion at many of the sessions, and that's really delightful to
see. I also feel like our use of Sched to help
relocate some sessions into larger rooms is definitely having a good effect. As far as I know, there was only one session
in the last round that was very crowded, so we will continue to strive to do better with
that, but I feel like definitely that's on the right track, so that's proving to be a
very useful tool. It really just falls to me to do a couple
of quick things. The first thing I want to do is ask you to
join me in thanking our wonderful staff at CNI who have made this conference run very,
very smoothly, and also the two volunteer helpers that we had from Georgetown who have
worked with the staff and have made it possible for us to capture a considerably larger number
of breakout sessions than we would have been able to do otherwise, so please join me in
a round of applause for all them.
The other folks I'd like to thank are all
of the presenters who contributed their time, their effort, their work, their good thinking
to all of the very, very rich set of breakout sessions that we've enjoyed. Those are absolutely the backbone of the meeting
and I am so grateful to everyone who took the time and the energy and the effort to
give us that superb set of breakout sessions. Please join me in a round of applause for
our presenters. And now let me get on to why you're really
here. Our closing plenary today is going to be given
– and I'm just so pleased that we were able to work this out – is going to be given by
the National Librarian of Medicine, Patricia Flatley Brennan. You have Dr. Brennan's official biography
in your conference materials and it's on the website, but I want to say a couple of other
contextual sorts of things.
First, it is really hard for me to say enough
good things about the National Library of Medicine and the amazing contributions that
that organization has made over the last 20 or 30 years to the advancement of science
and healthcare leadership, in the way we think about science as it's become increasingly
driven by data and information technology. It's a fabulous organization that has had
wonderful people and now is thriving under a new leader that is just carrying it on into
the future in a wonderful way. Let me say a few words about Patty herself.
I actually, I believe – and I have not tracked
this down and I haven't even told it to Patty – first became aware of her work probably
a good decade ago, when I saw some of it presented at a conference showing off high-performance
computing and communications applications. Yeah, interesting place for someone who comes
to the profession from the side of nursing. She's done amazing work with AR and VR in
healthcare, and was quite early to that work, I believe. At the same time, her pathway in through nursing
has really, in my conversations with her, resulted in her having a very broad view of
the challenges and opportunities of data and information in the healthcare context. Sometimes bioinformatics, in my experience,
can become extremely physician-driven and really, as all of us know, it's a much bigger
world out there than just physicians.
There's an army of professions, and the patients
themselves, who are all part of this very, very complex healthcare system that we're
trying to evolve into the digital world. So I think you will find her perspectives
to be both very nuanced but also very broad, and I am just thrilled to have her here to
tell us about all the wonderful things that NLM is doing and all of the very fabulous
thinking that they're doing about their role within NIH and within the national and global
healthcare and biomedical sciences world.
Patty, welcome. [Patricia]: Good afternoon, thank you very
much for coming indoors on one of the most beautiful days in Washington and for actually
staying here for this time. I'm going to make it worth your while, I promise. There has never been a greater need for trusted,
secure, accessible, valued information in the world, as we face right now at this very
time. Libraries, data scientists, informaticists,
networked information specialists: we are essential to the future. I'm very proud to be here as the director
of the National Library of Medicine, I'm proud to see some of my NLM colleagues here, our
Regional Medical Library National Network of Libraries of Medicine colleagues, and good
friends.
But I'm actually here because I owe a debt,
and I owe a debt to Cliff and a few of the others of you who are in the room, that my
very first year that I was the National Library of Medicine, with the transition guidance
from Betsy Humphreys, we established a strategic plan. And that strategic plan is now guiding our
investments into the future. I'm here today to talk with you about the
National Library of Medicine, its strategic plan, where we fit in with the NIH, how we
are preserving trust in data, and what options are coming forward that might be of interest
in your institutions.
And I promise not to keep you here too much
after 6 PM, no worries. I want you to look at the line in the last
box: how the National Library of Medicine directs trust in data, creates trust in data. We are. What does the library do? Lots of places have shelves. Lots of places have server racks. There's very smart people that know how to
enumerate and count and curate things. So what does a library do? We fundamentally create trust, and the substrate
of discovery right now is data. We are, at the National Library of Medicine,
committed to a data-driven discovery, a process of partnership with the NIH to make sure that
data-powered health becomes a reality in our society. As I said today, I'm going to talk to you
about our strategic plan and also about the National Institutes of Health Strategic Plan
for Data Science.
Data don't organize themselves, you've probably
figured out already. You've got to do something, you've got to
put some stakes in the ground, and that's what we're all about and trying to establish. We are preparing and actually doing some things
for data-driven discovery – I'm going to be sharing them with you today, very pleased
to see what's happening – and we'll be talking about training. But I suspect there might be one or two of
you in the room that doesn't really know much about the National Library of Medicine. So let me first introduce you to our library
by showing you a brief video. [Instrumental music] As an institute of the National Institutes
of Health, we are fundamentally a research engine.
The National Library of Medicine supports
direct research in the areas that you saw here, but we are best known for our products
and services that we deliver to the world millions of times a day – PubMed, MEDLINE,
MedlinePlus for laypeople, our early, out-of-the-box innovations such as the Visible Human Project,
or our very essential WISER, assistance in the moment of crisis. The National Library of Medicine is committed
to delivering high quality, trusted, trustable information, to serving as a repository of
that information, and to applying research techniques to make that information discoverable
and useful. We know today that a million citations a year
are added into PubMed. Not everybody's reading all those citations,
and we must find ways as a library to make them accessible, to use knowledge extraction
techniques, modern machine learning techniques, to make the information useful to society,
and also to preserve its relevance to our society and our biomedical community. Now, the National Library of Medicine will
continue to provide its fundamental core services – dbGaP, our database of Genotypes and Phenotypes,
or clinicaltrials.gov, which is our repository of both the declaration of clinical trials
and their results.
But over the next decade we're committed to
advancing in three key areas, and that's what I want to talk to you about right now. First, our first pillar is to accelerate discovery
and advance health through data-driven research; second, reach more people in more ways through
enhanced engagement and dissemination; and third, to build a workforce for data-driven
research and for health. Let's take these apart a little bit. To accelerate discovery and advance health
through data-driven research requires that the library stay clear and true to its core
mission, and yet we envision a future where we'll be fostering the ecosphere of discovery
by connecting digital research objects.
What a library does is build the arcs and
structure the ovals in this diagram that you see on the screen. We are envisioning a time where there is a
seamless pathway from the literature to the data underlying that literature, perhaps to
the people who conducted the investigation or the pathways and protocols used to carry
out the investigations. We are visioning interconnections that allow
an investigator to extract knowledge in an efficient fashion. What we know that the library does and has
always done is made connections, named things, and made them visible to others. We are committed to continuing to do this. We are committed to do this under a new framing
of knowledge, data science, and open science, and this is a game-changer for the way we
think about what is trustable information.
The National Library of Medicine, as many
libraries around the country, have had a successful long-term partnership with publishers who
provide the vetting of the knowledge, who provide the structuring of the knowledge,
and, in our case, send us the XML files to create the bibliographies. Partnership has been important. When we move to data-driven discovery, we've
removed the intermediary. We're dealing with raw substances. This requires we rethink what the library
does.
And we're doing this in an era where open
science is a philosophy that is rapidly being embraced around the world. Open science, open access to data, open participation
in research, open sharing of information. While we've had two decades where the economic
value of discovery has been heralded as the reason why we do research, now we're realizing
that opening the data, sharing the discoveries, actually accelerates the economy in a much
quicker fashion.
An open science model, though, changes the
game a lot in terms of how we think about, what is the protection of intellectual property
rights? What is the protection of an individual participant’s
data? And fundamentally, what are the rights of
our investigators to exploration without intrusive supervision, which we are able to do now with
all of our digital pathways? So we move into this era of open science,
recognizing there's new players, there's new rules, and there's new kinds of data to deal
with. But fundamentally, we know, as a library,
fewer and fewer people come up to Bethesda to see us. There's a great big fence around NIH right
now, it's a lot harder to get on campus than it used to be, but we remain accessible and
available to people, and we are fortunate because we know that the people who know us
like us. But what we have to remember is, we need to
reach out, we need to get to more people, we need to get to those people who don't know
us.
We are committed to enhancing our engagement,
reaching more people through better understanding, through knowing in the moment what the person
is after, through going beyond, pattern matching of terms and concepts so that the question
and the answer are responded to at a level that is factual. We now need to move into a level that becomes
operational. We're committed to enhancing information delivery. We are expanding our investment in standards,
particularly health data standards, but also the structuring of data around formalized
terminologies, including biological data, as well as image data.
We are experimenting with PubMed Labs. Now, those of you who've grown to love PubMed
might be getting a little nervous when I say we're going to change it, but we're going
to change it for the better. We've got some very exciting things coming
your way. If you haven't seen our new experimental site,
when I'm done speaking, please Google “PubMed Labs” and you'll get to see all of the new
innovations. One of the most exciting innovations we're
using is taking machine learning strategies to present your list of citations, not in
the reverse chronological order which you're familiar to – the most recent citation first
– but in something we call “relevance-based ranking." That is, understanding how what you're after,
and what relates to that in terms of the behavioral patterns of others – actually a thousand different
factors – might actually give you information more quickly, in a more timely fashion, and
with greater relevance.
We know this is important for many reasons,
but the fundamental reason is, 80% of the people who launch a PubMed search never go
to the second page. So if your best article is on the second page,
or the piece they need most for that next study is on the second page, they won't find
it. So we're working to make information more
accessible in the moment. We also recognize that many of our resources
are used from a machine-to-machine process – that is, there isn't a human in the loop
at the point of a search or receiving, so we need to do several things. We need to make sure our pathways are trustable. We need to make sure that the National Library
of Medicine brand is well-known and well-understood, because the trust that that can imbue in the
user of our information should extend their ability to do their work. We recognize that less than 50% of our users
nowadays are human eyeballs, and we have to figure out how to convey the same level of
resources to them.
We're looking to reach new users in new ways. We're experimenting with augmented reality
and virtual reality strategy, so if you look at the second box from the left, you see what
looks like a nutrition label hovering over an orange juice container. Imagine delivering information in the moment
that someone needs it. Now, we can't do this today – that Google
Glass thing failed – but there will be things in the future where we can do this, and the
library is getting ready for that. The library is getting ready to see, how do
we translate information that was once in a permanent structure on a piece of paper
to information that's floating in the air? We're looking to use augmented reality, experiential
information presentation, so that the information that is needed at the moment of need, not
at the moment of want, can be held by individuals. We also recognize that, no matter what happens
with technology, we are fundamentally a human enterprise, and we need to use our skill set
to be reaching people wherever.
We are very proud of the fact that the National
Library of Medicine has cultivated the National Network of Libraries of Medicine, organized
across the United States. We now have 7,200 points of presence around
the country – health science libraries, hospital libraries, public libraries where trusted
health information is accessible. And this platform is now powering engagement,
so that the All of Us program, the major initiative by the NIH to engage a million people in a
massive scientific exploration, now has a person on the ground in the neighborhood that
can answer health questions, that can be there and be present. So the National Library of Medicine is looking
towards a period of engagement, mediated by technology, fostered by individuals. We also recognize the importance of building
a workforce for data-driven research in health. We recognize there's different types of workforces
that are going to be needed. Certainly we need data scientists, and we
partner with our colleagues across the NIH to determine how to best prepare researchers
who have the data science skill within a health science framing.
We also, they need to partner on our university
campuses and research institutes, where your roles become critically important. So the alignment of data science investigation
– whether it's in physics, biochemistry, or chemical engineering – can all share the important
shareable parts of the analytics, the data management strategies, and the advanced visualizations. And the uniqueness of health information,
biochemistry information, or mechanical engineering, actually can be built in those areas. We are committed to enhancing research training
for biomedical informatics and data science, because it's essential to remember that what
biomedical informatics brings to the conversation – formalization tools, structuring of data,
making it possible to understand information relevant to the culture and context of health
is critical – but we also recognize that that must be done in a way that fosters a diverse
workforce, that brings individuals into the academic workforce and the research workforce
in new ways.
So we are committed to training across society. We are committed to using hackathons, which
can engage young people cleverly in using different kinds of computational techniques
to get excited and be willing to and interested in becoming part of the workforce. We recognize that we have to support laypeople
in understanding the value of data science and data-driven health for discoveries, so
we've established a new extramural research program called “personal health libraries”
that brings the power of data science into the hands of your neighbor.
We recognize fundamentally that what the National
Library of Medicine must do is foster our distinctiveness as a reliable, trusted source
of health information and biomedical data, including the analytics that are done and
the visualization tools that are used to make this possible. Now, we don't do this alone; we are one of
27 institutes and centers at the NIH. We are very delighted to right now be enjoying
a $37 billion annual budget, and you have a right to know what we're doing with that
money and how we're fostering data-driven discovery. If you've not visited the NIH campus before,
though, let me call your attention to the diagram on the back. That's our physical layout, and then the left-hand
side, midway up, you see a building with a diamond-type roof on it, and then a tall building
next to it. This is the National Library of Medicine space. We are in a wonderful spot on the campus because
we're the gateway to the campus, and yet we recognize we share the responsibility for
data management, we are not the data dump of NIH.
We have to accelerate effective use of data
across NIH, and with 27 institutes and centers, I can tell you, being one of 10 children at
home, getting 27 institutes and centers to look in the same direction, much less walk
in the same direction, is a significant challenge. However, we've made progress this year, in
part stimulated by Congress. We built the National Institutes of Health
Strategic Plan for Data Science. We received a recognition from Congress that
data science is important, we've received additional funding to foster data science. We now have a plan of how this will be rolled
out, and we're in the midst of an implementation process. Let me walk you through it briefly, though. The NIH Strategic Plan for Data Science should
be aligning with the plans that are going on in your institutions, and we should be
developing synergies with them. We are focused first on building a good data
infrastructure; second on modernizing that because the data ecosystem; third on improving
data management analytics and tools, because the t-tests that work in an experiment of
40 people is not going to work on a billion dots, I can tell you that right now.
We also need to focus on workforce development,
and remain committed to stewardship and sustainability of our data. Underneath each of these columns you see the
key activities; please note, in data infrastructure, the NIH recognizes, to optimize storage and
security is our primary responsibility. We are dealing with very precious, patient-level
data and we must make sure it's secure. We're also, though, proposing to interconnect
the NIH data systems, because, frankly, 27 institutes and centers has led us to about
300 different data repositories, not all of which talk to each other.
In order to modernize our data ecosystem,
we've taken steps towards both creating better repositories, finding strategies to safely
share individual-level data, and improving the integration of observational data with
information that comes out of traditional research activities. In terms of data management, we're committed
to generalizable workflows, generalizable visualization tools, increased ability to
catalog and know where our resources are. Our workforce development, as I explained
earlier, NLM is spearheading this, but it's across the NIH a commitment to expanding,
both from the standpoint of researchers, as well as clinicians, an expanded understanding
of data science. And finally, in terms of stewardship, we are
committed to the FAIR principles: all data should be findable, accessible, interoperable,
and reusable. Wonderful aspirations, a little hard to deliver
in public. We've already started down the plan for implementing
these, and we're recognizing some cross-cutting things that are requiring the NIH to actually
come together and sing from the same prayer book. We first know that there must be common infrastructure
and architecture, upon which more specialized services can be built, but we need a basic,
underlying infrastructure that supports the entire NIH operation.
We will not do this alone. Although $37 billion sounds like a lot of
money, we must leverage commercial tools, commercial resources, new kinds of expertise
from other fields, because that $37 billion should be driven towards cure, should be driven
towards health, should be driven towards people, and not necessarily creating an infrastructure
that could be best supported and used through the Department of Energy, the National Science
Foundation, or other government bodies.
We have recently launched an initiative with
the private sector cloud computing resources, particularly with Amazon Web Services and
with Google, to provide low-cost, permanent, accessible cloud storage to our major research
projects. We're committed to enhancing the training,
and what we recognize is, enhancing training can't simply adding courses into a program,
but it is fundamentally rethinking, what is the core knowledge that people have to come
to doctoral training with in the biomedical sciences? And what is the fundamental knowledge we should
be helping mid-career individuals access and acquire? We're committed to the secure and structuring
of our data resources as fair, and also to ensuring information security.
We have moved to a model of identity and access
management that we believe will extend beyond NIH-identified investigators, so we'll be
able to move towards sharing the NIH data resources more broadly, making use of identity
and access management resources that are both vetted by institutions such as you do, as
you give individuals permission rights in your institution, but also will allow citizen
scientists to actually engage with and use our data resources. A significant challenge the library is taking
on is curation at scale. We recognize the most expensive aspect of
any kind of storage that goes on nowadays is the curation part, the human engagement
and understanding, how do we label? How do we make this findable and identifiable? And we've launched several research programs
to develop new tools for computational-based curation, as well as to engender best practices.
And that requires that we work closely with
community input to promote and refine data standards. The word "community" takes on new meaning
when we move into interdisciplinary models where communities intersected at the edges,
and so as we build new standards, build new terminologies, we are committed to working
with some partners we may not have worked with before – physicists, chemists, humanities
– to understand how to extract the essentials of curation that are valid across all disciplines. And we are fundamentally committed to coordination
with funding agencies so that we can avoid unnecessary duplication. It makes no sense to repeatedly stand up and
structure data repositories, and I state that, frankly, more as a question that I want your
guidance on than as a conclusion that we have reached completely. We recognize that institutional data repositories
are becoming a big focus of concern in many of our universities; we need to hear how to
best work with you. The vision that we have at the NIH is, there
should be a sustainable infrastructure that rests on a federated Data Commons model. There will not be a single point of all data
in the world collected within the NIH fence, but rather think about how we federate these
with interoperable tools that you see written down the center: common user authentication
strategies, shareable APIs for data access and computing, automatic implementation of
the FAIR principles, dockerizing or containerizing data and analytics, reusable workflow management,
digital object identifiers – the world is finally realizing how critical it is to identify
digital objects with unique IDs – and, finally, data standards and sharing.
What you see floating around here are some
of the major NIH investments. What's outside of the cloud, that we know
needs to be connected for our scientists, is environmental analysis, transportation
information, agriculture trends, data that needs to be interconnected. The NIH cannot do this alone, and we need
to do this in working with our partners. In addition, though, we recognize that the
NIH must make sure the data are available from the moment they are exported from a research
project. When I started my academic career, the answer
to a research grant was the hypothesis resolution, you closed it up, you wrote a paper, you went
away. Now the research data is becoming a very important
– perhaps even most important part, more important part – of the research process, so how we
foster an era of shareability during this time of transition where we do not have all
the answers and all the pathways, is requiring the NIH to make, first, some very early fledgling
steps.
Let me show you where we are with our data
sharing policies at this point in time. We encourage our investigators, we encourage. We do not require data sharing yet, but we're
encouraging all NIH-funded investigators to share their data through open access data
repositories. For datasets that are small, it's possible
to attach the datasets to PubMed Central – articles. That is, as an article is published, the supplementary
materials can include data. For datasets that are slightly larger, up
to 20 gigabytes, we recognize that commercial resources, such as Dryad and figshare are
very good partners, because they assign unique IDs, they store and manage datasets, they
allow for very interesting investigator-driven management of those datasets. Now, though, when we get to very, very large
datasets, terabyte- and petabyte-size datasets, we need different approaches to storing them,
and that's why the NIH is investing in these partnerships with commercial cloud providers.
We are going to be continuing to evolve in
this area, we need to hear from the communities, including the information networking communities,
about where the investments should be, what you rely on NIH to provide, what you would
rather see provided within your own institution. The National Library of Medicine, though,
is key to making all of this happen, you've seen things come all the way through here. So let me show you a little bit about where
the National Library of Medicine is going to make this happen. [Instrumental music] The National Library of Medicine is committed
to creating the 21st Century Collection. The 21st Century Collection has the characteristics
that you see written across the bottom line here. We are serving as custodians for some of the
content, we are serving as connectors for some of the content, and we are building discovery
tools for the remainder of the content that might be needed.
In order to make the 21st Century Collection
possible, we need to find new ways for attribution. Who is the author of a dataset? How do we hold accountabilities here? How do we devise automatic indexing strategies
so we can make data accessible as quickly as possible, and how do we create personalized
presentation and delivery, whether you're a group of kids sitting around a high school
gymnasium looking at a health education video or you're a scientist in the moment needing
a specialized piece of information? We need to know you better so we can deliver
to you better.
We recognize that the 21st Century Collection
of Biomedical Knowledge has to be a permanent and discoverable archive of text and data,
a wide range of data – image data, sound, genomic data. We are listening to trends in science and
scientific communication to understand how to best partner with the way the scientific
community is interacting. We recognize open science principles are important,
preprints and other interim products of research – which are now acceptable as part of an NIH
dossier – must be supported, and we also recognize a growing but not completely adopted approach
to data sharing, which we want to be fostering. So we are planning, as the library, to prepare
and lead new directions through collaboration. Now, when we create a collection at the National
Library of Medicine, we're first driven by our Board of Regents's policy. The Board of Regents provides the federal
and public oversight of how we make a collection. Here's what our collection looks like. If you think of the National Library of Medicine's
collection as the large oval here, on the right-hand side you see PubMed Central, that's
our full-text repository, over five million articles, about half of which are fully open
access and machine-interpretable.
On the left-hand side you see a circle reflecting
PubMed. That is our bibliographic database, 27 million
citations in there now, and within that bibliographic database, about slightly over 90%, is what
we consider our Specialized Collection, our MEDLINE collection, our highly indexed, highly
accessible. So first we think in terms of, how do we create
a literature repository, either by holding or connecting to important literature? We are driven in many ways by the NIH Public
Access Policy. The Public Access Policy is almost ten years
old, and it specifically states that after twelve months, full text of any article reporting
NIH-funded research must be available. And we created the PubMed Central archive
to be able to make that possible. Recently, as I've mentioned, preprints have
become an important part of communication, and yet the NIH has taken what appears to
be paradoxical stands on this. On one hand, we do accept preprints as an
interim product of research that you can report in grant applications and is satisfaction
for grant responsibilities, but we are not housing preprints at the National Library
of Medicine and we don't anticipate doing that at this point in time.
We're working with publicly accessible archives,
such as bio archives, to make preprints available. We are increasingly focused on how to share
data. Now, NIH has had a data sharing policy for
over a decade, largely driven by the genome process and the work on the human genome,
which came with a commitment that said, "We want to make sure data are accessible and
not stored or held for private gain alone." Any researcher requesting NIH funding must
provide a data sharing policy. At present this is not a scored part of a
grant, which means it doesn't contribute to the evaluation of a grant, but in the future
they will become more central to the evaluation of a project.
And it's possible for an NIH-funded researcher
to use NIH funds to actually support data sharing, and we are encouraging experimentation
among our researchers to do this. We're also providing them services, and that's
where our data discovery emphasis and PubMed Central and PubMed become very important. If you think about the article as the nexus
for discovery, surrounding the articles are some of the things you saw earlier on our
eco sphere of data-driven discovery – patterns, profiles, research grants, preprints – and
we recognize that the structure in this present day, 2018, anchoring around an article is
still the most common way people will think about organizing information. So we're looking at ways to link data to articles,
and have been successful to do this.
Within our strategic plan, we say we're going
to stimulate new forms of scientific communication to make linkages for data, and we do this
in part because we've had a 20-year history with an NCBI for doing it well with genomic
data, but also because we recognize that, by making data shareable and accessible, it
enhances the rigor and reproducibility for research projects, better imbuing the public's
trust. Now, if you look at publications in PubMed
Central – that's our full-text data search – authors say many funny things: "the data
are available on the author's website" or, "all the data can be downloaded from this"
and a URL is provided. In these different sections, you find a wide
range, from references and pointers to specific NCBI accession numbers for genomic data, to
an individual investigator's laboratory where you can contact to get information.
We recognize this is not enough, so as of
October of '17, we've started to allow the connection of supplementary data into records
and articles that are published in PubMed Central. Supplementary files may include computer code,
implementable algorithms and computational models, but can also include the actual raw
data. Investigators are expected to and responsible
to ensure that the data are shared under the appropriate human subjects consideration.
We allow for data citations that should facilitate
the access to the data and any associated metadata code or related materials, so the
data can actually be reused, can actually be employed in secondary areas. And you see on the upper right-hand corner
what the data citation looks like in a PubMed Central record, how accessible they are, and
then below that on the green you see the XML code. This is often provided by the publishers. I would say in most cases, the data are coming
in from publisher sources. Our current snapshot is looking like we're
making progress, there's data availability statements in about 136,000 of our articles,
and almost 20% now have some kind of either supplementary material, including data, or
a data availability statement. And this allows us now to have some good experience. How big are the files? How complicated are they to work with? And they're frankly working really quite well. We find, each time we make data slightly more
available, we see a rise of 20 to 30% of downloads per day. To the most recent statistics I received this
morning, it was almost 40,000 downloads of data with data accessibility statements in
them.
So we know that the data are being used, we're
exploring different ways to understand how they're being used by individuals. So far I've been describing to you data that's
associated with a full-text article, but it's possible to have data associated with our
bibliographic record within PubMed, and these we refer to as our Secondary Source ID. These Secondary Source IDs can be provided
by the individual author or by the journal itself, and they exist at the lower end. On the right-hand side of the screen you see
a typical PubMed citation record with the abstract displayed, and the related secondary
data are available right there at the point.
Now, this still requires a great deal of human
engagement. We're working to make sure machine engagement
can happen with it. We do recognize, though, that public data
services like Dryad or figshare are becoming important, so we use our resource called LinkOut
that allows you to link out from articles, link datasets that are deposited into Dryad
back to a PubMed record, and this record can be maintained and updated – that is, that
when an individual indicates a new dataset has been added, the PubMed record is updated
itself. Not by itself, though, humans still have to
work on that. We are seeing on PubMed Labs a whole range
of new possibilities for data citations. PubMed Labs allows us to experiment with different
kinds of data accessibility tools, so as we prepare for the future, we try to develop
new ways to expose data. We are also looking for bringing better scientific
data into PubMed Central, so that a range of associated datasets can be accessed by
IDs or by URLs. An important feature that we're working on
right now is the linkage of datasets to an article through an individual's controlled
My Bibliography.
My Bibliography is an NLM-sponsored utility
that an individual can use to list all of their articles together. They are able to make updates of linkages
throughout time, so if a year or two after an article is published an individual wants
to attach a different dataset or an additional visualization to that article, it's possible
for the individual to do that, and the PubMed record will then be updated so that we're
able to engage our authors in helping to keep our information current. Lots of things going on with data. We are trying to follow the principles that
we know to make library resources useful and secure and available to the public. I want to briefly touch on some additional
data resources that are available at the National Library of Medicine and then return to the
major NIH directions in data science.
Many of you are familiar with our clinicaltrials.gov
repository. This is a place where you can declare clinical
trials and have the results reported. Importantly, it serves as a public accountability
for trials that are useful for FDA applications and for NIH accountability. But I want you to think not in terms of the
interface that allows individuals to locate a trial that might be useful for somebody
in their family to participate in, but rather to think about the concept of clinicaltrials.gov
as an information scaffold. Think back to the idea of creating an ecosphere
of discovery, and you'll see on here that we've moved from the focus being sharply on
an article to now the focus being on the trial declaration.
So we are able to connect additional relevant
parts of a research study – the protocols, the analysis plan, the results database, even
individual participant data – through a single point. We recognize that investigators and scientists
and society will come to our resources in lots of different ways and we need to provide
flexibility in how they do that, and clinicaltrials.gov is one example.
In clinicaltrials.gov, we know that instruments,
research studies use different kinds of measurements, and having standard approaches to measuring
familiar and important concepts in biomedical research is a critical accelerator of interoperability,
as well as of extending studies. So we have developed something referred to
as the Common Data Elements Portal. Common Data Elements is a commitment across
the NIH to enhance the findability and interoperability of data for common concepts such as depression,
family structure, adherence to medications. The use of Common Data Elements in a research
project allows for validated measures to be used, so it increases the rigor of research
studies, it makes it easier to correlate findings from studies, and it improves our ability
to extract knowledge when studies may be too small to resolve individual hypotheses. The National Library of Medicine is fostering
the Common Data Elements in both human- and machine-readable forms, allowing for structured
information to be available earlier in the point of research planning.
It also allows investigators to harmonize
across different measures to be able to select the most appropriate measures and to reduce
the duplication of effort. At the other end of the spectrum, we maintain
an enumeration of data sharing repositories, over 350 repositories of completed studies
with varying levels of accessibility – and they're summarized on our summary chart for
the data sharing repositories – allows us to take data collected for one study that
has been approved for reuse and make it accessible and reusable. We're making small steps, but the steps are
important on this trajectory of fostering data-driven discovery. My slides, by the way, will be available afterwards,
so that if you've been trying to jot down URLs, we'll be able to get these to you through
the CNI listing. Let me finally move to closing and then conversation
to tell you what's happening at the NIH to integrate the solution, to move it out of
a single institute to bring a broad strategy in.
We have just launched something called the
Office of Data Science Strategy within the NIH Office of the Director. Susan Gregory is the director of this office;
she's a fantastic colleague and a very great advocate, most of her experience is within
NSF and large-scale data systems there. Susan's efforts in the Office of Data Science
Strategy represents a commitment from the NIH, that a number of the projects that I've
just described to you – data sharing within articles, Common Data Elements, the scaffolding
of information around clinical trials, or the ecosphere of data discovery – is becoming
an NIH priority. So we are making another significant step
in ensuring that data will become an important substrate for discovery. But we recognize that we require a rapid infusion
of new workers into this workforce, so the NIH is launching several data science workforce
fellowships, and I'm bringing them to your attention here because applications are open
now and you may have students or faculty in your institutions that might be able to take
advantage of these fellowships.
The first fellowships we have are Graduate
Data Science Summer Programs. Graduate Data Science Summer Programs are
managed through the Office of Education on the NIH campus, and they are built on an initial
consortium of local universities – you see UVA, George Mason, and George Washington – but
students from all universities can apply. Projects have been proposed, we anticipate
bringing in fifteen interns this summer, who may be master's-level or early PhD-level students,
to come and address specific data science challenges within biomedical data. Now, we're doing this for several reasons. We certainly want to expand the workforce,
but we also recognize that many of these young students are bringing in approaches to data
analytics that will be novel to us, so we will be able to learn from them. And it will be a sharing process that we're
really quite excited about. If you go to the NIH website and simply look
for data science training, you'll find this.
Another fellowship training program that's
happening right now is something called Coding it Forward. Coding it Forward is part of the federal Civic
Digital Fellowship Program. It's a student-led nonprofit that actually
is setting up partnerships between students – usually these are college-level students,
they don't have to be graduate students – to participate in an internship program for ten
to twelve weeks at NIH. Our Office of Data Science is going to be
coordinating some activities so the fellows have a point of presence, they will have some
more supervised training than our graduate fellows, who will be deeply embedded into
laboratories.
The third fellowship that I want to bring
to your attention is a fellowship that is designed to increase the capacity of NIH to
handle large-scale data analytics. Our goal here – and this will be launched
by mid-summer of this year – is to create a program called the National Data and Technology
Advancement Fellows, the NIH Data Fellows. The idea is that we would bring individuals
from industry or from academic programs, maybe on a one- to two-year IPA – a sabbatical program
would be appropriate for this – where the data science expertise – not necessary people
who have experience in biomedical data, but expertise in data science – will work with
our massive data resources and be able to investigate how their approaches and their
strategies can benefit from, and maybe expand, our ability to work with the data that we
have available at NIH.
Our goal is to bring a cohort in in 2019 for
a two-year period. Our hope is to accelerate both the capacity
we have for analytics at NIH, as well as to stimulate interest in biomedical data around
the country. We expect to have two to five fellows in each
cohort, and we recognize one of our major challenges here is going to be bringing individuals
in at the appropriate level of salary, because – as you have all discovered, I'm sure – data
science training has brought people with very high salaries, so we need to find ways to
be supportive of that. Let me close by talking to you about a critical
policy issue on the docket right now, and to invite you to send me your personal comments
if you have any thoughts after this particular session that I'm presenting to you.
NIH is is proposing a data management and
sharing policy for all NIH-funded research. This policy is available on the website now,
even though the comment period is closed, but it is designed to help the NIH hear from
the community about what policies make the most sense in light of the way institutions
are managing data, the way institutions hope that the NIH will be funding data management,
and to drive us towards a period where data-driven discovery becomes the norm, not the exception. Before we establish a policy, we need to understand
from the community about the different ways that data and biomedical data are defined. We need to understand what the institutions
and the individuals believe are the requirements for data management and sharing plans.
When I was an investigator, it was enough
to say, "The data will be made available by the investigator via email." That doesn't work anymore; it doesn't have
the protection on the datasets, it doesn't make the datasets accessible, but without
excessively burdening the institutions, what are the next steps it would make sense to
have NIH consider? And what is the timing of this? Should we be doing experiments over the next
three to five years? Is the community ready for us to put some
standards in place for data sharing? We're hearing both responses, frankly; some
people want us to get moving more quickly, others are saying, "Test out a few different
structures," because we know institutional capacity for data management isn't where it
needs to be right now. The considerations that we're bringing together
here have to do with the budget for data management and sharing, particularly, who will be paying
for this? The use of existing repositories – and we
are particularly interested in hearing from institutions that have already made heavy
investments in intramural data repositories in their own institutions, we see a lot of
encouragement to use local repositories, and we need to know how we're going to make data
in those repositories accessible, shareable, understandable by others.
And, finally, we need to hear from around
the country about what the community expectations for shared data actually are. We're investing in data management, in part
because we recognize it's good stewardship, but in part because we believe it will accelerate
us to a future. Over the last hour I've taken you on a very
fast walk through the National Library of Medicine's data strategic plan, our focus
on new ways for data sharing, the NIH investment and interest in this, and now I'd like to
hear a little bit about your questions and your comments and directions we should be
going. Thank you very much for your time.
And we've left some time for questions and
I believe there's two microphones up if anyone would like to begin the comments or questions. [Audience member]: Hi, thanks so much for
that, it was great. In your discovery section you mentioned the
desire to do more personalization to deliver the information that your users need, and
we've been having a couple of conversations here about patient privacy and what steps
we should be making to protect those, so do you have any thoughts on the tension between
those two goals? [Patricia]: Well, the tension is an appropriate
tension, for certain.
And the issues about privacy, I would be – actually,
don't go away, because I'm going to ask you, what are the key privacy concerns that you
had? I can tell you the ones we have. So we're first and foremost interested in
reducing the number of cliques, secondly we're interested in allowing individuals the unfettered
access to the information they want without undue scrutiny, and at the same time we recognize
we need to understand the trajectory through our resources so we can determine how to best
organize them for individuals.
So what are people willing to have captured
about them, whether or not it's identifiable or not, and the third part that we're particularly
interested in is, how much of your history should inform your future? So are individuals, if you will, self-reflective
enough and adept enough to be able to say, "Toss that search, don't ever use that one,
but build on this one because it made sense for me." So can you tell me some of the issues that
came up in the discussion here? [Audience member]: Yeah, I think some of them
were the informed consent – how educated are our users when they're making decisions about
what they're willing to share? And then, I think, more alarmingly, probably,
is the tools were are using to aggregate that information, what data is getting shared among
the same publishers and third-party vendors and non-library entities? [Patricia]: I see, so when data becomes an
asset, a material asset? Yeah, I see.
So I'm from the government, I'm here to help
you. Of course we would not do that. I say that with … we don't do that, but
I really do recognize – you saw me say the word "trust" fifteen times in this talk. I know the federal government has a lot of
trust-building to do around health data. We have not always been honest or good stewards
about this. And I would be interested in guidance from
this community, if there is some, about, how do we explain what we do and don't do already? So we do not profile users, we do not know
users who choose not to let themselves be known to us, but we do have an option called
My NCBI that you can actually be known to us on a very regular basis, and we develop
a lot of dialogue. That's the second level of helping people
understand, what exactly is it do we do with our data? We do not share or expose our service logs.
We do not allow investigators, except for
in the improvement of a specific service, to come in and look for our Thursday night
users or our high-profile users or our everyday users. We don't have the ability to track individuals
at that level. But saying what we do and don't do is one
level of building trust, illustrating how we use the information is another. Now, I would be interested also in knowing,
as we interact with the publisher community, as we interact with commercial information
sharers, what kinds of questions should we be asking them about best practices? And to what extent should the National Library
of Medicine actually be fostering the dialogue around the best practices? I've been in some conversations that have
been really … that this is a neat time for scientific communication, very exciting. But also the awareness that scientific communication,
particularly the labeling of journal titles and articles, has taken on all sorts meaning,
so beyond impact factors and whether or not your h-index is high enough, the presence
or absence of a nature cover article in your CV speaks volumes according to some people.
Others are arguing that we should redact journal
names from [inaudible] 57:44 cases and from CVs so that we level out the playing field
and make the evaluations beyond the science and not on the presentation of the science. When we move into the citizen scientist, a
layperson, who, if you will, is not of the guild that academics are. We have a whole new way that we have to be
explaining what we're collecting and not collecting, what we know and don't know about individuals.
And, to be very honest, you can get PubMed
results through Google. So we have these even very confusing displays
of information. When you're on our site, there's certain behaviors
you can count on us for, but when you're receiving our resources but not on our site – that's
that NLM inside – we have another level of information. We do not, at this point in time, provide
page view statistics. I don't know, was there interest in page view
or was there concern about page view? I've heard both sides from people – that's
a measure of how widely your work is being read, or, no, it's distortion. [Audience member]: Specifically in any of
the sessions. [Patricia]: The page view issue? [Audience member]: Yeah. [Patricia]: Okay, that's helpful. We're very open to policy and ethics, we've
expanded our investment in this area.
Jerry Sheehan had been our public policy lead
for many years, he's now the deputy director, so he brings into that position at NLM a very
keen awareness of our public responsibility and also our collaboration across the federal
government. Dina Paltoo is now our director of public
policy and she's supported by Rebecca Goodwin, and also Lisa Federer's work is really helping
us think about the ethics of communication and the ethics of knowing what kind of information
people are looking at. Other comments or questions? Yes. [Audience member]: Hi, Patricia.
Erik Mitchell, UC San Diego. I'm really intrigued by the data sharing and
management policy [inaudible] 59:48 after the last topic. As we picked up this topic at San Diego, we
realized … actually, as we picked up the topic about making data more openly available
at San Diego, we realized we didn't have a good grasp on the policies, even, that govern
how research data is managed within our institution. And I'm curious, as we take on this kind of
modest endeavor, how can we do so in a way that will dovetail well with some of these
– if government agencies were also going to pick up this topic, how can we make sure we're
working together on it? Or maybe you've got advice for us on we should
stay away from as we get into it.
[Patricia]: Thank you very much for the question,
and this is exactly what I had hoped to have come out of having this conversation here,
is to learn from what's happening in the communities around the country and how does it align with,
or need to be aligned with what we do? So I'd like you to think about three things
that we find really important to think about. One is the lifecycle of data and helping investigators
early on in their planning to think about, what are the data products of their resource
and how long are they going to be valued? Now, everyone thinks their data is really
valuable and it's going to be used by lots of people, so that's a fine starting point,
but then we might want to help them think about, what is the the the long term? And even if it's not required in a federal
document, to begin at the department level or to begin at, whether you have an office
of research or research and sponsored projects office, to begin to get that information,
because it will give the institution an idea of what kind of downstream commitments might
be there.
So, first of all, think about the lifecycle
of data. IT's pretty c