SQL数据库的终结?

2012-02-21 13:32

第一部分

SQL 的发展起始于 E.F.Codd 博士1970年六月发表于计算机协会的“通信”上的一篇论文， “大型共享数据库的关系模型“。当时他和他的在IBM工作的同事 Donald Chamberlin 和 Raymond Boyce 正在研究一种查询语言(最初叫做SQUARE, Specifying Queries As RelationalExpressions 的首字母缩写)，并于1974年以论文”SEQUEL:A Structured English Query Language“将此成就推向顶峰。从此以后， SQL 就成了关系数据库系统的最主要的语言。近些年，软件开发业内出现了一些体系框架和架构，主要目的是试图隐藏（或完全放弃）直接使用SQL 和关系数据库，让开发人员能够在应用开发中专注于用户界面，业务逻辑和平台支持上。同时出现了一批被认为是关系型数替代品，称之为”NoSQL”的数据库。难道我们能够成为 SQL 和关系型数据库终结的见证人吗？

在一个由Mike Riley主持的十二月 DDJ podcast 访谈中，我被问到：“随着ORMs(Object Relational Mapping 对象关系映射)的流行，有些软件开发者们认为SQL已经失去其价值了。你对这种观点有什么看法？” 我整个新年假期都在想这个问题，思考这个问题所隐含的意义已及ORM的未来，我花一段时间研究了一下像 Ruby on Rails Active Record 和 Hibernate 这样的框架。这些框架仍然需要开发人员掌握关系数据的设计、开发和维护等知识。 Microsoft所开发的LINQ(.NET Language Integrated Query)也只是减少了编程语言和数据库语言之间的不兼容问题。

“NoSQL 运动”和分布式数据存储（Cloud based data stores）都是致力于彻底的将开发人员和SQL语言和关系数据库之间的依存斩断。一些程序员认为 NoSQL 运动是一种全新的感念。面向对象式数据库（Object databases）最早出现于20世纪80年代， Ray Ozzie 于20世纪90年代最早将它商用于Lotus Notes的文档数据存储业务。
Charlie Caro, 资深软件工程师，在美国 Embarcadero 开发 InterBase SQL 数据库引擎，他告诉我：“在以前，人们普遍认为，不对数据的并发操作进行控制的数据库基本不可能被大家广泛接受。但 Ozzie 认识到，分布式、可复制性和易于安装的特征所带来的好处远胜于在管理文档数据和消息说很少能遇到的并发更新冲突控制所带来的好处。而且，如果文档数据如果需要确保被正确的修改、不能丢失数据，我们可以把配置切换到并发控制状态上，这是可选择的。但缺省状态是不考虑更新冲突控制的。”

NoSQL, 根据 WikiPedia 上的解释，是 “一种泛称（umbrella term），指那些非关系性的、定义不是很明确的数据存储仓库。“这个术语最早是 Rackspace 公司的员工 Eric Evans 发明的。在他上年十月发表的博客里出现了 NoSQL (现在普遍认为是 Not Only SQL 的意思) 这个词。这篇博客里真正的闪光点是”我们之所以要寻找一个其它类型的数据库的根本原因是想解决关系型数据库存在的各种弊端。“ Adam Keys 在他的博客 The Real Adam blog post 提供了另一个相似的术语：”Post-Relational”。一些 NoSQL 数据库还把消除那种关系型数据库对计算机资源、内存占用的问题作为一个目标。 NoSQL 的其他目标还包括：弱化与编程语言的关系，使用web技术和RPC调用方式可访问，以及可切换的数据查询方式。

在最近的一篇博客”关于”NoSQL” 的讨论其实与SQL无关“里 Michael Stonebraker 教授将 SQL 和 NoSQL 数据库进行了对比。 SQL 和 NoSQL 数据库可以通过下面的几个特性和性能进行部分或全面比较。 (注意：应该有更多的特征可以添加到下面的列表里。欢迎在评论里追加你认为能够区别这两种数据库的特征)：

横向和纵向扩展能力 – 关系型数据库(传统的数据库)通常部署在一台服务器上，通过增加处理器、内存和硬盘来进行升级。部署在多台服务器上的关系型数据库通常是依赖相互复制来保持数据同步。 NoSQL 数据库可以部署在单服务器上，但更多的是部署成云状分布式 (NoSQL:分布式和可扩展的非关系型数据库系统)。
列，key/value存储，数组(Tuples)存储 – 关系型数据库通常是有表或视图里的字段构成(固定的结构，用各种操作相互关联)。 NoSQL 数据库通常存储的是一对键值或数组式(Tuples) (结构不固定，只是一个有顺序的数据队列)。
数据的内存和硬盘使用 – 关系型数据库通常是驻留在一个硬盘内或一个网络存储空间里。SQL查询或存储过程操作会把数据集提取到内存空间里。一些 (并不是全部) NoSQL 数据库可以直接在硬盘上操作，也可以通过内存来加快速度。
面向文档型（Document-Oriented）, 面向集合型（Collection-Oriented）, 面向列型（Column-Oriented）,
面向对象型（Object-Oriented）, 面向有序集合型（Set-Oriented）, 面向行型（Row-Oriented） – 面向文档型
数据库存储的是文档、属性和XML。面向集合型的数据集提供了更适合面向对象编程语言的特性。关系型数据库的特性是用表，行，列（面向列型）来组织数据。 SQL 查询操作通常返回的是指向包含特定列的某行或某些行的集合的指针。面向对象的数据库之所以出现是由于面向对象的编程的流行，但目前为止（以及将来很多年里）关系型数据库仍是数据存储模式里占有霸主地位。面向对象型数据库也是 NoSQL 数据库吗? 对象关系映射(ORM)框架的兴起将面向对象编程和大多数关系型数据紧紧的绑到了一起。 NoSQL 数据库里的数据通常是存储成对象、key/value、或数组（tuples）形式。 NoSQL 数据库的查询操作通常由编程代码或一个接口完成。

在一次邮件交流里，Charlie Caro 对我说了下面的话：”如果 Facebook 需要去管理 100,000,000 个用户的个人信息，一个分布式的、不依赖于环境的，、key-value 形式的存储模式是最适合不过了。在这样大数量的用户里查询会没有问题，但只要一个用户的更新操作就可能让传统的数据库过载宕机。多用户读数据时一个用户更新数据，这需要并发控制。在多数情况下， NoSQL 方案之所以能吸引它的用户群的原因是它的易于安装和使用的特征， SQL 数据库需要较多的运行条件(schema 等)，但正是这些schema方案给了并行关系型数据系统的高性能。易使用的好处更多的是体现在编程开发的时候。今天的许多程序员都更倾向于使用脚本语言，而不是相同功能的更安全的静态类型检查的编译型语言。脚本型语言只是容错性强和易于上手，有些软件能把这些脚本程序编译成 .NET/Java 字节码来提高运行性能。” 我和他都认为，所有的这一切都是为了让我们在工作中有更好的工具使用，而且从来都是这样！当有螺丝刀时谁还用锤子去钉螺丝钉。

第二部分

你想象不到，如今竟然有了那么多开源的/非开源的NoSQL数据库产品。而同时，每天都有新的品种出现。如果我的列举中遗漏了你喜爱的NoSQL数据库，请发评论告诉我。下面你将看到的就是各种不同类型的NoSQL数据库产品：面向文档的，面向集合的，面向列的，面向对象的，面向图的，面向有序集合的，面向行的，等等。

AllegroGraph
公司/组织：	Franz Inc.
类型：	Graph
简介：	Modern, high performance, persistent graph database.
存储方案：	Disk based, meta-data and data triples.
API(s):	SPARQL, Prolog
BerkleyDB
公司/组织：	Oracle
类型：	Key/Value
简介：	C language embeddable library for enterprise-grade, concurrent,transactional storage services. Thread safe to avoid data corruption or loss
存储方案：	B-tree, hash table, persistent queue
API(s):	C, C++ and Java
备注：	Use BerkleyDB XML layer on top of BerkleyDB for XML based applications. Comparison of BerkleyDB and relational databases
BigTable
公司/组织：	Google
类型：	Sparse, distributed, persistent multidimensional sorted map.
简介：	Distributed storage system for structured data. Data model provides dynamic control over data layout and format. Data can live in memory or on disk.
存储方案：	Data is stored as an uninterpreted array of bytes. Client applications can create structured and semi-structured data inside the byte arrays.
API(s):	Python, GQL, Sawzall API, REST, various.
备注：	Overview: Bigtable: A Distributed Storage System for Structured Data (PDF format)
Cassandra
公司/组织：	Apache
类型：	Dimensional hash table
简介：	Highly scalable distributed database. Combines Dynamo’s distributed design and Bigtable’s column family data model.
存储方案：	Clusters of multiple keyspaces. The keyspace is a name space for column families. Columns are comprised of a name, value and timestamp.
API(s):	Java, Ruby, perl, Python, C#, Thrift framework.
备注：	Open sourced by Facebook in 2008. Wiki, FAQ, Examples
CouchDB
公司/组织：	Apache
类型：	Document
简介：	Distributed database with incremental replication, bi-directional conflict detection and management.
存储方案：	Ad-hoc and schema-free with a flat address space.
API(s):	RESTful JSON API. JavaScript query language.
备注：	CouchDB Introduction, Technical Overview
db4o
公司/组织：	Versant
类型：	Object
简介：	Java and .NET dual license (commercial and open source) object database.
存储方案：	Data objects are stored in the way they are defined in the application.
API(s):	Java, .NET languages.
备注：	db4o db4o database runtime engine, about db4o
Dovetaildb
公司/组织：	Millstone Creative Works
类型：	JSON-based
简介：	Schemaless database similar to Amazon’s SimpleDB. Open source, standalone Java application server.
存储方案：	JSON data format, “bags” (similar to tables).
API(s):	HTTP and Javascript APIs
备注：	Dovetaildb JavaScript API reference manual
Dynomite
公司/组织：	Cliff Moon
类型：	Key/Value
简介：	Open source Amazon Dynamo clone written in Erlang.
存储方案：	Distributed key/valve store, Pluggable storage engines.
API(s):	Thrift API
备注：	Dynomite Wiki
eXtreme Scale
公司/组织：	IBM
类型：	In-memory grid/cache
简介：	Distributed cache processes, partitions, replicates and manages data across servers.
存储方案：	Data and database cache, “near cache” for local subset of data. Java persistent cache. Map reduce support.
API(s):	Java APIs, REST data service
备注：	eXtreme Scale Document library web site
GT.M
公司/组织：	FIS
类型：	Hierarchical, multi-dimensional sparse arrays, content associative memory
简介：	Small footprint, multi-dimensional array with fill support for ACID transactions, optimistic concurrency and software transactional memory.
存储方案：	Unstructured array of bytes. Can be Key/Value, document oriented, schema-less, dictionary or any other data model.
API(s):	Mumps, C/C++, SQL
备注：	GT.M FAQ
hamsterDB
公司/组织：	Christoph Rupp
类型：	Embedded storage library
简介：	Lightweight embedded database engine. Supports on disk and in memory databases.
存储方案：	B+tree with variable length keys.
API(s):	C++, Python, .NET and Java
备注：	hamsterdb FAQ, examples, tutorial
HBase
公司/组织：	Apache
类型：	Sparse, distributed, persistent multidimensional sorted map.
简介：	Open source, distributed, column-oriented, “Bigtable like” store
存储方案：	Data row has a sortable row key and an arbitrary number of columns, each containing arrays of bytes.
API(s):	Java API, Thrift API, RESTful API
备注：	Part of Apache Hadoop project. HBase Wiki, FAQ
Hypertable
公司/组织：	Zvents Inc.
类型：	Sparse, distributed, persistent multidimensional sorted map.
简介：	High performance distributed data storage system designed to run on distributed filesystems (but can run on local filesystems). Modeled after Google Bigtable.
存储方案：	Row key (primary key), column family, column qualifier, time stamp.
API(s):	C++, Thrift API, HQL
备注：	Hypertable Architectural overview, FAQ
Infinispan
公司/组织：	JBoss Community
类型：	Grid/Cache
简介：	Scalable, highly available, peer to peer, data grid platform.
存储方案：	Key/Value pair with optional expiration lifespan.
API(s):	Java, PHP, Python, Ruby, C
备注：	Infinispan FAQ, Wiki
InfoGrid
公司/组织：
类型：	Graph
简介：	Internet graph database made up on nodes and edges. Supports in-memory and persistent storage alternatives including RDBMS, file system, file grid, and custom storage.
存储方案：	Nodes (meshobjects) and edges (relationships). Meshobjects can have entity types, properties and participage in relationships. MeshObjects raise events.
API(s):	RESTful web services.
备注：	InfoGrid Overview, FAQ
Keyspace
公司/组织：	Scalien
类型：	Key/Value
简介：	Distributed (master/slave) key-value data store delivering strong consistency, fault-tolerance and high availability.
存储方案：	Uses BErkeleyDB library for For local storage. Key/Value pairs and their state are replicated to multiple servers.
API(s):	C/C++, Python, PHP, HTTP
备注：	Keyspace Overview, FAQ
MemcachedDB
公司/组织：
类型：	Key/Value
简介：	High performance, high realiability persistent storage engine for key/value object storage.
存储方案：	Uses BerkeleyDB as storage library/backend.
API(s):	Memcache protocol, C, Python, Java, perl
备注：	MemcacheDB complete guide (PDF format)
Mnesia
公司/组织：	Ericsson
类型：	Key/Value
简介：	Multiuser distributed database including support for replication and dynamic reconfiguration.
存储方案：	Organized as a set of tables made up of Erlang records. Tables also have properties including type location, persistence, etc.
API(s):	Erlang
备注：	Mnesia Reference manual
MongoDB
公司/组织：	10gen
类型：	Document
简介：	Scalable, high-performance, open source, schema-free, document-oriented database
存储方案：	JSON-like data schemas, Dynamic queries, Indexing, replication, MapReduc
API(s):	C,C++, Java, JavaScript, perl, PHP, Python, Ruby, C#, Erlang, Go, Groovy, Haskell, Scala, F#
备注：	MongoDB Documentation Index
Neo4J
公司/组织：	Neo Technology
类型：	Graph
简介：	Embedded, small footprint, disk based, transactional graph database written in Java. Dual license – free and commercial.
存储方案：	Graph-oriented data model with nodes, relationships and properties.
API(s):	Java, Python, Ruby, Scala, Groovy, PHP, RESTful API.
备注：	Neo4J Wiki, API, FAQ
Redis
公司/组织：
类型：	Key/Value
简介：	Key/Value store with the dataset kept in memory and saved to disk asynchronously. “not just another key-value DB”
存储方案：	Values can be strings, lists sets and sorted sets.
API(s):	Python, Ruby, PHP, Erlang, Lua, C, C#, Java, Scala, perl
备注：	Redis Wiki
SimpleDB
公司/组织：	Amazon
类型：	Item/Attribute/Value
简介：	Scalable Web Service providing data storage, query and indexing in Amazon’s cloud.
存储方案：	Items (like rows of data), Attributes (like column headers), and Values (can be multiple values)
API(s):	SOAP, REST
备注：	SimpleDB FAQ, Getting Started Guide, Developer Guide, API
Tokyo Cabinet
公司/组织：	Mikio Hirabayashi
类型：	Key/Value
简介：	Library (written in C) of functions for managing files of key/value pairs. Multi-thread support.
存储方案：	Keys and Values can have variable byte length. Binary data and strings can be used as a key and a value.
API(s):	C, perl, Ruby, Java, Lua.
备注：	Tokyo Cabinet Specifications, presentation(PDF format). Also available: Tokyo Tyrant (remote service), Tokyo Distopia (full text search), Tokyo Promenade (content management).
Voldemort
公司/组织：	LinkedIn
类型：	Hash Table
简介：	“It is basically just a big, distributed, persistent, fault-tolerant hash table.” High performance and availability.
存储方案：	Each key is unique to a store. Each key can have at most one value. Supported types: JSON, string, identity, protobuf, java-serialization.
API(s):	Java, C++, custom clients
备注：	Project Voldemort Wiki, Client how-to

有如此多的非关系型数据库可选择真是一件好事。积累一些NoSQL相关的知识和初步体验能帮助管理人员、架构师、开发人员将所知道的关系型数据库的长处和短处跟NoSQL数据库进行对比。关系型数据库和SQL查询语言目前在各种数据库应用程序的设计、开发和管理过程中仍是主要元素和中枢系统。但当我们需要开始使用云数据库结构时，所有的我们了解的知识和收集的资料都能保证我们能迅速的进行迁移。这完全是根据用户和业务的需求，我们才能做出到底是使用现有的关系型数据库技术还是使用NoSQL进行替换。

第三部分

如果你想收集更多的关于 NoSQL 和非关系型数据库的信息，请参考下面的一些网站，博客和文章：

No to SQL? Anti-database movement gains steam, Eric Lai,Computerworld
Dynamo: Amazon’s“highly available key-value store.”Werner Voegel,Amazon CTO, from hisblog post and team article.
Google BigTable: “distributed storage system for managing structured data.”Google Labs home page and paper. “Death to Relational Databases“, a generic intro to NoSQL by Ben Scofield,
CodeMash January 14, 2010.
Scalable Transactions for Web Applications in the Cloud, by Wei Zhou, Pierre Guillaume and Chi Chi-Hung.Euro-Par 2009 conference (and the PDF paper).
Is Microsoft Feeling the “NoSQL” Heat?, by David Ramel for Redmond Developer News.
It’s not NoSQL, it’s post-relationalby Adam Keys, software developer and writer,on The Real Adam blog, August 2009.
The Future Is Big Data in theCloud, by Ping Li, Accel Partners.
The Dark Side of NoSQLfrom theCode Monkeyism blog
NoSQL Ecosystemby Jonathan Ellis, on the RackSpace cloud blog.
NoSQL: A Modest Proposal, by Chris Williams, author Naked JavaScript and Co-Curator of NoSQL East conference, fromhis Voodoo Tiki God blog.
NoSql Databases – Part 1 – Landscapeby Vineet Gupta, GMSoftware Engineering at Directi Group,on his blog.
NoSQL meetup groups around the world from meetup.com.
nosql-databases.org – website thatis “YourUltimate Guide to the Non-Relational Universe!”
nosql-discussion Google web discussion group

下面是几个将要举行的和最近刚举行的关于 NoSQL 的会议，架构师和开发人员能从这些会议里得到很有价值的信息。下面列出的只是其中的一部分：

NoSQL Live, March 11,2010. Boston, Massachusetts. Hosted by 10gen (provides commercial support for MongoDB).
Glue Conference 2010 (Gluecon), May 26-27, Broomfield, Colorado.
Scandinavian Web Developer Conference 2010, June 2-3, Stockholm Sweden.
ICOODB 2010 – 3rd International Conference on Objects and Databases, September 28-30, 2010, Frankfurt/Main, Germany. Workshops: NoSQL Workshop & Meetup 28th Sept 2010.
FOSDEM – http://nosqldevroom.pbworks.com/NoSQL-devroom-Talks
Oakland California NOSQL meet up November – 2009. On the meet up web site there are several links to papers that were presented including: No SQL is a Horseless Carriage, Project Voldemort: What’s New, Cassandra in a nutshell, CouchDB, MarkLogic Server, JCR in 15 minutes.

看看那些在 Digg 上和在Computerworld 博客上访问者留下的评论和建议是很有必要的。感谢那些参与关系和非关系数据库相关讨论的朋友。这里是从那些评论里节选的一部分：

Emil Eifrem (Neo4j) commented: “You talk about scaling to size and
handling Facebook’s 100M user profiles. That’s an important use case and
one that for example a key-value store handles brilliantly. But it
turns out most companies aren’t Facebook. You can categorize the four
emerging categories of NOSQL databases (key-value stores, column family
stores, document dbs and graph databases) along the axes of scaling to
size and scaling to complexity. For more information about that, see this
blog post. Graph databases (like e.g. Neo4j,
which I’m involved with, or Sones)
excels at representing complex and rapidly evolving domain models and
then traversing them with high performance.”
Mongo-DB Developer commented: “We have seen the most common use case
to date being use of nosql solutions as operational data store of web
infrastructure projects. By operational, I mean, problems with real time
writes and reads (contrast with data warehousing with bulk occasional
loading). For these sort of problems these solutions work well and also
fit well with agile development methods where the somewhat ‘schemaless’
(or more accurately, columnless) nature of some of the solutions, and
the dynamically typed nature of the storage, really helps.”
Peter R commented: “I have already seen, in the domain I work in,
the movement away from straight up SQL databases. XML databases are one
technology that will be stealing a lot of SQL’s thunder (if they haven’t
already). Do I think SQL will ever die? No. But the key is that there
will be/are more options that need to be thought about when designing a
system now.”
Anonymous commented: “I agree object databases have a purpose. They
are great for large datasets that need to be replicated and called by a
key. However SQL provides a very important capability and that it is to
be able to query data across a number of datasets very efficiently, this
will be very hard to duplicate in a simple key value database.”
Johannes Ernst commented: “One of the difficulties for “normal”
developers with many of the NoSQL technologies that you’ve described so
far has been the learning curve and the additional work required: e.g.
it’s easy and everybody knows how to put “every customer can place one
or more orders” into a relational database, but what if the only thing
you have is keys and opaque values? Compared to many other NoSQL
alternatives, graph databases provide a high level of abstraction,
freeing developers to concentrate on their application, while still
bringing many of the same NoSQL benefits.For example, in InfoGrid (http://infogrid.org/), a project I’m
involved in, you can define “Customer” and “Order” and their
relationship, and the InfoGrid graph database takes care of storing and
retrieving data and enforcing the relationship. In our experience, that
makes graph databases much more approachable to developers than many
other NoSQL technologies.”
Database-ed commented: “The problem is that when folks think about
storing information that they need to retrieve, they are so ingrained to
SQL that they fail to think of other means. The Facebook example is a
case in point. Who is ever going to ask for an accurate report of every
user in Facebook? If you miss something the first time you go looking,
you can always present it later. The end user doesn’t know you lost it,
they assume it didn’t exist at the time and now it does. Yet you still
need to store the data for easy retrieval. One problem with SQL is that
it ties you into the relationships. Facebook is about letting people
build the relationships based on the fields they want to build them on,
not the ones you might think of. I know, it can be done within the
confines of SQL, but it is a lot harder to do when the size gets large.”
Raptor007 commented:
“Some tasks that are poorly serviced by SQL may get switched over to a
new method, but other implementations that are perfectly suited to SQL
will continue using it. As they quoted Eric Evans in the article, “the
whole point of seeking alternatives is that you need to solve a problem
that relational databases are a bad fit for.”
Miracle Blue commented: “While I highly doubt there’s going to be any significant
migration away from SQL and the like any time soon, I think more web
developers will start experimenting with data stores and other data
solutions as we move further into the cloud.”
TheUnGod commented:
“And as companies turn to ask their SQL DBAs what they think of this,
they’ll say “lets stick with SQL.” Honestly, there are so many people
that support SQL right now that will not switch any time soon this
article is just bogus. You can’t make a switch like that until people
can support it properly.”
SteelChicken commented: “Document centric is pretty dumb if you plan on doing any
sort of analytics and data mining. Great for workflow and such.”
Angusm commented: “The
significance of the NoSQL movement is that it adds new tools that offer
better solutions to specific problems. The future probably belongs to
NoSQL in the sense of ‘not-only SQL’, rather than ‘no SQL’. Don’t
imagine that NoSQL solutions offer a free lunch though. I had an
educational experience when I changed a view definition in a CouchDB
data store and my first trivial query took an hour to come back. CouchDB
can be pleasingly fast when all its indexes are built, but if you have
to rebuild those indexes from scratch … well, let’s just say that’s
not something you want to do on a live client-facing site.”
Afex2win commented: “digg is one of the bigger proponents of
Cassandra, a distributed data store in the vein of which the article is
talking about. http://about.digg.com/blog/looking-future-cassandra“
Drmangrum commented:
“SQL will be around for awhile. It’s good at doing what it was designed
to do. However, there are many times when people use SQL simply because
there is nothing better out there. As data complexity rises, a new
method for accessing and persisting that data will have to be
investigated. Part of the problem with many of the alternate solutions
is that few people know how to use them.”

数年以后，我估计我们大多数还是要依赖于关系数据库和SQL。我当然有愿望，我将会不断的研究寻找更好的方式去弱化和封装数据访问操作。一直以来，任何工程决策都是跟用户和业务需求相适应的。对于以后的软件工程来说，我相信，
我们一定会找到一个合适的非关系型数据存储产品。你是否正在使用非关系型数据库呢？你是否已经放弃了SQL和关系型数据库呢？你是否正在把你的数据转移到一个公共的或私有的云数据库里呢？请发表评论。

英文原文链接：LINK

VIA http://www.aqee.net