Big Data/Cassandra

From Wikiversity
Jump to navigation Jump to search
Search for Apache Cassandra on Wikipedia.

Apache Cassandra is a NoSQL wide column-oriented database management system, distributed and scalable. In 2015, it has become one of the world's most popular SGBD[1].

Installation[edit | edit source]

The Java sources are available on https://github.com/apache/cassandra, but a tarball is on http://cassandra.apache.org/download/.

  • MacOS: brew install cassandra && brew services start cassandra

See also http://cassandra.apache.org/doc/latest/getting_started/installing.html for more information.

To launch the server:

  • On Linux: /cassandra/bin/cassandra
  • On Windows: \cassandra\bin\cassandra.bat

Graphical user interface[edit | edit source]

There are several GUI to manage Cassandra. For example Helenos: its Java sources are available on https://github.com/tomekkup/helenos, and a compiled version on http://sourceforge.net/projects/helenos-gui/.

It includes an Apache + Tomcat server, launchable by \helenos\bin\startup.bat. Then, the web interface must be visible on http://localhost:8080 (login: admin / password: admin).

Helenos screenshot

NB: it can create some column families, but not see the ones which were created in CQL.


Data manipulation[edit | edit source]

In 2011 Cassandra introduced the Cassandra Query Language (CQL)[2][3], you can interact with CQL using the cqlsh client. Using cqlsh you can create w:keyspaces and tables, insert and query tables among other operations. The CQL 3.0 syntax looks like this[4]:

CREATE KEYSPACE MyBase1 WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
USE MyBase1;

CREATE TABLE MyTable1 (
id text,
FirstName text,
LastName text,
 PRIMARY KEY(id));

INSERT INTO MyTable1 (id, LastName) VALUES ('1', 'Test');

SELECT * FROM MyTable1;

DROP TABLE MyTable1;

Additional Notes:

  • There isn't any autoincrement option.
  • No case-sensitive field names.
  • Inserting a new record with an existing primary key will replace the old one, without any warning.
  • When inserting more than 1,000 records, cqlsh may ignore the rest. It's recommended to use the ETL sstableloader.

Cassandra port usage[edit | edit source]

  • 7000, cluster communication [5]
  • 7001, cluster communication if SSL enabled [6]
  • 7199 JMX (was 8080 pre Cassandra 0.8.xx)[7]
  • 9042 CQL native clients
  • 9160 Thrift client API[8]

How to use several nodes[edit | edit source]

To communicate from one server to another Cassandra needs to open the ports[9]: 7000, 7001, 7199 (SSL), 9042 and 9160.

There isn't any master node, so the fail-over is automatic. Each node must own a "seed node" in its configuration, to get the distributed architecture. Their description is stored into \cassandra\conf\cassandra-rackdc.properties.

To let the nodes communicate, into cassandra.yaml, the parameter endpoint_snitch must be RackInferringSnitch (instead of SimpleSnitch by default).

Then, the nodes list is visible with:

  • On Linux: \cassandra\bin\nodetool status
  • On Windows: \cassandra\bin\nodetool.bat status

NB: when a keyspace is cerated with a replication_factor superior to one, the nodes become redundant (mirroring).

Related Technologies[edit | edit source]

References[edit | edit source]

  1. http://db-engines.com/en/ranking
  2. https://grokbase.com/t/cassandra/user/1162fkpwx2/release-0-8-0
  3. https://docs.datastax.com/en/cql/3.3/cql/cqlIntro.html
  4. https://cassandra.apache.org/doc/cql3/CQL.html
  5. http://cassandra.apache.org/doc/latest/faq/index.html#what-ports
  6. http://cassandra.apache.org/doc/latest/faq/index.html#what-ports
  7. https://stackoverflow.com/questions/2359159/cassandra-port-usage-how-are-the-ports-used
  8. https://stackoverflow.com/questions/2359159/cassandra-port-usage-how-are-the-ports-used
  9. http://docs.datastax.com/en/cassandra/2.0/cassandra/initialize/initializeSingleDS.html
  10. https://en.wikipedia.org/wiki/Amazon_DynamoDB
  11. https://en.wikipedia.org/wiki/Redis