If you're migrating from HDInsight on Windows to HDInsight on Linux, you are probably upgrading to Spark 1.6 from Spark 1.3 and shifting from Zeppelin to Jupyter. This turns out to have some pretty fundamental changes. If you, like me, get cross-eyed trying navigate the Scala docs, I'll be creating a few posts about key changes with examples of the new syntax.
In this post, we'll begin at the beginning- loading a CSV text file.
In the previous version of HDI you would have loaded a text file using the following syntax:
val textLines = sparkContext.textFile("wasb:///subfolder/myfile.csv")
In Spark 1.6 the loading of data has be unified underneath a the DataFrameReader, which is accessed using the read property as follows:
val textLines = sparkContext.read.text("wasb:///subfolder/myfile.csv")
When it comes time to process the lines of your text file, the output of map has changed as well. In Spark 1.3 you would have iterated over the lines in this fashion:
val myRdd = textLines.map(line => line.split(","))
In Spark 1.6, this has changed, requiring you to first get the string representation out of the iterand object:
val myRdd = textLines.map(line => line.getString(0).split(","))
Hopefully this helps you get started quickly!
Comments