Scala: How can I replace value in Dataframes using scala

Spark 1.6.2, Java code (sorry), this will change every instance of Tesla to S for the entire dataframe without passing through an RDD: dataframe.withColumn(“make”, when(col(“make”).equalTo(“Tesla”), “S”) .otherwise(col(“make”) ); Edited to add @marshall245 “otherwise” to ensure non-Tesla columns aren’t converted to NULL.

PySpark count rows on condition

count doesn’t sum Trues, it only counts the number of non null values. To count the True values, you need to convert the conditions to 1 / 0 and then sum: import pyspark.sql.functions as F cnt_cond = lambda cond: F.sum(F.when(cond, 1).otherwise(0)) test.groupBy(‘x’).agg( cnt_cond(F.col(‘y’) > 12453).alias(‘y_cnt’), cnt_cond(F.col(‘z’) > 230).alias(‘z_cnt’) ).show() +—+—–+—–+ | x|y_cnt|z_cnt| +—+—–+—–+ | bn| … Read more

How to provide a reproducible copy of your DataFrame with to_clipboard()

First: Do not post images of data, text only please Second: Do not paste data in the comments section or as an answer, edit your question instead How to quickly provide sample data from a pandas DataFrame There is more than one way to answer this question. However, this answer isn’t meant as an exhaustive … Read more

Creating a new column in Panda by using lambda function on two existing columns

You can use function map and select by function np.where more info print df # a b #0 aaa rrrr #1 bb k #2 ccc e #condition if condition is True then len column a else column b df[‘c’] = np.where(df[‘a’].map(len) > df[‘b’].map(len), df[‘a’].map(len), df[‘b’].map(len)) print df # a b c #0 aaa rrrr 4 … Read more

Spark add new column to dataframe with value from previous row

You can use lag window function as follows from pyspark.sql.functions import lag, col from pyspark.sql.window import Window df = sc.parallelize([(4, 9.0), (3, 7.0), (2, 3.0), (1, 5.0)]).toDF([“id”, “num”]) w = Window().partitionBy().orderBy(col(“id”)) df.select(“*”, lag(“num”).over(w).alias(“new_col”)).na.drop().show() ## +—+—+——-+ ## | id|num|new_col| ## +—+—+——-| ## | 2|3.0| 5.0| ## | 3|7.0| 3.0| ## | 4|9.0| 7.0| ## +—+—+——-+ but … Read more

Preserve Dataframe column data type after outer merge

This should really only be an issue with bool or int dtypes. float, object and datetime64[ns] can already hold NaN or NaT without changing the type. Because of this, I’d recommend using the new nullable dtypes. You can use Int64 for your integer and ‘boolean’ for your Boolean columns. Both of these now support missing … Read more