Skip to content
Snippets Groups Projects
Commit 7667326d authored by eriwe600's avatar eriwe600
Browse files
parents be43fc0f f846dfdf
No related branches found
No related tags found
No related merge requests found
File deleted
File deleted
File deleted
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, explode
spark = SparkSession.builder.getOrCreate()
df = spark.read.csv("/user/s2875462/airline.csv.shuffle", header="true")
'''
QUESTION: What is the average delay time when it is caused by security reasons vs when it is caused by weather conditions?
Local run: 6m50.311s
Cluster run: 2m29.112s
Output:
The average delay caused by security is 20 minutes.
The average delay caused by weather conditions is 43 minutes.
'''
'''
--- Average Security Delay ---
'''
df_s = df.select(col("SecurityDelay")).where(col("SecurityDelay") > 0)
sum_s = df_s.rdd.map(lambda x: (1, x[0])).reduceByKey(lambda x, y: int(x) + int(y)).collect()[0][1]
avg_s = sum_s / df_s.count()
print("The average delay caused by security is "+str(avg_s)+" minutes.")
'''
--- Average Weather Delay ---
'''
df_w = df.select(col("WeatherDelay")).where(col("WeatherDelay") > 0)
sum_w = df_w.rdd.map(lambda x: (1, x[0])).reduceByKey(lambda x, y: int(x) + int(y)).collect()[0][1]
avg_w = sum_w / df_w.count()
print("The average delay caused by weather conditions is "+str(avg_w)+" minutes.")
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment