ImageProcessing_and_Exporting_with_DataFrame(Python)

Loading...

DataFrame内の画像ファイル(JPG)を一括で画像処理し、その後Exportする

1. サンプルのImage Dataframeを作成

image_df = (
  spark
  .read
  .format("binaryFile")
  .option("mimeType", "image/*")
  .load('/databricks-datasets/cctvVideos/train_images/label=0/Browse2frame000*.jpg')
)
 
display( image_df ) 
 
path
modificationTime
length
content
1
2
3
4
5
6
7
8
9
10
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/Browse2frame0005.jpg
2018-12-06T19:17:41.000+0000
53787
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/Browse2frame0004.jpg
2018-12-06T19:17:41.000+0000
53307
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/Browse2frame0003.jpg
2018-12-06T19:17:42.000+0000
53165
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/Browse2frame0002.jpg
2018-12-06T19:17:42.000+0000
52801
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/Browse2frame0001.jpg
2018-12-06T19:17:42.000+0000
52545
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/Browse2frame0009.jpg
2018-12-06T19:17:41.000+0000
51129
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/Browse2frame0008.jpg
2018-12-06T19:17:41.000+0000
50835
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/Browse2frame0007.jpg
2018-12-06T19:17:41.000+0000
49379
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/Browse2frame0006.jpg
2018-12-06T19:17:41.000+0000
45644
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/Browse2frame0000.jpg
2018-12-06T19:17:41.000+0000
45025

Showing all 10 rows.

Show image preview

2. 画像処理(グレイスケール)を一括でかける

2.1 画像処理をUDFにする

@udf('binary')
def convert_grayscale(content):
  '''
  contentをグレイスケール変換する
  
  param: 
    content: Imageのバイナリデータのカラム
  return: 
    procced_image_binary: 画像処理後のバイナリ 
  '''
  import os, io
  from PIL import Image
  
  # `content`カラム = imageバイナリの読み込み
  f=io.BytesIO(content)
  im = Image.open(f)
  
  # 画像処理
  grayscaled_im = im.convert('L')
  
  # imageバイナリを返却する
  out = io.BytesIO()
  grayscaled_im.save(out, format='JPEG')
  return out.getvalue()

2.2 UDFを適用して、画像処理(grayscale変換)を実施

grayscaled_df = image_df.withColumn('grayscaled_content', convert_grayscale('content'))
 
display( 
  grayscaled_df
)
 
path
modificationTime
length
content
grayscaled_content
1
2
3
4
5
6
7
8
9
10
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/Browse2frame0005.jpg
2018-12-06T19:17:41.000+0000
53787
/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/wAALCAEgAYABAREA/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQ= (truncated)
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/Browse2frame0004.jpg
2018-12-06T19:17:41.000+0000
53307
/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/wAALCAEgAYABAREA/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQ= (truncated)
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/Browse2frame0003.jpg
2018-12-06T19:17:42.000+0000
53165
/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/wAALCAEgAYABAREA/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQ= (truncated)
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/Browse2frame0002.jpg
2018-12-06T19:17:42.000+0000
52801
/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/wAALCAEgAYABAREA/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQ= (truncated)
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/Browse2frame0001.jpg
2018-12-06T19:17:42.000+0000
52545
/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/wAALCAEgAYABAREA/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQ= (truncated)
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/Browse2frame0009.jpg
2018-12-06T19:17:41.000+0000
51129
/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/wAALCAEgAYABAREA/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQ= (truncated)
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/Browse2frame0008.jpg
2018-12-06T19:17:41.000+0000
50835
/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/wAALCAEgAYABAREA/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQ= (truncated)
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/Browse2frame0007.jpg
2018-12-06T19:17:41.000+0000
49379
/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/wAALCAEgAYABAREA/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQ= (truncated)
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/Browse2frame0006.jpg
2018-12-06T19:17:41.000+0000
45644
/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/wAALCAEgAYABAREA/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQ= (truncated)
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/Browse2frame0000.jpg
2018-12-06T19:17:41.000+0000
45025
/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/wAALCAEgAYABAREA/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQ= (truncated)

Showing all 10 rows.

Show image preview

2.3 変換後の画像を一枚取り出して、プレビュー確認

import io
from matplotlib.pyplot import imshow
import numpy as np
from PIL import Image
 
byte_image = grayscaled_df.limit(1).select('grayscaled_content').collect()[0]['grayscaled_content'] # <=変換後(グレイスケール)
#byte_image = grayscaled_df.limit(1).select('content').collect()[0]['content'] #<= オリジナル
 
f=io.BytesIO(byte_image)
im = Image.open(f, formats=['JPEG'])
imshow(np.asarray(im), cmap = "gray")
 
Out[37]:
<matplotlib.image.AxesImage at 0x7fce0e24c3d0>

3. Dataframe内の画像ファイルの一括でExportする

3.1 画像ファイルをExportするするためのUDFを作成

@udf('string')
def export_as_jpg(path, content):
  '''
  contentをJPGファイルとして出力する。
  
  param: 
    path: ファイルパスのカラム
    content: Imageのバイナリデータのカラム
  return: 
    export_path: 出力先のパス
  
  '''
  import os, io
  from PIL import Image
  
  # ファイルの出力ファイル名を構成
  # オリジナルが`image001.jpg`であれば、`image001_proc.jgp`として出力
  base=os.path.basename(path)
  basename, ext = os.path.splitext(base)
  export_filename=f'{basename}_proc{ext}'
  
  export_base='/dbfs/tmp/images/' # 出力するディレクトリ
  os.makedirs(export_base, exist_ok=True)
  export_path = os.path.join(export_base, export_filename)
  
  f=io.BytesIO(content)
  im = Image.open(f)
  im.convert('L').save(export_path)
  
  return export_path

3.2 DataframeにUDFを適用する!

display( 
  grayscaled_df.withColumn('output_path', export_as_jpg('path', 'grayscaled_content'))
)
 
path
modificationTime
length
content
grayscaled_content
output_path
1
2
3
4
5
6
7
8
9
10
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/Browse2frame0005.jpg
2018-12-06T19:17:41.000+0000
53787
/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/wAALCAEgAYABAREA/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQ= (truncated)
/dbfs/tmp/images/Browse2frame0005_proc.jpg
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/Browse2frame0004.jpg
2018-12-06T19:17:41.000+0000
53307
/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/wAALCAEgAYABAREA/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQ= (truncated)
/dbfs/tmp/images/Browse2frame0004_proc.jpg
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/Browse2frame0003.jpg
2018-12-06T19:17:42.000+0000
53165
/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/wAALCAEgAYABAREA/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQ= (truncated)
/dbfs/tmp/images/Browse2frame0003_proc.jpg
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/Browse2frame0002.jpg
2018-12-06T19:17:42.000+0000
52801
/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/wAALCAEgAYABAREA/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQ= (truncated)
/dbfs/tmp/images/Browse2frame0002_proc.jpg
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/Browse2frame0001.jpg
2018-12-06T19:17:42.000+0000
52545
/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/wAALCAEgAYABAREA/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQ= (truncated)
/dbfs/tmp/images/Browse2frame0001_proc.jpg
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/Browse2frame0009.jpg
2018-12-06T19:17:41.000+0000
51129
/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/wAALCAEgAYABAREA/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQ= (truncated)
/dbfs/tmp/images/Browse2frame0009_proc.jpg
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/Browse2frame0008.jpg
2018-12-06T19:17:41.000+0000
50835
/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/wAALCAEgAYABAREA/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQ= (truncated)
/dbfs/tmp/images/Browse2frame0008_proc.jpg
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/Browse2frame0007.jpg
2018-12-06T19:17:41.000+0000
49379
/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/wAALCAEgAYABAREA/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQ= (truncated)
/dbfs/tmp/images/Browse2frame0007_proc.jpg
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/Browse2frame0006.jpg
2018-12-06T19:17:41.000+0000
45644
/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/wAALCAEgAYABAREA/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQ= (truncated)
/dbfs/tmp/images/Browse2frame0006_proc.jpg
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/Browse2frame0000.jpg
2018-12-06T19:17:41.000+0000
45025
/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/wAALCAEgAYABAREA/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQ= (truncated)
/dbfs/tmp/images/Browse2frame0000_proc.jpg

Showing all 10 rows.

Show image preview