原创

Python:逐行读写提高程序性能

版权声明:本文为博主原创文章,遵循 CC 4.0 BY 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://bluebird.blog.csdn.net/article/details/54378685

笔记本4G内存,使用率40%的样子,昨晚走之前跑一个程序,处理300M数据,第二天过来一看居然还没跑完,意识到严重性。

问题代码如下:

def getTopModes():
    with open(file_sim, "r") as fd:
        lis = [it.strip().split(",") for it in fd.readlines()]
    print "Read file successfully!"
    user_browse_dict = dict(get_dict_from_keylist(list(set([il[0] for il in lis]))))
    print "Count begin!"
    for uid in user_browse_dict.keys():
        temp_user_lis = [[it[0], it[1]+","+it[2]] for it in lis if it[0] == uid]
        df = pd.DataFrame(temp_user_lis, columns = ["id", "actions"])
        temp_browse_dict = dict(list(df.groupby("actions")))
        for k, v in temp_browse_dict.items():
            temp_browse_dict[k] = len(v)
        temp_vlist = sorted(temp_browse_dict.items(), key = lambda x: x[1], reverse = True)[:3]
        user_browse_dict[uid] = temp_vlist
    print "Write begin!"
    with open("browse_history_train.pickle", "wb") as fp:
        pickle.dump(user_browse_dict, fp)
    with open(file_feature, "w") as fw:
        for k,v in user_browse_dict.items():
            tmp_v = ["(" + tv[0] + ")" + ":" + str(tv[1]) for tv in v]
            fw.write(k + ":" + ",".join(tmp_v) + "\n")
    print user_browse_dict
getTopModes()

代码问题:

(1) 编码问题:fd.readlines()一次读入内存,占用空间

(2) 逻辑问题:一次性处理过多,看似简洁,实则嵌套了大量循环,耗费资源

内存占用如下(程序停留在循环中迟迟无法退出,耗时12h+):


解决:

(1) 采用文件迭代器,逐行读取并处理。

(2) 调整逻辑:一个用户的信息顺序读取完之随即处理该用户(而之前是到处去查找该用户的信息,浪费了原本就具有的空间局部性)。

代码如下:

#coding:utf-8
import pandas as pd
import numpy as np
import pickle
import gc
from utils import *

file_sim = "a.csv"
file_feature = "a.feature"
file_pickle = "a.pickle"

def getTopModes():
    fd = open(file_sim, "r")
    fp = open(file_pickle, "wb")
    fw = open(file_feature, "w")
    user_browse_dict = dict()
    temp_browse_dict = dict()
    temp_user_lis = []
    # 逐行读取并按用户处理
    print "Count begin"
    for line in fd:
        [uid, action, subact] = line.strip().split(",")
        if temp_user_lis == [] or uid == temp_user_lis[-1][0]:
            temp_user_lis.append([uid, action + "," + subact])
        else:
            # 处理上一个用户id的信息
            df = pd.DataFrame(temp_user_lis, columns = ["id", "actions"])
            temp_browse_dict = dict(list(df.groupby("actions")))
            for k, v in temp_browse_dict.items():
                temp_browse_dict[k] = str(len(v))
            temp_vlist = sorted(temp_browse_dict.items(), key = lambda x: x[1], reverse = True)[:3]     # ("117,2",8)
            user_browse_dict[uid] = temp_vlist
            fw.write(uid + "," + ",".join([",".join(tv) for tv in temp_vlist]) + "\n")
            # 清空准备统计下一个用户id的信息
            temp_browse_dict = dict()
            temp_user_lis = []
            temp_user_lis.append([uid, action + "," + subact])
    # 处理最后一个用户信息
    if temp_user_lis != []:
        df = pd.DataFrame(temp_user_lis, columns = ["id", "actions"])
        temp_browse_dict = dict(list(df.groupby("actions")))
        for k, v in temp_browse_dict.items():
            temp_browse_dict[k] = str(len(v))
        temp_vlist = sorted(temp_browse_dict.items(), key = lambda x: x[1], reverse = True)[:3]     # ("117,2",8)
        user_browse_dict[uid] = temp_vlist
        fw.write(uid + "," + ",".join([",".join(tv) for tv in temp_vlist]) + "\n")
    # dump数据
    print "Dump begin!"
    pickle.dump(user_browse_dict, fp)  
    fd.close(); fp.close(); fw.close();

getTopModes()

内存占用如下(46.1M, 程序已经由内存忙碌型转变为CPU忙碌型):


程序总计361s运行结束,可以看到原始内存占用率已经达到了46%。


文章最后发布于: 2017-01-12 16:22:14
展开阅读全文

没有更多推荐了,返回首页

©️2019 CSDN 皮肤主题: 编程工作室 设计师: CSDN官方博客

分享到微信朋友圈

×

扫一扫,手机浏览