Quellcode durchsuchen

秒过分数线数据导入

chengjie vor 2 Wochen
Ursprung
Commit
88cd236ab1

+ 2 - 1
.gitignore

@@ -32,4 +32,5 @@ src/web_crawler/Throne-of-Magical-Arcana/Throne-of-Magical-Arcana_chapters_with_
32 32
 src/web_crawler/Throne-of-Magical-Arcana/Throne-of-Magical-Arcana.jpeg
33 33
 src/web_crawler/Throne-of-Magical-Arcana/Throne-of-Magical-Arcana.txt
34 34
 src/web_crawler/Throne-of-Magical-Arcana/Throne-of-Magical-Arcana.epub
35
-.gitignore
35
+__pycache__/
36
+*.py[cod]

+ 338 - 0
秒过分数线数据导入/README.md

@@ -0,0 +1,338 @@
1
+# 上海中考招生计划与成绩导入说明
2
+
3
+本文档记录本项目每年从 PDF/图片整理上海中考招生计划、成绩,并导入 MySQL 表 `kylx365_db.MPS_Score` 的需求、步骤和注意事项。  
4
+当前已完成 2026 年“计划”中的 1、2、3:自主招生、名额到区、名额到校。
5
+
6
+## 年度工作范围
7
+
8
+每年需要处理两大类数据。
9
+
10
+一、计划
11
+
12
+1. 自主招生
13
+2. 名额到区
14
+3. 名额到校
15
+4. 1-15 志愿
16
+
17
+二、成绩
18
+
19
+1. 自主招生
20
+2. 名额到区
21
+3. 名额到校
22
+4. 1-15 志愿
23
+
24
+2026 年当前状态:
25
+
26
+- 计划/自主招生:已导入
27
+- 计划/名额到区:已导入
28
+- 计划/名额到校:已导入
29
+- 计划/1-15 志愿:待官方文件发布后导入
30
+- 成绩四类:预计 7 月中旬后导入
31
+
32
+## 数据库与核心表
33
+
34
+目标数据库:`kylx365_db`
35
+
36
+核心表:
37
+
38
+- `MPS_School`:学校表,所有学校相关信息以此表为准。
39
+- `MPS_Score`:计划与成绩表,所有导入结果写入此表。
40
+
41
+常用参照查询:
42
+
43
+```sql
44
+SELECT *
45
+FROM kylx365_db.MPS_School
46
+WHERE SchoolType1 = '高中';
47
+```
48
+
49
+```sql
50
+SELECT *
51
+FROM kylx365_db.MPS_Score
52
+WHERE ScoreYear = '2025'
53
+  AND ScoreType = '名额到校'
54
+  AND DistrictID = 1;
55
+```
56
+
57
+数据库连接信息不要写入 README 或提交到仓库。当前脚本里使用本机已有配置和 PyMySQL 驱动连接,后续最好抽成单独的本地配置文件或环境变量。
58
+
59
+## DistrictID 对照
60
+
61
+```text
62
+1  黄浦区
63
+2  徐汇区
64
+3  长宁区
65
+4  静安区
66
+5  普陀区
67
+6  虹口区
68
+7  杨浦区
69
+8  闵行区
70
+9  宝山区
71
+10 嘉定区
72
+11 浦东新区
73
+12 金山区
74
+13 松江区
75
+14 青浦区
76
+15 奉贤区
77
+16 崇明区
78
+```
79
+
80
+## MPS_Score 写入规则
81
+
82
+计划类导入一般只写计划数,不写成绩。
83
+
84
+通用字段规则:
85
+
86
+- `ScoreYear`:年份,例如 `2026`
87
+- `ScoreType`:`自主招生`、`名额到区`、`名额到校`、`1-15志愿`
88
+- `DistrictID`:对应区 ID
89
+- `SchoolTarget`:高中学校 `MPS_School.ID`,以字符串写入
90
+- `SchoolFullName`:必须使用高中 ID 对应的 `MPS_School.SchoolFullName`
91
+- `PlanNum`:计划人数
92
+- `ScoreTotal`、`Score1`、`Score2`、`Score3`、`Score4`:计划导入时填 `0`
93
+- `ScoreTotalDifferenceValue`:计划导入时填 `0`
94
+- `PlanNumDifferenceValue`:当前计划数减去上一年同维度计划数
95
+- `OrderID`:当前计划导入填 `0`
96
+- `SchoolNumber`、`SchoolNumber2`:当前计划导入填空字符串
97
+- `SchoolOfGraduation1`:当前计划导入填 `"0"`
98
+
99
+名额到校额外规则:
100
+
101
+- `SchoolOfGraduation`:初中学校 `MPS_School.ID`
102
+- `SchoolFullNameJunior`:必须使用初中 ID 对应的 `MPS_School.SchoolFullName`
103
+- 一条数据的唯一业务维度可按 `ScoreYear + ScoreType + DistrictID + SchoolOfGraduation + SchoolTarget` 理解。
104
+- 不能用 PDF 里的简称直接写入 `SchoolFullNameJunior` 或 `SchoolFullName`。
105
+
106
+自主招生额外规则:
107
+
108
+- 普通自主招生拆成:
109
+  - `1学科`
110
+  - `2体育`
111
+  - `3艺术`
112
+- 国际课程班/中外合作办学拆成:
113
+  - `4国际(本市)`
114
+  - `5国际(非本市)`
115
+- `SchoolTargetRemark2` 可参考上一年同学校同类别备注;体育/艺术通常沿用“市级优秀体育学生”“市级艺术骨干学生”等说明。
116
+
117
+名额到区额外规则:
118
+
119
+- `SchoolOfGraduation = 0`
120
+- `SchoolFullNameJunior = NULL`
121
+- `SchoolTargetRemark = ""`
122
+- 维度是“区 + 高中”。
123
+
124
+## 总体操作流程
125
+
126
+每一类数据都按以下流程做:
127
+
128
+1. 先研究上一年 PDF 与上一年数据库数据,确认字段含义和写入形态。
129
+2. 读取新一年 PDF/图片,优先用表格解析;表格解析失败或 PDF 其实是图片时再 OCR/人工读图。
130
+3. 先匹配学校,不确定的数据不要导入,写入问题清单。
131
+4. 先 dry-run 或打印 ready 汇总,核对每区行数和计划数。
132
+5. 只插入新数据,不删除、不修改已有数据。
133
+6. 导入后查询 `MPS_Score` 总行数、总计划数、分区汇总。
134
+7. 对问题学校更新 `MPS_School` 后,再运行补录脚本,只补缺失行,并刷新问题清单。
135
+
136
+重要原则:
137
+
138
+- 凡是弄不清楚的,先不入库,放入 JSON 问题清单。
139
+- 若某个区解析问题较多,整个区可以先不动,等其他区处理完再单独解决。
140
+- 每次补录必须跳过已存在业务 key,避免重复插入。
141
+- 新增/改名学校优先修正 `MPS_School`,再重新匹配导入。
142
+
143
+## PDF/图片解析经验
144
+
145
+优先级:
146
+
147
+1. 有 6 位学校编号:优先用编号匹配。
148
+2. 有学校全称:用 `SchoolFullName` 匹配。
149
+3. 有简称或别名:用 `SchoolShortName`、`SchoolOtherName` 匹配。
150
+4. 仍不能唯一匹配:列为问题数据。
151
+
152
+学校名称常见问题:
153
+
154
+- PDF 中会使用简称,而且初中简称比高中多。
155
+- 有学校改名,PDF 可能写成“原名(现新名/校区)”。
156
+- 有新增学校,学校表中原本没有。
157
+- OCR 可能把换行、空格、序号、备注混进学校名。
158
+- 部分 PDF 表格中的学校名可能被拆成多行,需要清理换行再匹配。
159
+
160
+本次经验:
161
+
162
+- 高中通常有 6 位编号,匹配相对稳定。
163
+- 名额到校的初中数量多,名称最容易出问题。
164
+- `SchoolOtherName` 很适合放改名后的现名或曾用名。
165
+- 对“原名(现某某)”这种文本,匹配时应同时尝试原名、括号内现名、去括号名称。
166
+- 图片清晰时可以 OCR/读图解决,但要把结果转成结构化行,再按学校表 ID 入库。
167
+
168
+## 当前脚本说明
169
+
170
+脚本分为三类:主流程脚本、公共解析/补录脚本、2026 一次性补充脚本。后续年度工作时,主流程和公共脚本可以复制改年份;一次性补充脚本主要用于追溯 2026 的特殊处理,不建议直接运行到新年份。
171
+
172
+自主招生:
173
+
174
+- `import_mps_score_2026.py`
175
+- 读取 2026 自主招生计划 PDF 与国际课程班/中外合作办学 PDF。
176
+- 导入 `ScoreType = '自主招生'`。
177
+- 脚本会在已有 2026 自主招生数据时拒绝再次插入。
178
+
179
+名额到区:
180
+
181
+- `import_mps_score_quota_2026.py`
182
+- 读取 16 个区的名额到区 PDF。
183
+- 支持 `--dry-run`。
184
+- 如果某区已存在数据,会跳过并报告。
185
+- 对图片或解析失败区,使用 `import_mps_score_quota_manual_2026.py` 做手工/OCR 补充。
186
+
187
+名额到校:
188
+
189
+- `research_mps_score_school_quota_2026.py`
190
+- 负责学校加载、名称清洗、PDF 表格解析、学校匹配。
191
+- 已支持编号匹配、全称/简称/别名匹配、括号内“现名”匹配、同区唯一包含式匹配。
192
+
193
+- `import_mps_score_school_quota_2026.py`
194
+- 主导入脚本,读取 16 个区名额到校 PDF。
195
+- 支持 `--dry-run`。
196
+- 解析不确定的数据写入 `mps_score_school_quota_2026_problems.json`。
197
+
198
+- `import_mps_score_school_quota_supplement_2026.py`
199
+- 用于补充处理徐汇、嘉定等表格/OCR特殊区。
200
+
201
+- `import_mps_score_school_quota_hongkou_2026.py`
202
+- 用于处理虹口图片读图后的结构化数据。
203
+
204
+- `fix_mps_score_school_quota_problems_2026.py`
205
+- 当 `MPS_School` 中新增/修正学校后,重新解析问题区并补插当前能匹配的数据。
206
+- 会跳过数据库中已存在的 `DistrictID + SchoolOfGraduation + SchoolTarget` 组合。
207
+- 会刷新 `mps_score_school_quota_2026_problems.json`,只保留仍未解决的问题。
208
+
209
+2026 一次性补充脚本:
210
+
211
+- `import_mps_score_quota_manual_2026.py`:用于 2026 名额到区图片/OCR特殊区的手工补录,不是新年份通用入口。
212
+- `import_mps_score_school_quota_hongkou_2026.py`:用于 2026 虹口名额到校图片读图后的手工矩阵导入,不是新年份通用入口。
213
+- `import_mps_score_school_quota_supplement_2026.py`:包含 2026 徐汇手工矩阵和嘉定特殊 PDF 解析;其中 `collect_jiading` 目前仍被 `fix_mps_score_school_quota_problems_2026.py` 引用,所以不要单独删除。
214
+
215
+生成物:
216
+
217
+- `__pycache__/` 和 `*.pyc` 是 Python 运行缓存,不属于业务数据或脚本,已在主仓库 `.gitignore` 中忽略。
218
+
219
+## 2026 已完成结果
220
+
221
+计划/自主招生:
222
+
223
+- `ScoreYear = 2026`
224
+- `ScoreType = 自主招生`
225
+- 已导入 265 行
226
+- 计划数合计 7813
227
+
228
+计划/名额到区:
229
+
230
+- `ScoreYear = 2026`
231
+- `ScoreType = 名额到区`
232
+- 已导入 947 行
233
+- 计划数合计 7171
234
+
235
+计划/名额到校:
236
+
237
+- `ScoreYear = 2026`
238
+- `ScoreType = 名额到校`
239
+- 已导入 3892 行
240
+- 计划数合计 12833
241
+- 问题清单 `mps_score_school_quota_2026_problems.json` 已清空
242
+
243
+2026 名额到校最终分区汇总:
244
+
245
+| DistrictID | 区 | 行数 | 计划数 |
246
+| --- | --- | ---: | ---: |
247
+| 1 | 黄浦区 | 217 | 996 |
248
+| 2 | 徐汇区 | 221 | 899 |
249
+| 3 | 长宁区 | 63 | 418 |
250
+| 4 | 静安区 | 271 | 1102 |
251
+| 5 | 普陀区 | 179 | 736 |
252
+| 6 | 虹口区 | 80 | 488 |
253
+| 7 | 杨浦区 | 144 | 707 |
254
+| 8 | 闵行区 | 460 | 1290 |
255
+| 9 | 宝山区 | 348 | 1076 |
256
+| 10 | 嘉定区 | 130 | 612 |
257
+| 11 | 浦东新区 | 1259 | 2082 |
258
+| 12 | 金山区 | 56 | 355 |
259
+| 13 | 松江区 | 190 | 779 |
260
+| 14 | 青浦区 | 93 | 725 |
261
+| 15 | 奉贤区 | 131 | 345 |
262
+| 16 | 崇明区 | 50 | 223 |
263
+
264
+## 常用核验 SQL
265
+
266
+总量:
267
+
268
+```sql
269
+SELECT COUNT(*) AS c, SUM(PlanNum) AS total
270
+FROM MPS_Score
271
+WHERE ScoreYear = '2026'
272
+  AND ScoreType = '名额到校';
273
+```
274
+
275
+分区:
276
+
277
+```sql
278
+SELECT DistrictID, COUNT(*) AS c, SUM(PlanNum) AS total
279
+FROM MPS_Score
280
+WHERE ScoreYear = '2026'
281
+  AND ScoreType = '名额到校'
282
+GROUP BY DistrictID
283
+ORDER BY DistrictID;
284
+```
285
+
286
+检查某区上一年参照:
287
+
288
+```sql
289
+SELECT *
290
+FROM MPS_Score
291
+WHERE ScoreYear = '2025'
292
+  AND ScoreType = '名额到校'
293
+  AND DistrictID = 1
294
+ORDER BY ID;
295
+```
296
+
297
+查初中学校名称:
298
+
299
+```sql
300
+SELECT ID, DistrictID, SchoolNumber, SchoolFullName, SchoolShortName, SchoolOtherName
301
+FROM MPS_School
302
+WHERE SchoolType1 = '初中'
303
+  AND (
304
+    SchoolFullName LIKE '%学校名关键词%'
305
+    OR SchoolShortName LIKE '%学校名关键词%'
306
+    OR SchoolOtherName LIKE '%学校名关键词%'
307
+  );
308
+```
309
+
310
+## 明年复制脚本时要改的地方
311
+
312
+把脚本从 2026 复制到新年份后,至少检查这些常量:
313
+
314
+- `YEAR`
315
+- `PREVIOUS_YEAR`
316
+- `BASE_DIR`
317
+- PDF 文件名
318
+- 问题 JSON 文件名
319
+- 特殊区手工数据脚本中的高中代码、初中代码、计划矩阵
320
+- 自主招生中国际课程班 PDF 名称
321
+
322
+导入前必须确认目标年份目标类型没有已有数据,或脚本明确支持跳过/补录。  
323
+不要为了重新跑脚本而删除数据库旧数据,除非明确确认要重做且已备份。
324
+
325
+## 待办
326
+
327
+计划/1-15 志愿:
328
+
329
+- 等 2026 官方文件发布后处理。
330
+- 需要先研究 2025 的 PDF 与数据库写入形态,再决定 `ScoreType`、字段、维度和差值计算。
331
+
332
+成绩导入:
333
+
334
+- 预计 7 月中旬后开始。
335
+- 四类成绩都要先研究上一年数据。
336
+- 成绩类导入会涉及 `ScoreTotal`、`Score1`、`Score2`、`Score3`、`Score4` 等字段,不能沿用计划类全部填 0 的规则。
337
+- 成绩导入前要明确每个分数列的含义、缺考/无分/未录取的表示方式,以及是否需要计算差值。
338
+

+ 140 - 0
秒过分数线数据导入/fix_mps_score_school_quota_problems_2026.py

@@ -0,0 +1,140 @@
1
+import json
2
+import os
3
+import sys
4
+
5
+sys.path.insert(0, "/private/tmp/codex_mysql_driver")
6
+import pymysql  # noqa: E402
7
+
8
+import research_mps_score_school_quota_2026 as parser  # noqa: E402
9
+from import_mps_score_school_quota_2026 import (  # noqa: E402
10
+    INSERT_COLUMNS,
11
+    PROBLEM_FILE,
12
+    build_record,
13
+    load_previous_plan_nums,
14
+)
15
+from import_mps_score_school_quota_supplement_2026 import collect_jiading  # noqa: E402
16
+
17
+
18
+DISTRICTS_TO_FIX = [7, 8, 9, 10, 11, 12, 13, 14, 15]
19
+
20
+
21
+def existing_keys(cursor):
22
+    cursor.execute(
23
+        """
24
+        SELECT DistrictID, SchoolOfGraduation, SchoolTarget
25
+        FROM MPS_Score
26
+        WHERE ScoreYear = '2026' AND ScoreType = '名额到校'
27
+        """
28
+    )
29
+    return {
30
+        (int(row["DistrictID"]), int(row["SchoolOfGraduation"]), str(row["SchoolTarget"]))
31
+        for row in cursor.fetchall()
32
+    }
33
+
34
+
35
+def insert_records(cursor, records):
36
+    if not records:
37
+        return 0
38
+    columns = ", ".join(INSERT_COLUMNS)
39
+    placeholders = ", ".join(["%s"] * len(INSERT_COLUMNS))
40
+    sql = f"INSERT INTO MPS_Score ({columns}) VALUES ({placeholders})"
41
+    cursor.executemany(sql, [[row[column] for column in INSERT_COLUMNS] for row in records])
42
+    return len(records)
43
+
44
+
45
+def problem_to_json(problem):
46
+    try:
47
+        raw, high_method, junior_method = problem
48
+        return {"raw": raw, "high_match": high_method, "junior_match": junior_method}
49
+    except Exception:
50
+        return {"raw": repr(problem)}
51
+
52
+
53
+def collect_regular(district_id, high_by_code, high_by_name, junior_by_code, junior_by_name):
54
+    district_name = parser.DISTRICTS[district_id]
55
+    path = os.path.join(parser.BASE_DIR, f"2026名额到校{district_name}.pdf")
56
+    rows, problems = parser.parse_tables(
57
+        path, district_id, high_by_code, high_by_name, junior_by_code, junior_by_name
58
+    )
59
+    return rows, [problem_to_json(item) for item in problems]
60
+
61
+
62
+def main():
63
+    conn = pymysql.connect(**parser.DB_CONFIG)
64
+    try:
65
+        with conn.cursor(pymysql.cursors.DictCursor) as cursor:
66
+            high_by_code, high_by_name, _ = parser.load_schools(cursor, "高中")
67
+            junior_by_code, junior_by_name, _ = parser.load_schools(cursor, "初中")
68
+            previous = load_previous_plan_nums(cursor)
69
+            keys = existing_keys(cursor)
70
+
71
+            all_records = []
72
+            remaining = {}
73
+            inserted_summary = {}
74
+
75
+            for district_id in DISTRICTS_TO_FIX:
76
+                if district_id == 10:
77
+                    rows, problems = collect_jiading(high_by_code, high_by_name, junior_by_name)
78
+                    json_problems = problems
79
+                else:
80
+                    rows, json_problems = collect_regular(
81
+                        district_id, high_by_code, high_by_name, junior_by_code, junior_by_name
82
+                    )
83
+
84
+                new_records = []
85
+                for row in rows:
86
+                    junior, high, _plan_num, _junior_method, _high_method = row
87
+                    key = (district_id, int(junior["ID"]), str(high["ID"]))
88
+                    if key in keys:
89
+                        continue
90
+                    keys.add(key)
91
+                    new_records.append(build_record(district_id, row, previous))
92
+
93
+                inserted_summary[str(district_id)] = {
94
+                    "district": parser.DISTRICTS[district_id],
95
+                    "rows": len(new_records),
96
+                    "plan": sum(row["PlanNum"] for row in new_records),
97
+                }
98
+                all_records.extend(new_records)
99
+
100
+                if json_problems:
101
+                    remaining[str(district_id)] = {
102
+                        "district": parser.DISTRICTS[district_id],
103
+                        "status": "partial",
104
+                        "file": os.path.join(
105
+                            parser.BASE_DIR, f"2026名额到校{parser.DISTRICTS[district_id]}.pdf"
106
+                        ),
107
+                        "problems": json_problems,
108
+                    }
109
+
110
+            inserted = insert_records(cursor, all_records)
111
+            conn.commit()
112
+
113
+            with open(PROBLEM_FILE, "w", encoding="utf-8") as handle:
114
+                json.dump(remaining, handle, ensure_ascii=False, indent=2, default=str)
115
+                handle.write("\n")
116
+
117
+            print("inserted", inserted)
118
+            print("inserted_summary", json.dumps(inserted_summary, ensure_ascii=False, default=str))
119
+            print(
120
+                "remaining_summary",
121
+                json.dumps(
122
+                    {
123
+                        key: {
124
+                            "district": value["district"],
125
+                            "problem_count": len(value.get("problems", [])),
126
+                        }
127
+                        for key, value in remaining.items()
128
+                    },
129
+                    ensure_ascii=False,
130
+                ),
131
+            )
132
+    except Exception:
133
+        conn.rollback()
134
+        raise
135
+    finally:
136
+        conn.close()
137
+
138
+
139
+if __name__ == "__main__":
140
+    main()

+ 350 - 0
秒过分数线数据导入/import_mps_score_2026.py

@@ -0,0 +1,350 @@
1
+import re
2
+import sys
3
+from collections import defaultdict
4
+
5
+import pdfplumber
6
+
7
+sys.path.insert(0, "/private/tmp/codex_mysql_driver")
8
+import pymysql  # noqa: E402
9
+
10
+
11
+DB_CONFIG = {
12
+    "host": "589ae8e08493d.sh.cdb.myqcloud.com",
13
+    "port": 8124,
14
+    "user": "cdb_outerroot",
15
+    "password": "kylx!@#!QAZ@WSX",
16
+    "database": "kylx365_db",
17
+    "charset": "utf8mb4",
18
+    "connect_timeout": 10,
19
+    "read_timeout": 30,
20
+    "write_timeout": 30,
21
+}
22
+
23
+YEAR = "2026"
24
+SCORE_TYPE = "自主招生"
25
+
26
+AUTONOMOUS_PDF = (
27
+    "/Volumes/程杰外接SD盘/上海中考招生计划/2026/计划/自主招生/"
28
+    "2026年上海市高中自主招生计划 .pdf"
29
+)
30
+INTERNATIONAL_PDF = (
31
+    "/Volumes/程杰外接SD盘/上海中考招生计划/2026/计划/自主招生/"
32
+    "2026年上海市高中国际课程班和中外合作办学学校招生计划.pdf"
33
+)
34
+
35
+INSERT_COLUMNS = [
36
+    "ScoreYear",
37
+    "ScoreType",
38
+    "DistrictID",
39
+    "SchoolOfGraduation",
40
+    "SchoolFullNameJunior",
41
+    "SchoolTarget",
42
+    "SchoolFullName",
43
+    "SchoolTargetRemark",
44
+    "PlanNum",
45
+    "ScoreTotal",
46
+    "Score1",
47
+    "Score2",
48
+    "Score3",
49
+    "Score4",
50
+    "SchoolTargetRemark2",
51
+    "PlanNumDifferenceValue",
52
+    "ScoreTotalDifferenceValue",
53
+    "OrderID",
54
+    "SchoolNumber",
55
+    "SchoolNumber2",
56
+    "SchoolOfGraduation1",
57
+]
58
+
59
+
60
+def clean_code(value):
61
+    match = re.search(r"\d{6}", str(value or ""))
62
+    return match.group(0) if match else None
63
+
64
+
65
+def clean_num(value):
66
+    nums = re.findall(r"\d+", str(value or ""))
67
+    return int(nums[-1]) if nums else None
68
+
69
+
70
+def connect():
71
+    return pymysql.connect(**DB_CONFIG)
72
+
73
+
74
+def load_schools(cursor):
75
+    cursor.execute(
76
+        """
77
+        SELECT ID, DistrictID, SchoolNumber, SchoolFullName, SchoolType1, SchoolType2
78
+        FROM MPS_School
79
+        WHERE SchoolType1 = '高中' AND SchoolNumber IS NOT NULL AND SchoolNumber <> ''
80
+        """
81
+    )
82
+    by_code = defaultdict(list)
83
+    for row in cursor.fetchall():
84
+        by_code[row["SchoolNumber"]].append(row)
85
+
86
+    schools = {}
87
+    for code, rows in by_code.items():
88
+        rows.sort(
89
+            key=lambda row: (
90
+                row["SchoolType2"] is None,
91
+                row["SchoolType2"] == "",
92
+                row["ID"],
93
+            )
94
+        )
95
+        schools[code] = rows[0]
96
+    return schools
97
+
98
+
99
+def load_2025_remarks(cursor):
100
+    cursor.execute(
101
+        """
102
+        SELECT SchoolTarget, SchoolTargetRemark, SchoolTargetRemark2
103
+        FROM MPS_Score
104
+        WHERE ScoreYear = '2025'
105
+          AND ScoreType = '自主招生'
106
+          AND SchoolTargetRemark IN ('2体育', '3艺术')
107
+        """
108
+    )
109
+    return {
110
+        (str(row["SchoolTarget"]), row["SchoolTargetRemark"]): row["SchoolTargetRemark2"]
111
+        for row in cursor.fetchall()
112
+    }
113
+
114
+
115
+def load_previous_plan_nums(cursor):
116
+    cursor.execute(
117
+        """
118
+        SELECT SchoolTarget, SchoolTargetRemark, PlanNum
119
+        FROM MPS_Score
120
+        WHERE ScoreYear = '2025' AND ScoreType = '自主招生'
121
+        """
122
+    )
123
+    return {
124
+        (str(row["SchoolTarget"]), row["SchoolTargetRemark"]): int(row["PlanNum"] or 0)
125
+        for row in cursor.fetchall()
126
+    }
127
+
128
+
129
+def parse_autonomous_pdf(path, schools):
130
+    rows = []
131
+    missing_codes = []
132
+    with pdfplumber.open(path) as pdf:
133
+        for page in pdf.pages:
134
+            for table in page.extract_tables():
135
+                for raw in table:
136
+                    code = clean_code(raw[1] if len(raw) > 1 else None)
137
+                    if not code:
138
+                        continue
139
+                    school = schools.get(code)
140
+                    if not school:
141
+                        missing_codes.append((code, raw))
142
+                        continue
143
+                    total, sport, art = [clean_num(cell) for cell in raw[-3:]]
144
+                    sport = sport or 0
145
+                    art = art or 0
146
+                    total = total or 0
147
+                    rows.append(
148
+                        {
149
+                            "school": school,
150
+                            "total": total,
151
+                            "subject": total - sport - art,
152
+                            "sport": sport,
153
+                            "art": art,
154
+                        }
155
+                    )
156
+    return rows, missing_codes
157
+
158
+
159
+def parse_international_pdf(path, schools):
160
+    rows = []
161
+    missing_codes = []
162
+    with pdfplumber.open(path) as pdf:
163
+        for page in pdf.pages:
164
+            for table in page.extract_tables():
165
+                for raw in table:
166
+                    code = clean_code(raw[1] if len(raw) > 1 else None)
167
+                    if not code:
168
+                        continue
169
+                    school = schools.get(code)
170
+                    if not school:
171
+                        missing_codes.append((code, raw))
172
+                        continue
173
+                    total, local, nonlocal_plan = [clean_num(cell) for cell in raw[-3:]]
174
+                    rows.append(
175
+                        {
176
+                            "school": school,
177
+                            "total": total or 0,
178
+                            "local": local or 0,
179
+                            "nonlocal": nonlocal_plan or 0,
180
+                        }
181
+                    )
182
+    return rows, missing_codes
183
+
184
+
185
+def build_record(school, remark, plan_num, remark2, previous_plan_nums):
186
+    school_target = str(school["ID"])
187
+    previous = previous_plan_nums.get((school_target, remark), 0)
188
+    return {
189
+        "ScoreYear": YEAR,
190
+        "ScoreType": SCORE_TYPE,
191
+        "DistrictID": school["DistrictID"],
192
+        "SchoolOfGraduation": 0,
193
+        "SchoolFullNameJunior": None,
194
+        "SchoolTarget": school_target,
195
+        "SchoolFullName": school["SchoolFullName"],
196
+        "SchoolTargetRemark": remark,
197
+        "PlanNum": plan_num,
198
+        "ScoreTotal": 0,
199
+        "Score1": 0,
200
+        "Score2": 0,
201
+        "Score3": 0,
202
+        "Score4": 0,
203
+        "SchoolTargetRemark2": remark2 or "",
204
+        "PlanNumDifferenceValue": plan_num - previous,
205
+        "ScoreTotalDifferenceValue": 0,
206
+        "OrderID": 0,
207
+        "SchoolNumber": "",
208
+        "SchoolNumber2": "",
209
+        "SchoolOfGraduation1": "0",
210
+    }
211
+
212
+
213
+def build_records(autonomous_rows, international_rows, previous_plan_nums, remark_2025):
214
+    records = []
215
+
216
+    for item in autonomous_rows:
217
+        school = item["school"]
218
+        records.append(
219
+            build_record(
220
+                school,
221
+                "1学科",
222
+                item["subject"],
223
+                school["SchoolType2"],
224
+                previous_plan_nums,
225
+            )
226
+        )
227
+        if item["sport"] > 0:
228
+            records.append(
229
+                build_record(
230
+                    school,
231
+                    "2体育",
232
+                    item["sport"],
233
+                    remark_2025.get((str(school["ID"]), "2体育"), "市级优秀体育学生"),
234
+                    previous_plan_nums,
235
+                )
236
+            )
237
+        if item["art"] > 0:
238
+            records.append(
239
+                build_record(
240
+                    school,
241
+                    "3艺术",
242
+                    item["art"],
243
+                    remark_2025.get((str(school["ID"]), "3艺术"), "市级艺术骨干学生"),
244
+                    previous_plan_nums,
245
+                )
246
+            )
247
+
248
+    for item in international_rows:
249
+        school = item["school"]
250
+        if item["local"] > 0:
251
+            records.append(
252
+                build_record(
253
+                    school,
254
+                    "4国际(本市)",
255
+                    item["local"],
256
+                    "国际课程班/中外合作办学高中",
257
+                    previous_plan_nums,
258
+                )
259
+            )
260
+        if item["nonlocal"] > 0:
261
+            records.append(
262
+                build_record(
263
+                    school,
264
+                    "5国际(非本市)",
265
+                    item["nonlocal"],
266
+                    "国际课程班/中外合作办学高中",
267
+                    previous_plan_nums,
268
+                )
269
+            )
270
+    return records
271
+
272
+
273
+def summarize(records):
274
+    summary = defaultdict(lambda: {"count": 0, "plan": 0})
275
+    for row in records:
276
+        bucket = summary[row["SchoolTargetRemark"]]
277
+        bucket["count"] += 1
278
+        bucket["plan"] += row["PlanNum"]
279
+    return dict(sorted(summary.items()))
280
+
281
+
282
+def main():
283
+    conn = connect()
284
+    try:
285
+        with conn.cursor(pymysql.cursors.DictCursor) as cursor:
286
+            cursor.execute(
287
+                """
288
+                SELECT COUNT(*) AS count
289
+                FROM MPS_Score
290
+                WHERE ScoreYear = %s AND ScoreType = %s
291
+                """,
292
+                (YEAR, SCORE_TYPE),
293
+            )
294
+            existing_count = cursor.fetchone()["count"]
295
+            if existing_count:
296
+                raise RuntimeError(
297
+                    f"Refusing to insert: {YEAR} {SCORE_TYPE} already has {existing_count} rows."
298
+                )
299
+
300
+            schools = load_schools(cursor)
301
+            previous_plan_nums = load_previous_plan_nums(cursor)
302
+            remark_2025 = load_2025_remarks(cursor)
303
+
304
+            autonomous_rows, missing_autonomous = parse_autonomous_pdf(AUTONOMOUS_PDF, schools)
305
+            international_rows, missing_international = parse_international_pdf(
306
+                INTERNATIONAL_PDF, schools
307
+            )
308
+            missing = missing_autonomous + missing_international
309
+            if missing:
310
+                raise RuntimeError(f"Missing school codes: {missing[:10]}")
311
+
312
+            records = build_records(
313
+                autonomous_rows,
314
+                international_rows,
315
+                previous_plan_nums,
316
+                remark_2025,
317
+            )
318
+
319
+            print("autonomous_pdf_rows", len(autonomous_rows))
320
+            print("international_pdf_rows", len(international_rows))
321
+            print("insert_rows", len(records))
322
+            print("summary", summarize(records))
323
+
324
+            placeholders = ", ".join(["%s"] * len(INSERT_COLUMNS))
325
+            columns = ", ".join(INSERT_COLUMNS)
326
+            sql = f"INSERT INTO MPS_Score ({columns}) VALUES ({placeholders})"
327
+            values = [[row[column] for column in INSERT_COLUMNS] for row in records]
328
+            cursor.executemany(sql, values)
329
+            conn.commit()
330
+
331
+            cursor.execute(
332
+                """
333
+                SELECT SchoolTargetRemark, COUNT(*) AS count, SUM(PlanNum) AS plan
334
+                FROM MPS_Score
335
+                WHERE ScoreYear = %s AND ScoreType = %s
336
+                GROUP BY SchoolTargetRemark
337
+                ORDER BY SchoolTargetRemark
338
+                """,
339
+                (YEAR, SCORE_TYPE),
340
+            )
341
+            print("db_summary", cursor.fetchall())
342
+    except Exception:
343
+        conn.rollback()
344
+        raise
345
+    finally:
346
+        conn.close()
347
+
348
+
349
+if __name__ == "__main__":
350
+    main()

+ 289 - 0
秒过分数线数据导入/import_mps_score_quota_2026.py

@@ -0,0 +1,289 @@
1
+import argparse
2
+import os
3
+import re
4
+import sys
5
+from collections import defaultdict
6
+
7
+import pdfplumber
8
+
9
+sys.path.insert(0, "/private/tmp/codex_mysql_driver")
10
+import pymysql  # noqa: E402
11
+
12
+
13
+DB_CONFIG = {
14
+    "host": "589ae8e08493d.sh.cdb.myqcloud.com",
15
+    "port": 8124,
16
+    "user": "cdb_outerroot",
17
+    "password": "kylx!@#!QAZ@WSX",
18
+    "database": "kylx365_db",
19
+    "charset": "utf8mb4",
20
+    "connect_timeout": 10,
21
+    "read_timeout": 30,
22
+    "write_timeout": 30,
23
+}
24
+
25
+YEAR = "2026"
26
+PREVIOUS_YEAR = "2025"
27
+SCORE_TYPE = "名额到区"
28
+BASE_DIR = "/Volumes/程杰外接SD盘/上海中考招生计划/2026/计划/名额到区"
29
+
30
+DISTRICTS = {
31
+    1: "黄浦区",
32
+    2: "徐汇区",
33
+    3: "长宁区",
34
+    4: "静安区",
35
+    5: "普陀区",
36
+    6: "虹口区",
37
+    7: "杨浦区",
38
+    8: "闵行区",
39
+    9: "宝山区",
40
+    10: "嘉定区",
41
+    11: "浦东新区",
42
+    12: "金山区",
43
+    13: "松江区",
44
+    14: "青浦区",
45
+    15: "奉贤区",
46
+    16: "崇明区",
47
+}
48
+
49
+INSERT_COLUMNS = [
50
+    "ScoreYear",
51
+    "ScoreType",
52
+    "DistrictID",
53
+    "SchoolOfGraduation",
54
+    "SchoolFullNameJunior",
55
+    "SchoolTarget",
56
+    "SchoolFullName",
57
+    "SchoolTargetRemark",
58
+    "PlanNum",
59
+    "ScoreTotal",
60
+    "Score1",
61
+    "Score2",
62
+    "Score3",
63
+    "Score4",
64
+    "SchoolTargetRemark2",
65
+    "PlanNumDifferenceValue",
66
+    "ScoreTotalDifferenceValue",
67
+    "OrderID",
68
+    "SchoolNumber",
69
+    "SchoolNumber2",
70
+    "SchoolOfGraduation1",
71
+]
72
+
73
+
74
+def clean_code(value):
75
+    match = re.search(r"\d{6}", str(value or ""))
76
+    return match.group(0) if match else None
77
+
78
+
79
+def clean_num(value):
80
+    nums = re.findall(r"\d+", str(value or ""))
81
+    return int(nums[-1]) if nums else None
82
+
83
+
84
+def connect():
85
+    return pymysql.connect(**DB_CONFIG)
86
+
87
+
88
+def district_file_name(district_name):
89
+    return f"2026名额到区{district_name}.pdf"
90
+
91
+
92
+def load_schools(cursor):
93
+    cursor.execute(
94
+        """
95
+        SELECT ID, DistrictID, SchoolNumber, SchoolFullName, SchoolType1, SchoolType2
96
+        FROM MPS_School
97
+        WHERE SchoolType1 = '高中' AND SchoolNumber IS NOT NULL AND SchoolNumber <> ''
98
+        """
99
+    )
100
+    by_code = defaultdict(list)
101
+    for row in cursor.fetchall():
102
+        by_code[row["SchoolNumber"]].append(row)
103
+
104
+    schools = {}
105
+    for code, rows in by_code.items():
106
+        rows.sort(
107
+            key=lambda row: (
108
+                row["SchoolType2"] is None,
109
+                row["SchoolType2"] == "",
110
+                row["ID"],
111
+            )
112
+        )
113
+        schools[code] = rows[0]
114
+    return schools
115
+
116
+
117
+def load_previous_plan_nums(cursor):
118
+    cursor.execute(
119
+        """
120
+        SELECT DistrictID, SchoolTarget, PlanNum
121
+        FROM MPS_Score
122
+        WHERE ScoreYear = %s AND ScoreType = %s
123
+        """,
124
+        (PREVIOUS_YEAR, SCORE_TYPE),
125
+    )
126
+    return {
127
+        (int(row["DistrictID"]), str(row["SchoolTarget"])): int(row["PlanNum"] or 0)
128
+        for row in cursor.fetchall()
129
+    }
130
+
131
+
132
+def parse_pdf(path, district_name, schools):
133
+    rows = []
134
+    missing = []
135
+    with pdfplumber.open(path) as pdf:
136
+        for page in pdf.pages:
137
+            for table in page.extract_tables():
138
+                for raw in table:
139
+                    code = clean_code(raw[1] if len(raw) > 1 else None)
140
+                    if not code:
141
+                        continue
142
+                    plan_num = clean_num(raw[-1] if raw else None)
143
+                    if plan_num is None:
144
+                        missing.append(("plan_num", raw))
145
+                        continue
146
+                    raw_text = " ".join(str(cell or "") for cell in raw)
147
+                    if district_name[:2] not in raw_text and district_name not in raw_text:
148
+                        missing.append(("district_mismatch", raw))
149
+                        continue
150
+                    school = schools.get(code)
151
+                    if not school:
152
+                        missing.append(("school_code", raw))
153
+                        continue
154
+                    rows.append({"school": school, "plan_num": plan_num, "raw": raw})
155
+    return rows, missing
156
+
157
+
158
+def build_record(district_id, parsed_row, previous_plan_nums):
159
+    school = parsed_row["school"]
160
+    school_target = str(school["ID"])
161
+    previous = previous_plan_nums.get((district_id, school_target), 0)
162
+    plan_num = parsed_row["plan_num"]
163
+    return {
164
+        "ScoreYear": YEAR,
165
+        "ScoreType": SCORE_TYPE,
166
+        "DistrictID": district_id,
167
+        "SchoolOfGraduation": 0,
168
+        "SchoolFullNameJunior": None,
169
+        "SchoolTarget": school_target,
170
+        "SchoolFullName": school["SchoolFullName"],
171
+        "SchoolTargetRemark": "",
172
+        "PlanNum": plan_num,
173
+        "ScoreTotal": 0,
174
+        "Score1": 0,
175
+        "Score2": 0,
176
+        "Score3": 0,
177
+        "Score4": 0,
178
+        "SchoolTargetRemark2": None,
179
+        "PlanNumDifferenceValue": plan_num - previous,
180
+        "ScoreTotalDifferenceValue": 0,
181
+        "OrderID": 0,
182
+        "SchoolNumber": "",
183
+        "SchoolNumber2": "",
184
+        "SchoolOfGraduation1": "0",
185
+    }
186
+
187
+
188
+def collect_records(cursor):
189
+    schools = load_schools(cursor)
190
+    previous_plan_nums = load_previous_plan_nums(cursor)
191
+    records_by_district = {}
192
+    problems = {}
193
+
194
+    for district_id, district_name in DISTRICTS.items():
195
+        cursor.execute(
196
+            """
197
+            SELECT COUNT(*) AS count
198
+            FROM MPS_Score
199
+            WHERE ScoreYear = %s AND ScoreType = %s AND DistrictID = %s
200
+            """,
201
+            (YEAR, SCORE_TYPE, district_id),
202
+        )
203
+        existing = cursor.fetchone()["count"]
204
+        if existing:
205
+            problems[district_id] = f"already has {existing} rows"
206
+            continue
207
+
208
+        pdf_name = district_file_name(district_name)
209
+        path = os.path.join(BASE_DIR, pdf_name)
210
+        if not os.path.exists(path):
211
+            jpg_path = os.path.join(BASE_DIR, f"2026名额到区{district_name}.jpg")
212
+            if os.path.exists(jpg_path):
213
+                problems[district_id] = f"image file requires OCR: {jpg_path}"
214
+            else:
215
+                problems[district_id] = f"missing file: {path}"
216
+            continue
217
+
218
+        parsed_rows, parse_problems = parse_pdf(path, district_name, schools)
219
+        if parse_problems:
220
+            problems[district_id] = f"parse problems: {parse_problems[:3]}"
221
+            continue
222
+        if not parsed_rows:
223
+            problems[district_id] = "no table rows extracted"
224
+            continue
225
+
226
+        records_by_district[district_id] = [
227
+            build_record(district_id, row, previous_plan_nums) for row in parsed_rows
228
+        ]
229
+
230
+    return records_by_district, problems
231
+
232
+
233
+def print_summary(records_by_district, problems):
234
+    for district_id in sorted(records_by_district):
235
+        records = records_by_district[district_id]
236
+        print(
237
+            "ready",
238
+            district_id,
239
+            DISTRICTS[district_id],
240
+            "rows",
241
+            len(records),
242
+            "plan",
243
+            sum(row["PlanNum"] for row in records),
244
+        )
245
+    for district_id in sorted(problems):
246
+        print("problem", district_id, DISTRICTS[district_id], problems[district_id])
247
+
248
+
249
+def insert_records(cursor, records_by_district):
250
+    rows = [
251
+        row
252
+        for district_id in sorted(records_by_district)
253
+        for row in records_by_district[district_id]
254
+    ]
255
+    if not rows:
256
+        return 0
257
+    placeholders = ", ".join(["%s"] * len(INSERT_COLUMNS))
258
+    columns = ", ".join(INSERT_COLUMNS)
259
+    sql = f"INSERT INTO MPS_Score ({columns}) VALUES ({placeholders})"
260
+    values = [[row[column] for column in INSERT_COLUMNS] for row in rows]
261
+    cursor.executemany(sql, values)
262
+    return len(rows)
263
+
264
+
265
+def main():
266
+    parser = argparse.ArgumentParser()
267
+    parser.add_argument("--dry-run", action="store_true")
268
+    args = parser.parse_args()
269
+
270
+    conn = connect()
271
+    try:
272
+        with conn.cursor(pymysql.cursors.DictCursor) as cursor:
273
+            records_by_district, problems = collect_records(cursor)
274
+            print_summary(records_by_district, problems)
275
+            if args.dry_run:
276
+                conn.rollback()
277
+                return
278
+            inserted = insert_records(cursor, records_by_district)
279
+            conn.commit()
280
+            print("inserted", inserted)
281
+    except Exception:
282
+        conn.rollback()
283
+        raise
284
+    finally:
285
+        conn.close()
286
+
287
+
288
+if __name__ == "__main__":
289
+    main()

+ 232 - 0
秒过分数线数据导入/import_mps_score_quota_manual_2026.py

@@ -0,0 +1,232 @@
1
+import sys
2
+
3
+sys.path.insert(0, "/private/tmp/codex_mysql_driver")
4
+import pymysql  # noqa: E402
5
+
6
+from import_mps_score_quota_2026 import (  # noqa: E402
7
+    DB_CONFIG,
8
+    INSERT_COLUMNS,
9
+    SCORE_TYPE,
10
+    YEAR,
11
+    build_record,
12
+    load_previous_plan_nums,
13
+    load_schools,
14
+)
15
+
16
+
17
+MANUAL_ROWS = {
18
+    6: [
19
+        ("042032", 7),
20
+        ("102056", 20),
21
+        ("102057", 14),
22
+        ("152003", 9),
23
+        ("012001", 9),
24
+        ("012003", 2),
25
+        ("012007", 1),
26
+        ("012008", 2),
27
+        ("012009", 3),
28
+        ("012011", 2),
29
+        ("042001", 4),
30
+        ("042008", 2),
31
+        ("042035", 3),
32
+        ("043015", 3),
33
+        ("052001", 5),
34
+        ("052002", 2),
35
+        ("053004", 3),
36
+        ("062002", 2),
37
+        ("062003", 6),
38
+        ("062004", 5),
39
+        ("062011", 2),
40
+        ("063004", 7),
41
+        ("064001", 1),
42
+        ("072002", 2),
43
+        ("073003", 7),
44
+        ("073082", 5),
45
+        ("092001", 8),
46
+        ("092002", 7),
47
+        ("093001", 6),
48
+        ("102004", 12),
49
+        ("102032", 10),
50
+        ("103002", 1),
51
+        ("122001", 4),
52
+        ("123001", 2),
53
+        ("122002", 3),
54
+        ("122003", 2),
55
+        ("132001", 8),
56
+        ("132002", 5),
57
+        ("133001", 4),
58
+        ("132003", 4),
59
+        ("133003", 6),
60
+        ("142001", 4),
61
+        ("142002", 10),
62
+        ("142004", 2),
63
+        ("152001", 10),
64
+        ("152002", 2),
65
+        ("152004", 8),
66
+        ("153001", 7),
67
+        ("153004", 3),
68
+        ("153005", 9),
69
+        ("151078", 2),
70
+        ("152005", 6),
71
+        ("162000", 1),
72
+        ("163002", 3),
73
+        ("172001", 2),
74
+        ("173001", 4),
75
+        ("174003", 2),
76
+        ("182001", 5),
77
+        ("183002", 4),
78
+        ("182002", 3),
79
+        ("202001", 2),
80
+        ("202002", 2),
81
+        ("512000", 2),
82
+        ("512001", 3),
83
+    ],
84
+    10: [
85
+        ("042032", 11),
86
+        ("102056", 14),
87
+        ("102057", 10),
88
+        ("152003", 10),
89
+        ("152006", 1),
90
+        ("012001", 5),
91
+        ("012003", 3),
92
+        ("012007", 2),
93
+        ("012008", 2),
94
+        ("012010", 10),
95
+        ("012011", 4),
96
+        ("042001", 3),
97
+        ("042008", 3),
98
+        ("042035", 1),
99
+        ("043015", 2),
100
+        ("052001", 6),
101
+        ("052002", 5),
102
+        ("053004", 4),
103
+        ("062002", 9),
104
+        ("062003", 3),
105
+        ("062004", 2),
106
+        ("062011", 4),
107
+        ("064001", 6),
108
+        ("072001", 16),
109
+        ("072002", 9),
110
+        ("073003", 10),
111
+        ("073082", 10),
112
+        ("092001", 5),
113
+        ("092002", 3),
114
+        ("093001", 2),
115
+        ("102004", 3),
116
+        ("102032", 5),
117
+        ("103002", 1),
118
+        ("122001", 5),
119
+        ("123001", 10),
120
+        ("122002", 4),
121
+        ("122003", 2),
122
+        ("132001", 8),
123
+        ("132002", 12),
124
+        ("133001", 9),
125
+        ("132003", 6),
126
+        ("133003", 6),
127
+        ("142001", 14),
128
+        ("142002", 10),
129
+        ("142004", 6),
130
+        ("152001", 3),
131
+        ("152002", 7),
132
+        ("152004", 14),
133
+        ("153001", 5),
134
+        ("153004", 5),
135
+        ("153005", 6),
136
+        ("151078", 2),
137
+        ("152005", 4),
138
+        ("162000", 2),
139
+        ("163002", 3),
140
+        ("172001", 1),
141
+        ("173001", 4),
142
+        ("172002", 2),
143
+        ("172004", 3),
144
+        ("174003", 5),
145
+        ("182001", 14),
146
+        ("183002", 10),
147
+        ("182002", 6),
148
+        ("202001", 3),
149
+        ("512000", 3),
150
+    ],
151
+}
152
+
153
+
154
+def connect():
155
+    return pymysql.connect(**DB_CONFIG)
156
+
157
+
158
+def build_manual_records(cursor):
159
+    schools = load_schools(cursor)
160
+    previous_plan_nums = load_previous_plan_nums(cursor)
161
+    records_by_district = {}
162
+
163
+    for district_id, code_rows in MANUAL_ROWS.items():
164
+        cursor.execute(
165
+            """
166
+            SELECT COUNT(*) AS count
167
+            FROM MPS_Score
168
+            WHERE ScoreYear = %s AND ScoreType = %s AND DistrictID = %s
169
+            """,
170
+            (YEAR, SCORE_TYPE, district_id),
171
+        )
172
+        existing = cursor.fetchone()["count"]
173
+        if existing:
174
+            raise RuntimeError(f"District {district_id} already has {existing} rows.")
175
+
176
+        records = []
177
+        for code, plan_num in code_rows:
178
+            school = schools.get(code)
179
+            if not school:
180
+                raise RuntimeError(f"Missing school code {code} in district {district_id}.")
181
+            records.append(
182
+                build_record(
183
+                    district_id,
184
+                    {"school": school, "plan_num": plan_num},
185
+                    previous_plan_nums,
186
+                )
187
+            )
188
+        records_by_district[district_id] = records
189
+
190
+    return records_by_district
191
+
192
+
193
+def insert_records(cursor, records):
194
+    placeholders = ", ".join(["%s"] * len(INSERT_COLUMNS))
195
+    columns = ", ".join(INSERT_COLUMNS)
196
+    sql = f"INSERT INTO MPS_Score ({columns}) VALUES ({placeholders})"
197
+    values = [[row[column] for column in INSERT_COLUMNS] for row in records]
198
+    cursor.executemany(sql, values)
199
+
200
+
201
+def main():
202
+    conn = connect()
203
+    try:
204
+        with conn.cursor(pymysql.cursors.DictCursor) as cursor:
205
+            records_by_district = build_manual_records(cursor)
206
+            for district_id in sorted(records_by_district):
207
+                records = records_by_district[district_id]
208
+                print(
209
+                    "ready",
210
+                    district_id,
211
+                    "rows",
212
+                    len(records),
213
+                    "plan",
214
+                    sum(row["PlanNum"] for row in records),
215
+                )
216
+            rows = [
217
+                row
218
+                for district_id in sorted(records_by_district)
219
+                for row in records_by_district[district_id]
220
+            ]
221
+            insert_records(cursor, rows)
222
+            conn.commit()
223
+            print("inserted", len(rows))
224
+    except Exception:
225
+        conn.rollback()
226
+        raise
227
+    finally:
228
+        conn.close()
229
+
230
+
231
+if __name__ == "__main__":
232
+    main()

+ 224 - 0
秒过分数线数据导入/import_mps_score_school_quota_2026.py

@@ -0,0 +1,224 @@
1
+import argparse
2
+import json
3
+import os
4
+import sys
5
+
6
+sys.path.insert(0, "/private/tmp/codex_mysql_driver")
7
+import pymysql
8
+
9
+import research_mps_score_school_quota_2026 as parser
10
+
11
+
12
+INSERT_COLUMNS = [
13
+    "ScoreYear",
14
+    "ScoreType",
15
+    "DistrictID",
16
+    "SchoolOfGraduation",
17
+    "SchoolFullNameJunior",
18
+    "SchoolTarget",
19
+    "SchoolFullName",
20
+    "SchoolTargetRemark",
21
+    "PlanNum",
22
+    "ScoreTotal",
23
+    "Score1",
24
+    "Score2",
25
+    "Score3",
26
+    "Score4",
27
+    "SchoolTargetRemark2",
28
+    "PlanNumDifferenceValue",
29
+    "ScoreTotalDifferenceValue",
30
+    "OrderID",
31
+    "SchoolNumber",
32
+    "SchoolNumber2",
33
+    "SchoolOfGraduation1",
34
+]
35
+
36
+PROBLEM_FILE = "mps_score_school_quota_2026_problems.json"
37
+
38
+
39
+def load_previous_plan_nums(cursor):
40
+    cursor.execute(
41
+        """
42
+        SELECT DistrictID, SchoolOfGraduation, SchoolTarget, PlanNum
43
+        FROM MPS_Score
44
+        WHERE ScoreYear = '2025' AND ScoreType = '名额到校'
45
+        """
46
+    )
47
+    return {
48
+        (int(row["DistrictID"]), int(row["SchoolOfGraduation"]), str(row["SchoolTarget"])): int(
49
+            row["PlanNum"] or 0
50
+        )
51
+        for row in cursor.fetchall()
52
+    }
53
+
54
+
55
+def build_record(district_id, row, previous_plan_nums):
56
+    junior, high, plan_num, _junior_method, _high_method = row
57
+    previous = previous_plan_nums.get((district_id, int(junior["ID"]), str(high["ID"])), 0)
58
+    return {
59
+        "ScoreYear": "2026",
60
+        "ScoreType": "名额到校",
61
+        "DistrictID": district_id,
62
+        "SchoolOfGraduation": int(junior["ID"]),
63
+        "SchoolFullNameJunior": junior["SchoolFullName"],
64
+        "SchoolTarget": str(high["ID"]),
65
+        "SchoolFullName": high["SchoolFullName"],
66
+        "SchoolTargetRemark": "",
67
+        "PlanNum": int(plan_num),
68
+        "ScoreTotal": 0,
69
+        "Score1": 0,
70
+        "Score2": 0,
71
+        "Score3": 0,
72
+        "Score4": 0,
73
+        "SchoolTargetRemark2": None,
74
+        "PlanNumDifferenceValue": int(plan_num) - previous,
75
+        "ScoreTotalDifferenceValue": 0,
76
+        "OrderID": 0,
77
+        "SchoolNumber": "",
78
+        "SchoolNumber2": "",
79
+        "SchoolOfGraduation1": "0",
80
+    }
81
+
82
+
83
+def problem_to_json(problem):
84
+    try:
85
+        raw, high_method, junior_method = problem
86
+        return {
87
+            "raw": raw,
88
+            "high_match": high_method,
89
+            "junior_match": junior_method,
90
+        }
91
+    except Exception:
92
+        return {"raw": repr(problem)}
93
+
94
+
95
+def collect(cursor):
96
+    high_by_code, high_by_name, _ = parser.load_schools(cursor, "高中")
97
+    junior_by_code, junior_by_name, _ = parser.load_schools(cursor, "初中")
98
+    previous_plan_nums = load_previous_plan_nums(cursor)
99
+
100
+    records_by_district = {}
101
+    problems_by_district = {}
102
+
103
+    for district_id, district_name in parser.DISTRICTS.items():
104
+        cursor.execute(
105
+            """
106
+            SELECT COUNT(*) AS count
107
+            FROM MPS_Score
108
+            WHERE ScoreYear = '2026' AND ScoreType = '名额到校' AND DistrictID = %s
109
+            """,
110
+            (district_id,),
111
+        )
112
+        existing = cursor.fetchone()["count"]
113
+        if existing:
114
+            problems_by_district[str(district_id)] = {
115
+                "district": district_name,
116
+                "status": f"already has {existing} rows",
117
+                "problems": [],
118
+            }
119
+            continue
120
+
121
+        pdf_path = os.path.join(parser.BASE_DIR, f"2026名额到校{district_name}.pdf")
122
+        jpg_path = os.path.join(parser.BASE_DIR, f"2026名额到校{district_name}.jpg")
123
+        if not os.path.exists(pdf_path):
124
+            problems_by_district[str(district_id)] = {
125
+                "district": district_name,
126
+                "status": "image_or_missing",
127
+                "file": jpg_path if os.path.exists(jpg_path) else pdf_path,
128
+                "problems": [],
129
+            }
130
+            continue
131
+
132
+        rows, problems = parser.parse_tables(
133
+            pdf_path, district_id, high_by_code, high_by_name, junior_by_code, junior_by_name
134
+        )
135
+        if not rows:
136
+            problems_by_district[str(district_id)] = {
137
+                "district": district_name,
138
+                "status": "no_safe_rows",
139
+                "file": pdf_path,
140
+                "problems": [problem_to_json(item) for item in problems[:200]],
141
+            }
142
+            continue
143
+
144
+        records_by_district[district_id] = [
145
+            build_record(district_id, row, previous_plan_nums) for row in rows
146
+        ]
147
+        if problems:
148
+            problems_by_district[str(district_id)] = {
149
+                "district": district_name,
150
+                "status": "partial",
151
+                "file": pdf_path,
152
+                "problems": [problem_to_json(item) for item in problems[:500]],
153
+            }
154
+
155
+    return records_by_district, problems_by_district
156
+
157
+
158
+def write_problem_file(problems):
159
+    with open(PROBLEM_FILE, "w", encoding="utf-8") as handle:
160
+        json.dump(problems, handle, ensure_ascii=False, indent=2, default=str)
161
+
162
+
163
+def insert_records(cursor, records_by_district):
164
+    rows = [
165
+        row
166
+        for district_id in sorted(records_by_district)
167
+        for row in records_by_district[district_id]
168
+    ]
169
+    if not rows:
170
+        return 0
171
+    columns = ", ".join(INSERT_COLUMNS)
172
+    placeholders = ", ".join(["%s"] * len(INSERT_COLUMNS))
173
+    sql = f"INSERT INTO MPS_Score ({columns}) VALUES ({placeholders})"
174
+    values = [[row[column] for column in INSERT_COLUMNS] for row in rows]
175
+    cursor.executemany(sql, values)
176
+    return len(rows)
177
+
178
+
179
+def main():
180
+    arg_parser = argparse.ArgumentParser()
181
+    arg_parser.add_argument("--dry-run", action="store_true")
182
+    args = arg_parser.parse_args()
183
+
184
+    conn = pymysql.connect(**parser.DB_CONFIG)
185
+    try:
186
+        with conn.cursor(pymysql.cursors.DictCursor) as cursor:
187
+            records_by_district, problems = collect(cursor)
188
+            write_problem_file(problems)
189
+            for district_id in sorted(records_by_district):
190
+                rows = records_by_district[district_id]
191
+                print(
192
+                    "ready",
193
+                    district_id,
194
+                    parser.DISTRICTS[district_id],
195
+                    "rows",
196
+                    len(rows),
197
+                    "plan",
198
+                    sum(row["PlanNum"] for row in rows),
199
+                )
200
+            for district_id in sorted(problems, key=int):
201
+                info = problems[district_id]
202
+                print(
203
+                    "problem",
204
+                    district_id,
205
+                    info["district"],
206
+                    info["status"],
207
+                    "count",
208
+                    len(info.get("problems", [])),
209
+                )
210
+            if args.dry_run:
211
+                conn.rollback()
212
+                return
213
+            inserted = insert_records(cursor, records_by_district)
214
+            conn.commit()
215
+            print("inserted", inserted)
216
+    except Exception:
217
+        conn.rollback()
218
+        raise
219
+    finally:
220
+        conn.close()
221
+
222
+
223
+if __name__ == "__main__":
224
+    main()

+ 92 - 0
秒过分数线数据导入/import_mps_score_school_quota_hongkou_2026.py

@@ -0,0 +1,92 @@
1
+import sys
2
+
3
+sys.path.insert(0, "/private/tmp/codex_mysql_driver")
4
+import pymysql  # noqa: E402
5
+
6
+import research_mps_score_school_quota_2026 as parser  # noqa: E402
7
+from import_mps_score_school_quota_2026 import INSERT_COLUMNS, build_record, load_previous_plan_nums  # noqa: E402
8
+
9
+
10
+HIGH_CODES = ["092001", "092002", "093001", "042032", "152003", "102057", "102056"]
11
+
12
+ROWS = [
13
+    ("上海市虹口实验学校", [16, 15, 12, 0, 0, 0, 0]),
14
+    ("上海市曲阳第二中学", [11, 11, 7, 1, 0, 0, 0]),
15
+    ("上海市钟山初级中学", [9, 8, 6, 0, 0, 1, 0]),
16
+    ("上海市长青学校", [6, 6, 4, 0, 1, 0, 0]),
17
+    ("华东师范大学第一附属初级中学", [9, 9, 6, 0, 0, 0, 0]),
18
+    ("上海市丰镇中学", [10, 10, 8, 0, 0, 0, 0]),
19
+    ("上海市北郊学校", [6, 4, 4, 0, 0, 0, 0]),
20
+    ("上海市江湾初级中学", [7, 6, 4, 0, 0, 0, 0]),
21
+    ("上海市复兴实验中学", [6, 6, 4, 0, 0, 1, 0]),
22
+    ("上海音乐学院虹口区实验中学", [7, 6, 5, 0, 0, 0, 0]),
23
+    ("上海市虹口区教育学院实验中学", [6, 5, 4, 0, 0, 0, 1]),
24
+    ("上海市继光初级中学", [4, 3, 3, 0, 0, 0, 0]),
25
+    ("上海市海南中学", [1, 1, 2, 0, 0, 0, 0]),
26
+    ("上海市鲁迅初级中学", [7, 6, 4, 0, 0, 0, 0]),
27
+    ("上海市第五十二中学", [5, 4, 4, 0, 0, 0, 0]),
28
+    ("上海师范大学附属虹口中学", [4, 4, 4, 1, 0, 0, 0]),
29
+    ("同济大学附属澄衷中学", [1, 1, 2, 0, 0, 0, 1]),
30
+    ("上海世外教育附属虹口区欧阳学校", [4, 4, 4, 0, 0, 0, 0]),
31
+    ("上海市民办新华初级中学", [15, 14, 11, 0, 0, 0, 1]),
32
+    ("上海市民办新复兴初级中学", [16, 15, 11, 0, 0, 1, 0]),
33
+    ("上海市民办新北郊初级中学", [16, 15, 12, 0, 0, 1, 0]),
34
+    ("上海市民办迅行中学", [7, 7, 5, 0, 0, 0, 0]),
35
+    ("上海民办克勒外国语学校", [7, 6, 5, 0, 0, 0, 1]),
36
+]
37
+
38
+
39
+def insert_records(cursor, rows):
40
+    columns = ", ".join(INSERT_COLUMNS)
41
+    placeholders = ", ".join(["%s"] * len(INSERT_COLUMNS))
42
+    sql = f"INSERT INTO MPS_Score ({columns}) VALUES ({placeholders})"
43
+    values = [[row[column] for column in INSERT_COLUMNS] for row in rows]
44
+    cursor.executemany(sql, values)
45
+
46
+
47
+def main():
48
+    conn = pymysql.connect(**parser.DB_CONFIG)
49
+    try:
50
+        with conn.cursor(pymysql.cursors.DictCursor) as cursor:
51
+            cursor.execute(
52
+                """
53
+                SELECT COUNT(*) AS count
54
+                FROM MPS_Score
55
+                WHERE ScoreYear = '2026' AND ScoreType = '名额到校' AND DistrictID = 6
56
+                """
57
+            )
58
+            existing = cursor.fetchone()["count"]
59
+            if existing:
60
+                raise RuntimeError(f"District 6 already has {existing} rows.")
61
+
62
+            high_by_code, _high_by_name, _ = parser.load_schools(cursor, "高中")
63
+            _junior_by_code, junior_by_name, _ = parser.load_schools(cursor, "初中")
64
+            previous = load_previous_plan_nums(cursor)
65
+
66
+            parsed_rows = []
67
+            for junior_name, plans in ROWS:
68
+                junior, method = parser.match_school(None, junior_name, {}, junior_by_name, 6)
69
+                if not junior:
70
+                    raise RuntimeError(f"Cannot match junior school: {junior_name} ({method})")
71
+                for high_code, plan in zip(HIGH_CODES, plans):
72
+                    if not plan:
73
+                        continue
74
+                    high = high_by_code.get(high_code)
75
+                    if not high:
76
+                        raise RuntimeError(f"Cannot match high school code: {high_code}")
77
+                    parsed_rows.append((junior, high, plan, "name", "code"))
78
+
79
+            records = [build_record(6, row, previous) for row in parsed_rows]
80
+            print("ready 6 虹口区 rows", len(records), "plan", sum(row["PlanNum"] for row in records))
81
+            insert_records(cursor, records)
82
+            conn.commit()
83
+            print("inserted", len(records))
84
+    except Exception:
85
+        conn.rollback()
86
+        raise
87
+    finally:
88
+        conn.close()
89
+
90
+
91
+if __name__ == "__main__":
92
+    main()

+ 202 - 0
秒过分数线数据导入/import_mps_score_school_quota_supplement_2026.py

@@ -0,0 +1,202 @@
1
+import json
2
+import sys
3
+
4
+import pdfplumber
5
+
6
+sys.path.insert(0, "/private/tmp/codex_mysql_driver")
7
+import pymysql  # noqa: E402
8
+
9
+import research_mps_score_school_quota_2026 as parser  # noqa: E402
10
+from import_mps_score_school_quota_2026 import INSERT_COLUMNS, build_record, load_previous_plan_nums  # noqa: E402
11
+
12
+
13
+XUHUI_HIGH_CODES = [
14
+    "042008",
15
+    "042035",
16
+    "042001",
17
+    "042002",
18
+    "043015",
19
+    "042036",
20
+    "042032",
21
+    "102056",
22
+    "102057",
23
+    "152003",
24
+    "152006",
25
+]
26
+
27
+XUHUI_ROWS = [
28
+    ("041302", [10, 10, 5, 2, 8, 4, 2, 0, 0, 0, 0]),
29
+    ("041305", [5, 6, 3, 2, 4, 2, 1, 1, 0, 0, 0]),
30
+    ("041306", [11, 11, 5, 2, 9, 3, 2, 1, 0, 0, 0]),
31
+    ("041316", [9, 7, 5, 1, 6, 3, 1, 0, 0, 0, 1]),
32
+    ("041318", [5, 5, 5, 1, 4, 1, 1, 0, 0, 0, 0]),
33
+    ("041319", [4, 5, 3, 1, 3, 1, 1, 0, 0, 0, 0]),
34
+    ("041320", [13, 11, 6, 2, 10, 5, 3, 0, 0, 0, 0]),
35
+    ("041326", [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]),
36
+    ("041327", [9, 8, 5, 2, 8, 3, 1, 1, 0, 0, 0]),
37
+    ("041328", [4, 5, 2, 2, 3, 1, 0, 0, 0, 0, 1]),
38
+    ("041329", [7, 8, 4, 1, 7, 4, 1, 1, 0, 0, 0]),
39
+    ("041331", [3, 2, 2, 1, 3, 1, 0, 0, 0, 0, 0]),
40
+    ("041334", [5, 3, 1, 1, 3, 1, 0, 0, 0, 0, 0]),
41
+    ("041336", [5, 5, 3, 2, 7, 1, 1, 0, 1, 0, 0]),
42
+    ("041347", [8, 8, 4, 1, 8, 2, 2, 0, 0, 0, 0]),
43
+    ("041363", [8, 8, 5, 3, 8, 4, 1, 0, 1, 0, 0]),
44
+    ("041385", [8, 10, 5, 2, 9, 3, 0, 0, 0, 0, 1]),
45
+    ("044103", [10, 9, 5, 2, 10, 3, 2, 0, 0, 0, 0]),
46
+    ("044107", [3, 3, 1, 1, 4, 1, 1, 0, 0, 0, 0]),
47
+    ("044109", [13, 14, 7, 3, 12, 5, 0, 0, 0, 0, 1]),
48
+    ("044110", [7, 6, 4, 1, 7, 3, 0, 0, 0, 0, 0]),
49
+    ("044111", [10, 11, 6, 3, 9, 4, 0, 0, 0, 0, 2]),
50
+    ("044114", [4, 4, 2, 2, 5, 1, 1, 0, 0, 0, 0]),
51
+    ("044133", [9, 7, 5, 1, 7, 2, 0, 0, 0, 0, 0]),
52
+    ("044162", [12, 13, 10, 2, 11, 4, 0, 0, 1, 2, 0]),
53
+    ("044164", [12, 13, 9, 2, 11, 4, 1, 1, 0, 1, 0]),
54
+    ("044181", [3, 3, 3, 1, 2, 1, 1, 0, 0, 0, 0]),
55
+    ("044182", [3, 4, 1, 1, 3, 1, 0, 0, 0, 0, 1]),
56
+    ("045304", [2, 2, 2, 1, 3, 1, 0, 0, 0, 0, 0]),
57
+    ("045306", [2, 2, 1, 1, 3, 1, 0, 0, 1, 0, 0]),
58
+    ("045444", [4, 5, 3, 1, 5, 2, 0, 0, 0, 1, 0]),
59
+]
60
+
61
+JIADING_PDF = (
62
+    "/Volumes/程杰外接SD盘/上海中考招生计划/2026/计划/名额到校/"
63
+    "2026名额到校嘉定区.pdf"
64
+)
65
+
66
+
67
+def make_row(junior, high, plan):
68
+    return (junior, high, int(plan), "code_or_name", "code")
69
+
70
+
71
+def collect_xuhui(high_by_code, junior_by_code):
72
+    rows = []
73
+    problems = []
74
+    for junior_code, values in XUHUI_ROWS:
75
+        junior = junior_by_code.get(junior_code)
76
+        if not junior:
77
+            problems.append({"type": "junior", "code": junior_code})
78
+            continue
79
+        for high_code, plan in zip(XUHUI_HIGH_CODES, values):
80
+            if not plan:
81
+                continue
82
+            high = high_by_code.get(high_code)
83
+            if not high:
84
+                problems.append({"type": "high", "code": high_code})
85
+                continue
86
+            rows.append(make_row(junior, high, plan))
87
+    return rows, problems
88
+
89
+
90
+def collect_jiading(high_by_code, high_by_name, junior_by_name):
91
+    rows = []
92
+    problems = []
93
+    with pdfplumber.open(JIADING_PDF) as pdf:
94
+        for page in pdf.pages:
95
+            for table in page.extract_tables():
96
+                for raw in table[2:]:
97
+                    junior_name = parser.clean_text(raw[0])
98
+                    if not junior_name:
99
+                        continue
100
+                    junior, junior_method = parser.match_school(
101
+                        None, junior_name, {}, junior_by_name, 10
102
+                    )
103
+                    if not junior:
104
+                        problems.append(
105
+                            {
106
+                                "type": "junior",
107
+                                "name": junior_name,
108
+                                "method": junior_method,
109
+                                "raw": raw,
110
+                            }
111
+                        )
112
+                        continue
113
+                    for col in [1, 4, 7, 10]:
114
+                        code = parser.clean_code(raw[col] if col < len(raw) else None)
115
+                        plan = parser.clean_num(raw[col + 2] if col + 2 < len(raw) else None)
116
+                        if not code or not plan:
117
+                            continue
118
+                        high, high_method = parser.match_school(
119
+                            code, raw[col + 1] if col + 1 < len(raw) else "", high_by_code, high_by_name
120
+                        )
121
+                        if not high:
122
+                            problems.append(
123
+                                {
124
+                                    "type": "high",
125
+                                    "code": code,
126
+                                    "method": high_method,
127
+                                    "raw": raw,
128
+                                }
129
+                            )
130
+                            continue
131
+                        rows.append(make_row(junior, high, plan))
132
+    return rows, problems
133
+
134
+
135
+def check_empty(cursor, district_id):
136
+    cursor.execute(
137
+        """
138
+        SELECT COUNT(*) AS count
139
+        FROM MPS_Score
140
+        WHERE ScoreYear = '2026' AND ScoreType = '名额到校' AND DistrictID = %s
141
+        """,
142
+        (district_id,),
143
+    )
144
+    count = cursor.fetchone()["count"]
145
+    if count:
146
+        raise RuntimeError(f"District {district_id} already has {count} rows.")
147
+
148
+
149
+def insert_records(cursor, rows):
150
+    if not rows:
151
+        return 0
152
+    columns = ", ".join(INSERT_COLUMNS)
153
+    placeholders = ", ".join(["%s"] * len(INSERT_COLUMNS))
154
+    sql = f"INSERT INTO MPS_Score ({columns}) VALUES ({placeholders})"
155
+    values = [[row[column] for column in INSERT_COLUMNS] for row in rows]
156
+    cursor.executemany(sql, values)
157
+    return len(rows)
158
+
159
+
160
+def main():
161
+    conn = pymysql.connect(**parser.DB_CONFIG)
162
+    problems = {}
163
+    try:
164
+        with conn.cursor(pymysql.cursors.DictCursor) as cursor:
165
+            high_by_code, high_by_name, _ = parser.load_schools(cursor, "高中")
166
+            junior_by_code, junior_by_name, _ = parser.load_schools(cursor, "初中")
167
+            previous = load_previous_plan_nums(cursor)
168
+
169
+            check_empty(cursor, 2)
170
+            check_empty(cursor, 10)
171
+
172
+            xuhui_rows, xuhui_problems = collect_xuhui(high_by_code, junior_by_code)
173
+            jiading_rows, jiading_problems = collect_jiading(
174
+                high_by_code, high_by_name, junior_by_name
175
+            )
176
+            problems["2"] = {"district": "徐汇区", "problems": xuhui_problems}
177
+            problems["10"] = {"district": "嘉定区", "problems": jiading_problems}
178
+
179
+            records = []
180
+            for row in xuhui_rows:
181
+                records.append(build_record(2, row, previous))
182
+            for row in jiading_rows:
183
+                records.append(build_record(10, row, previous))
184
+
185
+            print("ready 2 徐汇区 rows", len(xuhui_rows), "plan", sum(row[2] for row in xuhui_rows))
186
+            print("ready 10 嘉定区 rows", len(jiading_rows), "plan", sum(row[2] for row in jiading_rows))
187
+            print("problems", json.dumps(problems, ensure_ascii=False, default=str))
188
+
189
+            inserted = insert_records(cursor, records)
190
+            conn.commit()
191
+            with open("mps_score_school_quota_2026_supplement_problems.json", "w", encoding="utf-8") as handle:
192
+                json.dump(problems, handle, ensure_ascii=False, indent=2, default=str)
193
+            print("inserted", inserted)
194
+    except Exception:
195
+        conn.rollback()
196
+        raise
197
+    finally:
198
+        conn.close()
199
+
200
+
201
+if __name__ == "__main__":
202
+    main()

+ 1 - 0
秒过分数线数据导入/mps_score_school_quota_2026_problems.json

@@ -0,0 +1 @@
1
+{}

+ 71 - 0
秒过分数线数据导入/mps_score_school_quota_2026_supplement_problems.json

@@ -0,0 +1,71 @@
1
+{
2
+  "2": {
3
+    "district": "徐汇区",
4
+    "problems": []
5
+  },
6
+  "10": {
7
+    "district": "嘉定区",
8
+    "problems": [
9
+      {
10
+        "type": "junior",
11
+        "name": "上海市民办嘉宜初级中学",
12
+        "method": "not_found",
13
+        "raw": [
14
+          "上海市民办嘉宜初级中学",
15
+          "",
16
+          "",
17
+          "",
18
+          "142001",
19
+          "上海市嘉定区\n第一中学",
20
+          "8",
21
+          "142002",
22
+          "上海交通大学附属\n中学嘉定分校",
23
+          "9",
24
+          "142004",
25
+          "上海师范大学附属中\n学嘉定新城分校",
26
+          "5"
27
+        ]
28
+      },
29
+      {
30
+        "type": "junior",
31
+        "name": "上海嘉定区世外学校",
32
+        "method": "not_found",
33
+        "raw": [
34
+          "上海嘉定区世外学校",
35
+          "",
36
+          "",
37
+          "",
38
+          "142001",
39
+          "上海市嘉定区\n第一中学",
40
+          "4",
41
+          "142002",
42
+          "上海交通大学附属\n中学嘉定分校",
43
+          "4",
44
+          "142004",
45
+          "上海师范大学附属中\n学嘉定新城分校",
46
+          "2"
47
+        ]
48
+      },
49
+      {
50
+        "type": "junior",
51
+        "name": "上海市嘉定区嘉一实验初级中学",
52
+        "method": "not_found",
53
+        "raw": [
54
+          "上海市嘉定区嘉一实验初级中学",
55
+          "",
56
+          "",
57
+          "",
58
+          "142001",
59
+          "上海市嘉定区\n第一中学",
60
+          "5",
61
+          "142002",
62
+          "上海交通大学附属\n中学嘉定分校",
63
+          "5",
64
+          "142004",
65
+          "上海师范大学附属中\n学嘉定新城分校",
66
+          "3"
67
+        ]
68
+      }
69
+    ]
70
+  }
71
+}

+ 425 - 0
秒过分数线数据导入/research_mps_score_school_quota_2026.py

@@ -0,0 +1,425 @@
1
+import os
2
+import re
3
+import sys
4
+from collections import defaultdict
5
+
6
+import pdfplumber
7
+
8
+sys.path.insert(0, "/private/tmp/codex_mysql_driver")
9
+import pymysql  # noqa: E402
10
+
11
+
12
+DB_CONFIG = {
13
+    "host": "589ae8e08493d.sh.cdb.myqcloud.com",
14
+    "port": 8124,
15
+    "user": "cdb_outerroot",
16
+    "password": "kylx!@#!QAZ@WSX",
17
+    "database": "kylx365_db",
18
+    "charset": "utf8mb4",
19
+    "connect_timeout": 10,
20
+    "read_timeout": 30,
21
+}
22
+
23
+BASE_DIR = "/Volumes/程杰外接SD盘/上海中考招生计划/2026/计划/名额到校"
24
+YEAR = "2026"
25
+SCORE_TYPE = "名额到校"
26
+
27
+DISTRICTS = {
28
+    1: "黄浦区",
29
+    2: "徐汇区",
30
+    3: "长宁区",
31
+    4: "静安区",
32
+    5: "普陀区",
33
+    6: "虹口区",
34
+    7: "杨浦区",
35
+    8: "闵行区",
36
+    9: "宝山区",
37
+    10: "嘉定区",
38
+    11: "浦东新区",
39
+    12: "金山区",
40
+    13: "松江区",
41
+    14: "青浦区",
42
+    15: "奉贤区",
43
+    16: "崇明区",
44
+}
45
+
46
+NOISE = set("不得转载未经许可允许可经载转许未得允,")
47
+
48
+HIGH_ALIAS_CODES = {
49
+    "华二": "152003",
50
+    "华师大二附中": "152003",
51
+    "华东师范大学第二附属中学": "152003",
52
+    "上中": "042032",
53
+    "上海中学": "042032",
54
+    "复附": "102057",
55
+    "复旦附中": "102057",
56
+    "交附": "102056",
57
+    "交大附中": "102056",
58
+    "上师大": "152006",
59
+    "上师大附中": "152006",
60
+    "上师附中": "152006",
61
+    "华二普陀": "073082",
62
+    "二中": "072002",
63
+    "曹杨二中": "072002",
64
+    "晋元": "072001",
65
+    "宜川": "073003",
66
+    "上师附中宝山": "132003",
67
+    "格致奉贤": "012002",
68
+    "格致中学奉贤校区": "012002",
69
+}
70
+
71
+
72
+def clean_text(value):
73
+    text = str(value or "")
74
+    text = text.replace("\n", "")
75
+    text = "".join(ch for ch in text if ch not in NOISE)
76
+    text = re.sub(r"\s+", "", text)
77
+    return text
78
+
79
+
80
+def clean_code(value):
81
+    match = re.search(r"\d{6}", str(value or ""))
82
+    return match.group(0) if match else None
83
+
84
+
85
+def clean_num(value):
86
+    text = clean_text(value)
87
+    if text in {"", "/", "/", "-", "—"}:
88
+        return None
89
+    match = re.search(r"-?\d+", text)
90
+    return int(match.group(0)) if match else None
91
+
92
+
93
+def is_dataish(row):
94
+    cells = [clean_text(cell) for cell in row]
95
+    joined = "".join(cells[:4])
96
+    return bool(re.search(r"\d{6}", joined)) or any("中学" in cell or "学校" in cell for cell in cells[:4])
97
+
98
+
99
+def school_key(name):
100
+    name = clean_text(name)
101
+    for token in ["上海市", "上海", "区", "初级", "高级", "中学", "学校", "实验"]:
102
+        name = name.replace(token, "")
103
+    return name
104
+
105
+
106
+def add_name(names, value, school):
107
+    value = clean_text(value)
108
+    if value:
109
+        names[value].append(school)
110
+        key = school_key(value)
111
+        if key and key != value:
112
+            names[key].append(school)
113
+
114
+
115
+def name_variants(name):
116
+    cleaned = clean_text(name)
117
+    variants = []
118
+
119
+    def add(value):
120
+        value = clean_text(value)
121
+        if value and value not in variants:
122
+            variants.append(value)
123
+
124
+    add(cleaned)
125
+    for part in re.findall(r"[((]现([^))]+)[))]", cleaned):
126
+        add(part)
127
+    add(re.sub(r"[((].*?[))]", "", cleaned))
128
+    return variants
129
+
130
+
131
+def load_schools(cursor, school_type):
132
+    cursor.execute(
133
+        """
134
+        SELECT ID, DistrictID, SchoolNumber, SchoolFullName, SchoolShortName, SchoolOtherName, SchoolType1
135
+        FROM MPS_School
136
+        WHERE SchoolType1 = %s
137
+        """,
138
+        (school_type,),
139
+    )
140
+    by_code = {}
141
+    by_name = defaultdict(list)
142
+    rows = cursor.fetchall()
143
+    seen_names = defaultdict(set)
144
+    for row in rows:
145
+        if row["SchoolNumber"]:
146
+            by_code[row["SchoolNumber"]] = row
147
+        for field in ["SchoolFullName", "SchoolShortName", "SchoolOtherName"]:
148
+            value = row[field]
149
+            cleaned = clean_text(value)
150
+            if cleaned and row["ID"] not in seen_names[cleaned]:
151
+                add_name(by_name, value, row)
152
+                seen_names[cleaned].add(row["ID"])
153
+    return by_code, by_name, rows
154
+
155
+
156
+def match_school(code, name, by_code, by_name, district_id=None):
157
+    if code and code in by_code:
158
+        return by_code[code], "code"
159
+    cleaned = clean_text(name)
160
+    if by_code is not None:
161
+        for alias, alias_code in HIGH_ALIAS_CODES.items():
162
+            if alias in cleaned and alias_code in by_code:
163
+                return by_code[alias_code], f"alias:{alias}"
164
+    candidates = []
165
+    for variant in name_variants(name):
166
+        if variant in by_name:
167
+            candidates.extend(by_name[variant])
168
+        else:
169
+            key = school_key(variant)
170
+            if key in by_name:
171
+                candidates.extend(by_name[key])
172
+        if candidates:
173
+            break
174
+    if candidates:
175
+        seen = set()
176
+        candidates = [row for row in candidates if not (row["ID"] in seen or seen.add(row["ID"]))]
177
+    if district_id is not None:
178
+        district_candidates = [row for row in candidates if row["DistrictID"] == district_id]
179
+        if len(district_candidates) == 1:
180
+            return district_candidates[0], "name_district"
181
+    if len(candidates) == 1:
182
+        return candidates[0], "name"
183
+    if candidates:
184
+        return None, f"ambiguous:{[row['SchoolFullName'] for row in candidates[:4]]}"
185
+    if district_id is not None:
186
+        fuzzy_candidates = []
187
+        for variant in name_variants(name):
188
+            if len(variant) < 6:
189
+                continue
190
+            for school_list in by_name.values():
191
+                for row in school_list:
192
+                    if row["DistrictID"] != district_id:
193
+                        continue
194
+                    fields = [
195
+                        clean_text(row["SchoolFullName"]),
196
+                        clean_text(row["SchoolShortName"]),
197
+                        clean_text(row["SchoolOtherName"]),
198
+                    ]
199
+                    if any(variant in field or field in variant for field in fields if field):
200
+                        fuzzy_candidates.append(row)
201
+        if fuzzy_candidates:
202
+            seen = set()
203
+            fuzzy_candidates = [
204
+                row for row in fuzzy_candidates if not (row["ID"] in seen or seen.add(row["ID"]))
205
+            ]
206
+            if len(fuzzy_candidates) == 1:
207
+                return fuzzy_candidates[0], "name_contains_district"
208
+            return None, f"ambiguous_contains:{[row['SchoolFullName'] for row in fuzzy_candidates[:4]]}"
209
+    return None, "not_found"
210
+
211
+
212
+def extract_codes_from_header(header_rows, col_index):
213
+    for row in header_rows:
214
+        if col_index < len(row):
215
+            code = clean_code(row[col_index])
216
+            if code:
217
+                return code
218
+    return None
219
+
220
+
221
+def extract_name_from_header(header_rows, col_index):
222
+    parts = []
223
+    for row in header_rows:
224
+        if col_index < len(row):
225
+            value = clean_text(row[col_index])
226
+            value = re.sub(r"^(委属|区属)?市?实验性示范性高中分配结果", "", value)
227
+            value = value.replace("区属名额数", "").replace("委属名额", "")
228
+            if value and not re.fullmatch(r"\d{6}", value) and "计划数" not in value and "合计" not in value:
229
+                parts.append(value)
230
+    return "".join(parts)
231
+
232
+
233
+def data_start_index(table):
234
+    for index, row in enumerate(table):
235
+        left_header = "".join(clean_text(cell) for cell in row[:2])
236
+        if any(token in left_header for token in ["初中代码", "初中学校名称", "学校代码", "学校名称"]):
237
+            continue
238
+        first_cell = clean_text(row[0] if row else "")
239
+        first_code = clean_code(row[0] if row else "")
240
+        second_code = clean_code(row[1] if len(row) > 1 else "")
241
+        left_code = first_code or (second_code if re.fullmatch(r"\d{1,3}", first_cell) else None)
242
+        if left_code and any(clean_num(cell) is not None for cell in row[1:]):
243
+            return index
244
+        if index > 0 and any(clean_num(cell) is not None for cell in row[2:]):
245
+            if any("中学" in clean_text(cell) or "学校" in clean_text(cell) for cell in row[:3]):
246
+                return index
247
+    return None
248
+
249
+
250
+def parse_long_table(table, district_id, high_by_code, high_by_name, junior_by_code, junior_by_name, state=None):
251
+    rows = []
252
+    problems = []
253
+    state = state if state is not None else {}
254
+    current_high_code = state.get("high_code")
255
+    current_high_name = state.get("high_name")
256
+    data_rows = table[1:] if table and any("招生学校代码" in clean_text(cell) for cell in table[0]) else table
257
+    for raw in data_rows:
258
+        if len(raw) >= 5:
259
+            high_code = clean_code(raw[0]) or current_high_code
260
+            high_name = clean_text(raw[1]) or current_high_name
261
+            junior_code = clean_code(raw[2])
262
+            junior_name = clean_text(raw[3])
263
+            plan_num = clean_num(raw[4])
264
+            if clean_code(raw[0]):
265
+                current_high_code = high_code
266
+                current_high_name = high_name
267
+                state["high_code"] = current_high_code
268
+                state["high_name"] = current_high_name
269
+        elif len(raw) >= 3 and current_high_code:
270
+            high_code = current_high_code
271
+            high_name = current_high_name
272
+            junior_code = clean_code(raw[0])
273
+            junior_name = clean_text(raw[1])
274
+            plan_num = clean_num(raw[2])
275
+        else:
276
+            continue
277
+        if clean_code(raw[0]):
278
+            current_high_code = high_code
279
+            current_high_name = high_name
280
+            state["high_code"] = current_high_code
281
+            state["high_name"] = current_high_name
282
+        if plan_num is None or plan_num == 0:
283
+            continue
284
+        high, high_method = match_school(high_code, high_name, high_by_code, high_by_name)
285
+        junior, junior_method = match_school(
286
+            junior_code, junior_name, junior_by_code, junior_by_name, district_id
287
+        )
288
+        if not high or not junior:
289
+            problems.append((raw, high_method, junior_method))
290
+            continue
291
+        rows.append((junior, high, plan_num, junior_method, high_method))
292
+    return rows, problems
293
+
294
+
295
+def parse_matrix_table(table, district_id, high_by_code, high_by_name, junior_by_code, junior_by_name):
296
+    start = data_start_index(table)
297
+    if start is None:
298
+        return [], [("no_data_start", table[:3])]
299
+    header_rows = table[:start]
300
+    data_rows = table[start:]
301
+
302
+    # Locate the junior-code/name columns using the first data row.
303
+    sample = data_rows[0]
304
+    code_col = 0
305
+    name_col = 1 if len(sample) > 1 else 0
306
+    for i, cell in enumerate(sample[:4]):
307
+        if clean_code(cell):
308
+            code_col = i
309
+            same_cell_text = clean_text(cell)
310
+            name_col = i if re.sub(r"\d{6}", "", same_cell_text) else min(i + 1, len(sample) - 1)
311
+            break
312
+    first_target_col = name_col + 1
313
+
314
+    targets = {}
315
+    target_problems = []
316
+    for col in range(first_target_col, max(len(row) for row in table)):
317
+        header_name = extract_name_from_header(header_rows, col)
318
+        if not header_name or "合计" in header_name or "总计" in header_name:
319
+            continue
320
+        code = extract_codes_from_header(header_rows, col)
321
+        high, method = match_school(code, header_name, high_by_code, high_by_name)
322
+        if high:
323
+            targets[col] = (high, method)
324
+        else:
325
+            target_problems.append((col, header_name, method))
326
+
327
+    rows = []
328
+    problems = []
329
+    for raw in data_rows:
330
+        junior_code = clean_code(raw[code_col] if code_col < len(raw) else None)
331
+        junior_name = clean_text(raw[name_col] if name_col < len(raw) else None)
332
+        if not junior_code and (not junior_name or "合计" in junior_name):
333
+            continue
334
+        junior, junior_method = match_school(
335
+            junior_code, junior_name, junior_by_code, junior_by_name, district_id
336
+        )
337
+        if not junior:
338
+            problems.append((raw, "junior", junior_method))
339
+            continue
340
+        for col, (high, high_method) in targets.items():
341
+            plan_num = clean_num(raw[col] if col < len(raw) else None)
342
+            if plan_num is None or plan_num == 0:
343
+                continue
344
+            rows.append((junior, high, plan_num, junior_method, high_method))
345
+    return rows, target_problems + problems
346
+
347
+
348
+def parse_tables(path, district_id, high_by_code, high_by_name, junior_by_code, junior_by_name):
349
+    all_rows = []
350
+    all_problems = []
351
+    long_state = {}
352
+    with pdfplumber.open(path) as pdf:
353
+        for page in pdf.pages:
354
+            for table in page.extract_tables():
355
+                if not table:
356
+                    continue
357
+                first = [clean_text(cell) for cell in table[0]]
358
+                is_long = (
359
+                    any("招生学校代码" in cell for cell in first)
360
+                    and any("初中学校代码" in cell for cell in first)
361
+                ) or (long_state.get("high_code") and len(table[0]) == 3 and clean_code(table[0][0]))
362
+                if is_long:
363
+                    rows, problems = parse_long_table(
364
+                        table, district_id, high_by_code, high_by_name, junior_by_code, junior_by_name, long_state
365
+                    )
366
+                else:
367
+                    rows, problems = parse_matrix_table(
368
+                        table, district_id, high_by_code, high_by_name, junior_by_code, junior_by_name
369
+                    )
370
+                all_rows.extend(rows)
371
+                all_problems.extend(problems)
372
+    return all_rows, all_problems
373
+
374
+
375
+def main():
376
+    conn = pymysql.connect(**DB_CONFIG)
377
+    try:
378
+        with conn.cursor(pymysql.cursors.DictCursor) as cursor:
379
+            high_by_code, high_by_name, _ = load_schools(cursor, "高中")
380
+            junior_by_code, junior_by_name, _ = load_schools(cursor, "初中")
381
+            cursor.execute(
382
+                """
383
+                SELECT DistrictID, COUNT(*) AS count
384
+                FROM MPS_Score
385
+                WHERE ScoreYear = %s AND ScoreType = %s
386
+                GROUP BY DistrictID
387
+                """,
388
+                (YEAR, SCORE_TYPE),
389
+            )
390
+            existing = {row["DistrictID"]: row["count"] for row in cursor.fetchall()}
391
+
392
+            for district_id, district_name in DISTRICTS.items():
393
+                if existing.get(district_id):
394
+                    print("existing", district_id, district_name, existing[district_id])
395
+                    continue
396
+                pdf_path = os.path.join(BASE_DIR, f"2026名额到校{district_name}.pdf")
397
+                jpg_path = os.path.join(BASE_DIR, f"2026名额到校{district_name}.jpg")
398
+                if not os.path.exists(pdf_path):
399
+                    if os.path.exists(jpg_path):
400
+                        print("problem", district_id, district_name, "image", jpg_path)
401
+                    else:
402
+                        print("problem", district_id, district_name, "missing")
403
+                    continue
404
+                rows, problems = parse_tables(
405
+                    pdf_path, district_id, high_by_code, high_by_name, junior_by_code, junior_by_name
406
+                )
407
+                print(
408
+                    "district",
409
+                    district_id,
410
+                    district_name,
411
+                    "rows",
412
+                    len(rows),
413
+                    "plan",
414
+                    sum(row[2] for row in rows),
415
+                    "problems",
416
+                    len(problems),
417
+                )
418
+                for problem in problems[:8]:
419
+                    print("  problem_sample", problem)
420
+    finally:
421
+        conn.close()
422
+
423
+
424
+if __name__ == "__main__":
425
+    main()